Form spec proposal: configure audio quality for internal recording

This spec proposal relates to the new in-app recording feature discussed here.

We propose making the audio recording quality configurable in form design. This would allow the form designer to make an informed decision based on the analysis to be done and doesn't add to the complexity of configuring clients.

Introduce a key named quality to the parameters column. This matches the pattern established by max-pixels or audit parameters. This key would only be applicable to questions of type audio.

type name label parameters
audio my_recording Label quality=voice-only

Introduce a bind attribute in the odk namespace with name quality applicable to fields with bind type binary. It would be ignored for binary questions with mediatype other than audio/*.

<bind nodeset="/data/my_audio" type="binary" odk:quality="voice-only"/>
<upload mediatype="audio/*" ref="/data/my_audio">

Only the following string literals would be allowed as values:

  • voice-only: minimizes file size by optimizing for voice. Only appropriate for one speaker/participant at a time with minimal background noise.
  • low: allows voice recording in noisier backgrounds but not great for detailed sounds.
  • normal: high enough quality for most applications while keeping file size low.
  • external: recording will be delegated to an external app (same as current behaviour in Collect 1.28)
value extension codec channels sample rate bitrate file size
voice-only .amr AMR mono 8kHz 12.2kbps ~5MB/hour
low .m4a AAC mono 32kHz 24kpbs ~11MB/hour
normal .m4a AAC mono 32kHz 64kbps ~30MB/hour

Decisions around the details of the different quality settings (codec, container, bitrate etc) were based on the defaults we’ve seen in Sony (now deprecated but very popular) and Google’s Android recorder apps. From conversations on the forum and with potential users, AMR was identified as a good choice for low storage, voice optimized recordings that still work with transcription services. For the moment we’ve chosen not to offer a PCM/compressed lossless option as we’ve not seen many use cases that require it and would be more work to implement. If people need this they could continue to use external but we also want to make sure that high quality records could be added later as a contribution.

By default, the recording quality will be normal. We propose making the use of an external app configurable in Collect settings but not the quality. This will give the form designer control over the file type and size when using the internal recorder.


Minor comment - this feels very much like how we convey image size preferences via orx:max-pixels. Any reason not to use the same "orx:" scheme prefix?

1 Like

I'd say we made a mistake with max-pixels. is also used by CommCare and perhaps other tools so we risk collision/incompatibilities using it now that we no longer have an OpenRosa consortium. We've since updated to clarify:

For any new additions not defined in another specification, the "" namespace is now preferred. It is assigned the odk prefix in this documentation.

Hopefully that sounds right to you but do say if not!

:+1: Not perpetuating past indiscretions sounds like a good enough reason to me... :slight_smile:

1 Like

Thanks to the TAB members for the discussion on Wednesday! A lot of good questions and concerns raised around audio quality and configuration. As we chatted about, it makes sense to just use quality as the parameter name - I'll update the post to reflect that.

We agreed that as a follow up from the approval of this we'd look to into the following:

  1. Investigate adding "low" and "high" quality options
  2. Investigate adding overrides in Collect to allow for situations where the enumerator needs to change audio quality to get a recording (noisier environment than expected etc)

On 1), we'll do some research and work out what the configurations for these would be. We'll initially look into offering "low" quality that's just AAC at 32kbps (around ~30MB/hour). We'll experiment with this but it'd also be good to hear about any examples where people have used this (or any lossy encoding at a similar bitrate) and why. I'd really like to be able to give people examples of when to use each quality setting so having a good understanding of the drawbacks/advantages for each is important.

When chatting this over, @LN pointed out to me that for 2) we can actually use form logic as an alternative to adding client settings:

type name label parameters relevance
select_one yes_no is_quiet Are you in a quiet place?
audio recording_voice_only Please record quality='voice-only' ${is_quiet} = 'yes'
audio recording_normal Please record quality='normal' ${is_quiet} = 'no'

I also noticed that @martijnr gave this a :heart: but wanted to double check that everything about this seems ok from Enketo's point of view? Our thinking is this would be optional for the client to implement but interested to hear your input if theres anything that could be added/improved upon for the Enketo side of things.


Thanks Callum!

It looks good to me! Wrt to Enketo, and browser-clients in general, the only potential issue would be the supported formats and codecs that correspond to the different quality values for this feature. Would be nice if they can be the same as what ODK Collect will support. I don't know much about browser audio recording tbh.

Is KoBoToolbox still planning to (finish) add(ing) audio recording to Enketo @Tino_Kreutzer? If so, could one of the devs comment on formats and codecs?


After doing some research and testing we're thinking that we could add a low quality that records AAC at 24kbps. This would have a file size of ~11MB/hour. We had initially looked at using 32kbps but after testing 24 and 32 vs AMR (at 12.2), it seems we still get a nice boost in detail in a 24kpbs AAC over AMR (and obviously still a large file size saving from normal).

Given all that, I think we can add low to the values for quality unless anyone has any concerns?

1 Like

Yes, but not in the near term because the dev working on this has left us. So there are no commitments on that side regarding the spec.

@jnm may have other thoughts though.

@seadowg, great to see the additional level in there. So in summary you're proposing to add the second row to the original table:

value extension codec channels sample rate bitrate file size
voice-only .amr AMR mono 8kHz 12.2kbps ~5MB/hour
low .m4a AAC mono 32kHz (?) 24kbps ~11MB/hour
normal .m4a AAC mono 32kHz 64kbps ~30MB/hour

In your testing, did you find a noticeable difference in speaker/voice quality between voice-only and low, and between low and normal?

I would just like to confirm if there are strong reasons against adding another level for when "normal" may not be high enough quality (to capture use cases that are not simple voice-related, e.g. recording a sample of music, birdsong, etc.). This could be using 44kHz and 192kbps.


Here's a zip of one of the test recordings I made at AMR, 24kbps AACand WAV as an example: (2.1 MB)

Basically we found with heavier background noise that 24 had an advantage over AMR. It also feels like 24 would be usable in situations where you're interested in picking out what's happening in the background for context (whereas AMR makes that hard). As a side note we were pretty impressed with AMR overall and found even with a couple of people speaking over each other it worked fine.

Oh, sorry I didn't mean to suggest we were dismissing that! I started testing out "low" first and haven't played around with a "high" setting yet. For use cases like that (recording music samples, birdsongs) do you have contacts we could talk to? I think with "low" as we're still really focused on voices but I'm less clear on what use cases/level of detail might drive out a "high" quality.

We did do a quick test comparing recordings at higher AAC bitrates and found we didn't see much of a difference between 64 and 320 but did see a difference between 320 AAC and lossless (WAV). I'd want to get some examples of sounds people are trying to capture, so we're testing the right things though.

High quality examples could be sightings (or hearings) of native fauna, e.g. bird or frog calls. Some might lend themselves to postprocessing / ML given suitable quality.
There are probably some apps out there that already capture sound like this, maybe "Birds in my backyard".
Other use case could be sound recording of noise complaints for local council apps.

'Bird', or 'Frog' huh... (yeah right)

Personally, I'd be interested to know how well these different quality levels are at, oh, perhaps acquiring Testudine calls... :wink:


Oh with turtles the quality doesn't matter as we record at night (too dark) :smile:
Also, happy birthday Gareth!

1 Like

Is that something you've done before? I'd be interested here what you'd require for doing analysis like that.

I haven't used audio recordings in my own projects yet, but I have worked on data where "sightings" include encounters where a human observer recognizes characteristic bird calls from a distance.
These calls might be so transient that by the time an observer whips out ODK Collect, they're gone. Here, having audio clips of known calls (as already in spec) could help picking the right species.

I've heard of apps recording frog presence, and frogs are identified trough their calls. Frogs call long enough to record them.
I could imagine ODK Collect recordings of frog calls being analyzed against a library of calls of known species. This could be done using ML (Shazam for frogs) or through frog experts reviewing the audio. Substitute your favourite taxon here :wink:

Citizen science projects in general are limited by the expertise of the observers, and the best data we get from those projects are uninterpreted data - auto geolocation, auto timestamp, photos, videos, audio recordings.
Media can often be analyzed later, and often contains things we're not even looking for in the field, but might be of interest much later. Example: underwater photos of coral reef transects are analyzed for the hard coral cover, but later someone else might revisit the images to count sponges.
So having "sufficiently high quality for a potential but not yet known future use" recordings would be useful.

1 Like

This is definitely something I've been thinking about and it would lean towards supporting a lossless format. If you have any contacts you think I could speak to more about this feel free to DM me here or on Slack!

1 Like