Form spec proposal: configure audio quality for internal recording

seadowg · November 9, 2020, 1:22pm

This spec proposal relates to the new in-app recording feature discussed here.

We propose making the audio recording quality configurable in form design. This would allow the form designer to make an informed decision based on the analysis to be done and doesn't add to the complexity of configuring clients.

XLSForm
Introduce a key named quality to the parameters column. This matches the pattern established by max-pixels or audit parameters. This key would only be applicable to questions of type audio.

type	name	label	parameters
audio	my_recording	Label	quality=voice-only

XForm
Introduce a bind attribute in the odk namespace with name quality applicable to fields with bind type binary. It would be ignored for binary questions with mediatype other than audio/*.

<bind nodeset="/data/my_audio" type="binary" odk:quality="voice-only"/>
...
<upload mediatype="audio/*" ref="/data/my_audio">

Values
Only the following string literals would be allowed as values:

voice-only: minimizes file size by optimizing for voice. Only appropriate for one speaker/participant at a time with minimal background noise.
low: allows voice recording in noisier backgrounds but not great for detailed sounds.
normal: high enough quality for most applications while keeping file size low.
external: recording will be delegated to an external app (same as current behaviour in Collect 1.28)

value	extension	codec	channels	sample rate	bitrate	file size
`voice-only`	.amr	AMR	mono	8kHz	12.2kbps	~5MB/hour
`low`	.m4a	AAC	mono	32kHz	24kpbs	~11MB/hour
`normal`	.m4a	AAC	mono	32kHz	64kbps	~30MB/hour

Decisions around the details of the different quality settings (codec, container, bitrate etc) were based on the defaults we’ve seen in Sony (now deprecated but very popular) and Google’s Android recorder apps. From conversations on the forum and with potential users, AMR was identified as a good choice for low storage, voice optimized recordings that still work with transcription services. For the moment we’ve chosen not to offer a PCM/compressed lossless option as we’ve not seen many use cases that require it and would be more work to implement. If people need this they could continue to use external but we also want to make sure that high quality records could be added later as a contribution.

By default, the recording quality will be normal. We propose making the use of an external app configurable in Collect settings but not the quality. This will give the form designer control over the file type and size when using the internal recorder.

Xiphware · November 9, 2020, 8:30pm

Minor comment - this feels very much like how we convey image size preferences via orx:max-pixels. Any reason not to use the same "orx:" scheme prefix?

LN · November 10, 2020, 5:51am

I'd say we made a mistake with max-pixels. http://openrosa.org/xforms is also used by CommCare and perhaps other tools so we risk collision/incompatibilities using it now that we no longer have an OpenRosa consortium. We've since updated https://getodk.github.io/xforms-spec/#namespaces to clarify:

For any new additions not defined in another specification, the "http://www.opendatakit.org/xforms" namespace is now preferred. It is assigned the odk prefix in this documentation.

Hopefully that sounds right to you but do say if not!

Xiphware · November 10, 2020, 8:11pm

Not perpetuating past indiscretions sounds like a good enough reason to me...

seadowg · November 13, 2020, 1:46pm

Thanks to the TAB members for the discussion on Wednesday! A lot of good questions and concerns raised around audio quality and configuration. As we chatted about, it makes sense to just use quality as the parameter name - I'll update the post to reflect that.

We agreed that as a follow up from the approval of this we'd look to into the following:

Investigate adding "low" and "high" quality options
Investigate adding overrides in Collect to allow for situations where the enumerator needs to change audio quality to get a recording (noisier environment than expected etc)

On 1), we'll do some research and work out what the configurations for these would be. We'll initially look into offering "low" quality that's just AAC at 32kbps (around ~30MB/hour). We'll experiment with this but it'd also be good to hear about any examples where people have used this (or any lossy encoding at a similar bitrate) and why. I'd really like to be able to give people examples of when to use each quality setting so having a good understanding of the drawbacks/advantages for each is important.

When chatting this over, @LN pointed out to me that for 2) we can actually use form logic as an alternative to adding client settings:

type	name	label	parameters	relevance
select_one yes_no	is_quiet	Are you in a quiet place?
audio	recording_voice_only	Please record	quality='voice-only'	${is_quiet} = 'yes'
audio	recording_normal	Please record	quality='normal'	${is_quiet} = 'no'

I also noticed that @martijnr gave this a but wanted to double check that everything about this seems ok from Enketo's point of view? Our thinking is this would be optional for the client to implement but interested to hear your input if theres anything that could be added/improved upon for the Enketo side of things.

martijnr · November 13, 2020, 5:25pm

Thanks Callum!

It looks good to me! Wrt to Enketo, and browser-clients in general, the only potential issue would be the supported formats and codecs that correspond to the different quality values for this feature. Would be nice if they can be the same as what ODK Collect will support. I don't know much about browser audio recording tbh.

Is KoBoToolbox still planning to (finish) add(ing) audio recording to Enketo @Tino_Kreutzer? If so, could one of the devs comment on formats and codecs?

seadowg · November 18, 2020, 6:37pm

After doing some research and testing we're thinking that we could add a low quality that records AAC at 24kbps. This would have a file size of ~11MB/hour. We had initially looked at using 32kbps but after testing 24 and 32 vs AMR (at 12.2), it seems we still get a nice boost in detail in a 24kpbs AAC over AMR (and obviously still a large file size saving from normal).

Given all that, I think we can add low to the values for quality unless anyone has any concerns?

Tino_Kreutzer · November 18, 2020, 7:48pm

Yes, but not in the near term because the dev working on this has left us. So there are no commitments on that side regarding the spec.

@jnm may have other thoughts though.

Tino_Kreutzer · November 18, 2020, 8:26pm

@seadowg, great to see the additional level in there. So in summary you're proposing to add the second row to the original table:

value	extension	codec	channels	sample rate	bitrate	file size
`voice-only`	.amr	AMR	mono	8kHz	12.2kbps	~5MB/hour
`low`	.m4a	AAC	mono	32kHz (?)	24kbps	~11MB/hour
`normal`	.m4a	AAC	mono	32kHz	64kbps	~30MB/hour

In your testing, did you find a noticeable difference in speaker/voice quality between voice-only and low, and between low and normal?

I would just like to confirm if there are strong reasons against adding another level for when "normal" may not be high enough quality (to capture use cases that are not simple voice-related, e.g. recording a sample of music, birdsong, etc.). This could be using 44kHz and 192kbps.

seadowg · November 19, 2020, 9:23am

Here's a zip of one of the test recordings I made at AMR, 24kbps AACand WAV as an example:

street_recording.zip (2.1 MB)

Basically we found with heavier background noise that 24 had an advantage over AMR. It also feels like 24 would be usable in situations where you're interested in picking out what's happening in the background for context (whereas AMR makes that hard). As a side note we were pretty impressed with AMR overall and found even with a couple of people speaking over each other it worked fine.

Oh, sorry I didn't mean to suggest we were dismissing that! I started testing out "low" first and haven't played around with a "high" setting yet. For use cases like that (recording music samples, birdsongs) do you have contacts we could talk to? I think with "low" as we're still really focused on voices but I'm less clear on what use cases/level of detail might drive out a "high" quality.

We did do a quick test comparing recordings at higher AAC bitrates and found we didn't see much of a difference between 64 and 320 but did see a difference between 320 AAC and lossless (WAV). I'd want to get some examples of sounds people are trying to capture, so we're testing the right things though.

Florian_May · November 20, 2020, 12:04am

High quality examples could be sightings (or hearings) of native fauna, e.g. bird or frog calls. Some might lend themselves to postprocessing / ML given suitable quality.
There are probably some apps out there that already capture sound like this, maybe "Birds in my backyard".
Other use case could be sound recording of noise complaints for local council apps.

Xiphware · November 20, 2020, 7:03am

'Bird', or 'Frog' huh... (yeah right)

Personally, I'd be interested to know how well these different quality levels are at, oh, perhaps acquiring Testudine calls...

Florian_May · November 20, 2020, 10:57am

Oh with turtles the quality doesn't matter as we record at night (too dark)
Also, happy birthday Gareth!

seadowg · November 20, 2020, 2:34pm

Is that something you've done before? I'd be interested here what you'd require for doing analysis like that.

Florian_May · November 20, 2020, 11:33pm

I haven't used audio recordings in my own projects yet, but I have worked on data where "sightings" include encounters where a human observer recognizes characteristic bird calls from a distance.
These calls might be so transient that by the time an observer whips out ODK Collect, they're gone. Here, having audio clips of known calls (as already in spec) could help picking the right species.

I've heard of apps recording frog presence, and frogs are identified trough their calls. Frogs call long enough to record them.
I could imagine ODK Collect recordings of frog calls being analyzed against a library of calls of known species. This could be done using ML (Shazam for frogs) or through frog experts reviewing the audio. Substitute your favourite taxon here

Citizen science projects in general are limited by the expertise of the observers, and the best data we get from those projects are uninterpreted data - auto geolocation, auto timestamp, photos, videos, audio recordings.
Media can often be analyzed later, and often contains things we're not even looking for in the field, but might be of interest much later. Example: underwater photos of coral reef transects are analyzed for the hard coral cover, but later someone else might revisit the images to count sponges.
So having "sufficiently high quality for a potential but not yet known future use" recordings would be useful.

seadowg · November 23, 2020, 10:13am

This is definitely something I've been thinking about and it would lean towards supporting a lossless format. If you have any contacts you think I could speak to more about this feel free to DM me here or on Slack!

Tino_Kreutzer · November 25, 2020, 8:25pm

@seadowg Could you share the audio files you used for this? I agree that for recording pure voice the difference can be hard to tell but am wondering what you tested (with background noises etc.).

Here's a simple test from a music file, saved as 64 and 192. The difference is obvious, so no need to make this a blind test. music-64 and 192.zip (635.8 KB)

jnm · November 26, 2020, 12:19am

I wanted to offer some feedback with an eye (ear?) toward using automated transcription tools. Google's best practices for their Speech-to-Text API say:

For optimal results...Use a lossless codec to record and transmit audio. FLAC or LINEAR16 is recommended. If your application must use a lossy codec to conserve bandwidth, we recommend the AMR_WB, OGG_OPUS or SPEEX_WITH_HEADER_BYTE codecs, in that preferred order.

"AMR_WB" is defined as "Adaptive Multi-Rate Wideband" with a sample rate of 16kHz. Plain ol' AMR (narrow band, as used in street-amr.amr) is not recommended at all. Possibly helpful—Wikipedia claims that Android provides a mechanism to encode AMR Wideband:

For encoding, another open-source library exists as well, provided by VisualOn. It is included in the Android mobile operating system.

IBM is more cavalier about the situation and oddly recommends only a sample depth, not a particular codec, sample rate, or bitrate:

With Speech to Text, you can safely use lossy compression to maximize the amount of audio that you can send to the service with a recognition request. Because the dynamic range of the human voice is more limited than, say, music, speech can accommodate a bit rate that is much lower than other types of audio. For speech recognition, IBM® recommends that you use 16 bits per sample for your audio and employ a format that compresses the audio data.

Although I am but a mere developer and can't speak directly about use cases in the field, I do think having a CD-quality, archival option is important in general—so thumbs up to the following:

PS: @Tino_Kreutzer, your 64 vs. 192 kbps test uses AC3, not AAC. That said, I can definitely hear the difference between 64 and 320 kbps AAC at a 32 kHz sampling rate. Here's a brief speech sample I recorded as 44.1 kHz WAV on a Pixel XL (first-generation, using this app) and then encoded with ffmpeg using:

ffmpeg -i source.wav -ar 32k -b:a 64k '64 kbps 32 khz.mp4'
64 kbps 32 khz.zip (377.0 KB)
ffmpeg -i source.wav -ar 32k -b:a 320k '320 kbps 32 khz.mp4'
320 kbps 32 khz.zip (938.2 KB)

For fun, I also subtracted (by inverting and adding) the decoded 64 kbps waveform from the decoded 320 kbps waveform, yielding a difference that gives a sense of the content lost by encoding at a lower bitrate: 64 kbps 32 khz -subtracted from- 320 kbps 32 khz.zip (2.1 MB).

PPS: Do we get to dodge the concerns expressed earlier about codec patents and licensing because the encoding is handled by the Android OS and therefore not our problem?

LN · December 1, 2020, 6:07am

@seadowg captured speech while walking on a busy road, two people talking over each other in a park and an outdoor fountain in a courtyard -- I'll let him share some of those. We evaluated them thinking about information content. That is, can all the same background sounds, intonation, etc, be heard. I think @jnm's example is representative of our findings: beyond 64kbps, a difference is audible but there isn't any content difference for most sounds. @jnm what did you take away from your experiment?

Then the question becomes what the purpose of a higher-quality recording would be. Two bad outcomes would be that we pick a bitrate that does not capture any meaningfully greater amount of information but results in file sizes that are impractically large (e.g. 128kbps vs 192kbps) or that we pick a bitrate that doesn't capture enough (e.g. 192kbps vs 256kbps).

128kbps seems to be the threshold beyond which most people can't hear the difference for music, especially with a variable bit rate. But my experience doing A/B tests suggests it really depends on genre/instruments/etc. I also don't know how comparable the listening experience between music and field recordings would be. That's where knowing more about how recordings might be used would really help pick a good high option. But if others are feeling good about 192kbps, I would not push back.

It sounds like there is broad agreement that adding a lossless option would be beneficial. That seems it would pretty clearly be FLAC as m4a, 44.1 kHz, 16-bit. I think it would probably be worth doing stereo for those who might have external mics. Maybe that's one to add for this initial release?

That's right.

AMR wideband gets to about 11MB/hour in the testing I've done so it loses the major benefit of AMR narrowband which is to dramatically reduce file size. My understanding is that it's also supported by a lot fewer tools so may not be convenient for most users. @aurdipas I think you said that when you've used amr you had about ~5MB/hour files, right? Do you know for sure whether you used narrowband or wideband?

A quick sampling of online speech-to-text services is more in line with IBM. Are you advocating for switching to AMR WB, @jnm?

aurdipas · December 1, 2020, 9:48am

Not sure it was narrowband or wideband. It was the default amr coming with the app.