Form spec proposal: configure audio quality for internal recording

jnm · November 26, 2020, 12:19am

I wanted to offer some feedback with an eye (ear?) toward using automated transcription tools. Google's best practices for their Speech-to-Text API say:

For optimal results...Use a lossless codec to record and transmit audio. FLAC or LINEAR16 is recommended. If your application must use a lossy codec to conserve bandwidth, we recommend the AMR_WB, OGG_OPUS or SPEEX_WITH_HEADER_BYTE codecs, in that preferred order.

"AMR_WB" is defined as "Adaptive Multi-Rate Wideband" with a sample rate of 16kHz. Plain ol' AMR (narrow band, as used in street-amr.amr) is not recommended at all. Possibly helpful—Wikipedia claims that Android provides a mechanism to encode AMR Wideband:

For encoding, another open-source library exists as well, provided by VisualOn. It is included in the Android mobile operating system.

IBM is more cavalier about the situation and oddly recommends only a sample depth, not a particular codec, sample rate, or bitrate:

With Speech to Text, you can safely use lossy compression to maximize the amount of audio that you can send to the service with a recognition request. Because the dynamic range of the human voice is more limited than, say, music, speech can accommodate a bit rate that is much lower than other types of audio. For speech recognition, IBM® recommends that you use 16 bits per sample for your audio and employ a format that compresses the audio data.

Although I am but a mere developer and can't speak directly about use cases in the field, I do think having a CD-quality, archival option is important in general—so thumbs up to the following:

PS: @Tino_Kreutzer, your 64 vs. 192 kbps test uses AC3, not AAC. That said, I can definitely hear the difference between 64 and 320 kbps AAC at a 32 kHz sampling rate. Here's a brief speech sample I recorded as 44.1 kHz WAV on a Pixel XL (first-generation, using this app) and then encoded with ffmpeg using:

ffmpeg -i source.wav -ar 32k -b:a 64k '64 kbps 32 khz.mp4'
64 kbps 32 khz.zip (377.0 KB)
ffmpeg -i source.wav -ar 32k -b:a 320k '320 kbps 32 khz.mp4'
320 kbps 32 khz.zip (938.2 KB)

For fun, I also subtracted (by inverting and adding) the decoded 64 kbps waveform from the decoded 320 kbps waveform, yielding a difference that gives a sense of the content lost by encoding at a lower bitrate: 64 kbps 32 khz -subtracted from- 320 kbps 32 khz.zip (2.1 MB).

PPS: Do we get to dodge the concerns expressed earlier about codec patents and licensing because the encoding is handled by the Android OS and therefore not our problem?