Form spec proposal: add background audio recording

This is a specification to support recording background audio. As described in the feature post, we intend to release a first version of this feature in Collect v1.30 which would only include starting recording at the beginning of form entry and stopping when the form is exited. However, we are proposing a more comprehensive spec here so that the feature can be expanded on in the future.

Related: start-geopoint, audit, built-in audio recording

Add a new background-audio type. The text in the name column is chosen by the user and is the name of the field that has the audio recording. The only other column that can be used with this type is the parameters column which accepts a key named quality with values defined in the documentation. If the key is omitted, quality defaults to voice-only. Specifying the external quality results in a form conversion error.

type name parameters
background-audio my_recording quality=low

Add a new bind attribute odk:background-audio that applies to the binary type. If set to true for a binary question, that question is automatically populated by the client with a background audio recording. Client behavior is undefined if an audio binary question has the odk:background-audio attribute set to true and is bound to a body element (XLSForm would never generate this). The XLSForm above would result in the following XForms output:

<bind nodeset="/data/my_recording" type="binary" odk:background-audio="true" odk:quality="low"/>

The odk:quality attribute is considered for binary fields with odk:background-audio set to true. Values voice-only (default), normal and low are accepted and everything else is ignored.

There can be multiple binary fields with the odk:background-audio field set to true but they will all be populated with the same filename.

This design leaves the door open for additional XLSForm parameters/XForms bind attributes to configure recordings. Consider the following:

Type name parameters
background-audio introduction start=${intro}; end=${q1}; probability=.1
background-audio conclusion start=${q17}; end=${thanks}

This might mean that for approximately one out of every 10 form filling sessions, the question with name intro, the question named q1, and all questions in between will be recorded. For all form filling sessions, the question named q17, the question named thanks and all questions in between will be recorded in a separate file.


Would label::... columns not work with these questions? It can sometimes be useful to given different human-readable labels to data columns which can be used for visualization / exporting into different languages. I believe this is possible for other hidden questions like calculations. Not a big issue though.

Partial audio recordings
This is not part of the spec yet obviously. As mentioned elsewhere, from a user perspective I would find it more intuitive to specify the start / end of a background recording by adding a start / end point to the form, rather than what you proposed in the Extensions section. This could work like groups, e.g.

type name
background-audio-start recording1

They would not because there would be no body element for the label to be nested in. They do work with calculations but then the calculation is displayed by the form filling client (and should always be marked as read-only unless the calculation is wrapped in once() but really default is the way to go now!).

This is also what @danbjoseph proposed here. I want to gently push back on this being simpler. It implies that you'd be thinking about audio recording as you author the form and probably not modifying the recording start/end. I've been imagining adding audio recording as a step that happens after a form is complete. That is, I'd like to test the form, make sure all my logic makes sense, get a feel for how it flows, and once it's final, define what portions I'd like to record. I also imagine I might adjust when recordings start and end as I try the form out and having those defined together would make that process easier.

One additional idea you might like that @seadowg and I discussed is exposing question and group attributes for cases where just a single question or group should be recorded.

If the general sense is that start/end is more intuitive, it should be possible.

Another related theme that came out of the TAB call is whether there would be a benefit to using the event/action mechanism we have instead of bind attributes. I told @Xiphware I'd write that up for consideration in the next couple of days. That would correspond more closely to this start/end XLSForm concept.

Below is an alternative XForms concept that uses events and actions. This would require introducing either a new action for recording audio (e.g. odk:startaudio) or, as I've shown below, a combination of a generic background recording action (e.g. odk:startrecording) and an attribute to indicate the type of recording (e.g. odk:type="audio").

  <bind nodeset="/data/my_recording" type="binary"/>
  <odk:startrecording event="xforms-ready" ref="/data/my_recording"  odk:type="audio" odk:quality="low"/>

Adding recording for a range would require introducing an event for a question being reached and a stoprecording action (the recording implicitly stops on form exit in the example above). (Side note, I don't know if it would really make sense to allow partial recording within a field-list or an Enketo form not in pages mode but it would be possible by doing something like using the value of the immediately preceding question changing as triggering this new event.)

  <bind nodeset="/data/my_recording" type="binary"/>
  <input ref="/data/q1">
    <startrecording ref="/data/my_recording" event="odk-question-reached" odk:type="audio" odk:quality="low"/>
  <input ref="/data/thanks">
    <stoprecording ref="/data/my_recording" event="odk-question-reached"/>

The purist side of me really likes this. It's extremely flexible and powerful and it's consistent with concepts we've already introduced. The pragmatist side of me concerned. I think that in XLSForm we'd only expose triggering on the odk-question-reached event and we could do a fair amount of validation at that level. But to really follow the specs clients would need to be able to handle actions being triggered by all the events we support, mismatched start/stop actions, etc. That may be handled by our existing generic support for events and actions but I have a feeling that there will be issues. For example, xforms-value-changed events can be triggered in really quick succession and that might cause instability for media recording. If we're unlikely to expose that functionality in XLSForms anyway, I'd rather avoid it.

One of the big points @Xiphware brought up during the TAB call is that there might be other kinds of background recording and that we'd want to introduce an approach that can be extended to e.g. video, locations, humidity, etc. I don't think that there's a big advantage of one approach vs. the other for this.

  • In the original attributes approach I outlined, we'd have to introduce a new attribute name for each type of data to record. Alternately, the attribute name could be something like odk:auto-populate with values like audio, location, video (which I think I prefer).
  • In the event/action approach defined above we'd need to introduce either new actions or new types for each new type of data to record.

I think of the attributes approach as a flattened version of the actions/events one that implicitly always uses a "question reached" event to trigger recording start and stop. It's less powerful but I think that's an asset because we get more control over what can be expressed by a form. It's also generally simpler to only have to deal with the bind rather than having information about the recording in actions as well.


I want to make sure that @martijnr has seen this thread and has a chance to react.

It would also be great to hear from @Tino_Kreutzer and @danbjoseph about their current thoughts regarding the XSLForm syntax after my response here.

Thanks @LN for the heads up and for this proposal. Sorry, I had not seen it. I'll focus on the XForms side.

Functionally this seems closest to our existing preload items, and since we'd like to eventually deprecate those and replace them with setvalue actions (for consistency), I think your 2 setvalue proposals make the most sense.

I like and would very much prefer the simple <odk:startrecording> action as sibling of <bind> but as you mentioned this would depend on whether we can accept not having fine-grained start and stop control.

If this start/stop functionality really is required (really?), there is an issue with how to determine when a question is reached (the trigger question may never get focus or a value, as the user may skip it - so it would require lots of require/field-list logic to make it all work). Depending on how this could be implemented in ODK, I'm wondering if something like odk-page-shown would be precise enough and perhaps reflect better how it would be implemented anyway.

1 Like

Is your primary dislike of the attribute-based option that it's not consistent? That was also one of @Xiphware's complaints. I'd argue this case is very similar to audit. In that case, we used a fixed node name to signal to clients that a certain file node should be populated by an audit. If we'd thought of it at the time I might have preferred using a bind attribute. Then the audit is configured through bind attributes which would be the same. The major difference between the two is that there's exactly one audit versus possibly multiple background recordings. Other than that they do feel fairly analogous in that we're asking the client to populate a certain field with a particular kind of data.

Yes, from what we've heard from users, it's quite important. Imagine you have a 3000-question survey and there's one section you suspect is not being handled consistently. It would be much better to ask for those few questions to be recorded than the whole survey.

Conceptually, users want to specify a range of questions and know that whenever the enumerator is operating within that range, audio is being recorded. I've been imagining that clients would pre-compute identifiers (e.g. XPath paths) for all questions between the specified start and end. They'd initiate recording when any of those nodes is "reached" (for whatever that means for the specific client and view) and stop recording when any node not in the set is "reached." This implementation concept is one of the things that has me looking away from actions/events -- we likely wouldn't actually use the events. Instead it would make the pre-computation work harder than if the information about start and end were available in the same place.

That would be a better name if we're pretty confident clients with single-page views wouldn't want to use focus, presence on screen or value change in the same context. On the Collect side, since we allow non-linear paths through a form, we'd likely do the kind of implementation I described above. In other words, we wouldn't really use existing action and event implementations.

EDIT: maybe something like odk-question-reached-or-passed or odk-page-shown-or-passed would capture the concept I'm describing?

That's what still bothers me. Obviously, the desired behavior - at least for a form designer - is that they can (somehow) explicitly state the specific conditions when to auto-start and end audio recording. So this needs to be conveyed explicitly in the form definition and cannot otherwise be client specific.

In Collect there is a reliable expectation of overt user interaction involved around flipping between each question (although there is nothing in the XForm definition stating this...), but in more paged/web interfaces like Enketo (or iXForms for that matter) form navigation and interaction is more free form; there's really no explicit assumption that can made about the order users may fill in questions [short of overloading the form definition with relevant dependencies...]. Even triggering it around 'entering' or 'exiting' a group is problematic: eg in iXForms groups are merely used to tell the form renderer to show these questions within a new tableview section, so depending on the screen size you can readily have multiple groups visible at once.

I'm not totally sure about this. This kind of artifact would be a supporting artifact for training or quality control and so I think it's more about getting some ability to cut down on what's recorded and there is likely some tolerance on exactly what is included. Naturally each client would need to explicitly document their behavior but I don't think this is a case where they all need to behave precisely the same way.

An alternative that I'd be open to is to say that we expect the start/end concepts will only be defined as applied to questions that each take up a whole screen (e.g. Collect not in field-lists or Enketo pages mode). Otherwise any specified start/end would be ignored.

That said, we still need to handle jumping around between questions and the event/action model doesn't seem very well suited to that.

Maybe that's worth going back to our feature advocates about; ie is there an expectation that initiating audio recording is something the form designer has a (high?) degree of control over [which they may want if its for, say, auditing purposes], or something more left up to user/enumerator discretion...

I'm definitely not suggesting that the enumerator should have control over the recording beyond with how they navigate the form. The question is more whether it'd be ok for two different clients with the same form to have slightly different recording triggers based on what their display modality is. Each client would still precisely define what it does in its documentation. In other words, is it realistic/desirable to come up with an event type that triggers recording or can we say "clients will find a reasonable way to capture audio roughly in the range of questions from start to end and will make sure that their strategy is precisely defined?"

(Some of my initial input was meant to just be questions to better understand the proposal, not necessarily recommendations for any particular implementation.)
How does where the proposal stand now affect the scenario mentioned on the TAB call: in which a user may tend to advance to the next question/screen before the interviewee finishes talking. If you swipe to advance or click to navigate to a new question that is outside of the recording range, would the client show a confirmation prompt notifying you that the recording will be stopped if you continue?

Yes, totally. I had completely forgotten about the audit feature. I have no objection to this more magical way of implementing this feature (but in the <orx:meta> block with a fixed nodeName). The multiple-file generation issue you mentioned is something to solve indeed.

Using bind attributes instead of meta node attributes also seems better to me, but we'd have to also do that for the existing audit feature, then... right?

Good discussion about start/end. It's pretty difficult to figure out. Nothing new to contribute there yet.

Y'all are really pushing my brain, thank you!

That's a good idea and might be the way to go if it's really critical that no part of the question is missed. My immediate reaction is that it would be pretty disruptive. As an alternative, documentation could suggest things like starting the recording one or more questions before what's really important and ending after. We can also show example forms with a screen that asks the data collector to acknowledge that the next N questions will be recorded and/or get consent (like our foreground recording demo form). For Collect, we can also suggest disabling settings for moving backwards and going to the jump view to force fully linear navigation through the form. Using documentation this way would allow for more flexibility depending on the context.

We could. The special thing about audit that I don't think would be the case for any other kind of automatic background population is that there can only ever be one. So it feels ok that it has its own special spec.

But if consistency is really important there, it should be ok for Collect to support both the fixed audit name in orx:meta and a specific attribute that applies to binary fields without a body element anywhere in the form (e.g. odk-audit=true or maybe even odk-background-populate=audit). We could do what we've done before and eventually shift docs and pyxform to the preferred spec. I think that approach can address the multiple-file requirement: we can allow any number of fields with the attribute. That would also be fine with audit, we'd just say that all would be populated with the same filename (same as with background audio without a range specified).

Here's another concept tying in various parts of the conversation above. It sounds like we generally like an action for the case of a single recording that captures everything. Perhaps we could do that and introduce the artificial restriction that it can only be triggered by xforms-ready (and thus can't nest in the body).

  <odk:recordaudio event="xforms-ready" ref="/data/my_recording" odk:quality="low"/>

I've used odk:recordaudio instead of odk:startrecording with a type like above because I now think it's easier to read and more coherent with implementation. I can't imagine that e.g. audio recording and video recording would be able to share a lot in any platform, even less so other types of data we've discussed like humidity readings.

For recording only in a question range, we could specify attributes on the action:

  <odk:recordaudio event="xforms-ready" ref="/data/my_recording" odk:quality="low" odk:start="/data/q1" odk:end="/data/q13" />

This would mean "start the recording engine when the form loads and actually record audio when the enumerator is interacting with a question between /data/q1 and /data/q13 inclusively." Alternately, as came up in conversation with @Xiphware above, it could mean "start the recording engine when the form loads if and only if /data/q1 and /data/q13 are questions displayed on their own screens (otherwise ignore the whole thing). Actually record when a question between /data/q1 and /data/q13 inclusively is on screen."

1 Like

@LN I like the idea of moving the onus on documentation and advice on how best to use the feature. In my experience, the need to record audio for a specific section always assumes that users move in a linear way, or at least jump to a section and move on linearly from there.

I'm in favor of just documenting whatever the final behavior is for each client for someone jumping into the middle of a recording section. We already manage this with many subtle differences between Collect and Enketo so this shouldn't be an issue from my perspective.


Hello spec enthusiasts! Team Collect has completed most of an implementation for this functionality and is now blocked on the specification. I think this is a largely unprecedented scenario because we typically wait to have an approved spec before implementing. In this case, our commitments and constraints are such that it made sense to prioritize the implementation. Our ideal would be to release it in 2 weeks which would require a finalized spec by this Friday at the latest. I don't want to rush the process and if there are strong feelings that we need to continue consideration, we can delay. On the other hand, if folks don't feel very strongly, perhaps we can try to get to a quick conclusion. I think this is fairly self-contained and a less than optimal spec here seems unlikely to have much impact.

Are there any other clients that are very eager to implement background audio recording (@martijnr, @Xiphware)? If so, do you think recording a limited range of questions would be relevant in that context?

On the XLSForm side, we are considering a parameter-based option and a nesting option.

On the XForms side, we have three concepts: one purely attribute-based, one purely action/event-based and a hybrid.

If I had to make a decision immediately, I would go with the initial attribute-based proposal for both XForms and XLSForm. There's a lot I like about the XForms hybrid concept and would be happy if we went with it but I am somewhat uncomfortable with having to restrict events to only xforms-ready. I continue to believe that introducing an action that can be triggered by any event adds a lot of complexity and I'm not seeing benefits that would make it worthwhile.

1 Like

not from me at this time... (far bigger features I need to catch up on first! :wink: )

1 Like

@Xiphware and I had some back and forth today and he's overall preferring the hybrid approach. He has highlighted that he is thinking about event-triggered recording because he is interested in user-directed data capture with something like events from a client pause/record button triggering data capture.

One additional possible wrinkle I want to mention if we do go in an action/event direction is that xforms-ready is still not well-defined. We had a long conversation about it when we added odk-instance-first-load. Upon reflection, I think that the question about whether xforms-ready is fired or not on reentry into a saved form might be irrelevant in W3C XForms because the assumption is that forms are filled out while online and that there's no "save as draft" concept. Either way, we'd need to decide whether we re-introduce xforms-ready with a different meaning than what it used to have or introduce a new event.

Sounds good to me to use the hybrid approach.

Wrt to the event to use, you're thinking the event should fire for both an empty form and when loading a draft record, right?

What would we want to happen if a user loads a draft record in both whole-form and section scenarios? Append new audio to an existing file?

1 Like

Yes, exactly.

Yes in both whole-form and section scenarios.

1 Like