Form spec proposal: add background audio recording

This is a specification to support recording background audio. As described in the feature post, we intend to release a first version of this feature in Collect v1.30 which would only include starting recording at the beginning of form entry and stopping when the form is exited. However, we are proposing a more comprehensive spec here so that the feature can be expanded on in the future.

Related: start-geopoint, audit, built-in audio recording

Add a new background-audio type. The text in the name column is chosen by the user and is the name of the field that has the audio recording. The only other column that can be used with this type is the parameters column which accepts a key named quality with values defined in the documentation. If the key is omitted, quality defaults to voice-only. Specifying the external quality results in a form conversion error.

type name parameters
background-audio my_recording quality=low

Add a new bind attribute odk:background-audio that applies to the binary type. If set to true for a binary question, that question is automatically populated by the client with a background audio recording. Client behavior is undefined if an audio binary question has the odk:background-audio attribute set to true and is bound to a body element (XLSForm would never generate this). The XLSForm above would result in the following XForms output:

<bind nodeset="/data/my_recording" type="binary" odk:background-audio="true" odk:quality="low"/>

The odk:quality attribute is considered for binary fields with odk:background-audio set to true. Values voice-only (default), normal and low are accepted and everything else is ignored.

There can be multiple binary fields with the odk:background-audio field set to true but they will all be populated with the same filename.

This design leaves the door open for additional XLSForm parameters/XForms bind attributes to configure recordings. Consider the following:

Type name parameters
background-audio introduction start=${intro}; end=${q1}; probability=.1
background-audio conclusion start=${q17}; end=${thanks}

This might mean that for approximately one out of every 10 form filling sessions, the question with name intro, the question named q1, and all questions in between will be recorded. For all form filling sessions, the question named q17, the question named thanks and all questions in between will be recorded in a separate file.


Would label::... columns not work with these questions? It can sometimes be useful to given different human-readable labels to data columns which can be used for visualization / exporting into different languages. I believe this is possible for other hidden questions like calculations. Not a big issue though.

Partial audio recordings
This is not part of the spec yet obviously. As mentioned elsewhere, from a user perspective I would find it more intuitive to specify the start / end of a background recording by adding a start / end point to the form, rather than what you proposed in the Extensions section. This could work like groups, e.g.

type name
background-audio-start recording1

They would not because there would be no body element for the label to be nested in. They do work with calculations but then the calculation is displayed by the form filling client (and should always be marked as read-only unless the calculation is wrapped in once() but really default is the way to go now!).

This is also what @danbjoseph proposed here. I want to gently push back on this being simpler. It implies that you'd be thinking about audio recording as you author the form and probably not modifying the recording start/end. I've been imagining adding audio recording as a step that happens after a form is complete. That is, I'd like to test the form, make sure all my logic makes sense, get a feel for how it flows, and once it's final, define what portions I'd like to record. I also imagine I might adjust when recordings start and end as I try the form out and having those defined together would make that process easier.

One additional idea you might like that @seadowg and I discussed is exposing question and group attributes for cases where just a single question or group should be recorded.

If the general sense is that start/end is more intuitive, it should be possible.

Another related theme that came out of the TAB call is whether there would be a benefit to using the event/action mechanism we have instead of bind attributes. I told @Xiphware I'd write that up for consideration in the next couple of days. That would correspond more closely to this start/end XLSForm concept.

Below is an alternative XForms concept that uses events and actions. This would require introducing either a new action for recording audio (e.g. odk:startaudio) or, as I've shown below, a combination of a generic background recording action (e.g. odk:startrecording) and an attribute to indicate the type of recording (e.g. odk:type="audio").

  <bind nodeset="/data/my_recording" type="binary"/>
  <odk:startrecording event="xforms-ready" ref="/data/my_recording"  odk:type="audio" odk:quality="low"/>

Adding recording for a range would require introducing an event for a question being reached and a stoprecording action (the recording implicitly stops on form exit in the example above). (Side note, I don't know if it would really make sense to allow partial recording within a field-list or an Enketo form not in pages mode but it would be possible by doing something like using the value of the immediately preceding question changing as triggering this new event.)

  <bind nodeset="/data/my_recording" type="binary"/>
  <input ref="/data/q1">
    <startrecording ref="/data/my_recording" event="odk-question-reached" odk:type="audio" odk:quality="low"/>
  <input ref="/data/thanks">
    <stoprecording ref="/data/my_recording" event="odk-question-reached"/>

The purist side of me really likes this. It's extremely flexible and powerful and it's consistent with concepts we've already introduced. The pragmatist side of me concerned. I think that in XLSForm we'd only expose triggering on the odk-question-reached event and we could do a fair amount of validation at that level. But to really follow the specs clients would need to be able to handle actions being triggered by all the events we support, mismatched start/stop actions, etc. That may be handled by our existing generic support for events and actions but I have a feeling that there will be issues. For example, xforms-value-changed events can be triggered in really quick succession and that might cause instability for media recording. If we're unlikely to expose that functionality in XLSForms anyway, I'd rather avoid it.

One of the big points @Xiphware brought up during the TAB call is that there might be other kinds of background recording and that we'd want to introduce an approach that can be extended to e.g. video, locations, humidity, etc. I don't think that there's a big advantage of one approach vs. the other for this.

  • In the original attributes approach I outlined, we'd have to introduce a new attribute name for each type of data to record. Alternately, the attribute name could be something like odk:auto-populate with values like audio, location, video (which I think I prefer).
  • In the event/action approach defined above we'd need to introduce either new actions or new types for each new type of data to record.

I think of the attributes approach as a flattened version of the actions/events one that implicitly always uses a "question reached" event to trigger recording start and stop. It's less powerful but I think that's an asset because we get more control over what can be expressed by a form. It's also generally simpler to only have to deal with the bind rather than having information about the recording in actions as well.


I want to make sure that @martijnr has seen this thread and has a chance to react.

It would also be great to hear from @Tino_Kreutzer and @danbjoseph about their current thoughts regarding the XSLForm syntax after my response here.

Thanks @LN for the heads up and for this proposal. Sorry, I had not seen it. I'll focus on the XForms side.

Functionally this seems closest to our existing preload items, and since we'd like to eventually deprecate those and replace them with setvalue actions (for consistency), I think your 2 setvalue proposals make the most sense.

I like and would very much prefer the simple <odk:startrecording> action as sibling of <bind> but as you mentioned this would depend on whether we can accept not having fine-grained start and stop control.

If this start/stop functionality really is required (really?), there is an issue with how to determine when a question is reached (the trigger question may never get focus or a value, as the user may skip it - so it would require lots of require/field-list logic to make it all work). Depending on how this could be implemented in ODK, I'm wondering if something like odk-page-shown would be precise enough and perhaps reflect better how it would be implemented anyway.

1 Like

Is your primary dislike of the attribute-based option that it's not consistent? That was also one of @Xiphware's complaints. I'd argue this case is very similar to audit. In that case, we used a fixed node name to signal to clients that a certain file node should be populated by an audit. If we'd thought of it at the time I might have preferred using a bind attribute. Then the audit is configured through bind attributes which would be the same. The major difference between the two is that there's exactly one audit versus possibly multiple background recordings. Other than that they do feel fairly analogous in that we're asking the client to populate a certain field with a particular kind of data.

Yes, from what we've heard from users, it's quite important. Imagine you have a 3000-question survey and there's one section you suspect is not being handled consistently. It would be much better to ask for those few questions to be recorded than the whole survey.

Conceptually, users want to specify a range of questions and know that whenever the enumerator is operating within that range, audio is being recorded. I've been imagining that clients would pre-compute identifiers (e.g. XPath paths) for all questions between the specified start and end. They'd initiate recording when any of those nodes is "reached" (for whatever that means for the specific client and view) and stop recording when any node not in the set is "reached." This implementation concept is one of the things that has me looking away from actions/events -- we likely wouldn't actually use the events. Instead it would make the pre-computation work harder than if the information about start and end were available in the same place.

That would be a better name if we're pretty confident clients with single-page views wouldn't want to use focus, presence on screen or value change in the same context. On the Collect side, since we allow non-linear paths through a form, we'd likely do the kind of implementation I described above. In other words, we wouldn't really use existing action and event implementations.

EDIT: maybe something like odk-question-reached-or-passed or odk-page-shown-or-passed would capture the concept I'm describing?

That's what still bothers me. Obviously, the desired behavior - at least for a form designer - is that they can (somehow) explicitly state the specific conditions when to auto-start and end audio recording. So this needs to be conveyed explicitly in the form definition and cannot otherwise be client specific.

In Collect there is a reliable expectation of overt user interaction involved around flipping between each question (although there is nothing in the XForm definition stating this...), but in more paged/web interfaces like Enketo (or iXForms for that matter) form navigation and interaction is more free form; there's really no explicit assumption that can made about the order users may fill in questions [short of overloading the form definition with relevant dependencies...]. Even triggering it around 'entering' or 'exiting' a group is problematic: eg in iXForms groups are merely used to tell the form renderer to show these questions within a new tableview section, so depending on the screen size you can readily have multiple groups visible at once.

I'm not totally sure about this. This kind of artifact would be a supporting artifact for training or quality control and so I think it's more about getting some ability to cut down on what's recorded and there is likely some tolerance on exactly what is included. Naturally each client would need to explicitly document their behavior but I don't think this is a case where they all need to behave precisely the same way.

An alternative that I'd be open to is to say that we expect the start/end concepts will only be defined as applied to questions that each take up a whole screen (e.g. Collect not in field-lists or Enketo pages mode). Otherwise any specified start/end would be ignored.

That said, we still need to handle jumping around between questions and the event/action model doesn't seem very well suited to that.

Maybe that's worth going back to our feature advocates about; ie is there an expectation that initiating audio recording is something the form designer has a (high?) degree of control over [which they may want if its for, say, auditing purposes], or something more left up to user/enumerator discretion...

I'm definitely not suggesting that the enumerator should have control over the recording beyond with how they navigate the form. The question is more whether it'd be ok for two different clients with the same form to have slightly different recording triggers based on what their display modality is. Each client would still precisely define what it does in its documentation. In other words, is it realistic/desirable to come up with an event type that triggers recording or can we say "clients will find a reasonable way to capture audio roughly in the range of questions from start to end and will make sure that their strategy is precisely defined?"

(Some of my initial input was meant to just be questions to better understand the proposal, not necessarily recommendations for any particular implementation.)
How does where the proposal stand now affect the scenario mentioned on the TAB call: in which a user may tend to advance to the next question/screen before the interviewee finishes talking. If you swipe to advance or click to navigate to a new question that is outside of the recording range, would the client show a confirmation prompt notifying you that the recording will be stopped if you continue?