ODK Collect sends more media than expected

We are implementing an ODK form where we collect a series of audios and images in very remote locations. The resources for this report can be found in this Google Drive: https://drive.google.com/drive/folders/1tQbAu3kMp8yFbGOrOKT5sVXb5HwwD_GW?usp=sharing

ODK Collect V2023.2.4 sent a submission that we named f7e6d09c-caa2-47a8-94c3-9cbe843bbd9b.xml. The XML file has the following files as part of its data:
1718391240535.m4a
1718391246261.m4a
1718391254240.m4a
1718391261247.m4a
1718391267697.m4a
1718391273488.m4a
1718085675352.jpg
1718085666941.jpg
1718085662005.jpg
1718084667038.jpg

ODK sent this data over 4 POST requests with isIncomplete as part of the POST keys. Here are the files that ODK Collect sent in each POST:
POST #1:
f7e6d09c-caa2-47a8-94c3-9cbe843bbd9b.xml
1718084667038.jpg
1718084701415.m4a
1718084705211.m4a
1718084710805.m4a
1718084715704.m4a
1718084718828.m4a
1718084722522.m4a

POST #2:
f7e6d09c-caa2-47a8-94c3-9cbe843bbd9b.xml
1718085666941.jpg

POST #3
f7e6d09c-caa2-47a8-94c3-9cbe843bbd9b.xml
1718085675352.jpg
1718391246261.m4a

POST #4
f7e6d09c-caa2-47a8-94c3-9cbe843bbd9b.xml
1718085662005.jpg
1718391240535.m4a
1718391261247.m4a
1718391254240.m4a
1718391267697.m4a
1718391273488.m4a

ODK is sending 16 media files. 6 more than expected. The files are entirely different, meaning that the 6 extra files are not replicas of the 10 that are correct.

We haven't found a way to replicate the error, but it happens randomly and more than once in our data collection exercise. We are collecting about 10,000 submissions, with around 230 submissions with such problems.

We use Python Pyramid to process ODK requests. This is the code that logs the files in each POST request. It is the first thing that we do:

for key in request.POST.keys():
if key != "isIncomplete":
filename = request.POST[key].filename
print(filename)

Each POST comes with f7e6d09c-caa2-47a8-94c3-9cbe843bbd9b.xml. We also detected that each f7e6d09c-caa2-47a8-94c3-9cbe843bbd9b.xml is identical (has the same MD5SUM). So, we are receiving the same submission in 4 parts.

Has anyone encountered this problem before?

Collect sends all media saved to the submission folder even if it's not referenced in the data submission. We don't expect there to be extra files sent often but in some unexpected contexts it can prevent data loss.

If I recall correctly, Collect keeps audio files that are interrupted by a crash. The first thing I would do is ask folks who captured that data whether they experienced any kind of crash or unexpected behavior.

That said, we don't expect such a high crash rate so I'm not very confident of this theory. The second thing I would do is sort the list of audio files for a submission that had extra ones and listen to one that sorts right before an expected one. The filenames are unix timestamps so they also give you a sense of sequence of events. Maybe Collect doesn't clean up deleted audio files or something like that. The contents of the files might give you a clue.

This is expected: https://docs.getodk.org/openrosa-form-submission/#rationale-for-sending-the-form-s-xml-submission

1 Like