Submit forms to Aggregate using 'compact-tag' syntax

The syntax of the compact tag makes huge data savings over standard XML form submission

i.e. using example from ODK docs.

SMS Submission in compact-tag format (26 bytes)

fn Chrissy ln Roberts a 38

XML Submission of same data (434 bytes)

<first_name>Chrissy</first_name><last_name>Roberts</last_name>38<n0:meta xmlns:n0="http://openrosa.org/xforms">n0:instanceIDuuid:587d3c39-41d4-4143-9159-b1ad2e76fbbc</n0:instanceID></n0:meta>

The absolute rate limiting step for many projects is the amount of data being sent over very limited data bandwidth. Most of the XML data submitted is actually just the form structure itself, so is there a way to excise this?

For work in low internet setting, it would be great to be able to send the data to aggregate via web but in a compact format such as that used for SMS.

Agreed that bandwidth is limiting factor, but perhaps there are other approaches that would work. For example, Aggregate (on Tomcat only, I believe) supports gzipped submissions which reduces the submission size without any additional software development.

Can we ground this in a real scenario? How much bandwidth do you typically have in the field and how much data are you generating?

I agree - there are probably less intrusive means to accomplish this. For example, just using JSON instead of XML - simply as the transport data format - is well known, simple, reversible, etc. In the above example (and if we also choose to eliminate the uuid attribute) the JSON can look like:

{"first_name":"Chrissy","last_name":"Roberts","#y":"38"}

56 chars compares quite favorably to 26, and involves no loss of tag info.

Thanks Yaw and Gareth for helpful discussions,

So the real world scenario we are facing now is that data are being sent via satellite phone from places where there is effectively zero terrestrial coverage of either phone lines or GPRS/1-4G.

Depending on the model, data plan and level of interference to the signal, the speed of satellite data is between say 12 kb/min and 800 kb/min (more often than not at the lower end). Cost for this is about $1-$5 a minute regardless of the amount of data actually sent.

Total amount of data across project lifespan is upwards of 100MB/month, but this is being sent on a sat-phone that is shared by many users and so time on the modem is at a premium in many senses.

I admit that I didn't know about gzip on tomcat server (and I don't know how I would implement this) but I assume that you wouldn't get a massive level of compression on these text files and you'd still be looking at the 100s rather than 10s of bytes for this data. The JSON format seems sensible though as a non-coder I have no idea how feasible any of this really is.

The ideal situation in my mind would be that (a) data sent to server is created in a more compact format such as the SMS or JSON formats AND (b) it is also compressed for sending to server.

As examples go, I freely admit that using the context of a health emergency in the most remote parts of the world is the extreme, but it is cases like this that highlight the need the most. Outside of emergency response work, our experience is that many of the projects we have facilitated for 'normal' research projects have had the same issues. The number one issue we have is data sitting on devices when it should be on servers. Most data from low and middle income country settings is being sent over prepay sim cards rather than wifi broadband, so imho every byte costs time and money.

Best
Chrissy h

another thing you might try in the interim - which requires no code changes - is if you know your forms are going to be deployed in this sort of situation, is to construct them deliberately using extremely abbreviated (single character?) tags, dont include any meta data, etc. That alone would reduce the above XML to:

<data><a>Chrissy</a><b>Roberts</b>38</data>

(which is arguably no more obscure than "fn", "ln"...) So long as your backend know what these tags refer to, the end user never actually see's them in Collect.

2 Likes

This is definitely a good solution to the immediate problem and gets the bytes down so thanks very much for the pointer.

It would still however be preferable to be able to bind up an abbreviated form to a more descriptive set of keys in the main form on aggregate. I'm no fan of data sets that can only be interpreted once they've been converted and reconstituted using a data dictionary.

i.e.

FORM STRUCTURE

KEY1 KEY2 DATA
DESCRIPTIVE.NAME AA A PIECE OF DATA
person.ID AB RSS-10001
gender AC MALE

DATA THAT GETS SENT OVER WEB

<data><AA>A PIECE OF DATA</AA><AB>RSS-10001</AB><AC>MALE</AC></data>

DATA THAT YOU GET IN THE CSV FILE

DESCRIPTIVE.NAME person.ID gender
A PIECE OF DATA RSS-10001 MALE

You'll get maybe 7:1 compression with Gzip. Assuming you are hosting with Tomcat directly and don't have an nginx proxy, enabling it requires adding compression="on" to your server.xml file. https://examples.javacodegeeks.com/enterprise-java/tomcat/enable-gzip-compression-apache-tomcat and https://community.jaspersoft.com/wiki/how-compress-http-responses-tomcat-level are good additional resources.

Again, not perfect, but every little bit helps.

If your situation is one where [obscure reference to Monty Python follows...] every byte is sacred, then I might suggest that you seriously consider looking into writing - or at least editing - your actual resulting XML form(s). Tools like XLSForm, KoboToolbox, etc make it easy for novices to write forms, but they make no particular attempt at optimizing the size of the resulting XML instance. eg, they create hierarchical groups in the XML instance that reflect any groupings you made of controls, but these are not strictly necessary (except for repeat groups) and only add to the XML payload.

Basically, you may be able to get some additional savings with handwritten XML forms. Its a bit of a learning curve, but if you are literally paying for every byte it may be worth the investment.

Well, you actually already have pretty much all the information you need to 'reconstruct' a full(er) description of (most) of the keys - namely the XForm definition itself! Each of the now-abbreviated instance XML elements has unique nodepath - eg /data/aa - which can conceivably be used to find the matching control/question in the XForm, from which you can get the control's full label text. This correspondance can be made either directly via the control's ref properly, or indirectly via the control's associated binding's nodeset property.

OK, I'm getting pretty down-and-dirty into the XML here, and it may take a bit of fancy programming to process the original XForm definition against your submission to pull out the right label for each, but I guess my point is the necessary data you need may already be present. [note: if you dont want to use the question's full label text, you could conceivably put your more descriptive property tags as a hint for each control, and pull out the hint text instead].

To follow up on this topic, there is some recent spec work being done on defining an abbreviated format for submitting ODK XForms, specifically for such mediums as SMS; see Compact Record Representation

1 Like