External data - current state of affairs

tomsmyth · August 14, 2018, 1:28pm

Hello friends. Can anyone point me in the direction of the latest on external data? I remember there had been talk of somehow unifying the disparate worlds of pulldata, external itemsets, etc.

I am running into this limitation: https://github.com/opendatakit/collect/issues/179

and wondering what the most sensible route to overcoming that might be.

LN · August 15, 2018, 12:37am

Which are you trying to use? If you're using search()/pulldata(), what @andrew (I think!) suggested in the follow-on post is still your best bet.

LN · August 29, 2018, 11:59pm

Hopefully that quick info was enough to help you address your immediate problem, @tomsmyth. I do also want to give a broader update about large datasets in ODK since I don't think that information exists in one place currently. This question ties in to conversations happening about more sophisticated data workflows. In particular, the TSC has been exploring options for adding some kind of case management and that will likely rely on a case list managed as an external document. @adam.butler has been working on a spec at https://github.com/opendatakit/roadmap/issues/23 and I know @Tino_Kreutzer and the folks at Kobo are also interested in this broad topic. @martijnr has also been involved in related conversations and generally really helped me understand the state of things.

I first want to point out that large datasets directly in the form have gotten a lot more performant over the last year thanks to work by @dcbriccetti. So I do recommend trying an internal dataset first as this is simpler to deal with and non-controversial. Those datasets can be updated without having to change the form_id and Collect can can automatically pull those updates (see general settings > Form management > form updates). Depending on the device and specific usage, this can be performant enough for 1000s of data elements.

There are currently 3 incompatible approaches to external data supported by ODK Collect.

External secondary instances

As documented in the ODK XForms specification. This approach is entirely consistent with how datasets in the form are handled. That means everything that works with internal choices also works with external choices including choice filters for cascading selects and any kind of complex XPath querying. This is the only option for external data that is part of the ODK XForms spec. It is supported in ODK Collect and in Enketo. It does have two major downsides: clumsiness of the XML format and performance.

XML is not reasonable to create by hand. For this to be a viable way to represent external data, there needs to be some way to go from tabular data to an XML doc. This could be done server-side, for example. XLSForm could also be able to generate XML external instances just like it builds CSVs for external itemsets. Related to all this is @martijnr's proposal to formally introduce a way to attach a CSV file that would be queryable in the same way as the XML (by, for example, generating the XML on the device). The proposed spec is described in xforms-spec#88.

This works well for mid-size datasets but because the whole XML doc is currently represented and queried in memory in the Collect implementation, it can become quite slow for large datasets or slow devices. @dcbriccetti made some great improvements to performance about a year ago and there are probably more to make without big changes to the code structure. There are likely larger changes possible such as not holding the whole doc in memory or maybe even using a database.

`search()` and `pulldata()`

As documented in the XLSForm spec and with more details here. These predate external secondary instances and have no grounding in the ODK XForms spec. As I understand it, the two major design criteria were fast loading of large datasets and minimal changes needed to existing tools, especially pyxform. I think that's what led to the unexpected choice to make search() an appearance. Implementation-wise, this approach takes a CSV, loads it into a database table and then makes queries against it. search() entirely bypasses the form structure and injects the values from a database into a list of options for a select. pulldata() behaves roughly like a normal XPath function and makes it possible to pull a single value out of the database and use it in some expression.

These are performant but have limitations like the one @tomsmyth pointed to because the data is not part of the form. SurveyCTO (company that builds tools on top of ODK) contributed this and I know they have implemented some form of case management on top of it.

Fast external itemsets

As documented in the XLSForm spec. This implementation also predates external secondary instances and I think was added roughly at the same time as search()/pulldata() to solve the same problem of making large choice sets performant. It also has no grounding in the ODK XForms spec and works in a similar way -- Collect puts the contents of a CSV named itemsets.csv into a database and then queries it. It's more limited than search()/pulldata() and narrowly solves the problem of speeding up cascading selects. It only works with select ones (not even select multiples). I don't think it would be appropriate for representing general datasets like case lists.

tomsmyth · September 5, 2018, 5:50pm

Terrific write-up, thank you.

cooperka · December 10, 2018, 3:44pm

Update on the state of affairs:

We've encountered a couple JavaRosa bugs filed here preventing us from using external secondary instances.

Based on the docs I've read it sounds like external selects behave in effectively the same way as internal selects, thus would be expected to have similar performance, but according to benchmarks done by @dcbriccetti here it looks like external was much faster (at least prior to his performance improvements about a year ago).

I recently benchmarked internal selects to be significantly slower than his original measurements (at least up to several thousand items), though the time currently increases linearly rather than polynomially. A middling phone took 25 seconds to load only 1500 items. He didn't specify what device he used originally.

His original data (September 2017):

My current data (December 2018) based on Galaxy Nexus, Android 6.0, 1 GB RAM:

We're considering whether our client is going to need this to be improved; if so we'll pursue further. The next step would probably be to profile and figure out if there are bottlenecks or how else we can improve performance.

dcbriccetti · December 13, 2018, 6:44pm

Hi Kevin. Thanks for your detailed investigative work.

@ggalmazor, you were asking how I made that chart. I ran timesParsingLargeInternalSecondaryInstanceFiles and timesParsingLargeExternalSecondaryInstanceFiles in XFormParserTest. They produce output like

Children Seconds
1,000 0.126
1,258 0.01

I graphed those values with Apple’s Numbers.

yanokwa · March 26, 2019, 1:19am

In order to effectively plan for the deprecation of some of the external data methods, we collected anonymized data from Collect over the last eight months. Here are the findings...

7,800 unique forms with external data

The method break down is as follows:

4,100 unique forms IDs using pulldata()
2,000 unique form IDs forms using external itemsets
3,600 unique forms IDs using search()

Across the data set, there were 7,800 unique form IDs. This is less than the sum of all form IDs because some forms likely used both pulldata() and search().

pulldata dominates the per question events

The three methods were called 82.6 million times. The percentages:

pulldata(): 90%
external itemsets: 5%
search(): 5%

pulldata() tends to be used multiple times in a form which is why it dominates these events.

pulldata has the majority of per form events

There were 5.4 million unique method/form ID pairings. The percentages:

pulldata(): 61%
external itemsets: 23%
search(): 15%

I interpret this data to mean that most forms use pulldata, but external itemsets and search are pretty popular.

Conclusions

We did not track how many unique form IDs are used in Collect to compare the 7,800 figure against, but it doesn't feel like a small number to me. If we want to deprecate these methods, we need to give people lots of time to make the transition.

External data - current state of affairs

External secondary instances

search() and pulldata()

Fast external itemsets

`search()` and `pulldata()`