ODK XForms spec proposal: add pulldata as custom XPath function

LN · April 1, 2019, 5:05pm

pulldata() was introduced to Collect on top of functionality that loads a CSV into a database. pulldata() queries that database to pull out a single value. There's more about it in the thread on the state of external data.

The original spec for pulldata() and the underlying database feature states that any CSV included with a form would be parsed and available for querying based on the original csv's filename. This is not in the spirit of the ODK XForms spec because the form has no explicit knowledge of the external data file. Some years back, @martijnr and @Ukang_a_Dickson made changes to pyxform such that if the XLSForm uses pulldata, the corresponding XForm explicitly declares the existence of the CSV file queried by pulldata. This opens the door for pulldata to simply be a convenience custom XPath function that hides some of the complexity of XPath queries (much like indexed-repeat).

Given that pulldata() is already in broad usage and can be implemented consistently across spec-compliant clients, I'd like to propose that it become an official part of the specification. That would help make sure it's thoroughly documented which it isn't right now.

If the form describes an external CSV instance:
<instance id=<instance name> src="jr://file-csv/<filename>" />

The CSV can be queried thus:
pulldata(<instance name>, <return column>, <query column>, <query value>)

For example:
<instance id="foo" src="jr://file-csv/my-cool-data.csv" />
pulldata("foo", "Age", "Name", "Bruce Wayne")

This implies that the my-cool-data.csv file has "Age" and "Name" columns. The function call would return the "Age" value of the row with "Name" "Bruce Wayne."

The example above includes one deviation from current usage in Collect: currently, the instance name must match the filename. That is, for my-cool-data.csv, the instance ID needs to be my-cool-data (or to be more precise, the first parameter actually represents the CSV filename). That restriction doesn't make sense in the ODK XForms world so I suggest we do without it. In practice, pyxform would always make them match.

martijnr · April 1, 2019, 5:47pm

Thanks for this proposal!

This opens the door for pulldata to simply be a convenience custom XPath function that hides some of the complexity of XPath queries (much like indexed-repeat).

This is great. In order to be a real XPath function, I think the function should be able to operate on XML data (as well) which would be an extension of its current use in ODK Collect. I am thinking that after adoption of the proposal to add CSV data, we could explain pulldata('a', 'b', 'c', 'd') as a shortcut for:

instance('a')/root/item[c=d]/b (fast but restrictive wrt XML data structure)

or

instance('a')//*[c=d]/b (slower, but more flexible wrt XML data structure)

Note, these changes don't affect the CSV data support @LN outlined above. (Also note that the first option is how it is has been implemented in Enketo, but the second option would be compatible with the first.)

I agree.

Xiphware · April 1, 2019, 8:12pm

Ha, I was just thinking last night about pretty much this! Basically, making pulldate() a generic XPath function that can be used to basically 'lookup' data (from a table) from an arbitrary source, be it an external CVS or XML data structure (eg nodeset). It certianly 'feels' better as a XPath function to me, and hopefully we can define it in such a way that it can still be used in its existing context.

I was also thinking about if/how this could be extended somehow to choice-filter() [?] to perhaps somehow have a (single?)general purpose XPath function to perform the function of a 'table lookup', from a specified data source. The later returning perhaps returning a nodeset to use as an select/select1 item list.

Unfortuantely I feel asleep before getting much further...

LN · April 2, 2019, 12:28am

Ah, yes, of course!

I had not considered this. I haven't had a chance to think through it deeply yet so maybe this is not a legitimate concern but I wonder what the performance implications would be for the common /root/item case. Is it just when there's a gnarly hierarchy that it gets slow or would even the simple case get a performance hit (possibly implementation-specific to some extent)?

Here's a quick explanation of what that means for those who are interested but may not be deeply familiar with XPath. To make CSVs XPath queriable, a standard transformation is applied as described here. Basically, each row is put in an item block with child elements for each column. But if someone is authoring XML themselves, they could have a much more complex structure. What @martijnr is suggesting is allowing pulldata to find an XML element with name <query column> and child <query value> anywhere in the XML document and return a sibling of <query column> with name <return column>. (Hopefully I got that right!)

LN · April 9, 2019, 12:50am

A post was split to a new topic: ODK XForms spec proposal: introduce a virtual instance for current state

adam.butler · April 10, 2019, 2:48pm

Thinking about this in relation to the ideas that came up in the last TSC meeting re. using static analysis to decide on an optimal approach to data storage (i.e. use a database if the XPath is simple enough to be easily convertible to SQL)... I think that the generic pulldata use case should automatically use SQLite, as is the case in the current implementation. In order for this to be the case, I think that we would not be able to use arbitrary levels of nesting.

So I think (and please correct me if I'm wrong) that this can be trivially converted to SQL:

instance('a')/root/item[c=d]/b

while this can't:

instance('a')//*[c=d]/b

Assuming that I've got my XPath complexity analysis right, this means that we would need to use the former, simpler version.

LN · April 10, 2019, 4:35pm

The explicit expression is indeed trivial to convert to SQL. I believe you are correct that the more general one can't readily be converted since there may or may not be joins involved.

That said, conversion or not to SQL would not be part of the specification, it would be a client decision. It's hard not to blur the lines between spec and implementation!

Now that I've let this sit in my brain a little while, I think there's a spec-level case to make for the simpler expression. I think saying pulldata is a shortcut for instance('a')/root/item[c=d]/b is much easier to explain and doesn't reduce flexibility for users who need it. That is, if someone is generating XML documents with a custom schema that benefits from a generic query like instance('a')//*[c=d]/b, then why do they need a convenience function?

In practice, my guess is that pulldata will mostly continue to be used to query CSVs because again, folks who are generating XML documents probably have some understanding of XPath. An XPath expression is very explicit whereas the 4 parameters of pulldata are a bit mysterious and I always have to look up what order they go in.

@martijnr, do you want to make a case for the more flexible expression?

martijnr · April 10, 2019, 5:03pm

No, I agree. Thanks. My vote is also for the less flexible expression (pure XPath performance + Collect implementation advantages, and I think you're right that advanced XML enthusiasts likely wouldn't need this function).

LN · April 23, 2019, 7:34pm

Thanks, all! Here is the spec documentation: https://opendatakit.github.io/xforms-spec/#fn:pulldata