Let say I have a HH survey with a repeat part. I can use $filter option to download only those submissions I need
https://{{myserver}}/v1/projects/{{projectId}}/forms/{{xmlFormId}}.svc/Submissions?$filter=__system/submissionDate le 2020-01-31T23:59:59.999Z
Now I need to download the corresponding part of the submission of the repeat table Submissions.main.hq. In this table have fields: age, name, __id and __Submissions-id
Using any of these fields with $filter doesn't work:
## Not working
https://{{myserver}}/v1/projects/{{projectId}}/forms/{{xmlFormId}}.svc/Submissions.main.hq?$filter=age le 15
## Not working either
https://{{myserver}}/v1/projects/{{projectId}}/forms/{{xmlFormId}}.svc/Submissions.main.hq?$filter=__Submissions-id in ({{comma_separated_list_of_ids}})
Hi @Odil! There's currently only limited support for filtering OData, particularly for filtering subtables.
In general, we don't plan to support filtering on form fields (like age) in the near future. See this forum post for some of the reasoning behind that. However, we support filtering the primary table on specific metadata fields.
We would like to add the ability to filter subtables on those metadata fields. We have a GitHub issue for that feature here, and there was related discussion in this forum topic. With that feature in place, you would be able to filter a subtable on __system/submissionDate. However, we aren't sure yet what the right approach or syntax within OData would be for that.
It is currently possible to fetch a subtable for a single Submission:
Thank you @Matthew_White for these comments. It's helpful. I was trying to use $filter to download only the latest data. If we can use __system/submissionDate also for subtables in the new future that would be really great.
For the time being, to download the latest data I will use OData document parameters $count and $top
This will bring the total number of the data in subtable,
...
"@odata.count": 6
With the next query I can download the submissions from the top which should be equal to the number: [total number of the data in the server] - [count of I downloaded already].
This is the only solution I could find to download only the latest data from OData tables. Hope this solution will help others too.
FWIW for my ETL pipeline I download all submissions every day with ruODK::odata_submission_get, which also downloads media attachments to a local folder. Each run only adds new media attachments, and the actual data only a few seconds (we have 100k+ records, so nowhere near LSHTM scales).
This approach works only because I persist the downloaded media files between runs.
I split the data into new (create), existing (update or skip) and existing and changed downstream (always skip) using an excerpt of the downstream database. The splitting takes a second.
This is to demonstrate my hack around subsetting subtables.
Hi @Matthew_White , I am still trying to pull only fresh data from central in order to not download every hour all the data collected since march (Propagate submission date to child tables - #3 by mathieubossaert).
As @Florian_May said, getting all data every day or every hour works fine, but I am looking for a solution that do not consume resources for nothing in our "real time" data collection workflow)
So a few days ago I read your post with most attention than I did and I thought I found the solution.
I adapted my sql function to get submissionId from central that are relevant to filters I give (with submissionDate, updatedAt et reviewState parameter.
It's works fine but I fear it only works for a "one level" repeat form ?
My form contains a "observations" repeat nested in the "location" one. So I achieve to automatically get data for the significant instanceId for both submission and location table but not for observations one.
No. but each "parent" table contains the path to repeat tables. The "subtablename@odata.navigationLink" attribute/column gives me the information I need to pull the data form the subtable !
As a alternative solution to filter data subtables, and as a continuation of this start-up, I now have a first version of a pyODK function that meets our needs (getting only fresh data from Central)
For a given filter, it returns a dictionary containing for each "table" of the form (submission tables + repetition) the json data corresponding to the filter.
It looks for nested navigation links in each pulled json and calls the function recursively on them.
I think it's not really optimized and can certainly be improved, so anyone with more python experience than me is welcome to review it !
def fresh_data_only(project_id, form_id, path, criteria, datas):
from pyodk.client import Client
import re, json
client = Client()
client.open()
if path == '':
return None
url = 'projects/'+project_id+'/forms/'+form_id+'.svc/'+path+'?$filter='+criteria
if re.match(r"Submissions\?.*", path) or re.match(r".*\)$", path):
tablename = 'submissions'
else:
tablename = path.rsplit('/')[-1]
response = client.get(url)
value = response.json()['value']
navigationlinks = re.findall(r'(\'\w+@odata\.navigationLink\'):\s+([^\}]+)', str(value))
for (key, link) in navigationlinks:
link = link.replace("'", "'").replace('"','')
fresh_data_only(project_id, form_id, link, '', datas)
if tablename in datas.keys():
datas[tablename] += value
else:
datas[tablename] = value
json_datas = json.dumps(datas)
return json_datas
fresh_data_only(project_id = '5', form_id = 'Sicen_2022', path = 'Submissions', criteria ='__system/submissionDate ge 2023-03-03',datas = {})
The result in PostgreSQL, when this function is used into a pl/python function is a table with two columns :
the table name
and its json
The next step is to integrate this into central2pg to get a lightweight, and high frequency workflow, between Central and our databases.