Using $filter with repeat tables to download OData

Let say I have a HH survey with a repeat part. I can use $filter option to download only those submissions I need


https://{{myserver}}/v1/projects/{{projectId}}/forms/{{xmlFormId}}.svc/Submissions?$filter=__system/submissionDate le 2020-01-31T23:59:59.999Z

Now I need to download the corresponding part of the submission of the repeat table Submissions.main.hq. In this table have fields: age, name, __id and __Submissions-id

Using any of these fields with $filter doesn't work:

## Not working
https://{{myserver}}/v1/projects/{{projectId}}/forms/{{xmlFormId}}.svc/Submissions.main.hq?$filter=age le 15

## Not working either
https://{{myserver}}/v1/projects/{{projectId}}/forms/{{xmlFormId}}.svc/Submissions.main.hq?$filter=__Submissions-id in ({{comma_separated_list_of_ids}})

Do you have any suggestions?

Hi @Odil! There's currently only limited support for filtering OData, particularly for filtering subtables.

In general, we don't plan to support filtering on form fields (like age) in the near future. See this forum post for some of the reasoning behind that. However, we support filtering the primary table on specific metadata fields.

We would like to add the ability to filter subtables on those metadata fields. We have a GitHub issue for that feature here, and there was related discussion in this forum topic. With that feature in place, you would be able to filter a subtable on __system/submissionDate. However, we aren't sure yet what the right approach or syntax within OData would be for that.

It is currently possible to fetch a subtable for a single Submission:

/v1/projects/{{projectId}}/forms/{{xmlFormId}}.svc/Submissions('{{instanceId}}')/{{repeatName}}

(Note that xmlFormId and instanceId should be percent-encoded.) However, that doesn't provide a way to fetch data for multiple Submissions at once.

Hope that helps!

2 Likes

Thank you @Matthew_White for these comments. It's helpful. I was trying to use $filter to download only the latest data. If we can use __system/submissionDate also for subtables in the new future that would be really great.
For the time being, to download the latest data I will use OData document parameters $count and $top

...{{xmlFormId}}.svc/{{subtable}}?$count=true&$top=0

This will bring the total number of the data in subtable,

...
"@odata.count": 6

With the next query I can download the submissions from the top which should be equal to the number: [total number of the data in the server] - [count of I downloaded already].

This is the only solution I could find to download only the latest data from OData tables. Hope this solution will help others too.

3 Likes

Hi @Odil

Depending on the amount of data that you expect in your repeat groups, it may also be an option to use the ?expand=* query parameter in order to obtain repeated data directly with the main submissions query. See: https://odkcentral.docs.apiary.io/#reference/odata-endpoints/odata-form-service/data-document

This is available since ODK Central v1.2.

1 Like

Yes, you are right. But this is not a solution when you want to download each table separately.

FWIW for my ETL pipeline I download all submissions every day with ruODK::odata_submission_get, which also downloads media attachments to a local folder. Each run only adds new media attachments, and the actual data only a few seconds (we have 100k+ records, so nowhere near LSHTM scales).

This approach works only because I persist the downloaded media files between runs.

I split the data into new (create), existing (update or skip) and existing and changed downstream (always skip) using an excerpt of the downstream database. The splitting takes a second.

This is to demonstrate my hack around subsetting subtables.

1 Like

Hi @Matthew_White , I am still trying to pull only fresh data from central in order to not download every hour all the data collected since march (Propagate submission date to child tables - #3 by mathieubossaert).
As @Florian_May said, getting all data every day or every hour works fine, but I am looking for a solution that do not consume resources for nothing in our "real time" data collection workflow)

So a few days ago I read your post with most attention than I did and I thought I found the solution.

I adapted my sql function to get submissionId from central that are relevant to filters I give (with submissionDate, updatedAt et reviewState parameter.
It's works fine but I fear it only works for a "one level" repeat form ?

My form contains a "observations" repeat nested in the "location" one. So I achieve to automatically get data for the significant instanceId for both submission and location table but not for observations one.

Am I wrong ?

@Matthew_White you can forget my question :wink:

No. but each "parent" table contains the path to repeat tables. The "subtablename@odata.navigationLink" attribute/column gives me the information I need to pull the data form the subtable !

/v1/projects/{{projectId}}/forms/{{xmlFormId}}.svc/{{subtablename@odata.navigationLink}}/{{repeatName}}

So I think I am close to a solution :slight_smile:

As a alternative solution to filter data subtables, and as a continuation of this start-up, I now have a first version of a pyODK function that meets our needs (getting only fresh data from Central)
For a given filter, it returns a dictionary containing for each "table" of the form (submission tables + repetition) the json data corresponding to the filter.
It looks for nested navigation links in each pulled json and calls the function recursively on them.

I think it's not really optimized and can certainly be improved, so anyone with more python experience than me is welcome to review it !

def fresh_data_only(project_id, form_id, path, criteria,  datas):
    
    from pyodk.client import Client
    import re, json
    client = Client()
    client.open()
    if path == '':
        return None
    
    url = 'projects/'+project_id+'/forms/'+form_id+'.svc/'+path+'?$filter='+criteria
    
    if re.match(r"Submissions\?.*", path) or re.match(r".*\)$", path):
        tablename = 'submissions'
    else:
        tablename = path.rsplit('/')[-1]    
    
    response = client.get(url)
    
    value = response.json()['value']
    
    navigationlinks = re.findall(r'(\'\w+@odata\.navigationLink\'):\s+([^\}]+)', str(value))
    for (key, link) in navigationlinks:
        link = link.replace("'", "'").replace('"','')
        fresh_data_only(project_id, form_id, link, '', datas)
    
    if tablename in datas.keys():
        datas[tablename] += value
    else:
        datas[tablename] = value
    
    json_datas = json.dumps(datas) 
    return json_datas
    
        
fresh_data_only(project_id = '5', form_id = 'Sicen_2022', path = 'Submissions', criteria ='__system/submissionDate ge 2023-03-03',datas = {})

The result in PostgreSQL, when this function is used into a pl/python function is a table with two columns :

  • the table name
  • and its json

The next step is to integrate this into central2pg to get a lightweight, and high frequency workflow, between Central and our databases.
:slight_smile:

2 Likes

A first version of this new set of PostgreSQL fonctions is out :slight_smile: