Filter Data in ODK Central

Vinod_Rathore · September 26, 2020, 11:16am

Is it possible to make filter data based on selected Indicators and share this filter data to users they have the rights to view the data .

Matthew_White · September 26, 2020, 4:59pm

Hi @Vinod_Rathore!

As of the current version (v1.0), Central does not have the ability to filter data. In v1.1, we plan to add the ability to filter by submission date and submitter.

Could you say more about how and with whom you wish the data to be shared? Who are the users who have rights to view the data? Why will they be viewing filtered data rather than the entire data set — what sort of indicators will be used? Will they view the data in Central or in a separate service/application? It'd be helpful if you could describe your high-level need in as much detail as possible.

You'll also find some related discussion in this forum thread:

Vinod_Rathore · September 27, 2020, 4:59am

Details of my need as per following Example . : -

I want to create a data collection project for monitoring of health services in 5 districts.

I will create a single form for all these 5 Districts and create data collector user, and 5 District supervisors.

I want to all data collectors send data through ODK in single form and their district supervisors can access his/her district data. so they can analyze, monitor and guide their team.

dmenne · September 27, 2020, 9:59am

Your question addressed @Vinod_Rathore, but maybe I should give you my 5 cents:

Example 1: Get all records from patient 1234, and create a time series of the blood cell count. Currently, I have to do dirty tricks to keep a local warehousing database that is updated regularily skipping the number of previously retrieved items (since skipping is the only supported filtering). This use-case would not be covered by your suggestion. Why do you want to limit the query to submitter/date? There are libraries around that do it, and KOBO has it implemented.

Example 2: Whenever a new record is uploaded, we want a report to be sent to an email. The current solution uses a trigger event on the database, which posts to MQTT for further processing. While this works, it is a maintenance nightmare. When I can ask for the records generated within the last minute, I could run a much more robust chron-query. Alternatively, being able to generated an event on posts as KOBO has would be nice, but it certainly is more complicated.

Surprisingly, Listing all users has a query string.

Keeping the local SQLite database up to date and queryable needed about 1000 lines of code which I would happily scrap if a query is available. Please, keep me updated when you have plans on the syntax in the REST API. I am currently writing an R package for the REST API to complement ruODK, and including a query would be a big gain.

dmenne · September 27, 2020, 2:32pm

I had a look at the database structure, and I begin to understand why an intelligent query is not possible on the data. Submitter and Time are in table submissions and are simple fields that can be queried. Data are in submissions_defs, field xml which surprisingly is of type text, not xml. I have not tried it yet, but as far I understand PG, XPATH queries are not possible in text field.

If I correctly understand this, it means that there will be no extended querying possible in the next future.

Too bad. Please correct me when I am wrong.

Matthew_White · September 27, 2020, 4:49pm

We're planning to start with filtering on submission date and submitter. (@dmenne, you're right that this submission metadata is easier to access.) However, we've discussed expanding to filtering on any form field, though there's no timeline right now. It's very useful to learn more about your use cases and the ways in which this functionality would be helpful.

@dmenne, I think there are a few sources of complexity here:

Parsing the OData filter expression
Selecting the field
Indexing the field so that filtering is faster than pulling all rows

@issa discusses (1) a little here.

You're right that (2) is more challenging for form fields than it is for submission metadata like submission date and submitter. I think you're right that the Postgres xpath() function isn't designed for text values, but Postgres also allows conversion between text and xml, and we could maybe also alter the type of that column.

@dmenne, for your example 2, isn't filtering on submission date what you need there?

dmenne · September 27, 2020, 6:13pm

Example 2: filtering on submission data is fine and removes some awkward code to keep the last skip number. A real filtering on data however would make the whole sqlite warehouse redundant.

@issa you mentioned:

Why not leave the sorting out of the awkwardness to the API-user? XPATH is not funny to create, but would be an easy way out - return XML in a first step in the API. If you really must, you can add higher level queries later.

I have tried to convert the text to xml, hoping that each xml is implicitly converted to text. No success yet, but will try again tomorrow.

Florian_May · September 29, 2020, 9:19am

@dmenne would it make sense and would you be interested in joining forces and integrating your functions into ruODK? I'd gladly review PRs and you'd benefit from a lot of work and infrastructure that's already implemented in ruODK. Just an option for your consideration

dmenne · September 29, 2020, 9:45am

I anyway wanted to do this, but I fear I do not have the time to create a full CRAN-ready package.

My goal is a bit different from ruODK - yours is a high-level package, mine is planned to be a straightforeward R-Interface to the REST API only. It uses almost no additional packages besides httr and directly related ones, most importantly no tidyverse. I am not an ideological anti-tidyverser, using it a lot in teaching and in reporting, but I believe that low-level API packages should be lean.

It could be that the implementation of the unnesting will be much less complete than your's, because is it MUCH simpler in tidyverse.

So using ruODK in odkapi is not useful, the inverse might be helpful. Using ruODK is messy for me because of the missing filtering functions in the API, converting the XML to xml-Format in the database looks promising though for lovers of XPATH.

I need the API to build an end-user (clinical staff) proof form upload with individual reporting. So there are many additional problems, for example uploading of one patient name (dynamically) by sneaking with ssh into Android, running in a setting without internet connection and custom HTTPS. I am using an custom build of ODK Collect with burnt-in certificate, because Android > 5 does not accept global ones,

dmenne · September 29, 2020, 1:08pm

Today, I did some experiments to show that an API for a flexible query of submissions is not that difficult.

To avoid killing the app, I added a column xxml to submissions_attachment:

update submission_defs set xxml = XMLPARSE(CONTENT(xml))

This converts the header to a element: Using some postgresql options, a slightly different treatment of the header is possible.

<data xmlns:ev="http://www.w3.org/2001/xml-events" xmlns:h="http://www.w3.org/1999/xhtml" xmlns:jr="http://openrosa.org/javarosa" xmlns:odk="http://www.opendatakit.org/xforms" xmlns:orx="http://openrosa.org/xforms"
      xmlns:xsd="http://www.w3.org/2001/XMLSchema" id="Endoskopie_Protokoll" version="Endoskopie Protokoll">
  <pat_sequence_nr>1</pat_sequence_nr>

... 
</data>

An xpath query gave what I wanted: formDefId is a regular field

SELECT xpath('/data[pat_no_nr = 102]/protokoll/blutdruck/text()', xxml) as blutdruck
from submission_defs
where submission_defs."formDefId" = 7

This is similar to

Select /data/protocol/blutdruck from xxml where /data/pat_no_nr=102

Considering all specia" cases not needed. Writing XPATH is as funny as writing regular expressions, but both work, and additional higher level API function are possible, but it is a first step.

It would be better to make the xml column of tyle xml, not text - the header stufs is the main hurdle I see. The xml column can be read as text without any changes; the following works.

SELECT xxml from submission_defs
where submission_defs."formDefId" = 7

Matthew_White · October 2, 2020, 4:53pm

@LN just wrote up a detailed description of our thinking around what Central can offer in terms of data analysis. I wanted to highlight a couple of paragraphs that I think are especially relevant to this topic:

ODK Central clarifications required

One of the early decisions we made was to store incoming form submissions as XML blobs rather than splitting them into database tables as Aggregate does. This is a decision that we did not make lightly. It has helped our small team make quicker progress and has ensured speed and stability as submissions come in. We learned from Aggregate and other systems that splitting records is a big source of code complexity, bugs, and performance bottlenecks.

The tradeoff is that this limits the analysis that can be done directly on Central -- the data is not organized for any kind of fast operations across the dataset. Additional implications are that directly connecting to the Central database for analysis is not practical and that we don't provide a performant API for open-ended data querying. … [W]hat we learned from Aggregate is that most people need to rely on external tools for analysis anyway.

I also wanted to link to What's coming in Central over the next few years, which discusses future directions for Central.

@dmenne, I'm realizing that your example may relate to ongoing discussions around entity-based data collection. It seems possible that entity-based data collection would simplify your workflow. Related to that is this section from @LN's post:

Matthew_White · October 2, 2020, 5:01pm

@Vinod_Rathore, have you explored the Central OData feed? This allows you to connect Central to tools like Excel or Power BI. You should be able to filter the data within those tools. For example, I think each district supervisor could use the same OData feed, but apply different filters.