ODK Central OData to R

dr_michaelmarks · October 11, 2018, 11:06am

What is the problem? Please be detailed.
We are beta-testing ODK Central.
I am trying to pull data using the OData feed into either R (ideally) or excel for downstream analysis.

In R I am using the following script:
url <- "https://odk-survey.lshtm.ac.uk/v1/forms/AIM_Day_1.svc"
r <- GET(url, authenticate("USERNAME", "PASSWORD"))
stop_for_status(r)
content(r)

But the output I get is not the dataset but instead:

$@odata.context

[1] "https://odk-survey.lshtm.ac.uk/v1/forms/AIM_Day_1.svc/$metadata"

$value
$value[[1]]
$value[[1]]$name
[1] "Submissions"

$value[[1]]$kind
[1] "EntitySet"

$value[[1]]$url
[1] "Submissions"

So then I tried it more simply in Excel
Data>From Other Sources>OData Feed
Insert correct URL
Insert same username/password used in the R pull attempt
Then I get a 403 forbidden couldn't connect error

What username/password should I be supplying to the OData pulls? I presume my WebUser account that lets me login to central

What ODK tool and version are you using? And on what device and operating system version?
ODKCentral beta
Windows 10
R 3.5
Excel 2016

What you have you tried to fix the problem?
Tried both R and excel
Tried different R package but it didn't have option for supplying username/password to the OData pull so it didn't work

issa · October 11, 2018, 10:02pm

So, what you're seeing is the OData service document—it's essentially a directory that catalogs all the available data on the service. What it's telling you here is that there is an EntitySet (table) called Submissions available at the sub-URL Submissions. Of course, this is mostly useful for programmatic access, particularly when the available options are not known in advance.

In your case, I'd not bother with the Service Document at all and just query this URL instead: https://odk-survey.lshtm.ac.uk/v1/forms/AIM_Day_1.svc/Submissions and that should get you a JSON document with all the data on record. Then you can forget the fact that it's OData at all in the first place and just treat it like any other JSON data source.

I'm not sure why Excel is having problems with your login; you seem to have the right information in there. Power BI works for me when I try it with my credentials and they are theoretically the same under the covers in regards to OData.

dr_michaelmarks · October 12, 2018, 8:32am

Yup ok, that is a step forward - I just need to work out next how to get R to parse that data into a meaningful dataframe I guess

Florian_May · May 28, 2019, 9:19am

@dr_michaelmarks @issa
Apologies for jumping in on this topic!
I wanted to do the same thing, but the available R packages were rather meagre and didn't really work for me. So here's an attempt: https://dbca-wa.github.io/ruODK/ including a worked example.
One missing feature so far is to support pagination and forms with many submissions. Is such a form out there anywhere?
What other features would you like to see included?

LN · May 28, 2019, 9:00pm

Especially in these trying times, it is important to ask: “R U ODK?”

Great contribution, @Florian_May.

Stuart · June 7, 2019, 7:44pm

I've got a solution which isn't to much of a coding nightmare. If you use the /Submission url to grab the 'pre-loop' data and the /Submissions.repeat url to grab the repeat data you can flatten the two json's using:

text_content <- content(r, "text", encoding = "UTF-8")
json_content <- fromJSON(text_content,flatten = F)
repeats <- as.data.frame(json_content$value)

Then just left_join the two together i.e

left_join(meta,repeats,by=c("__id"="__Submissions-id"))

Florian_May · September 26, 2019, 6:28am

@dr_michaelmarks quick update: ruODK just got a bit simpler to use, and I've added an Rmd template for a rolling start.

workshop with example use of the template Rmd: https://github.com/dbca-wa/urODK

OData vignette: https://dbca-wa.github.io/ruODK/articles/odata.html

dr_michaelmarks · September 26, 2019, 8:21am

Thanks - its a very nice package.
Our issue presently is we almost exclusively use encrypted data (as its mostly health data) which is a barrier to use for most of our projects.

Florian_May · September 26, 2019, 8:30am

Noted, thanks! https://github.com/dbca-wa/ruODK/issues/30
Haven't gotten around to it, but probably time to tackle encryption.

Thalie · September 29, 2020, 1:13pm

Hello @Florian_May, I would be interested to know if you have had time to make progress with encrypted projects, as my understanding is that ruODK relies on the OData service which is not available anymore when a project is encrypted.
This would be super relevant in all projects involving health data / where personally identifiable information are collected (e.g., we may want to use R to generate a phone follow-up log based on patient data and ensure data integrity /security at the same time).

Florian_May · September 30, 2020, 4:16am

Hi @Thalie,
I haven't had bandwidth to address encryption yet, but am open to PRs or any other helpful input. Still on my backlog to implement!

ruODK supports all API endpoints useful to data access, not only OData, but also the RESTful API and the CSV/ZIP download. I found OData to be most suitable to handle, as repeats come in separate tables, whereas REST returns repeats as nested records, a nightmare to rectangle.
I'm aiming to implement new endpoints when they are added.

ruODK focuses on data access and less on management, as ODK Central's GUI is doing a great job at management.

Could I learn more about your use cases? Would you go via ZIP > CSV or via REST?

Thalie · September 30, 2020, 9:43am

Thanks for your answer. ODK central is indeed great on data management, and I am planning to use it for all classical DM tasks (user access / audit trail / also data edits when available, etc) directly on the server (i.e., not via the REST API), but I have a very specific research setup, in which I need to generate customised lists of participants, and reconcile data within and between forms. Data quality checks and de-identification routines will also run on information merged from several forms. The idea is then to rely on the R ecosystem to integrate within a single processing / visualisation platform all the functions that are too project-specific for ODK Central. This seems to me a very appealing design and complements very efficiently ODK Central, the direct access to the server via RuODK also ensuring high data integrity and limiting the risk of errors. With this design, only the users involved in DM / data collection will directly interact with ODK Central. All other users will interact with the R platform. In particular, manual data exports used for analysis (i.e., processed de-identified data) will be generated with R, and not with ODK Central (which is "raw" data).

Some background info on the project
In each country, we will collect data on ~50,000 children with acute diseases in primary health care facilities (facility form) for 1 year. Data from the facility form will be used to generate a list of participants to be called by phone for a centralised follow-up (day 7 and day 28 follow-up forms). Data from the day 7 follow-up form will be used to generate a list of participants who were hospitalised during the follow-up period and for whom we will (retrospectively) collect clinical data in referral hospitals (hospitalisation form). Participants will be given a QR code ID and may visit the study facilities several times during the follow-up period (either the facility of enrolment or another facility; the same facility form will be used to collect the data of these repeat visits) and be enrolled multiple times in the study with a different QR code ID. + we have other additional studies linked to this main data collection (but of lower complexity)

If I now only focus on the facility form use case

The form contains both personally identifiable information (PII) and sensitive health data, and hence encryption is desirable. Possibly the form could be split in two, using the QR code ID as the link between both, but this would be more cumbersome for our data collectors and the encryption being a project setting in ODK Central, it seems it does not change how the data will be handled eventually.
There are indeed 2 repeat questions among the health data, but I have not looked at how these questions are imported yet. There is no repeat questions in PII.
So far during my various tests, I have been using ruODK with either OData or the RESTful API. My guess is that the ZIP > CSV approach could become quite tedious when the database is more populated and we approach the end of the study, given the scale of the data collection, but I may be wrong here.
Data accessed from ODK Central would then be handled in two different dataframe structures in R: the first one containing PII and the second one containing (de-identified) health data associated with a (randomly generated) research ID and an encrypted sqlite mapping database for storing the link between the initial data collection ID and the research ID.
Personally identifiable information - I would use these data for generating the follow-up lists on a daily basis and reconciling entries that relate to the same participant. When exported, these data will be encrypted with restricted access rights (for data manager / monitors if a problem is detected; data collectors will access temporarily a subset of PII about participants to be contacted for follow-up / for whom hospitalisation data needs to be retrieved).
Health data - I would merge these data with the corresponding follow-up, and possible hospitalisation and repeat visit entries, run quality /de-identification checks that will issue a data quality report, and prepare for a CSV export.