Metadata files with extremely frequent updates

joebrew · September 18, 2022, 1:35pm

A health research project my team is working on requires (a) a "household" form with information about a household and its constitutent members and (b) approximately a dozen "individual" forms with information only about one person. Both forms will be collected repeatedly for all individuals and households over the course of many months. Both forms require metadata about households and individuals, so that the fieldworker can pick from dropdowns of households and individuals.

The metadata must be up to date. For example, if a new person comes into the study (through migration or birth), that person must appear in the selectable dropdowns of individuals. Similarly, if a person dies or migrates out of the study area, that person must no longer appear in dropdowns. And if a person previously residing in household 1 moves to household 2, that person should now appear in the dropdowns for household 2, and no longer in household 1.

The turn-around time on updates must be fairly quick. For example, a fieldworker should be able to register a new birth via the household form, submit the form, and then carry out an individual visit for the newly registered individual (selecting them from a drop-down in the individual form) in the same visit (ie, within minutes). Fieldworkers will have internet connectivity.

Both forms will consume identically structured metadata: households.csv (information about the households) and people.csv (information about individuals, such as which household they live in). These metadata will be created by an R script which (a) reads from submissions of the household form, (b) makes modifications to households.csv and people.csv based on those submissions (ie, removing the dead, associating migrants with their new households, etc.) and (c) pushing the newly updated files as metadata to be associated with each form.

In other words, we are treating the households.csv and people.csv as a pseudo-database which undergoes frequent reads (for all forms) and writes (from the household form only).

We have tested this through a working proof of concept. The script which carries out the updates is run every 180 seconds. Everything performs as expected. However, before implementing this in a large project at scale, we have a few questions.

When we make changes to a form's metadata, but not to the underlying xlsform schema, is it necessary to increment the form's version number so as to trigger an update in ODK Collect?
Assuming that the answer to the previous question is "yes", should we anticipate any problems by having thousands of versions of each form? For example, if the study lasts 1 year, and we want to carry out an update every 2 minutes, that would be 262,800 form versions. Will we run into storage/memory/performance issues on ODK Central?
Is there a wiser way to satisfy the requirement of frequently updated external demographic data than what I have described above?

Thanks for any input.

joebrew · September 18, 2022, 1:42pm

In case it is helpful in understanding the nature of the question, see below links:

Forms

The household XLSform
The individual XLSform

Metadata
people.csv (3.4 KB)
households.csv (92 Bytes)

Script for running the update

github.com

databrew/bk/blob/main/misc/proof_of_concept_remote_database/01_update.R

message('Starting process')
library(googledrive)
library(gsheet)
library(dplyr)
library(ruODK)
library(yaml)
library(readr)
# Useful docs on API: https://odkcentral.docs.apiary.io/# 

# Create data or use the already submitted data?
first_time <- FALSE

# Overwrite the form definition or just update the metadata
overwrite_form_definition <- FALSE

# Configure ruODK and file paths
credentials_file <- '../../credentials/credentials.yaml'
creds <- yaml::yaml.load_file(credentials_file)
ruODK::ru_setup(
  svc = 'https://databrew.org/v1/projects/17/forms/household.svc',

This file has been truncated. show original

aurdipas · September 19, 2022, 12:06pm

hi @joebrew

Collect treats forms as "updated" if the form definition or the media files are different on the server than they are on the device. It won't redownload media files it's already got on the device however as it compares the hashes of local files with the hash the server returns for the media files and will skip files it already has. This optimization requires the server you're using to implement the Open Rosa spec correctly of course - Central does this, but a custom server might not. If you're using "Exactly Match Server" for form management, this should all be happening in the background.

extracted from post When are media files re-downloaded?

LN · September 20, 2022, 4:48am

As @aurdipas states, the form version requirement is on Central's part.

I've been involved in projects like this that have in the low 10s of thousands of updates. The form version screen in the Central frontend gets slow because there's no paging there but everything else works. Depending on the size of your entity lists it can take up a lot of storage. You can elect to remove some stale form attachments though that has to be done directly from the database and is risky.

One thing I see missing in your script is a condition to only update if there are entities to update. This could be as simple as pulling submissions since the last update and exiting if there are no new ones to process. Since you likely won't have folks in the field every minute of every day this would greatly reduce the number of new versions made.

This is currently your best bet. We are working on automating this kind of workflow as described in Entity-based data collection.

joebrew · September 20, 2022, 9:59am

Thanks so much for the quick replies, @aurdipas and @LN . Very helpful.

Hélène, thanks for the tip on updating only after changes/submissions. Very wise.

My questions have been fully answered (merci!). I have some follow-up based on what Hélène wrote:

Depending on the size of your entity lists it can take up a lot of storage. You can elect to remove some stale form attachments though that has to be done directly from the database and is risky.

"Take up a lot of storage" refers to (a) on the android device or (b) on the Central server (or both)? I'm assuming (b) only, but have not found any documentation indicating this. I also hope it's B only, as I think Android 11 makes managing clean-up on device nearly impossible (?).
Our entity lists are likely going to be fairly large. 10-20k households with as many as 100k individuals, along with approximately a dozen variables on each. Therefore, I think we will likely want to "remove some stale form attachments" as we go, even though this "has to be done directly from the database and is risky" (and we'll likely want to automate this). I've looked but have not found any examples or documentation regarding how to go about removing stale metadata/versions. Any tips and/or resources that I may have missed?

Again, thank you!!!

LN · October 31, 2022, 8:46pm

Thanks for the nice Tweet which reminded me that I never answered your follow-up questions.

I was thinking mostly about Central but it is in fact both. Currently Collect stores every version of a form so that it can show/edit submissions with the schema it was made from. This is something we would like to make more intelligent but haven't gotten around to.

From Collect, you can go to Settings > Project management > Reset and select "Blank forms" to reclaim space. Please note there is no undo!

None at the moment and we try not to say too much about the database since it's not a stable interface. Is this something you'd still like some hints on or have you figured out a workable solution?

joebrew · November 1, 2022, 6:05pm

Thanks @LN ! Your response is useful and appreciated.

Is this something you'd still like some hints on or have you figured out a workable solution?

Right now, we're still in the design phase. But it's likely that we'll finalize design decisions soon (at which time I'll come back with more questions). Cheers!