ODK Aggregate data wrangling for App Engine

Hi, ODK hive mind!

Recently, I've been involved in some data management and extraction operations and I wanted to share some methods and tools that I think are neat and useful.

If you know about other processes, please, comment and I'll add them to the main post.

ODK Briefcase

Best case scenario, you can connect to your ODK Aggregate server using Briefcase. I'd recommend to always try this first. If this works, you will be able to dump all your data into CSV files and push data to another Aggregate server, no fuss.

Backup and restore Aggregate data

Aggregate commonly uses the DataStore service to store data when it's deployed in AppEngine. It turns out that there is great documentation explaining how to go about Extracting and Importing Entities in DataStore. This is the official way to do backups and data restoration in DataStore.

You'll need to install the Google Cloud SDK and it involves executing some command line commands, but the documentation is very detailed and it gives step by step instructions to get your backups going.

Note that, by restoring a backup, pre-existing data will be completely replaced by the backup data.

Advanced analytics with Google BigQuery

Aggregate commonly uses the DataStore service to store data when it's deployed in AppEngine. If you know SQL, you can do some advanced analytics using Google BigQuery's support for joining tables, CTEs, rollup grouping sets, and window functions, among other cool SQL tricks.

You can follow the instructions for Extracting and Importing Entities to dump individual DataStore kinds into a Google Storage bucket and then load them into BigQuery tables to run SQL queries on them. There is an official document about Loading Data From Cloud Datastore Exports that covers the whole process.

This has the added benefit that, since it's a separate system from AppEngine/DataStore, it doesn't affect the performance of your Aggregate server.

Run GQL queries on DataStore

Sometimes you just need to quickly check something or export an excerpt of an Aggregate table and you already know some SQL. The DataStore Entities browser provides a web interface to run GQL queries but it can be hard to work with lots of entities and columns.

There's a command line tool called DSIO that works really well and lets you run GQL queries and get the results through the standard output or into a file. in YAML, CSV, and TSV formats, which can be very handy.

DSIO requires a Go programming language environment setup. If you don't have that, you can use a containerized version of DSIO created by yours truly that saves you all the fuss thanks to some Docker magic. You will need to install Docker, though.

Advanced DataStore maintenance

Warning: the steps described in this section could potentially destroy data and make your Aggregate deployment inoperable.

DSIO (discussed in the previous section) lets you run "upsert" (update and/or insert) operations in DataStore. The process can be roughly described like this:

  1. Extract data of a DataStore kind into a YAML file, which includes its schema description
  2. Make changes to the exported files: modify the schema and/or entities, add new entities, etc.
  3. Upload ("upsert") data back into the DataStore kind

This will create any new entities or modify existing ones, according to the changes you make to the YAML file. You can modify pre-existing data, add or remove fields to pre-existing entities or even create new entities. For more information about this, refer to DSIO's documentation.

6 Likes

A post was split to a new topic: How to import ODK data from Aggregate to Google BigQuery