An Aggregate data maintenance case

ggalmazor · January 8, 2019, 10:45am

(This is a follow up post to the Aggregate data wrangling compendium post)

Recently, I've been involved in some Aggregate maintenance work that required some data wrangling. The Aggregate server was having issues to publish data to Google Fusion Tables (R.I.P.). It managed to publish data up to some point, but then newer data wasn't published anymore.

First intervention

The first analysis of this issue revealed critical errors related to lock acquisition in AppEngine/DataStore, which suggested a degraded database function. Since the Aggregate version was relatively old (v1.4.x), the interaction with Briefcase still worked, and the number of forms (3) and submissions (around 15k) was reasonable, we decided to address this by resetting the Aggregate instance, and upgrading it to the latest v1.7.1.

This is the steps we followed:

Backup the DataStore
Pull all forms and submissions with Briefcase
Stop all activity on Aggregate
Take note of all publishers, and Aggregate configuration and credentials
Use the DataStore Entity Manager to delete all entities
Install the latest Aggregate
Manually restore previous Aggregate configuration and credentials, and check that everything worked as expected.
Push all forms with Briefcase
Create new publishers

Before performing any destructive action on the server, I simulated the whole process locally, which revealed issues with some submissions that had submission dates from the year 1900 for some reason. Aggregate wasn't able to parse these dates, producing a partial push in step 7.

To deal with this, I used sed to search and replace these submission dates to the year 1980 on all XML files in the Briefcase storage directory. After doing this, all submissions were pushed successfully. I filed an issue to fix this in Aggregate as well.

Selective data removal

The user reported that they wanted to remove some submissions that matched some criteria. This proved to be something that we currently can't do on Aggregate instances deployed in AppEngine.

A solution for this would have been to repeat the steps from first intervention and removing the selected submissions before pushing them back to Aggregate.

In Aggregate instances deployed on Tomcat, it would have been possible to do this by running some SQL DELETE statements for the target submissions.

Second intervention

After resetting the Aggregate server, the critical errors related to lock acquisition went away, but the user reported the same publisher issue they had in the first place: after some time, submissions were no longer being published to Fusion Tables.

Although the cause of this issue is still unknown, we decided to modify the publisher configurations to trick Aggregate into thinking that it needed to resume publishing submissions starting from older ones.

The publishers first use the date_marked_as_complete metadata field to track the point in time of the newest submission they have published. When more than one submission match that, the submission's UID is used.

We reconfigured the publisher's LAST_STREAMING_KEY and LAST_STREAMING_PERSISTENCE_CURSOR fields to match the instanceID and date_marked_as_complete values of the last published submission in Fusion Tables and waited for the scheduler to kick in and resume publishing the submissions that were missing.

This strategy proved to work, and all the submissions were present in Fusion Tables. It's probable that this strategy could be used for other publishers as well.

yanokwa · January 9, 2019, 7:42pm

Thanks for documented the process in such great detail. My hope is that this can help someone (often ourselves in the future)!