Briefcase: Export from last

1. What is the general goal of the feature?
Thanks to recent ODK Briefcase releases we have three targeted pull/export features

1] start-from-date to resume pull operation from a specific date
2] start-from-last to resume pull operation, picking up from position of last pull.
3] export-from-date which runs an export operation from a specific date

What's now logically missing is the export counterpart to start-from-last
i.e.

4] export-from-last which runs an export operation, picking ip from position of last export

I can also see reason to include an option

5] smart-export which looks at the target file, then either

  1. creates the target file if it is not there and exports from first submission
  2. Identifies the meta-instanceID of the last exported submission and then exports from the next submission

2. What are some example use cases for this feature?

A] Export-from-last

When working with long lived and heavily used forms, export time becomes a significant issue when using behaviour that deletes the old csv and exports all data to a new one.

Daily export of ~750,000 forms to 17 CSVs currently takes about 5-7 hours on our largest project

An alternative solution we have tried is to use the append behaviour and to stitch new submissions from an arbitrarily recent date (i.e. today < 5 days) to the end of the existing form, but this leaves duplicates that need to be removed using downstream analysis. Basic Unix based process such as sort | uniq leads to problems with line order being changed in the resulting file (headers are also affected) so is not ideal.

Using export from date only works at granular level of the day, so if we pulled twice a day we would potentially miss or duplicate some records.

An export from last would have a fairly good use case for all export activities, but especially in long lived forms when managed on CLI

B] smart-export

When system failure occurs or when passing system over to another operator, the ODK briefcase database can be quite quickly recreated by copying ODK xmls from backup drive to a new machine, but the start-from-last or export-from-last position flag is lost when this happens. By implementing a system that can look at the target file and identify the appropriate resume point, the full system could be rapidly recreated/replicated from system failure by

  1. Copying the xmls folder to new machine
  2. Copying the target CSV folder to the new machine

Another use case for smart-export is that I set up a system and run it for six months. Then I need to go on a journey and @dr_michaelmarks wants to run the system while I am away. I copy the whole ODK Briefcase directory and CSV files on to a hard drive and give them to Michael. Michael's copy of Briefcase then just figures everything out (resume points for pull and export) and picks up where I left off without having to first run long pulls or exports.

I think that resume data are currently stored in system java somewhere, but moving to within ODK briefcase folder would allow resume info to be carried within the folder, meaning they could be passed between two systems.

3. What can you contribute to making this feature a reality?
Beta testing, discussion

Thanks, @chrissyhroberts! The new features you're suggesting sound super useful.

First we need to be aware of the differences in relation to dates between the pull and export operations.

  • The pull operation uses the "last update" date of a submission, which is related to the date&time a submission is physically sent over the network to Aggregate, regardless of when it was completed.

    This "last update" date&time must not be confused with the submission date inside the metadata block of the submission's XML contents. Although both date&times can be close or even the same, on paper we can't make the assumption that "last update date" = "submission date".

    The "last update" date is not part of a submission's XML contents.

    This is imposed by the Briefcase - Aggregate API.

  • The export operation uses the submission date in the metadata block of the submission's XML contents.

Now that we've covered that, here are my thoughts about the features you're suggesting:

  • At this moment, it would be easier for us to implement the smart-export feature. I'd suggest to call it smart-append instead, and it would be incompatible with the --overwrite_csv_export flag.

    I recon that the smart-export|append flag would solve both use cases you have describe. Do you agree?

  • As an alternative that requries no changes to Briefcase, one could take the last line of the previous exported CSV file with something like awk '/./{line=$0} END{print line}' my_file.txt, extract (with awk again) the first column's value, and pipe it to the existing --export_start_date argument.

    We might need to change the argument to accept ISO8601 datetimes, tough.

Hi @ggalmazor smart_append sounds perfect and agree it would do the job of knowing what is at the bottom of the current exported CSV then adding new stuff. The awk based solution would be a simple solution but presumably would still have issues with overlap on dates unless the export_start_date were to be adapted to allow a full timestamp. I think that you'd never need the latter if you had smart_append

1 Like

Agreed. I guess we can one, the other, or both things:

  1. Change the --export_start_date to allow ISO8601 dates and datetimes (assuming start of day 00:00:00.000 when providing a date without time)
  2. Add a new --smart_append feature as described in this thread

We could probably have 1. much faster than 2.

Quite by coincidence I have just been experiencing some epic problems with export.
Now have a form at about 130,000 submissions and export is failing, presumably because of RAM running out or similar issue [haven't nailed it down yet].

I've been playing with the existing export from date feature and it runs very slowly when the ODK Briefcase Storage contains lots of submissions in the instances folder. I guess that's unsurprising, as it no doubt has to go all the way through the hundreds of thousands of submissions to find the ones with specific date tags...

At which point I start thinking out loud...

...and it got me to thinking that the biggest issues we've faced when working with long lasting forms is that the software often has to read through a massive folder of instances to find the most recent ones. As a result I am starting to think about a system for archiving the older instances.

It would work something like this

  1. Pull forms from server
  2. Export forms to data.CSV
  3. Backup data.CSV file to backups/yyyymmddhhmmss.csv.bck
  4. Move instances from instances folder to archive/yyyymmddhhmmss/ folder
  5. Wait until tomorrow
  6. Pull forms from server to now empty instances folder using --sfl
  7. Export forms to CSV (appending to file data.CSV)
  8. Backup data.CSV file to archive/yyyymmddhhmmss.csv.bck
  9. Move instances from instances folder to archive/yyyymmddhhmmss/ folder
  10. Repeat....

Benefits of this are

  • You can have an archive of instances, arranged by pull datestamp, making it easier to find specific data points, especially when encrypted
  • You can incrementally append new instances to the data.csv file, saving export time previously needed to scan through big instances folders
  • Automated incremental backup of CSV file stored alongside instances, allowing for easy recovery of data. i.e. worst case scenario is that you move last increment of data.CSV from most recent archived pull to main folder and pull using --start-from-date. Then it presumably goes back to being able to use --sfl again
  • Data storage moved away from folder used to handle fresh pulls/exports. Puts a firebreak between old and current data.

Also

  • Could in principle use the archive folder names to grab the last pull date for a smart pull

i.e. Briefcase could be run at 2019-06-10-23:12:00 and reads through folders in archive and find...

archive/2019-06-02-12:53:56
archive/2019-06-04-23:12:12
archive/2019-06-07-03:55:11

...and figures out that the last pull was at 2019-06-07-03:55:11
Then it pulls from 2019-06-07-03:55:12
Exports to data.CSV
Creates a new folder

archive/2019-06-10-23:12:00
and finally copies all the recently pulled instances and data.CSV in to that folder, renaming data.CSV as
2019-06-10-23:12:00.CSV.bck along the way

Does this sound like madness to you @ggalmazor?
I'm thinking it is a toss of the coin that I haven't slept enough, or it could be great.

I'll probably mock something up using existing CLI and some unix stuff as I think it is possible using those, will post here if it seems to work, but I wonder if this could be an interesting thing to think about in Briefcase

#!/usr/bin/env bash

# requires you to have the server password in top line of a file called serverpass.txt
	# ODKPASSWORD=12334435dm

#########################################################################################
## USER DEFINED VARIABLES
#########################################################################################
ODK_STORAGE_PATH="Data"
ODK_EXPORT_PATH="Data/Output"
URL="https://xxx.yyy"
ODKUSERNAME="user"
PEM="keys/ODK.PRIVATE.KEY.xxxx.pem"

#########################################################################################
#########################################################################################

# GET TODAY'S DATE
PULLTIME=$(date +"%Y-%m-%d")
PULLTIME2=$(date +"%Y-%m-%d_%H_%M_%S")

echo Today is "$PULLTIME"
# GET YESTERDAY'S DATE
EXPORTLIMIT=$(date -v -1d +"%Y-%m-%d")

echo Today is "$PULLTIME"


## Get the password from file "pass"
source serverpass.txt
echo password is "$ODKPASSWORD"

declare -a arr=(
#########################################################################################
## ADD ODK FORM IDs BELOW THIS LINE  (FORMID;FORMNAME) replace and . with _
#########################################################################################
"Eggs47a;Some_Egg_analysis"
"Dateformat;A_form_about_dates"
#########################################################################################
#########################################################################################
                )





## now loop through the above array and perform functions on every form ID
for i in "${arr[@]}"
	do
		j="output/$i.csv"
		#until ( test -e "$j"); 
		#do
			echo Working on form "$j" 				
			FORM_ID=$(echo "$i" | cut -d ";" -f1)
			FORM_NAME=$(echo "$i" | cut -d ";" -f2)
			echo Form Name : "$FORM_NAME"
			echo Form ID :  "$FORM_ID"
			echo getting "$ODK_STORAGE_PATH"/ODK\ Briefcase\ Storage/forms/"$FORM_NAME"/nextpull.txt
			echo Date of next pull "$NEXTPULL"
			#read in the next pull date from nextpull.txt
			if [ ! -f "$ODK_STORAGE_PATH"/ODK\ Briefcase\ Storage/forms/"$FORM_NAME"/nextpull.txt ]; then
			    echo "Next pull not found! Creating 1970-01-01"
				echo NEXTPULL=1970-01-01 | cat > "$ODK_STORAGE_PATH"/ODK\ Briefcase\ Storage/forms/"$FORM_NAME"/nextpull.txt
			fi
			source "$ODK_STORAGE_PATH"/ODK\ Briefcase\ Storage/forms/"$FORM_NAME"/nextpull.txt
				
			#make folders for archive
			mkdir "$ODK_STORAGE_PATH"/ODK\ Briefcase\ Storage/forms/"$FORM_NAME"/
			mkdir "$ODK_STORAGE_PATH"/ODK\ Briefcase\ Storage/forms/"$FORM_NAME"/archive/
			mkdir "$ODK_STORAGE_PATH"/ODK\ Briefcase\ Storage/forms/"$FORM_NAME"/archive/"$PULLTIME2"
		
			#pull data using briefcase -sfd option and $NEXTPULL
			echo Pulling data from "$NEXTPULL" onwards
			java -jar ODK-Briefcase-v1.15.0.jar -sfd "$NEXTPULL" -plla -pp -id "$FORM_ID" -sd "$ODK_STORAGE_PATH" -url $URL -u "$ODKUSERNAME" -p "$ODKPASSWORD"
		
			#export data from form using -start $NEXTPULL and -end $EXPORTLIMIT (i.e. start at last day when pull was done and end at yesterday)
			echo exporting new instances since "$NEXTPULL" and up to "$EXPORTLIMIT"
			java -jar ODK-Briefcase-v1.15.0.jar -e -ed "$ODK_EXPORT_PATH"/ -start "$NEXTPULL" -end "$EXPORTLIMIT" -sd "$ODK_STORAGE_PATH" -id "$FORM_ID" -f "$FORM_ID".csv -pf "$PEM"
		
			#move anything pulled today to a timestamped folder
			echo moving instances to archive
			find "$ODK_STORAGE_PATH"/ODK\ Briefcase\ Storage/forms/"$FORM_NAME"/instances/ -name '*' -exec mv -f \{\} "$ODK_STORAGE_PATH"/ODK\ Briefcase\ Storage/forms/"$FORM_NAME"/archive/"$PULLTIME2"/ \;

			#write today's date to the nextpull.txt file. The next pull will resume at today's date
			echo writing next pulltime to nextpull.txt
			echo NEXTPULL="$PULLTIME" | cat > "$ODK_STORAGE_PATH"/ODK\ Briefcase\ Storage/forms/"$FORM_NAME"/nextpull.txt
			mkdir "$ODK_STORAGE_PATH"/ODK\ Briefcase\ Storage/forms/"$FORM_NAME"/instances
			#make a backup of the CSV file
			echo making backup copy of CSV file
			cp "$ODK_EXPORT_PATH"/"$FORM_ID".csv "$ODK_STORAGE_PATH"/ODK\ Briefcase\ Storage/forms/"$FORM_NAME"/archive/"$PULLTIME2"/"$PULLTIME2".csv
	done


I've been running this for a day now and pull/export time cut from around 9 hours to 12 minutes.
Seems to be an effective way to manage data.

Currently only works on daily basis, but making export-from-date work with full ISO timestamps should make the system pick up exactly at last pull time, At the moment it only exports up to the start of the current day (i.e. stops at 23:59:59 the day before) or else we'd get duplicates in the CSV.

1 Like

Hey, @chrissyhroberts!

I did a first read of this some days ago and everything looked fine. I think there's good ground for great improvements (once again, thanks for your continued feedback!).

I just wanted to tell you that I'm busy with the release of Briefcase v1.16 and have little time for extra stuff, but I'm planning to study your proposal and have some feedback during the next weeks.

I've posted a more detailed version of this, with links to github repo at

1 Like

Impressive work as always, @chrissyhroberts. Thanks for sharing!

It seems like we can make some easy (correct me if I'm wrong, @ggalmazor) improvements to Briefcase to reduce the work you have to do with OBD.

The easiest is if -sfl and -sfd can persist their cursors to the storage directory that would remove the need to re-pull if you move the data between machine. Correct?

That would be immensely useful general purpose change. I don't know much about how cursor position storage works, but a human modifiable file would provide an additional level of user control to the system. i.e. like with the OBD scripts where we can simply open a file and change it to manipulate the behaviour of the next pull.

The other thing that would make a big difference would be to make -sfd and export from/to dates work with more granular timestamps. i.e. instead of dd-mm-yyyy with dd-mm-yyyy-hh-mm-ss I think we've discussed this before elsewhere, but worth highlighting here as use-case.

At present the key limitation of these functions is that they can be controlled only to the day, meaning that in the OBD workflow presented above, if I run the scripts today at 1200, I have to truncate exports at midnight this morning if I don't want duplicate instances when I then run scripts again at 1400. i.e. any reports have to be limited to yesterday in order to prevent duplication of records in the CSV.

In extended use, a more granular export timestamp allows one to perform a smart export by including a line in the bash scripts that grabs the last line of the previous CSV file, then greps the timestamp of the last export i.e. tail -n 1 countries.txt | grep %d%M%Y-%h%m%s (or something to that effect). Armed with that, the scripts could set the export datetime to one second after the last pull in order to seamlessly perform export of specifically new data that have arrived since the last pull. This is in contrast to the current behaviour, which is to export data that have arrive up until the current calendar date.

This way we could run manual scripts in an ad hoc manner and automated scripts on an interval less than one day, all without the fear of introducing duplicate lines to the CSV files.

3 Likes

Hi, @chrissyhroberts, and others!

I wanted to give y'all an update. Since we first talked about this in June, we've been busy with v1.16, but we now have cycles for this :).

So far, Briefcase will write into a metadata.json file at the form's location with information about the form, the last pull cursor and the last exported submission. This is an example:

{
  "key" : {
    "name" : "RoadCrossingStructures_EGR",
    "id" : "build_RoadCrossingStructures-EGR_1561322641",
    "version" : null
  },
  "storageDirectory" : "/tmp/ODK Briefcase Storage/forms/RoadCrossingStructures_EGR",
  "hasBeenPulled" : false,
  "cursor" : {
    "type" : "aggregate",
    "value" : "<cursor xmlns=\"http://www.opendatakit.org/cursor\"><attributeName>_LAST_UPDATE_DATE</attributeName><attributeValue>2010-01-01T00:00:00.000+0800</attributeValue><uriLastReturnedValue>uuid:f2766df0-a49c-4615-8d56-c4159a4df00a</uriLastReturnedValue><isForwardCursor>true</isForwardCursor></cursor>"
  },
  "lastExportedSubmission" : {
    "instanceId" : "uuid:89008a18-de9e-4a57-8bcb-fdd6fa10824a",
    "submissionDate" : "2019-08-11T22:54:32.201Z",
    "exportDateTime" : "2019-08-12T09:18:08.944131+02:00"
  }
}

When pulling with the --start_from_last flag, the cursor will be read from this file.

Now I'm implementing the --smart_append flag :slight_smile:

@chrissyhroberts and @dr_michaelmarks ODK Briefcase v1.17 Beta has the smart append feature and more portable storage directories. Give it a try and leave your feedback on the pre-releases page!

3 Likes

I've also created https://github.com/opendatakit/briefcase/issues/809 to track the change to allow datetimes in the date range filters in the CLI