Hi, I am waking this old thread up as has useful info on how to start briefcase query on old data from a point other than the start of time.
I have a lot of data on server, but would prefer not to purge as (a) it is safe there and (b) international partners may want direct download access
When I use briefcase to pull, the app spends a long long time querying old forms that have already been downloaded
I'd like to start querying at time-point at or near where the last pull left off (or to specify a date/time)
The idea in the thread above is to use a push to set the time-point for the next pull run seems nice but it is unclear whether
(i) this would require dummy data to be added to the server
(ii) how to do it using briefcase / briefcase CLI
(iii) whether briefcase would remember the correct time-point if I used briefcase for another server between pulls
Would there be scope for a future feature in briefcase "Start query at date : "
I've also seen how pulling data from servers with lots of submissions can be slow. It looks like there's place for optimization when users know what's the last submission they've already pulled to their computers.
I understand this would filter submissions with the sumission date field.
I think we should explore what happens when submissions are delayed and don't get included in pulls like this.
Maybe I do a pull every day at 3am to get yesterday's submissions, but a submission sent yesterday arrives to Aggregate at 10am, after my script has been launched.
Could weird scenarios like this one take place when using Briefcase to pull submissions from Collect and push them to Aggregate?
Thinking on your scenario, I would always set the date to start checking for a couple of days or even weeks prior to last pull. That way it can double check for these stray submissions, but still not have to go back to the very beginning of time
For instance I am on day 100 of a study and running through days 1-90 is pretty pointless, but 91-100 could catch new submissions whilst still saving 90% of the time
In order to get the first batch of submissions, Briefcase sends Aggregate an empty cursor
Briefcase will continue asking for more batches until an empty batch arrives, which ends the pull operation
Here's an idea:
Briefcase stores the last cursor used for each form
We add a checkbox to "resume" the last pull in the Pull tab
When that's enabled, Briefcase sends the stored cursor instead of an empty one, effectively resuming the pull operation.
We could even build arbitrary cursors to resume pulls starting from different submissions, but the idea above seems like the smallest possible increment that brings value and would let us test this on the field.
Sounds like an excellent idea. I assume because the cursor is based on UUID that any new submissions that had taken a while to arrive would be in the new batches of data even if they had 'submission dates' that predated time of last pull.
In fact, the UUID is a secondary criterion used to filter what's part of a batch after we get the ordered list of submissions based on the last update date (this is the primary criterion).
Since Aggregate uses the last update date (which is metadata that Aggregate adds to every submission), instead of the submission date, we could expect those delayed submissions to make it to the batch, since their last update date would be effectively their reception date.
Making this feature to be on by default could affect users with other workflows/workloads, but I like the idea of moving the checkbox to the settings tab, since it would be also used when exporting with the "pull before export" conf param.
Could we settle on having a new "Always try to resume pulls" checkbox on the Settings tab and having it disabled by default?
Shipping it disabled by default feels safer because it has no downside at all. Enabling it by default, on the other hand, could confuse users with different workloads/workflows e.g. they could miss submissions until they realize they need to force full pulls.
I don't understand enough about the technical aspects buts sometimes we certainly set our workflows to delete downloaded XMLs after running analysis and so in that situation want to people.to redo a full pull.
Equally for our very large datasets we don't do this so want to be able to pull from last point only.
Can see arguments in both directions for default setting
Just a quick heads up to tell you that we've released Briefcase v1.14.0-beta.0 with a new "resume last pull" feature that we need help testing
The new feature will save time and resources by skipping all the instanceID batches that have already been pulled. In a form with 10 000 submissions that means saving 100 HTTP requests and checking 10 000 submissions in the local instance database.
More details at the usual places:
If there's no major issue with the release, we will be releasing it next week.