Pull submissions starting from date

Hi, I am waking this old thread up as has useful info on how to start briefcase query on old data from a point other than the start of time.

Here's why

  1. I have a lot of data on server, but would prefer not to purge as (a) it is safe there and (b) international partners may want direct download access

  2. When I use briefcase to pull, the app spends a long long time querying old forms that have already been downloaded

  3. I'd like to start querying at time-point at or near where the last pull left off (or to specify a date/time)

The idea in the thread above is to use a push to set the time-point for the next pull run seems nice but it is unclear whether

(i) this would require dummy data to be added to the server
(ii) how to do it using briefcase / briefcase CLI
(iii) whether briefcase would remember the correct time-point if I used briefcase for another server between pulls

Would there be scope for a future feature in briefcase "Start query at date : "

1 Like

Given I work with @chrissyhroberts it is perhaps unsurprising that I agree it would be very useful to be able to say "Pull from X date" and do so from the command line interface.
Michael

Hi, @chrissyhroberts, @dr_michaelmarks!

I've split the post to discuss this as a new feature for Aggregate and Briefcase. Would you care to edit your post and give a little more context?

I've also seen how pulling data from servers with lots of submissions can be slow. It looks like there's place for optimization when users know what's the last submission they've already pulled to their computers.

I understand this would filter submissions with the sumission date field.

I think we should explore what happens when submissions are delayed and don't get included in pulls like this.

Scenario:
Maybe I do a pull every day at 3am to get yesterday's submissions, but a submission sent yesterday arrives to Aggregate at 10am, after my script has been launched.

Could weird scenarios like this one take place when using Briefcase to pull submissions from Collect and push them to Aggregate?

Hi,
Thinking on your scenario, I would always set the date to start checking for a couple of days or even weeks prior to last pull. That way it can double check for these stray submissions, but still not have to go back to the very beginning of time

For instance I am on day 100 of a study and running through days 1-90 is pretty pointless, but 91-100 could catch new submissions whilst still saving 90% of the time

1 Like

I've been studying the code and I have some insights regarding this feature proposal:

  • Briefcase pulls data from Aggregate in batches of a maximum of 100 submissions each.

  • A batch consists on an XML with:

    • A list of submission instanceIDs
    • A "cursor" that can be used to get the next batch
  • A cursor is an XML with:

    • The field used to order submissions. Currently we're using LAST_UPDATE_DATE
    • The last update date (ISO8601) of the last submission from the previous batch
    • The instanceID of the last submission from the previous batch
    • A boolean telling if the cursor is a forward cursor (purpose of this yet to be determined)
    <cursor xmlns="http://www.opendatakit.org/cursor">
      <attributeName>_LAST_UPDATE_DATE</attributeName>
      <attributeValue>2018-11-07T14:43:24.644+0000</attributeValue>
      <uriLastReturnedValue>uuid:d9b67b6f-2058-469b-8cb1-c86b9c34b632</uriLastReturnedValue>
      <isForwardCursor>true</isForwardCursor>
    </cursor>
    
  • In order to get the first batch of submissions, Briefcase sends Aggregate an empty cursor

  • Briefcase will continue asking for more batches until an empty batch arrives, which ends the pull operation

Here's an idea:

  • Briefcase stores the last cursor used for each form
  • We add a checkbox to "resume" the last pull in the Pull tab
  • When that's enabled, Briefcase sends the stored cursor instead of an empty one, effectively resuming the pull operation.

We could even build arbitrary cursors to resume pulls starting from different submissions, but the idea above seems like the smallest possible increment that brings value and would let us test this on the field.

Sounds like an excellent idea. I assume because the cursor is based on UUID that any new submissions that had taken a while to arrive would be in the new batches of data even if they had 'submission dates' that predated time of last pull.

In fact, the UUID is a secondary criterion used to filter what's part of a batch after we get the ordered list of submissions based on the last update date (this is the primary criterion).

Since Aggregate uses the last update date (which is metadata that Aggregate adds to every submission), instead of the submission date, we could expect those delayed submissions to make it to the batch, since their last update date would be effectively their reception date.

1 Like

I think we have the grounds for a new feature here. I'll document this into an issue so that we can discuss the coding aspect more comfortably.

2 Likes

@ggalmazor I'm really liking this idea and I'm moving it to the Features category.

Instead of the checkbox, it seems like we should just make things resumable by default? Seems pretty safe. And if we are worried we could add a global setting to do full downloads.

1 Like

Making this feature to be on by default could affect users with other workflows/workloads, but I like the idea of moving the checkbox to the settings tab, since it would be also used when exporting with the "pull before export" conf param.

Could we settle on having a new "Always try to resume pulls" checkbox on the Settings tab and having it disabled by default?

Maybe I'm slow, but what exact problems are you expecting?

Shipping it disabled by default feels safer because it has no downside at all. Enabling it by default, on the other hand, could confuse users with different workloads/workflows e.g. they could miss submissions until they realize they need to force full pulls.

I don't understand enough about the technical aspects buts sometimes we certainly set our workflows to delete downloaded XMLs after running analysis and so in that situation want to people.to redo a full pull.
Equally for our very large datasets we don't do this so want to be able to pull from last point only.

Can see arguments in both directions for default setting

1 Like

An additional point we would want to be able to control this using command line interface as we use this to run all our automated scripts

Thanks, @dr_michaelmarks! I'll add it to https://github.com/opendatakit/briefcase/issues/681

Just a quick heads up to tell you that we've released Briefcase v1.14.0-beta.0 with a new "resume last pull" feature that we need help testing :wink:

The new feature will save time and resources by skipping all the instanceID batches that have already been pulled. In a form with 10 000 submissions that means saving 100 HTTP requests and checking 10 000 submissions in the local instance database.

More details at the usual places:

If there's no major issue with the release, we will be releasing it next week.

1 Like