Fast briefcase download?

Hi all,

There was a recent question on the user's list asking about how they export
faster from Aggregate using Briefcase:
https://groups.google.com/forum/#!topic/opendatakit/kHLMgxF334U

Ultimately, the user was directed to another thread where Mitch suggested
that pushing may speed up pulling "a bit":
https://groups.google.com/forum/#!msg/opendatakit/awkgN2psOZo/MzZZvDRiBwAJ

I was curious, so took a look at the code. From what I can tell, it appears
both push and pull get a more expensive as the submission count grows
because they both consider the entire remote submission list.

For upload:

For download:

If I understand correctly, the list is fetched serially in small chunks of
100. So for the user's 50,000 submissions, just fetching the list takes
around 500 HTTP requests. Additionally, based on that list multiple I/O
operations are performed for each submission (some file system, some
network).

I'm not sure what metadata Mitch was referring to when he stated that
pushing may help a bit, but from what I can see the biggest speedup is
skipping a submission entirely:

However, it appears that the data driving this, the recorded_instances
table, only has new records added during download. Upload only removes
entries. So it doesn't appear that a push would not improve things.

Since I can't tell what metadata Mitch was referring to in his post, I
can't tell if he was referring to something else entirely. Does anyone know
why a push makes a pull faster?

Thanks,

Brent

Sorry, an unfortunate typo. That should have been:

However, it appears that the data driving this, the recorded_instances
table, only has new records added during download. Upload only removes
entries. So it doesn't appear that a push would not improve things.

ยทยทยท On Tue, Aug 30, 2016 at 4:50 PM, Brent Atkinson wrote:

Hi all,

There was a recent question on the user's list asking about how they
export faster from Aggregate using Briefcase: https://groups.
google.com/forum/#!topic/opendatakit/kHLMgxF334U

Ultimately, the user was directed to another thread where Mitch suggested
that pushing may speed up pulling "a bit": https://groups.google.
com/forum/#!msg/opendatakit/awkgN2psOZo/MzZZvDRiBwAJ

I was curious, so took a look at the code. From what I can tell, it
appears both push and pull get a more expensive as the submission count
grows because they both consider the entire remote submission list.

For upload: https://github.com/opendatakit/briefcase/blob/
b81c79384894939cb77d1f8877b4bcbcb3e6327f/src/org/
opendatakit/briefcase/util/ServerUploader.java#L220

For download: https://github.com/opendatakit/briefcase/blob/
b81c79384894939cb77d1f8877b4bcbcb3e6327f/src/org/
opendatakit/briefcase/util/ServerFetcher.java#L178

If I understand correctly, the list is fetched serially in small chunks of
100. So for the user's 50,000 submissions, just fetching the list takes
around 500 HTTP requests. Additionally, based on that list multiple I/O
operations are performed for each submission (some file system, some
network).

I'm not sure what metadata Mitch was referring to when he stated that
pushing may help a bit, but from what I can see the biggest speedup is
skipping a submission entirely:

https://github.com/opendatakit/briefcase/blob/
b81c79384894939cb77d1f8877b4bcbcb3e6327f/src/org/
opendatakit/briefcase/util/ServerFetcher.java#L350

However, it appears that the data driving this, the recorded_instances
table, only has new records added during download. Upload only removes
entries. So it doesn't appear that a push would not improve things.

Since I can't tell what metadata Mitch was referring to in his post, I
can't tell if he was referring to something else entirely. Does anyone know
why a push makes a pull faster?

Thanks,

Brent