I was curious, so took a look at the code. From what I can tell, it appears
both push and pull get a more expensive as the submission count grows
because they both consider the entire remote submission list.
For upload:
For download:
If I understand correctly, the list is fetched serially in small chunks of
100. So for the user's 50,000 submissions, just fetching the list takes
around 500 HTTP requests. Additionally, based on that list multiple I/O
operations are performed for each submission (some file system, some
network).
I'm not sure what metadata Mitch was referring to when he stated that
pushing may help a bit, but from what I can see the biggest speedup is
skipping a submission entirely:
However, it appears that the data driving this, the recorded_instances
table, only has new records added during download. Upload only removes
entries. So it doesn't appear that a push would not improve things.
Since I can't tell what metadata Mitch was referring to in his post, I
can't tell if he was referring to something else entirely. Does anyone know
why a push makes a pull faster?
Sorry, an unfortunate typo. That should have been:
However, it appears that the data driving this, the recorded_instances
table, only has new records added during download. Upload only removes
entries. So it doesn't appear that a push would not improve things.
ยทยทยท
On Tue, Aug 30, 2016 at 4:50 PM, Brent Atkinson wrote:
Ultimately, the user was directed to another thread where Mitch suggested
that pushing may speed up pulling "a bit": https://groups.google.
com/forum/#!msg/opendatakit/awkgN2psOZo/MzZZvDRiBwAJ
I was curious, so took a look at the code. From what I can tell, it
appears both push and pull get a more expensive as the submission count
grows because they both consider the entire remote submission list.
If I understand correctly, the list is fetched serially in small chunks of
100. So for the user's 50,000 submissions, just fetching the list takes
around 500 HTTP requests. Additionally, based on that list multiple I/O
operations are performed for each submission (some file system, some
network).
I'm not sure what metadata Mitch was referring to when he stated that
pushing may help a bit, but from what I can see the biggest speedup is
skipping a submission entirely:
However, it appears that the data driving this, the recorded_instances
table, only has new records added during download. Upload only removes
entries. So it doesn't appear that a push would not improve things.
Since I can't tell what metadata Mitch was referring to in his post, I
can't tell if he was referring to something else entirely. Does anyone know
why a push makes a pull faster?