Ways to speed up ODK Briefcase

We are using ODK for a malaria testing campaign. Everything in terms of
data collection is running well, however there are going to be a rather
large number (anticipate 50,000) of encrytped records to download from our
aggregate instance and the process with ODK Briefcase is very slow even
though the individual forms are relatively small. Is there any way to speed
up the download process?

Assuming you always use the same ODK Briefcase Storage Location, the
download will resume from the point at which it last stopped -- it keeps
track of the last set of data pulled from the server, and resumes the pull
from after that point.

So I don't think it can be made any more efficient.

If you delete or use a new ODK Briefcase Storage Location, then you will
always be pulling every record down, and that will take a long time.

Mitch

··· On Fri, Jun 7, 2013 at 4:21 AM, dj_bridges wrote:

We are using ODK for a malaria testing campaign. Everything in terms of
data collection is running well, however there are going to be a rather
large number (anticipate 50,000) of encrytped records to download from our
aggregate instance and the process with ODK Briefcase is very slow even
though the individual forms are relatively small. Is there any way to speed
up the download process?

--

Post: opendatakit@googlegroups.com
Unsubscribe: opendatakit+unsubscribe@googlegroups.com
Options: http://groups.google.com/group/opendatakit?hl=en


You received this message because you are subscribed to the Google Groups
"ODK Community" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to opendatakit+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Mitch Sundt
Software Engineer
University of Washington
mitchellsundt@gmail.com

Mitch,

Would parallelizing the process work? So split your enumerators into n
groups and have n instances of Aggregate being hit by n copies of
Briefcase on n computers.

Yaw

··· -- Need ODK help? Go to http://nafundi.com for custom features, form design, implementation support, and user training for ODK.

On Fri, Jun 7, 2013 at 9:01 AM, Mitch Sundt mitchellsundt@gmail.com wrote:

Assuming you always use the same ODK Briefcase Storage Location, the
download will resume from the point at which it last stopped -- it keeps
track of the last set of data pulled from the server, and resumes the pull
from after that point.

So I don't think it can be made any more efficient.

If you delete or use a new ODK Briefcase Storage Location, then you will
always be pulling every record down, and that will take a long time.

Mitch

On Fri, Jun 7, 2013 at 4:21 AM, dj_bridges danieljbridges@gmail.com wrote:

We are using ODK for a malaria testing campaign. Everything in terms of
data collection is running well, however there are going to be a rather
large number (anticipate 50,000) of encrytped records to download from our
aggregate instance and the process with ODK Briefcase is very slow even
though the individual forms are relatively small. Is there any way to speed
up the download process?

--

Post: opendatakit@googlegroups.com
Unsubscribe: opendatakit+unsubscribe@googlegroups.com
Options: http://groups.google.com/group/opendatakit?hl=en


You received this message because you are subscribed to the Google Groups
"ODK Community" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to opendatakit+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Mitch Sundt
Software Engineer
University of Washington
mitchellsundt@gmail.com

--

Post: opendatakit@googlegroups.com
Unsubscribe: opendatakit+unsubscribe@googlegroups.com
Options: http://groups.google.com/group/opendatakit?hl=en


You received this message because you are subscribed to the Google Groups
"ODK Community" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to opendatakit+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The limitation will be the bandwidth and throughput and/or
transmission-rate throttling of the network down to the client -- not the
speed of Google infrastructure to vend the data, so this would not help.

Google's BigTable will also make this slower than a real database, but the
dominating component will be network bandwidth.

Mitch

··· On Fri, Jun 7, 2013 at 9:28 AM, Yaw Anokwa wrote:

Mitch,

Would parallelizing the process work? So split your enumerators into n
groups and have n instances of Aggregate being hit by n copies of
Briefcase on n computers.

Yaw

Need ODK help? Go to http://nafundi.com for custom features, form
design, implementation support, and user training for ODK.

On Fri, Jun 7, 2013 at 9:01 AM, Mitch Sundt mitchellsundt@gmail.com wrote:

Assuming you always use the same ODK Briefcase Storage Location, the
download will resume from the point at which it last stopped -- it keeps
track of the last set of data pulled from the server, and resumes the
pull
from after that point.

So I don't think it can be made any more efficient.

If you delete or use a new ODK Briefcase Storage Location, then you will
always be pulling every record down, and that will take a long time.

Mitch

On Fri, Jun 7, 2013 at 4:21 AM, dj_bridges danieljbridges@gmail.com wrote:

We are using ODK for a malaria testing campaign. Everything in terms of
data collection is running well, however there are going to be a rather
large number (anticipate 50,000) of encrytped records to download from
our

aggregate instance and the process with ODK Briefcase is very slow even
though the individual forms are relatively small. Is there any way to
speed

up the download process?

--

Post: opendatakit@googlegroups.com
Unsubscribe: opendatakit+unsubscribe@googlegroups.com
Options: http://groups.google.com/group/opendatakit?hl=en


You received this message because you are subscribed to the Google
Groups

"ODK Community" group.
To unsubscribe from this group and stop receiving emails from it, send
an

email to opendatakit+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Mitch Sundt
Software Engineer
University of Washington
mitchellsundt@gmail.com

--

Post: opendatakit@googlegroups.com
Unsubscribe: opendatakit+unsubscribe@googlegroups.com
Options: http://groups.google.com/group/opendatakit?hl=en


You received this message because you are subscribed to the Google Groups
"ODK Community" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to opendatakit+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--

Post: opendatakit@googlegroups.com
Unsubscribe: opendatakit+unsubscribe@googlegroups.com
Options: http://groups.google.com/group/opendatakit?hl=en


You received this message because you are subscribed to the Google Groups
"ODK Community" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to opendatakit+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Mitch Sundt
Software Engineer
University of Washington
mitchellsundt@gmail.com

I am by no means an expert here, but it seems the process of getting each form individually is not really that efficient and time consuming. Not sure why the client could not open up multiple connections to the server to request the forms in parallel? All of the end-points should be known from the submission list, so it would seem just to be a matter of splitting up the download list into multiple parts, and opening up new download threads to grab multiple submissions with different threads?

Regards,
Jason

··· On Friday, June 7, 2013 6:28:15 PM UTC+2, Yaw Anokwa wrote: > Mitch, > > > > Would parallelizing the process work? So split your enumerators into n > > groups and have n instances of Aggregate being hit by n copies of > > Briefcase on n computers. > > > > Yaw > > -- > >

Thanks for all the comments.

While I know we certainly have less than perfect network speeds, it does
seem that for the size of the download it progresses far slower than an
equivalent. Then again perhaps I am just downloading at a time when the
network is slow. As you point out Mitch - at least one can resume from the
previous point rather than having to start from scratch each time.

Cheers
Dan

Just an update. I managed to replicate the server to a local machine, and the forms are downloading at MUCH faster rate. I guess this is due to network latency between our server (Ireland) and Briefcase clients here (Zambia). Not much we can do about that unfortunately, but it would be very nice to have a way to download all of this in bulk somehow. The CSV exports are very fast, and the exported CSV file downloads reasoably quickly, but the multiple requests of the ServerFetcher seems to slow things down in our situation for some reason. Hopefully my other post (about the Briefcase decryption) will go away after I manage to get all the forms.

Hi there. I am also working with Dan on this project, and responsbile for the server. We are using an Amazon instance, and do not think the server bandwidth is the issue.
Would it be possible to mirror to a local server to speed up this process? . Bandwidth consumption is about 5 kB/s, far below the bandwidth of the DSL line I am working on. It takes about 2 seconds per record, so we are looking at least a day or more of downloads for the entire record set. Just not convinced that the bandwidth is the real issue here.

Thanks,
Jason

Issuing multiple concurrent requests could speed up downloads provided
there is enough network bandwidth
to your computer. The transmission of
the data request, set-up and access to that data on the server accounts for
a negligible amount of bandwidth and delay.

However, for most users, network bandwidth is the limiting factor. If you
open 10 simultaneous downloads, and if your network only has capacity for
at most 2 concurrent download streams, each of these 10 downloads is going
to take up to 5 times longer (because they are all interleaved and there is
no extra capacity).

We did not add multiple concurrent requests because it can severely degrade
Google App Engine interactions. Google AppEngine has a 60-second
transmission limit, after which the Google infrastructure may cut off the
transmission. When the network has very little bandwidth, interleaving
requests stretches out the individual transmissions and can trip that limit
and lead to failures.

Thus, any concurrent-request logic needs to adjust its level of concurrency
to account for the available bandwidth. If someone were to code up a
solution, we would review it and consider adding it to the code base, but
there should be a way to disable this and force one-by-one access.

Mitch

··· On Mon, Aug 12, 2013 at 11:31 AM, wrote:

Just an update. I managed to replicate the server to a local machine, and
the forms are downloading at MUCH faster rate. I guess this is due to
network latency between our server (Ireland) and Briefcase clients here
(Zambia). Not much we can do about that unfortunately, but it would be very
nice to have a way to download all of this in bulk somehow. The CSV exports
are very fast, and the exported CSV file downloads reasoably quickly, but
the multiple requests of the ServerFetcher seems to slow things down in our
situation for some reason. Hopefully my other post (about the Briefcase
decryption) will go away after I manage to get all the forms.

--

Post: opendatakit@googlegroups.com
Unsubscribe: opendatakit+unsubscribe@googlegroups.com
Options: http://groups.google.com/group/opendatakit?hl=en


You received this message because you are subscribed to the Google Groups
"ODK Community" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to opendatakit+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Mitch Sundt
Software Engineer
University of Washington
mitchellsundt@gmail.com

Sounds like a Tomcat/server configuration issue.

The only thing you need to copy is the MySQL database, including all of its
blob storage. Once you have that replicated, you can set up any ODK
Aggregate to point to that new database, and then start using it (you
should probably use the create_user_and_db.sql script that the installer
creates before replicating the AWS database into that new instance).

When setting up the new (local) ODK Aggregate, if you specify the same 'ODK
Aggregate Instance Name', then you should not see any change, other than in
the hostname of the server. Everything should work as-is. If you change
that name, then the passwords for all the ODK Aggregate usernames will need
to be reset (they will all be invalidated if the name is not identical to
the original installation).

Mitch

··· On Fri, Aug 9, 2013 at 11:24 AM, wrote:

Hi there. I am also working with Dan on this project, and responsbile for
the server. We are using an Amazon instance, and do not think the server
bandwidth is the issue.
Would it be possible to mirror to a local server to speed up this process?
. Bandwidth consumption is about 5 kB/s, far below the bandwidth of the DSL
line I am working on. It takes about 2 seconds per record, so we are
looking at least a day or more of downloads for the entire record set. Just
not convinced that the bandwidth is the real issue here.

Thanks,
Jason

--

Post: opendatakit@googlegroups.com
Unsubscribe: opendatakit+unsubscribe@googlegroups.com
Options: http://groups.google.com/group/opendatakit?hl=en


You received this message because you are subscribed to the Google Groups
"ODK Community" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to opendatakit+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Mitch Sundt
Software Engineer
University of Washington
mitchellsundt@gmail.com

Hi Mitch,
Thanks for the quick reply.

··· On Friday, August 9, 2013 8:51:26 PM UTC+2, Mitch Sundt wrote: > Sounds like a Tomcat/server configuration issue.

Could you be more explicit about what might be the configuration issue? CPU usage is low. The server does not appear to be stressed, and the ODK Aggregate runs fast through the UI, so not sure what could be the bottle neck.

I can certainly try and replicate to a local server and see what happens.

Ran up against this as well, at about record 530 of 50,000+ and the downloads stop.

09-Aug-2013 20:56:12 org.opendatakit.aggregate.parser.BaseFormParserForJavaRosa$XFormParserWithBindEnhancements parseBind
INFO: Calling handle found value null
Warning: 1 Unrecognized attributes found in Element [input] and will be ignored: [accuracyThreshold] Location:

Problem found at nodeset: /html/body/group[@appearance=field-list][@ref=/data/Location]/input
With element <input ref="/data/Location/GPS" accuracyThreshold="8">

Parsing form...
Title: "Encrypted Form"
09-Aug-2013 20:56:12 org.opendatakit.aggregate.parser.BaseFormParserForJavaRosa$XFormParserWithBindEnhancements parseBind
INFO: Calling handle found value 2048
09-Aug-2013 20:56:12 org.opendatakit.aggregate.parser.BaseFormParserForJavaRosa$XFormParserWithBindEnhancements parseBind
INFO: Calling handle found value null
09-Aug-2013 20:56:12 org.opendatakit.aggregate.parser.BaseFormParserForJavaRosa$XFormParserWithBindEnhancements parseBind
INFO: Calling handle found value null
09-Aug-2013 20:56:12 org.opendatakit.aggregate.parser.BaseFormParserForJavaRosa$XFormParserWithBindEnhancements parseBind
INFO: Calling handle found value null
09-Aug-2013 20:56:12 org.opendatakit.aggregate.parser.BaseFormParserForJavaRosa$XFormParserWithBindEnhancements parseBind
INFO: Calling handle found value 2048

Are you using the latest versions of ODK Aggregate and ODK Briefcase?

My initial thought is that the database thread pool and connection pool are
somehow not being reaped. The stall could be due to running out of file
handles on the operating system. After a connection is released, there is
a idling period before which it is reused, and a further period before
which it is closed.

You could also check that there aren't an excessive number of connections
held open by Briefcase back to the server. The pull should be fully
serialized, so you should only see a few connections active (e.g., netstat
).

··· On Fri, Aug 9, 2013 at 11:59 AM, wrote:

Hi Mitch,
Thanks for the quick reply.

On Friday, August 9, 2013 8:51:26 PM UTC+2, Mitch Sundt wrote:

Sounds like a Tomcat/server configuration issue.

Could you be more explicit about what might be the configuration issue?
CPU usage is low. The server does not appear to be stressed, and the ODK
Aggregate runs fast through the UI, so not sure what could be the bottle
neck.

I can certainly try and replicate to a local server and see what happens.

Ran up against this as well, at about record 530 of 50,000+ and the
downloads stop.

09-Aug-2013 20:56:12
org.opendatakit.aggregate.parser.BaseFormParserForJavaRosa$XFormParserWithBindEnhancements
parseBind
INFO: Calling handle found value null
Warning: 1 Unrecognized attributes found in Element [input] and will be
ignored: [accuracyThreshold] Location:

Problem found at nodeset:

/html/body/group[@appearance=field-list][@ref=/data/Location]/input
With element

Parsing form...
Title: "Encrypted Form"
09-Aug-2013 20:56:12
org.opendatakit.aggregate.parser.BaseFormParserForJavaRosa$XFormParserWithBindEnhancements
parseBind
INFO: Calling handle found value 2048
09-Aug-2013 20:56:12
org.opendatakit.aggregate.parser.BaseFormParserForJavaRosa$XFormParserWithBindEnhancements
parseBind
INFO: Calling handle found value null
09-Aug-2013 20:56:12
org.opendatakit.aggregate.parser.BaseFormParserForJavaRosa$XFormParserWithBindEnhancements
parseBind
INFO: Calling handle found value null
09-Aug-2013 20:56:12
org.opendatakit.aggregate.parser.BaseFormParserForJavaRosa$XFormParserWithBindEnhancements
parseBind
INFO: Calling handle found value null
09-Aug-2013 20:56:12
org.opendatakit.aggregate.parser.BaseFormParserForJavaRosa$XFormParserWithBindEnhancements
parseBind
INFO: Calling handle found value 2048

--

Post: opendatakit@googlegroups.com
Unsubscribe: opendatakit+unsubscribe@googlegroups.com
Options: http://groups.google.com/group/opendatakit?hl=en


You received this message because you are subscribed to the Google Groups
"ODK Community" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to opendatakit+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Mitch Sundt
Software Engineer
University of Washington
mitchellsundt@gmail.com

Hi Mitch,

We are using 1.3.1 on the server. Using Briefcase 1.3.2. Back-end database is Postgresql 9.2. All running on an AWS linux server.

I checked both netstat and the DB. I have 100 connections active in the DB, of which about a maximum of 61 seem to be active during a form download (normally around 40) . There is nothing else running on the server besides Tomcat, Postgres and Nginx (reverse proxy) . I went ahead and increased the number of files as documented in another post, but see no real appreciable difference in terms of throughput.

Wondering if this has anything to do with the max_allowed_packet which was mentioned in that other post. There is not an equivalent parameter on Postgres which I know of. Could try upping the number of connections in postgres, but this does not see to be the issue. :-/

Regards,
Jason

Below, a small snippet from the logs during the download process, which might help in terms of viewing the timing.

INFO: FormCache: fetching new list of Forms
Aug 12, 2013 7:42:58 AM org.opendatakit.aggregate.util.BackendActionsTable logValues
INFO: incoming- last Fetch: -7602 [S: -231779 Eq: -173810 Fs: 726190] futureMillis: 726190
Aug 12, 2013 7:42:58 AM org.opendatakit.aggregate.util.BackendActionsTable logValues
INFO: -fetched- last Fetch: 0 [S: -231779 Eq: -173810 Fs: 726190] futureMillis: -1
Aug 12, 2013 7:43:05 AM org.opendatakit.aggregate.form.FormFactory internalGetForms
INFO: FormCache: fetching new list of Forms
Aug 12, 2013 7:43:05 AM org.opendatakit.aggregate.util.BackendActionsTable logValues
INFO: incoming- last Fetch: -6444 [S: -238223 Eq: -180254 Fs: 719746] futureMillis: 719746
Aug 12, 2013 7:43:05 AM org.opendatakit.aggregate.util.BackendActionsTable logValues
INFO: -fetched- last Fetch: 0 [S: -238223 Eq: -180254 Fs: 719746] futureMillis: -1
Aug 12, 2013 7:43:06 AM org.opendatakit.aggregate.form.FormFactory internalGetForms
INFO: FormCache: using cached list of Forms
Aug 12, 2013 7:43:06 AM org.opendatakit.aggregate.form.FormFactory internalGetForms
INFO: FormCache: using cached list of Forms
Aug 12, 2013 7:43:08 AM org.opendatakit.aggregate.form.FormFactory internalGetForms
INFO: FormCache: fetching new list of Forms
Aug 12, 2013 7:43:08 AM org.opendatakit.aggregate.util.BackendActionsTable logValues
INFO: incoming- last Fetch: -3551 [S: -241774 Eq: -183805 Fs: 716195] futureMillis: 716195
Aug 12, 2013 7:43:08 AM org.opendatakit.aggregate.util.BackendActionsTable logValues
INFO: -fetched- last Fetch: 0 [S: -241774 Eq: -183805 Fs: 716195] futureMillis: -1
Aug 12, 2013 7:43:12 AM org.opendatakit.aggregate.form.FormFactory internalGetForms
INFO: FormCache: fetching new list of Forms
Aug 12, 2013 7:43:12 AM org.opendatakit.aggregate.util.BackendActionsTable logValues
INFO: incoming- last Fetch: -3227 [S: -245001 Eq: -187032 Fs: 712968] futureMillis: 712968
Aug 12, 2013 7:43:12 AM org.opendatakit.aggregate.util.BackendActionsTable logValues
INFO: -fetched- last Fetch: 0 [S: -245001 Eq: -187032 Fs: 712968] futureMillis: -1
Aug 12, 2013 7:43:12 AM org.opendatakit.aggregate.form.FormFactory internalGetForms
INFO: FormCache: using cached list of Forms