Fast export of data from Aggregate server

Gunnarroe · August 30, 2016, 11:46am

I am currently running an Aggregate server and need to export a csv of all records every hour. So far I have been running briefcase on the same server to download all the records and export a csv file. Due to the large number of records in the forms this has now become way too slow and I therefore need a faster way of exporting the data to csv.

What is the fastest way to export the data keeping in mind that the export function needs to automated and will run on the same server as aggregate. Is there any API functionality for aggregates csv export? Maybe it would be possible to access the csv files generated by aggregate on the servers file system? Any help to speed up this process would be greatly appreciated.

Gunnar

Batkinson · August 30, 2016, 2:18pm

Gunnar,

Is there a reason you are needing to re-export the same records every time?
As you are finding, such a process will be increasingly slower as the
number of records increase. If you are doing this for automation, it makes
more sense to have Aggregate publish new submissions rather than
re-exporting everything, every hour.

Brent

···

On Tue, Aug 30, 2016 at 7:46 AM, wrote:

I am currently running an Aggregate server and need to export a csv of all
records every hour. So far I have been running briefcase on the same server
to download all the records and export a csv file. Due to the large number
of records in the forms this has now become way too slow and I therefore
need a faster way of exporting the data to csv.

What is the fastest way to export the data keeping in mind that the export
function needs to automated and will run on the same server as aggregate.
Is there any API functionality for aggregates csv export? Maybe it would be
possible to access the csv files generated by aggregate on the servers file
system? Any help to speed up this process would be greatly appreciated.

Gunnar

--

Post: opendatakit@googlegroups.com
Unsubscribe: opendatakit+unsubscribe@googlegroups.com
Options: http://groups.google.com/group/opendatakit?hl=en

You received this message because you are subscribed to the Google Groups
"ODK Community" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to opendatakit+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gunnarroe · August 30, 2016, 2:40pm

We have a fairly complicated data pipeline where we need complete csv files of all the records every hour. In the future we are planning to rewrite this whole process and will probably then use the publish feature in Aggregate.

Since this will take some significant development time to set up I was just hoping there was a way I could speed up the way we are currently exporting our data.

Thanks,

Gunnar

···

On Tuesday, 30 August 2016 17:18:50 UTC+3, Brent Atkinson wrote: > Gunnar, > > > Is there a reason you are needing to re-export the same records every time? As you are finding, such a process will be increasingly slower as the number of records increase. If you are doing this for automation, it makes more sense to have Aggregate publish new submissions rather than re-exporting everything, every hour. > > > https://opendatakit.org/use/aggregate/data-transfer/#Publishing > > > > Brent > > > On Tue, Aug 30, 2016 at 7:46 AM, wrote: > I am currently running an Aggregate server and need to export a csv of all records every hour. So far I have been running briefcase on the same server to download all the records and export a csv file. Due to the large number of records in the forms this has now become way too slow and I therefore need a faster way of exporting the data to csv. > > > > What is the fastest way to export the data keeping in mind that the export function needs to automated and will run on the same server as aggregate. Is there any API functionality for aggregates csv export? Maybe it would be possible to access the csv files generated by aggregate on the servers file system? Any help to speed up this process would be greatly appreciated. > > > > Gunnar > > > > -- > > -- > > Post: opend...@googlegroups.com > > Unsubscribe: opendatakit...@googlegroups.com > > Options: http://groups.google.com/group/opendatakit?hl=en > > > > --- > > You received this message because you are subscribed to the Google Groups "ODK Community" group. > > To unsubscribe from this group and stop receiving emails from it, send an email to opendatakit...@googlegroups.com. > > For more options, visit https://groups.google.com/d/optout.

Batkinson · August 30, 2016, 3:16pm

Hi Gunnar,

So, depending on that constraint you may not have many options. It might be
possible to get a minor speedup, but until you can eliminate the constraint
on exporting all records, you can't escape the problem of that slow-down.
Apologies if you know all this, but here are the options I can think of:

Use an SQL tool to export form records directly from Aggregate's
database
Try to use bulk export from Aggregate's web site directly rather going
through Briefcase

Both of these approaches have trade-offs.

For the first, Aggregate's internal database format is simple enough for
flat forms (one's with simple questions and no nested structures), but in
many cases it isn't easy to generate the same CSV that Briefcase generates.
For example, repeats and multi-select questions are stored in separate
tables. You will have to run multiple queries per form to generate a single
unified CSV depending on the form. The performance gains may not be worth
the effort.

For the second, it might be possible to get some speedup by scraping the
web-based CSV export from Aggregate. However, I think the format isn't
exactly the same as what Briefcase generates for CSV and it may be more
fragile. Also, I'm not entirely sure the gains would be worth the
additional complexity depending on the forms you're using.

If anyone else in the community knows better, correct me mercilessly.

Brent

···

On Tue, Aug 30, 2016 at 10:40 AM, wrote:

We have a fairly complicated data pipeline where we need complete csv
files of all the records every hour. In the future we are planning to
rewrite this whole process and will probably then use the publish feature
in Aggregate.

Since this will take some significant development time to set up I was
just hoping there was a way I could speed up the way we are currently
exporting our data.

Thanks,

Gunnar

On Tuesday, 30 August 2016 17:18:50 UTC+3, Brent Atkinson wrote:

Gunnar,

Is there a reason you are needing to re-export the same records every
time? As you are finding, such a process will be increasingly slower as the
number of records increase. If you are doing this for automation, it makes
more sense to have Aggregate publish new submissions rather than
re-exporting everything, every hour.

https://opendatakit.org/use/aggregate/data-transfer/#Publishing

Brent

On Tue, Aug 30, 2016 at 7:46 AM, gunn...@gmail.com wrote:
I am currently running an Aggregate server and need to export a csv of
all records every hour. So far I have been running briefcase on the same
server to download all the records and export a csv file. Due to the large
number of records in the forms this has now become way too slow and I
therefore need a faster way of exporting the data to csv.

What is the fastest way to export the data keeping in mind that the
export function needs to automated and will run on the same server as
aggregate. Is there any API functionality for aggregates csv export? Maybe
it would be possible to access the csv files generated by aggregate on the
servers file system? Any help to speed up this process would be greatly
appreciated.

Gunnar

--

--

Post: opend...@googlegroups.com

Unsubscribe: opendatakit...@googlegroups.com

Options: http://groups.google.com/group/opendatakit?hl=en

You received this message because you are subscribed to the Google
Groups "ODK Community" group.

To unsubscribe from this group and stop receiving emails from it, send
an email to opendatakit...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Post: opendatakit@googlegroups.com
Unsubscribe: opendatakit+unsubscribe@googlegroups.com
Options: http://groups.google.com/group/opendatakit?hl=en

You received this message because you are subscribed to the Google Groups
"ODK Community" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to opendatakit+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

yanokwa · August 30, 2016, 3:25pm

Brent,

Good options, but my gut is that SQLing and scrapping are too much work.

One option is to tweak the Briefcase code to push after each pull and
that should speed things up. Mitch suggested this a few weeks back:
https://groups.google.com/d/msg/opendatakit/awkgN2psOZo/MzZZvDRiBwAJ.
A code contribution here would be helpful to the community.

Another option is to use the publishing system to stream submissions
to a JSON endpoint and do with it as you will. You can also stream the
data to Spreadsheets or Fusion Tables and export the data using those
APIs. https://opendatakit.org/use/aggregate/data-transfer has a write
up on what you can get out of each of these options.

Yaw

···

-- Need ODK consultants? Nafundi provides form design, server setup, in-field training, and software development for ODK. Go to https://nafundi.com to get started.

On Tue, Aug 30, 2016 at 3:16 PM, Brent Atkinson brent.atkinson@gmail.com wrote:

Hi Gunnar,

So, depending on that constraint you may not have many options. It might be
possible to get a minor speedup, but until you can eliminate the constraint
on exporting all records, you can't escape the problem of that slow-down.
Apologies if you know all this, but here are the options I can think of:

Use an SQL tool to export form records directly from Aggregate's database
Try to use bulk export from Aggregate's web site directly rather going
through Briefcase

Both of these approaches have trade-offs.

For the first, Aggregate's internal database format is simple enough for
flat forms (one's with simple questions and no nested structures), but in
many cases it isn't easy to generate the same CSV that Briefcase generates.
For example, repeats and multi-select questions are stored in separate
tables. You will have to run multiple queries per form to generate a single
unified CSV depending on the form. The performance gains may not be worth
the effort.

For the second, it might be possible to get some speedup by scraping the
web-based CSV export from Aggregate. However, I think the format isn't
exactly the same as what Briefcase generates for CSV and it may be more
fragile. Also, I'm not entirely sure the gains would be worth the additional
complexity depending on the forms you're using.

If anyone else in the community knows better, correct me mercilessly.

Brent

On Tue, Aug 30, 2016 at 10:40 AM, gunnarroe@gmail.com wrote:

We have a fairly complicated data pipeline where we need complete csv
files of all the records every hour. In the future we are planning to
rewrite this whole process and will probably then use the publish feature in
Aggregate.

Since this will take some significant development time to set up I was
just hoping there was a way I could speed up the way we are currently
exporting our data.

Thanks,

Gunnar

On Tuesday, 30 August 2016 17:18:50 UTC+3, Brent Atkinson wrote:

Gunnar,

Is there a reason you are needing to re-export the same records every
time? As you are finding, such a process will be increasingly slower as the
number of records increase. If you are doing this for automation, it makes
more sense to have Aggregate publish new submissions rather than
re-exporting everything, every hour.

https://opendatakit.org/use/aggregate/data-transfer/#Publishing

Brent

On Tue, Aug 30, 2016 at 7:46 AM, gunn...@gmail.com wrote:
I am currently running an Aggregate server and need to export a csv of
all records every hour. So far I have been running briefcase on the same
server to download all the records and export a csv file. Due to the large
number of records in the forms this has now become way too slow and I
therefore need a faster way of exporting the data to csv.

What is the fastest way to export the data keeping in mind that the
export function needs to automated and will run on the same server as
aggregate. Is there any API functionality for aggregates csv export? Maybe
it would be possible to access the csv files generated by aggregate on the
servers file system? Any help to speed up this process would be greatly
appreciated.

Gunnar

--

--

Post: opend...@googlegroups.com

Unsubscribe: opendatakit...@googlegroups.com

Options: http://groups.google.com/group/opendatakit?hl=en

You received this message because you are subscribed to the Google
Groups "ODK Community" group.

To unsubscribe from this group and stop receiving emails from it, send
an email to opendatakit...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Post: opendatakit@googlegroups.com
Unsubscribe: opendatakit+unsubscribe@googlegroups.com
Options: http://groups.google.com/group/opendatakit?hl=en

You received this message because you are subscribed to the Google Groups
"ODK Community" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to opendatakit+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Post: opendatakit@googlegroups.com
Unsubscribe: opendatakit+unsubscribe@googlegroups.com
Options: http://groups.google.com/group/opendatakit?hl=en

You received this message because you are subscribed to the Google Groups
"ODK Community" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to opendatakit+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gunnarroe · August 30, 2016, 4:15pm

Thanks Brent and Yaw,

those are helpful suggestions. I do not have much Java experience, but I wrote some a very simple python script to try to download the data my self. What I noticed was that the actual download was no faster than briefcase, but that it might be possible to speed up the process when many of the records have already been downloaded.

So essentially I now just rerun briefcase every hour using the same storage folder. It does run slightly faster when this storage folder already includes almost all of the records, but for our about 50000 submissions it still takes at least half an hour to download just the new submissions. Are there any speedups to be gained here?

Thanks for all the help

Gunnar

···

On Tuesday, 30 August 2016 18:26:06 UTC+3, Yaw Anokwa wrote: > Brent, > > Good options, but my gut is that SQLing and scrapping are too much work. > > One option is to tweak the Briefcase code to push after each pull and > that should speed things up. Mitch suggested this a few weeks back: > https://groups.google.com/d/msg/opendatakit/awkgN2psOZo/MzZZvDRiBwAJ. > A code contribution here would be helpful to the community. > > Another option is to use the publishing system to stream submissions > to a JSON endpoint and do with it as you will. You can also stream the > data to Spreadsheets or Fusion Tables and export the data using those > APIs. https://opendatakit.org/use/aggregate/data-transfer has a write > up on what you can get out of each of these options. > > Yaw > -- > Need ODK consultants? Nafundi provides form design, server setup, > in-field training, and software development for ODK. Go to > https://nafundi.com to get started. > > On Tue, Aug 30, 2016 at 3:16 PM, Brent Atkinson wrote: > > Hi Gunnar, > > > > So, depending on that constraint you may not have many options. It might be > > possible to get a minor speedup, but until you can eliminate the constraint > > on exporting all records, you can't escape the problem of that slow-down. > > Apologies if you know all this, but here are the options I can think of: > > > > Use an SQL tool to export form records directly from Aggregate's database > > Try to use bulk export from Aggregate's web site directly rather going > > through Briefcase > > > > Both of these approaches have trade-offs. > > > > For the first, Aggregate's internal database format is simple enough for > > flat forms (one's with simple questions and no nested structures), but in > > many cases it isn't easy to generate the same CSV that Briefcase generates. > > For example, repeats and multi-select questions are stored in separate > > tables. You will have to run multiple queries per form to generate a single > > unified CSV depending on the form. The performance gains may not be worth > > the effort. > > > > For the second, it might be possible to get some speedup by scraping the > > web-based CSV export from Aggregate. However, I think the format isn't > > exactly the same as what Briefcase generates for CSV and it may be more > > fragile. Also, I'm not entirely sure the gains would be worth the additional > > complexity depending on the forms you're using. > > > > If anyone else in the community knows better, correct me mercilessly. > > > > Brent > > > > On Tue, Aug 30, 2016 at 10:40 AM, wrote: > >> > >> We have a fairly complicated data pipeline where we need complete csv > >> files of all the records every hour. In the future we are planning to > >> rewrite this whole process and will probably then use the publish feature in > >> Aggregate. > >> > >> Since this will take some significant development time to set up I was > >> just hoping there was a way I could speed up the way we are currently > >> exporting our data. > >> > >> Thanks, > >> > >> Gunnar > >> > >> On Tuesday, 30 August 2016 17:18:50 UTC+3, Brent Atkinson wrote: > >> > Gunnar, > >> > > >> > > >> > Is there a reason you are needing to re-export the same records every > >> > time? As you are finding, such a process will be increasingly slower as the > >> > number of records increase. If you are doing this for automation, it makes > >> > more sense to have Aggregate publish new submissions rather than > >> > re-exporting everything, every hour. > >> > > >> > > >> > https://opendatakit.org/use/aggregate/data-transfer/#Publishing > >> > > >> > > >> > > >> > Brent > >> > > >> > > >> > On Tue, Aug 30, 2016 at 7:46 AM, wrote: > >> > I am currently running an Aggregate server and need to export a csv of > >> > all records every hour. So far I have been running briefcase on the same > >> > server to download all the records and export a csv file. Due to the large > >> > number of records in the forms this has now become way too slow and I > >> > therefore need a faster way of exporting the data to csv. > >> > > >> > > >> > > >> > What is the fastest way to export the data keeping in mind that the > >> > export function needs to automated and will run on the same server as > >> > aggregate. Is there any API functionality for aggregates csv export? Maybe > >> > it would be possible to access the csv files generated by aggregate on the > >> > servers file system? Any help to speed up this process would be greatly > >> > appreciated. > >> > > >> > > >> > > >> > Gunnar > >> > > >> > > >> > > >> > -- > >> > > >> > -- > >> > > >> > Post: opend...@googlegroups.com > >> > > >> > Unsubscribe: opendatakit...@googlegroups.com > >> > > >> > Options: http://groups.google.com/group/opendatakit?hl=en > >> > > >> > > >> > > >> > --- > >> > > >> > You received this message because you are subscribed to the Google > >> > Groups "ODK Community" group. > >> > > >> > To unsubscribe from this group and stop receiving emails from it, send > >> > an email to opendatakit...@googlegroups.com. > >> > > >> > For more options, visit https://groups.google.com/d/optout. > >> > >> -- > >> -- > >> Post: opendatakit@googlegroups.com > >> Unsubscribe: opendatakit+unsubscribe@googlegroups.com > >> Options: http://groups.google.com/group/opendatakit?hl=en > >> > >> --- > >> You received this message because you are subscribed to the Google Groups > >> "ODK Community" group. > >> To unsubscribe from this group and stop receiving emails from it, send an > >> email to opendatakit+unsubscribe@googlegroups.com. > >> For more options, visit https://groups.google.com/d/optout. > > > > > > -- > > -- > > Post: opendatakit@googlegroups.com > > Unsubscribe: opendatakit+unsubscribe@googlegroups.com > > Options: http://groups.google.com/group/opendatakit?hl=en > > > > --- > > You received this message because you are subscribed to the Google Groups > > "ODK Community" group. > > To unsubscribe from this group and stop receiving emails from it, send an > > email to opendatakit+unsubscribe@googlegroups.com. > > For more options, visit https://groups.google.com/d/optout.

Gunnarroe · August 30, 2016, 5:15pm

Thanks

···

On Tuesday, 30 August 2016 20:06:58 UTC+3, Brent Atkinson wrote: > Gunnar, > > > I think you got your answer: > What you're doing is inherently slow - prefer publishing over bulk exporting It may be possible to speedup Briefcase by issuing a push after pulling (https://groups.google.com/d/msg/opendatakit/awkgN2psOZo/MzZZvDRiBwAJ) > Brent > > > On Tue, Aug 30, 2016 at 12:15 PM, wrote: > Thanks Brent and Yaw, > > > > those are helpful suggestions. I do not have much Java experience, but I wrote some a very simple python script to try to download the data my self. What I noticed was that the actual download was no faster than briefcase, but that it might be possible to speed up the process when many of the records have already been downloaded. > > > > So essentially I now just rerun briefcase every hour using the same storage folder. It does run slightly faster when this storage folder already includes almost all of the records, but for our about 50000 submissions it still takes at least half an hour to download just the new submissions. Are there any speedups to be gained here? > > > > Thanks for all the help > > > > Gunnar > > > > > > > > On Tuesday, 30 August 2016 18:26:06 UTC+3, Yaw Anokwa wrote: > > > Brent, > > > > > > Good options, but my gut is that SQLing and scrapping are too much work. > > > > > > One option is to tweak the Briefcase code to push after each pull and > > > that should speed things up. Mitch suggested this a few weeks back: > > > https://groups.google.com/d/msg/opendatakit/awkgN2psOZo/MzZZvDRiBwAJ. > > > A code contribution here would be helpful to the community. > > > > > > Another option is to use the publishing system to stream submissions > > > to a JSON endpoint and do with it as you will. You can also stream the > > > data to Spreadsheets or Fusion Tables and export the data using those > > > APIs. https://opendatakit.org/use/aggregate/data-transfer has a write > > > up on what you can get out of each of these options. > > > > > > Yaw > > > -- > > > Need ODK consultants? Nafundi provides form design, server setup, > > > in-field training, and software development for ODK. Go to > > > https://nafundi.com to get started. > > > > > > On Tue, Aug 30, 2016 at 3:16 PM, Brent Atkinson wrote: > > > > Hi Gunnar, > > > > > > > > So, depending on that constraint you may not have many options. It might be > > > > possible to get a minor speedup, but until you can eliminate the constraint > > > > on exporting all records, you can't escape the problem of that slow-down. > > > > Apologies if you know all this, but here are the options I can think of: > > > > > > > > Use an SQL tool to export form records directly from Aggregate's database > > > > Try to use bulk export from Aggregate's web site directly rather going > > > > through Briefcase > > > > > > > > Both of these approaches have trade-offs. > > > > > > > > For the first, Aggregate's internal database format is simple enough for > > > > flat forms (one's with simple questions and no nested structures), but in > > > > many cases it isn't easy to generate the same CSV that Briefcase generates. > > > > For example, repeats and multi-select questions are stored in separate > > > > tables. You will have to run multiple queries per form to generate a single > > > > unified CSV depending on the form. The performance gains may not be worth > > > > the effort. > > > > > > > > For the second, it might be possible to get some speedup by scraping the > > > > web-based CSV export from Aggregate. However, I think the format isn't > > > > exactly the same as what Briefcase generates for CSV and it may be more > > > > fragile. Also, I'm not entirely sure the gains would be worth the additional > > > > complexity depending on the forms you're using. > > > > > > > > If anyone else in the community knows better, correct me mercilessly. > > > > > > > > Brent > > > > > > > > On Tue, Aug 30, 2016 at 10:40 AM, wrote: > > > >> > > > >> We have a fairly complicated data pipeline where we need complete csv > > > >> files of all the records every hour. In the future we are planning to > > > >> rewrite this whole process and will probably then use the publish feature in > > > >> Aggregate. > > > >> > > > >> Since this will take some significant development time to set up I was > > > >> just hoping there was a way I could speed up the way we are currently > > > >> exporting our data. > > > >> > > > >> Thanks, > > > >> > > > >> Gunnar > > > >> > > > >> On Tuesday, 30 August 2016 17:18:50 UTC+3, Brent Atkinson wrote: > > > >> > Gunnar, > > > >> > > > > >> > > > > >> > Is there a reason you are needing to re-export the same records every > > > >> > time? As you are finding, such a process will be increasingly slower as the > > > >> > number of records increase. If you are doing this for automation, it makes > > > >> > more sense to have Aggregate publish new submissions rather than > > > >> > re-exporting everything, every hour. > > > >> > > > > >> > > > > >> > https://opendatakit.org/use/aggregate/data-transfer/#Publishing > > > >> > > > > >> > > > > >> > > > > >> > Brent > > > >> > > > > >> > > > > >> > On Tue, Aug 30, 2016 at 7:46 AM, wrote: > > > >> > I am currently running an Aggregate server and need to export a csv of > > > >> > all records every hour. So far I have been running briefcase on the same > > > >> > server to download all the records and export a csv file. Due to the large > > > >> > number of records in the forms this has now become way too slow and I > > > >> > therefore need a faster way of exporting the data to csv. > > > >> > > > > >> > > > > >> > > > > >> > What is the fastest way to export the data keeping in mind that the > > > >> > export function needs to automated and will run on the same server as > > > >> > aggregate. Is there any API functionality for aggregates csv export? Maybe > > > >> > it would be possible to access the csv files generated by aggregate on the > > > >> > servers file system? Any help to speed up this process would be greatly > > > >> > appreciated. > > > >> > > > > >> > > > > >> > > > > >> > Gunnar > > > >> > > > > >> > > > > >> > > > > >> > -- > > > >> > > > > >> > -- > > > >> > > > > >> > Post: opend...@googlegroups.com > > > >> > > > > >> > Unsubscribe: opendatakit...@googlegroups.com > > > >> > > > > >> > Options: http://groups.google.com/group/opendatakit?hl=en > > > >> > > > > >> > > > > >> > > > > >> > --- > > > >> > > > > >> > You received this message because you are subscribed to the Google > > > >> > Groups "ODK Community" group. > > > >> > > > > >> > To unsubscribe from this group and stop receiving emails from it, send > > > >> > an email to opendatakit...@googlegroups.com. > > > >> > > > > >> > For more options, visit https://groups.google.com/d/optout. > > > >> > > > >> -- > > > >> -- > > > >> Post: opend...@googlegroups.com > > > >> Unsubscribe: opendatakit...@googlegroups.com > > > >> Options: http://groups.google.com/group/opendatakit?hl=en > > > >> > > > >> --- > > > >> You received this message because you are subscribed to the Google Groups > > > >> "ODK Community" group. > > > >> To unsubscribe from this group and stop receiving emails from it, send an > > > >> email to opendatakit...@googlegroups.com. > > > >> For more options, visit https://groups.google.com/d/optout. > > > > > > > > > > > > -- > > > > -- > > > > Post: opend...@googlegroups.com > > > > Unsubscribe: opendatakit...@googlegroups.com > > > > Options: http://groups.google.com/group/opendatakit?hl=en > > > > > > > > --- > > > > You received this message because you are subscribed to the Google Groups > > > > "ODK Community" group. > > > > To unsubscribe from this group and stop receiving emails from it, send an > > > > email to opendatakit...@googlegroups.com. > > > > For more options, visit https://groups.google.com/d/optout. > > > > -- > > -- > > Post: opend...@googlegroups.com > > Unsubscribe: opendatakit...@googlegroups.com > > Options: http://groups.google.com/group/opendatakit?hl=en > > > > --- > > You received this message because you are subscribed to the Google Groups "ODK Community" group. > > To unsubscribe from this group and stop receiving emails from it, send an email to opendatakit...@googlegroups.com. > > For more options, visit https://groups.google.com/d/optout.

Batkinson · August 30, 2016, 5:06pm

Gunnar,

I think you got your answer:

What you're doing is inherently slow - prefer publishing over bulk
exporting
It may be possible to speedup Briefcase by issuing a push after
pulling (
https://groups.google.com/d/msg/opendatakit/awkgN2psOZo/MzZZvDRiBwAJ)

Brent

···

On Tue, Aug 30, 2016 at 12:15 PM, wrote:

Thanks Brent and Yaw,

those are helpful suggestions. I do not have much Java experience, but I
wrote some a very simple python script to try to download the data my self.
What I noticed was that the actual download was no faster than briefcase,
but that it might be possible to speed up the process when many of the
records have already been downloaded.

So essentially I now just rerun briefcase every hour using the same
storage folder. It does run slightly faster when this storage folder
already includes almost all of the records, but for our about 50000
submissions it still takes at least half an hour to download just the new
submissions. Are there any speedups to be gained here?

Thanks for all the help

Gunnar

On Tuesday, 30 August 2016 18:26:06 UTC+3, Yaw Anokwa wrote:

Brent,

Good options, but my gut is that SQLing and scrapping are too much work.

One option is to tweak the Briefcase code to push after each pull and
that should speed things up. Mitch suggested this a few weeks back:
https://groups.google.com/d/msg/opendatakit/awkgN2psOZo/MzZZvDRiBwAJ.
A code contribution here would be helpful to the community.

Another option is to use the publishing system to stream submissions
to a JSON endpoint and do with it as you will. You can also stream the
data to Spreadsheets or Fusion Tables and export the data using those
APIs. https://opendatakit.org/use/aggregate/data-transfer has a write
up on what you can get out of each of these options.

Yaw

Need ODK consultants? Nafundi provides form design, server setup,
in-field training, and software development for ODK. Go to
https://nafundi.com to get started.

On Tue, Aug 30, 2016 at 3:16 PM, Brent Atkinson brent.atkinson@gmail.com wrote:

Hi Gunnar,

So, depending on that constraint you may not have many options. It
might be
possible to get a minor speedup, but until you can eliminate the
constraint
on exporting all records, you can't escape the problem of that
slow-down.
Apologies if you know all this, but here are the options I can think
of:

Use an SQL tool to export form records directly from Aggregate's
database
Try to use bulk export from Aggregate's web site directly rather going
through Briefcase

Both of these approaches have trade-offs.

For the first, Aggregate's internal database format is simple enough
for
flat forms (one's with simple questions and no nested structures), but
in
many cases it isn't easy to generate the same CSV that Briefcase
generates.
For example, repeats and multi-select questions are stored in separate
tables. You will have to run multiple queries per form to generate a
single
unified CSV depending on the form. The performance gains may not be
worth
the effort.

For the second, it might be possible to get some speedup by scraping
the
web-based CSV export from Aggregate. However, I think the format isn't
exactly the same as what Briefcase generates for CSV and it may be more
fragile. Also, I'm not entirely sure the gains would be worth the
additional
complexity depending on the forms you're using.

If anyone else in the community knows better, correct me mercilessly.

Brent

On Tue, Aug 30, 2016 at 10:40 AM, gunnarroe@gmail.com wrote:

We have a fairly complicated data pipeline where we need complete csv
files of all the records every hour. In the future we are planning to
rewrite this whole process and will probably then use the publish
feature in
Aggregate.

Since this will take some significant development time to set up I was
just hoping there was a way I could speed up the way we are currently
exporting our data.

Thanks,

Gunnar

On Tuesday, 30 August 2016 17:18:50 UTC+3, Brent Atkinson wrote:

Gunnar,

Is there a reason you are needing to re-export the same records
every
time? As you are finding, such a process will be increasingly
slower as the
number of records increase. If you are doing this for automation,
it makes
more sense to have Aggregate publish new submissions rather than
re-exporting everything, every hour.

https://opendatakit.org/use/aggregate/data-transfer/#Publishing

Brent

On Tue, Aug 30, 2016 at 7:46 AM, gunn...@gmail.com wrote:
I am currently running an Aggregate server and need to export a csv
of
all records every hour. So far I have been running briefcase on the
same
server to download all the records and export a csv file. Due to
the large
number of records in the forms this has now become way too slow and
I
therefore need a faster way of exporting the data to csv.

What is the fastest way to export the data keeping in mind that the
export function needs to automated and will run on the same server
as
aggregate. Is there any API functionality for aggregates csv
export? Maybe
it would be possible to access the csv files generated by aggregate
on the
servers file system? Any help to speed up this process would be
greatly
appreciated.

Gunnar

--

--

Post: opend...@googlegroups.com

Unsubscribe: opendatakit...@googlegroups.com

Options: http://groups.google.com/group/opendatakit?hl=en

You received this message because you are subscribed to the Google
Groups "ODK Community" group.

To unsubscribe from this group and stop receiving emails from it,
send
an email to opendatakit...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Post: opendatakit@googlegroups.com
Unsubscribe: opendatakit+unsubscribe@googlegroups.com
Options: http://groups.google.com/group/opendatakit?hl=en

You received this message because you are subscribed to the Google
Groups
"ODK Community" group.
To unsubscribe from this group and stop receiving emails from it,
send an
email to opendatakit+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Post: opendatakit@googlegroups.com
Unsubscribe: opendatakit+unsubscribe@googlegroups.com
Options: http://groups.google.com/group/opendatakit?hl=en

You received this message because you are subscribed to the Google
Groups
"ODK Community" group.
To unsubscribe from this group and stop receiving emails from it, send
an
email to opendatakit+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Post: opendatakit@googlegroups.com
Unsubscribe: opendatakit+unsubscribe@googlegroups.com
Options: http://groups.google.com/group/opendatakit?hl=en

You received this message because you are subscribed to the Google Groups
"ODK Community" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to opendatakit+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.