Massive dataset - Need optimization advice

Saad · March 23, 2020, 8:18am

Hi,

I have a project which has gathered a massive number of form submissions, around 100k+. Each form has 4 or more images as well. On hard disk, it has taken over 300GB of data, and increasing. The problem now is that this huge amount of data is going beyond the max limits of a lot of resources. I have been increasing server CPU power and memory consistently (on cloud), but it has reached to a point where now tomcat is giving up.

I need advice on following challenges:

How to manage such huge amount of data, going forward? Even when I open aggregate, it takes eternity to open, due to image loading.
Can I offload the saved data somewhere to make space for new incoming data, and make the server lighter? I usually take backup via briefcase. But then again, briefcase only helps me take a backup, and i still need to put the data somewhere to extract it properly (make CSV, visualize on map). Maybe a separate standalone offline aggregate?
Extending point 2 above, is it possible to push the data from briefcase to another aggregate server, which does not have the URL of main project (since primary URL goes to the main server of course!)
Is there any load-balancing mechanism I can put in? At which level (aggregate, tomcat, etc.)? I am using AWS cloud.

And the final question: Can ODK central help in any such scenario or massive data handling? I have been delaying learning ODK central for some time now, but if it is the way forward, then now is the time for me to do it.

Many thanks!
Saad

LN · March 25, 2020, 9:40pm

Hi @Saad and thanks for jumping in to offer support to others. I'd certainly love to learn more about your project in the Showcase when you have a moment.

Roughly how big is each image that is collected? Are you familiar with the max-pixel attribute that makes it possible to force a scale-down of images as they are taken? Most devices today take huge pictures which is great when they need to be printed but is unnecessary in most data-collection contexts. Collect also has a corresponding setting that can be changed without needing to deploy a form update.

Yes, it is possible to use Briefcase to push data to another Aggregate server, either locally or at another URL. If the documentation isn't enough to help you do that, please start a new topic. I also see you are concerned about subsequent analysis when trying to decide which direction to go in. Can you describe your analysis pipeline?

I don't have experience with how to identify the layer to target but @yanokwa may have suggestions.

Yes. Central does minimal processing of submissions as they come in which ensures it can handle a lot of data coming in at the same time with minimal resource usage. It only shows a subset of submission data on the website and does not load all images so that view also remains fast even with large datasets. We recommend doing analysis through the OData feed which is live-updating and should also remain fast. Here again, describing your desired analysis would help us guide you towards a good solution.

One thing you could do is set up Central locally and use Briefcase to push your dataset to it to try it out.