ODK aggregate stress testing failures

Saad · January 24, 2020, 5:57am

Hi,

Has anyone here done the stress testing of ODK aggregate?

I recently did a deployment in which some 5000+ people accessed the server simultaneously. The server could not handle this much traffic and got very slow. As I was on amazon cloud, I blindly upgraded the server 3 or 4 times to a monster level machine, but still the server response did not improve, although I did have CPU and RAM at comfortable level. There were bunch of errors in tomcat logs, but I assume mainly it was due to traffic overload. It left a bad taste of ODK to the audience (it was a showcase training!).

Has anyone faced any similar issues? Any tips for optimizing tomcat config to handle more traffic?

Thanks,
Saad

yanokwa · January 24, 2020, 9:04pm

Hi @Saad, see Approaches to stress testing ODK servers to try to tease out where the bottlenecks are in your config.

The thing that comes to mind is that Aggregate splits submissions into individual values that are insert into the DB, so my guess is that the bottleneck might be at the DB layer.

Central doesn't do this sort of submission splitting and so it can be a lot faster. Have you tried a similar deployment with Central? See https://docs.opendatakit.org/central-intro/#notes-on-performance. I'd be interested to see how it performs for you.

Saad · January 26, 2020, 7:01am

Hi @yanokwa,

Thanks for the link. Till now, I assume it's tomcat server which is becoming bottleneck. Perhaps its the concurrent sessions or TPS (transactions per second) values. It's around 10,000 people trying to access the server at the same time! I have been searching for optimizing tomcat values, but no luck so far. If anyone has any idea, please do share.

Aggregate seems to be working OK for the traffic which does reach to it. Occasionally it throws some of the following errors:

SEVERE: Unexpected exception: java.io.EOFException: Unexpected EOF read on the socket
SEVERE: Unexpected exception: java.net.SocketTimeoutException
INFO [http-nio-8080-exec-177] org.apache.coyote.http11.Http11Processor.service Error parsing HTTP request header

I have checked Central, it's great. However, I have not been able to develop my expertise on it yet. In addition, since it's not in stable condition, I could not take the risk of launching project with that. However, if it is stable and more optimized, I will spend more time and energy on it.

For now, I would like to get some help on the current situation, especially better tomcat config to handle such huge traffic.

Thanks,
Saad

yanokwa · January 28, 2020, 7:02pm

I haven't spent a ton of time optimizing Tomcat, so caveat emptor. That said, you are using words like "assume" and "perhaps" so I think it'd be worthwhile for you to dig into the problem more and do some benchmarking before you optimize values.

Get a benchmarking setup with ab where you can test your specific form against the server. The previously linked stress testing Aggregate should be of some help.
Turn on detailed monitoring on your instances and see if there's anything obvious there as far as incoming traffic being choked off.
Use htop to see if there is anything obviously wrong (maybe you have lots of CPU and RAM on the machine, but they aren't actually allocated to Tomcat?).
Read through both Tomcat, Aggregate, and MySQL logs to see if there is anything obvious.
Confirm where the problem is (network, machine, Tomcat, MySQL) and what kind of problem it is and go from there.

For projects with massive traffic, what might be easier is to rethink the architecture. Have a load balancer that splits traffic between a few smaller Tomcat instances on EC2 and have those instances talk to a single large database on RDS.

As to Central, it's no longer in beta. We consider it stable and ready for production.