Gigantic study - Best server settings

chrissyhroberts · May 8, 2019, 11:33am

We are about to undertake a frankly enormous clinical research study with up to a million participants, which will probably involve around 2.5-5 million form submissions. We want to be sure that the Aggregate server will be able to cope with such huge numbers but it's unclear if we need to do anything special to ensure stability.

I wonder if the dev community has any recommendations for the optimal server end design and settings that would allow us to collect all this data and store it on the server without having to do regular purges.

yanokwa · May 8, 2019, 6:34pm

@chrissyhroberts You all run a fair number of Aggregate servers so I bet you have pretty good instincts. I'd say PostgreSQL, fast SSDs, and a ton of RAM is a good place to start.

Do you have a sense of how many enumerators will be hitting the server and how often? What about the forms themselves? Roughly how many variables do you expect. Will you have lots of repeats? Submissions with images or other large binaries?

issa · May 8, 2019, 6:47pm

depending on some of your answers to Yaw's questions, Central may be an ideal answer for you. it doesn't try to do as much work as Aggregate and it's built on a more modern streaming architecture so it should be more than capable of handling that many submissions without a problem.

to give insight to my opinion,
on the one hand i really wouldn't start worrying about Central on paper until we start hitting the 30-40 million record range, or a very high submission throughput (and in both cases the problem is still solvable with the Central software but operationally we would want to start spreading the load across multiple machines). so really, 5 million should be fine regardless of your answers.
but on the other hand, nobody has battle-tested Central in real life as much as we have Aggregate. there will be more knowledge about what can go wrong and how to fix it if it does.

at any rate, should you decide to try Central out you'll have my attention and effort to make it go smoothly.

ggalmazor · May 9, 2019, 8:53am

That sounds super interesting, @chrissyhroberts!

I think that in such a project, you should focus on scalability for many reasons. Not planning on a scalable infrastructure forces you to make an initial bet and hope that it will be enough, incurring in higher costs from day one (when you don't need to support high volumes) and forcing you to stick to a hardware stack.

You can look to scalability from different perspectives when dealing with the app server and the database.

I'd highly recommend using a managed database service such as AWS RDS. This will let you easily increase your database resources to match your needs as time passes, deal with backups, replication, etc.

With a reasonable amount of memory, the app server shouldn't be the bottleneck but since it's harder to scale, I'd suggest having a strategy that lets you replace it smoothly. When you need a bigger machine, you can use DNS to switch traffic from the old server to the new one with some down-time, or have a load balancer in front of them for better results, and easier management.

Monitoring the infrastructure should be a priority too. I'd strongly recommend activating your cloud provider's metrics, which can measure stuff like response times from Aggregate, etc.

Replicating the running app server into a beefier machine, increasing database resources... all this is fairly easy to manage if you can use a cloud provider. If you have to do this in-house, I'd suggest to have the app server and the database in two different hosts, which will let you deal better with infraestructure growth.

If you can't do any of this, then it's easy: I'd go for the biggest machine you can afford CPUs are not so important, and any modern 2-4 core CPU will be enough (if not, we're the ones doing something wrong :D). In any case, invest first on big NVMe/SSD drives, then memory (at least 8GB for Aggregate, maybe? not sure), then CPU.

chrissyhroberts · May 9, 2019, 3:36pm

Amazing! Thanks for your experience and insight.

@ggalmazor I think that cloud is out as partners want servers based on physical machines in-country. Will feed forward to our IT team your ideas for making stuff scalable.

@issa I think that we should stick with Aggregate for now as Central not field-proven and this is literally life or death work so need to avoid nasty surprises. Will be very interested in using Central in the future, particularly as we have so many projects and Central seems to have some great features for this.

@yanokwa
About 250 devices, daily uploads.

Forms will be minimal as possible but expect mostly 3kb each, up to 100 variables

No repeats, no images and no attachments.

At backend probably three users doing daily pulls.

issa · May 9, 2019, 6:25pm

i think given your responses Central would easily handle it, but i understand your concern.

on the other hand, someone has to take a bet on it or it'll never be field-proven.

Mark_Schormann1 · May 9, 2019, 8:01pm

As long as you have given some thought as to how to do your purges (if you intend to do so) and how frequently you intend doing them. The main difficulties we have had with Aggregate over the years has been when the servers are burdened with a large number of records. Throwing resources at the servers has mitigated the issue to an extent, but the servers always seem to hit a point where they struggle and can become unresponsive. That said, our forms have been image heavy and have had numerous repeat groups. The answer to this has been reasonably regular old record purges.

It will be interesting to hear how the project goes.

chrissyhroberts · May 10, 2019, 8:33am

For this particular project we have to de-risk everything to the greatest extent possible but we'll definitely be field-testing Central at scale in the future and would be very happy to liaise with you on some field evaluations. Encrypted forms are an absolute must for all our work, so we've been waiting for that to be implemented before we look at this in depth.

chrissyhroberts · May 10, 2019, 8:37am

Thanks Mark,
We've seen the issue previously that the front end became unresponsive at around 500k forms, but setting up a new front end that led to the same SQL database seemed to fix the issue. Need to share data with partners and desire to avoid any downtime means we've tried to avoid purges so far, but may be necessary for this one.

One thing I don't understand about purge is whether there is a way to clear records from the web interface but to leave them on the underlying database. Given the behaviour described above (db fine, front end unresponsive) this would seem a useful way to handle things.

issa · May 10, 2019, 6:51pm

ah, i see. well, you'd be perhaps interested to see our plan for the next release.