Zero Downtime Deployment / High Availability / Cluster

Hi!

When I follow the instructions for deploying ODK Central to DigitalOcean, it seems to have one major downside: When we want to update the deployment, the services will not be available anymore.

My question: If we would run containers in parallel with loadbalancers: are there any caveats we would need to know? E.g. are all services stateless and can be run in parallel?

Best regards,
Matthias

In the meantime, I've found that there is some progress regarding pre-built docker images here: Host ODK Central Docker images on GitHub container registry

Anyway, the original question is still open

Re hosted images: Since Central includes Enketo, build during deployment is required. Enketo has a special way of baking secret keys into the image at build time, so images cannot be shared between deployments. Therefore, pre-built images are of limited use.

I'm no longer working on a k8s deployment of ODK Central using pre-built images, because our IT finally gave up on their Kubernetes doctrine and allowed me to run the vanilla setup using docker-compose and a separate Postgres db.
Edit: I'd like to say using the vanilla images is wonderful for two reasons: no additional maintenance required to translate each new release into a different paradigm (k8s vs docker compose), and secondly, the vanilla deployment is rock solid and just works.

3 Likes

Thanks for the update!

Do you know for sure that you need more complex system architecture? Do you have many hundreds or thousands of data collectors? Are they expected to all be sending many submissions at once? If so, I recommend doing some simulations with test data to see what your bottlenecks actually are.

We typically see around a second of downtime when doing updates and schedule them when we see the least traffic in a typical week. The recommended update process is documented here. Notice that we recommend doing a docker-compose build while everything is still up. The only downtime is what it takes to stop and then start everything.

2 Likes

Thanks for the explanation! I was also thinking of high-availability in case of server crashes. But for now, we are okay with the current situation. Especially since configuration are stored inside the container images and would need more work...

1 Like

That's definitely a good consideration. For better or worse, so far we've almost exclusively seen crashes that have to do with specific submissions that are sent in. That means you likely wouldn't escape them without also segmenting your data somehow and then you'd really be fighting how the system is structured.

I encourage you to monitor your performance and uptime and communicate back about how things are going for your project. If there are issues, we can discuss some possibilities for approaching them.