Memory use issue using docker-compose stack, odkcentral-service container

1. What is the issue? Please be detailed.

The docker-compose ODK Central stack eats all the memory in my server (4 GB virtual machine). After ~2 weeks the server dies due to Out-Of-Memory errors.

Any tips on how to troubleshoot this would be very much appreciated.

2. What steps can we take to reproduce this issue?

  1. Get the repo, https://github.com/getodk/central
  2. git checkout v2025.2.0
  3. Set .env according to below
  4. We use custom SSL certs since it’s an internal deployment not visible to the internet.
  5. Fire up the docker-compose stack according to the self-hosting instructions.
  6. Wait for 1-2 weeks

3. What have you tried to fix the issue?

  • Made sure the git submodules were correctly installed
  • Wiped the whole docker install and reinstalled from scratch

4. Upload any forms or screenshots you can share publicly below.

Every night the memory use for this container increases:

This is my .env file:

# Use fully qualified domain names. Set to DOMAIN=local if SSL_TYPE=selfsign.
DOMAIN=redacted

# Used for Let's Encrypt expiration emails and Enketo technical support emails
SYSADMIN_EMAIL=redacted

# Options: letsencrypt, customssl, upstream, selfsign
SSL_TYPE=customssl

# Do not change if using SSL_TYPE=letsencrypt
HTTP_PORT=80
HTTPS_PORT=443

NO_PROXY=same as DOMAIN above

# Optional: configure Node
# SERVICE_NODE_OPTIONS=

# Optional: connect to a custom database server
# DB_HOST=
# DB_USER=
# DB_PASSWORD=
# DB_NAME=

# Optional: configure a custom mail server
EMAIL_FROM=noreply@redacted
EMAIL_HOST=redacted
# EMAIL_PORT=
# EMAIL_SECURE=
# EMAIL_IGNORE_TLS=
# EMAIL_USER=
# EMAIL_PASSWORD=

# Optional: configure Single Sign-on with OpenID Connect
# OIDC_ENABLED=
# OIDC_ISSUER_URL=
# OIDC_CLIENT_ID=
# OIDC_CLIENT_SECRET=

# Optional: configure error reporting
# SENTRY_ORG_SUBDOMAIN=
# SENTRY_KEY=
# SENTRY_PROJECT=

# Optional: configure S3-compatible storage for binary files
# S3_SERVER=
# S3_ACCESS_KEY=
# S3_SECRET_KEY=
# S3_BUCKET_NAME=

Hmmmm, does it actually die? So the services do not get restarted? You can’t connect to Central anymore?

Asking because if there’d be a memory leak in one of the components, then indeed they could be killed by the kernel’s OOM killer, but then the supervisor (docker in this case) would restart the service (and then it can start leaking memory again, so over time the same would happen again).

Yes it crashes the whole VM if I let it run for about 2 weeks.

I upgraded to the latest tag in the github repo, and still same problem.

Interestingly, the increase in memory use happens almost exactly at 4, 5, 6, and 7am every day. See graph below from our monitoring software.

The cronjobs in the service container run at 2, 3, 4, and 5, so that's why you are seeing a step-by-step increase.

What host OS are you running Docker on (e.g., Ubuntu 22 LTS)? And what version of Docker are you using?

1 Like

This is Debian 12 (fully updated).

Docker version 28.1.1, build 4eba377.

ODK central version: git checkout v2025.2.2

I have to confess that after updating to v2025.2.2, I had to comment out the following line in start-odk.sh since it would just print “null” to the logs and hang and stop the service container from initializing.

node ./lib/bin/log-upgrade

Maybe this error is related? However this was not required in the earlier version.

In any case, it seems like these scripts in the crontab are started every night, but they never finish. See output of "ps xw" below. So there must be something wrong with my setup. Any pointers on how to debug this would be much appreciated, as I have very little experience with the node.js framework.

root@ae0eb5f8357b:/usr/odk# ps xw
PID TTY STAT TIME COMMAND
1 ? Ssl 0:00 npm exec pm2-runtime ./pm2.config.js
36 ? S 0:01 cron -f
55 ? S 0:00 sh -c pm2-runtime ./pm2.config.js
56 ? Sl 20:15 node /usr/odk/node_modules/.bin/pm2-runtime ./pm2.config.js
67 ? Sl 73:50 node /usr/odk/lib/bin/run-server.js
74 ? Sl 74:42 node /usr/odk/lib/bin/run-server.js
81 ? Sl 74:38 node /usr/odk/lib/bin/run-server.js
88 ? Sl 73:56 node /usr/odk/lib/bin/run-server.js
157 ? S 0:00 CRON -f
158 ? Ss 0:00 /bin/sh -c /usr/odk/process-backlog.sh
159 ? S 0:00 /bin/bash -eu /usr/odk/process-backlog.sh
160 ? Sl 2:00 /usr/local/bin/node lib/bin/process-backlog.js
174 ? S 0:00 CRON -f
175 ? Ss 0:00 /bin/sh -c /usr/odk/run-analytics.sh
176 ? S 0:00 /bin/bash -eu /usr/odk/run-analytics.sh
177 ? Sl 1:54 /usr/local/bin/node lib/bin/run-analytics.js
195 ? S 0:00 CRON -f
196 ? Ss 0:00 /bin/sh -c /usr/odk/purge.sh
197 ? S 0:00 /bin/bash -eu /usr/odk/purge.sh
198 ? Sl 6:18 /usr/local/bin/node lib/bin/purge.js
212 ? S 0:00 CRON -f
213 ? Ss 0:00 /bin/sh -c /usr/odk/upload-blobs.sh
214 ? S 0:00 /bin/bash -eu /usr/odk/upload-blobs.sh
215 ? Sl 0:35 /usr/local/bin/node lib/bin/s3.js upload-pending
409 ? S 0:00 CRON -f
410 ? Ss 0:00 /bin/sh -c /usr/odk/process-backlog.sh
411 ? S 0:00 /bin/bash -eu /usr/odk/process-backlog.sh
412 ? Sl 7:31 /usr/local/bin/node lib/bin/process-backlog.js
426 ? S 0:00 CRON -f
427 ? Ss 0:00 /bin/sh -c /usr/odk/run-analytics.sh
428 ? S 0:00 /bin/bash -eu /usr/odk/run-analytics.sh
429 ? Sl 2:36 /usr/local/bin/node lib/bin/run-analytics.js
450 ? S 0:00 CRON -f
451 ? Ss 0:00 /bin/sh -c /usr/odk/purge.sh
452 ? S 0:00 /bin/bash -eu /usr/odk/purge.sh
453 ? Sl 6:16 /usr/local/bin/node lib/bin/purge.js
467 ? S 0:00 CRON -f
468 ? Ss 0:00 /bin/sh -c /usr/odk/upload-blobs.sh
469 ? S 0:00 /bin/bash -eu /usr/odk/upload-blobs.sh
470 ? Sl 7:10 /usr/local/bin/node lib/bin/s3.js upload-pending
607 ? S 0:00 CRON -f
608 ? Ss 0:00 /bin/sh -c /usr/odk/process-backlog.sh
609 ? S 0:00 /bin/bash -eu /usr/odk/process-backlog.sh
610 ? Sl 5:13 /usr/local/bin/node lib/bin/process-backlog.js
624 ? S 0:00 CRON -f
625 ? Ss 0:00 /bin/sh -c /usr/odk/run-analytics.sh
626 ? S 0:00 /bin/bash -eu /usr/odk/run-analytics.sh
627 ? Sl 5:10 /usr/local/bin/node lib/bin/run-analytics.js
645 ? S 0:00 CRON -f
646 ? Ss 0:00 /bin/sh -c /usr/odk/purge.sh
647 ? S 0:00 /bin/bash -eu /usr/odk/purge.sh
648 ? Sl 5:04 /usr/local/bin/node lib/bin/purge.js
662 ? S 0:00 CRON -f
663 ? Ss 0:00 /bin/sh -c /usr/odk/upload-blobs.sh
664 ? S 0:00 /bin/bash -eu /usr/odk/upload-blobs.sh
665 ? Sl 4:58 /usr/local/bin/node lib/bin/s3.js upload-pending
802 ? S 0:00 CRON -f
803 ? Ss 0:00 /bin/sh -c /usr/odk/process-backlog.sh
804 ? S 0:00 /bin/bash -eu /usr/odk/process-backlog.sh
805 ? Sl 2:59 /usr/local/bin/node lib/bin/process-backlog.js
819 ? S 0:00 CRON -f
820 ? Ss 0:00 /bin/sh -c /usr/odk/run-analytics.sh
821 ? S 0:00 /bin/bash -eu /usr/odk/run-analytics.sh
822 ? Sl 2:55 /usr/local/bin/node lib/bin/run-analytics.js
840 ? S 0:00 CRON -f
841 ? Ss 0:00 /bin/sh -c /usr/odk/purge.sh
842 ? S 0:00 /bin/bash -eu /usr/odk/purge.sh
843 ? Sl 2:47 /usr/local/bin/node lib/bin/purge.js
857 ? S 0:00 CRON -f
858 ? Ss 0:00 /bin/sh -c /usr/odk/upload-blobs.sh
859 ? S 0:00 /bin/bash -eu /usr/odk/upload-blobs.sh
860 ? Sl 2:42 /usr/local/bin/node lib/bin/s3.js upload-pending
994 ? S 0:00 CRON -f
995 ? Ss 0:00 /bin/sh -c /usr/odk/process-backlog.sh
996 ? S 0:00 /bin/bash -eu /usr/odk/process-backlog.sh
997 ? Sl 0:45 /usr/local/bin/node lib/bin/process-backlog.js
1011 ? S 0:00 CRON -f
1012 ? Ss 0:00 /bin/sh -c /usr/odk/run-analytics.sh
1013 ? S 0:00 /bin/bash -eu /usr/odk/run-analytics.sh
1014 ? Sl 0:39 /usr/local/bin/node lib/bin/run-analytics.js
1032 ? S 0:00 CRON -f
1033 ? Ss 0:00 /bin/sh -c /usr/odk/purge.sh
1034 ? S 0:00 /bin/bash -eu /usr/odk/purge.sh
1035 ? Sl 0:34 /usr/local/bin/node lib/bin/purge.js
1049 ? S 0:00 CRON -f
1050 ? Ss 0:00 /bin/sh -c /usr/odk/upload-blobs.sh
1051 ? S 0:00 /bin/bash -eu /usr/odk/upload-blobs.sh
1052 ? Sl 0:11 /usr/local/bin/node lib/bin/s3.js upload-pending

PS the clock in the docker container is not synchronized to the host, is this a problem?

I just read that you are using Sentry for collecting some anonymized usage data. We’re behind a firewall that doesn’t allow outgoing connections, so this could be a problem. I’ll try disabling the Sentry connection to see if that helps.

1 Like

I disabled Sentry according to the instructions in the DigitalOcean installation notes. Now I can start the container stack without commenting out start-odk.sh, and when I do "docker exec -it odkcentral-service-1" on the scripts in the crontab you showed they complete quickly without error.

Hopefully this will fix everything, I'll get back tomorrow or next week when I see the memory use.

1 Like

Yep, that fixed it. The problem seems to be that the scripts (both start-odk.sh and the nightly cron jobs) hang when the service container can't reach the ODK Sentry instance on the internet. It would be great if this could be handled gracefully with an error or with a log message.

Thanks for sharing what the root cause was. I've filed it at https://github.com/getodk/central/issues/1347 and we'll fix it as soon as we're able.

1 Like

Great, thanks for picking it up as a ticket. We have ODK set up for collection of questionnaires and informed consent at a hospital, and the firewall is very restrictive wrt outbound connections. ODK looks very promising to replace the hack/proprietary solution we have right now.

1 Like

Hi All,

This also bit me today, and it took a long time to track down this issue. The booting a new installation hung after logging null from the service container, and led me down a wild goose chase of audit issues. Disabling Sentry also resolved the problem, allowing the central-service-1 to come alive.

In my case, I was trying to migrate the same ODK instance between two physical machines. I’ll share my experience for reference.

This failure state isn’t easily “googlable”, and I worry it might hit others as well who didn’t understand what was going on and just gave up on the product. Here was my basic logic for trying to track this down:

  1. After installing the new instance and restoring the backup (following all the usual instructions), I received the familiar message “This account is already logged in, refresh the page to continue” on the central website. This message seems to be given whenever the /session/restore endpoint cannot be reached, even if the whole central-service container is crashed. Since I knew this from previous experience, I checked the logs.
  2. The nginx container reported an upstream failure, trying to access port 8383. Once I figured out this pointed to the service container, I knew something was going on with booting the service container.
  3. After reading the upgrade notes, I realized that maybe the size of my data (it was ~300MB zipped) was causing the audit logs, geotraces, etc, to take too long. After all, the docs suggested that I could budget for an hour of downtime. So, I wiped everything and started over, trying to get a minimal ODK server up and running before restoring the backups.
  4. I followed the instructions for a new install with a fresh domain. I was able to interact with the odk-cmd to create a user, promote it, etc, telling me the service container was at least viable. I didn’t see anything in the logs about failures installing npm modules. This time, without data, the error message when trying to log into Central was a 502 error code, which was less confusing than the previous “you are already logged in” error.
  5. Maybe it was still part of the “expect an hour of downtime” message I read in the upgrade notes, so I left everything running for an hour and came back to try again. No dice.
  6. I checked the forum for anyone that was having issues with version 2025.4.1, but didn’t find any issues. I looked at troubleshooting ODK Central, but none of those seemed to apply. What would prevent the server from starting…?
  7. I manually accessed the service container using docker exec -it service bash and started poking around. The last log of the service container was null, which made me think some sort of undefined failure happened when returning values. So, I started putting print statements and console.log()s and running with DEBUG=* node to get as much information as possible throughout the start-odk.sh and subsequent scripts. Like this thread, I came upon the culprit being node ./lib/bin/log-upgrade.
  8. I commented that line out (as a test) and the server started! Weird. Maybe it was doing heavy PostgreSQL operations…? I read the code, but none of those lines seemed to do anything super weird. I started looking through recent build failures on Github, but they didn’t match -they all got to the point where the service container was running.
  9. Finally, through some more googling around, I got here.

Based on this experience, I recommend the following:

  1. Add a note to the installation documentation, the troubleshooting documentation, and the customizing Sentry documentation, all mentioning this failure state? If ODK does not have access to the Sentry routes, the server will not start and will not provide error messages to guide sysadmins on how to resolve the issue.
  2. Maybe the default behavior of ODK Central should be, if no API key is provided, Sentry is disabled? Currently, Sentry is being treated as a core part of the ODK application in a way that feels too tightly coupled. If Sentry’s domains go down, it would still be nice to run our ODK Central servers.
  3. Make some more descriptive error messages when the service container cannot be connected to on the client. If some 500* error message is sent back from trying to connect to the service container, it would be nice to have a user-facing message that says something like “Error 502: Received a failure when logging into the backend server. Is it accessible?” or something of the sort.

With a few pointers, I’d be happy to contribute to any or all of these fixes.

Thanks!

The coupling is not intentional. We’ve got a PR to add some documentation about this behavior and we’ve also got a hot fix coming shortly so that Central can startup even if it can’t reach Sentry.

Thanks for all the work you did tracking down the issue.

1 Like

@johannes.toger @jniles thanks for all the info on this. We have options for an immediate fix, and have filed upstream at https://github.com/getsentry/sentry-javascript/issues/18802

1 Like

Sentry just changed the IPs of its default ingestion endpoints today. Folks with restrictive network environments may need to update their network configurations with the new IP addresses for Central to work until @alxndrsn's patch is released.

We've added documentation about this issue here.

1 Like