Memory use issue using docker-compose stack, odkcentral-service container

Hi All,

This also bit me today, and it took a long time to track down this issue. The booting a new installation hung after logging null from the service container, and led me down a wild goose chase of audit issues. Disabling Sentry also resolved the problem, allowing the central-service-1 to come alive.

In my case, I was trying to migrate the same ODK instance between two physical machines. I’ll share my experience for reference.

This failure state isn’t easily “googlable”, and I worry it might hit others as well who didn’t understand what was going on and just gave up on the product. Here was my basic logic for trying to track this down:

  1. After installing the new instance and restoring the backup (following all the usual instructions), I received the familiar message “This account is already logged in, refresh the page to continue” on the central website. This message seems to be given whenever the /session/restore endpoint cannot be reached, even if the whole central-service container is crashed. Since I knew this from previous experience, I checked the logs.
  2. The nginx container reported an upstream failure, trying to access port 8383. Once I figured out this pointed to the service container, I knew something was going on with booting the service container.
  3. After reading the upgrade notes, I realized that maybe the size of my data (it was ~300MB zipped) was causing the audit logs, geotraces, etc, to take too long. After all, the docs suggested that I could budget for an hour of downtime. So, I wiped everything and started over, trying to get a minimal ODK server up and running before restoring the backups.
  4. I followed the instructions for a new install with a fresh domain. I was able to interact with the odk-cmd to create a user, promote it, etc, telling me the service container was at least viable. I didn’t see anything in the logs about failures installing npm modules. This time, without data, the error message when trying to log into Central was a 502 error code, which was less confusing than the previous “you are already logged in” error.
  5. Maybe it was still part of the “expect an hour of downtime” message I read in the upgrade notes, so I left everything running for an hour and came back to try again. No dice.
  6. I checked the forum for anyone that was having issues with version 2025.4.1, but didn’t find any issues. I looked at troubleshooting ODK Central, but none of those seemed to apply. What would prevent the server from starting…?
  7. I manually accessed the service container using docker exec -it service bash and started poking around. The last log of the service container was null, which made me think some sort of undefined failure happened when returning values. So, I started putting print statements and console.log()s and running with DEBUG=* node to get as much information as possible throughout the start-odk.sh and subsequent scripts. Like this thread, I came upon the culprit being node ./lib/bin/log-upgrade.
  8. I commented that line out (as a test) and the server started! Weird. Maybe it was doing heavy PostgreSQL operations…? I read the code, but none of those lines seemed to do anything super weird. I started looking through recent build failures on Github, but they didn’t match -they all got to the point where the service container was running.
  9. Finally, through some more googling around, I got here.

Based on this experience, I recommend the following:

  1. Add a note to the installation documentation, the troubleshooting documentation, and the customizing Sentry documentation, all mentioning this failure state? If ODK does not have access to the Sentry routes, the server will not start and will not provide error messages to guide sysadmins on how to resolve the issue.
  2. Maybe the default behavior of ODK Central should be, if no API key is provided, Sentry is disabled? Currently, Sentry is being treated as a core part of the ODK application in a way that feels too tightly coupled. If Sentry’s domains go down, it would still be nice to run our ODK Central servers.
  3. Make some more descriptive error messages when the service container cannot be connected to on the client. If some 500* error message is sent back from trying to connect to the service container, it would be nice to have a user-facing message that says something like “Error 502: Received a failure when logging into the backend server. Is it accessible?” or something of the sort.

With a few pointers, I’d be happy to contribute to any or all of these fixes.

Thanks!