Severe issue on multiple Central working with images - Error 500

Hi,

I am facing the same issue on 2 of my ODK central installations, both containing quite a big amount of data. One instance has around 40k+ submissions (each with at least 2 images), while the other has around 6k submissions, each with around 3 images each.

I suspect that the issue comes when someone tries to start a download of all data with media images. The Central server starts giving following issues:

  • Unable to login. Message coming as SOMETHING WENT WRONG - ERROR 500.
  • If you have a logged in session already, the portal starts giving message: THE USER IS ALREADY LOGGED IN. PLEASE REFRESH THE PAGE.
  • There is no way to login to the central portal via any account.

Also, the field workers are unable to send data to the server. The Collect app also starts giving error of 500 and CANNOT CONNECT messages.

Central version 1.3.3. No issue of RAM or CPU or disk space.

Temporary workaround: I tried everything but nothing worked. I had to restart the full machine again. Then everything started working. But after a couple of days, the same thing happened again.

If someone could help, I would be grateful. Let me know if you need to me pull out some logs.

Thanks,
Saad

1 Like

I get those user login issues every now and then. Deleting browser cookies fixes that.

I've seen issues fixed temporarily through a server restart here on this forum. Are you low on disk space? Running the latest Central version?

Hi @Florian_May

Thanks. It does not work for me like this. I have tried clearing cache, changing browsers and changing computers as well.

Restart is a not a good way, although it is the only way I know so far. There is always a risk of anything going wrong, ending up losing all data. The data so far is so huge that it's not easily possible to back it up frequently. No issue of RAM, CPU or disk space. Version is 1.3.3, but I assume 1.4 is also having similar issues (I saw some thread in forum).

I used ODK Aggregate to its breaking point with huge huge amount of data in different projects, and it never broke down. But Central is causing issues. Need some permanent fix for it.

Let me know if I could get you some logs for troubleshooting.

Many thanks,
Saad

We run Central instances with much more data than you have that don't have this problem. The issue is most likely with your infrastructure.

Look at your logs. What do they say when you start a download with media?

You say you have no issue of RAM or CPU. What frequency are you measuring usage? Are you looking at the host only or also each container's stats?

Hi @yanokwa,

Thanks, very valid points. My infra is in AWS, and I select pretty powerful machines (never lighter than t3.large). However, the RAM and CPU checks I do on the host only, not inside containers.

Would need some hand-holding for picking up logs (which and how), and also how to check container resources?

Many thanks,
Saad

Note that t3.large is burstable and so performance isn't predictable.

Independent of that, a machine with lots of RAM won't help because node (which Central uses) won't use more than 2 GB of RAM unless you specifically allocate more RAM.

See https://docs.getodk.org/central-troubleshooting/#reading-container-logs for how to read logs and check status.

See https://docs.getodk.org/central-install-digital-ocean/#increasing-memory-allocation for allocating more memory to the service container.

1 Like

Hi,

I picked up these logs from the same issue happening recently.

::ffff:172.18.0.9 - abc@123.com [27/Sep/2022:09:29:30 +0000] "GET /v1/projects/2/forms/daily_activity_tracker/submissions.csv HTTP/1.0" 200 -
::ffff:172.18.0.9 - abc@123.com [27/Sep/2022:09:29:30 +0000] "GET /v1/projects/2/forms/daily_activity_tracker/submissions.csv HTTP/1.0" 200 -
::ffff:172.18.0.9 - abc@123.com [27/Sep/2022:09:29:41 +0000] "GET /v1/projects/2/forms/daily_activity_tracker/submissions.csv HTTP/1.0" 500 271
::ffff:172.18.0.9 - abc@123.com [27/Sep/2022:09:29:41 +0000] "GET /v1/projects/2/forms/daily_activity_tracker/submissions.csv HTTP/1.0" 500 271

It seems the download initiated is causing the issue. I have an 8GB dual core server which is showing below 5% RAM usage all the time, even when the issue occurs. Even the service containers are also not overloaded with memory issue I suppose. The version of CENTRAL I am using is 1.3.3.

Any help would be appreciated.

Many thanks,
Saad

What version of Central are you running?

Are you using the default database configuration? Do you have any database monitoring you can refer to? What does CPU usage look like? Have you looked at the links provided above?

Hi,

  • Central version is 1.3.3.
  • Yes, default database config. Following absolutely default installation instructions which are given on ODK website, no customization from my side.
  • I don't have any database monitoring. It's all default docker containers. Do you suspect its database issue? I can pick up database container logs and share if you want.
  • CPU usage remains under 5% all the time, even when the error occurs.
  • Yes, have seen the links above. No issue of memory on main host and service container.
  • My server is on AWS cloud, ubuntu, with 8GB RAM. Around 40GB of data of ODK.

Regards,
Saad

Sorry, looks like I missed that in your original message. We've made a lot of changes related to database connections since then. If you can make a backup and upgrade to v1.5.3, I think that's likely to solve the problem.

My best guess is that you're running into the issue reported in this thread. You can read through the thread for some ideas to get useful logging. In particular, this thread may be helpful. Once you get 500s, it's safe to docker-compose stop && docker-compose up --detach to restart.

Thanks. Yes, upgrade is on my mind (I am pretty much a new-version-enthusiast!). But since it's a production machine, I am a bit hesitant to upgrade, because of fear of losing data. The data backup is also an issue, because pulling out 40GB of images data from ODK interface does not really work out for me. When the server goes into whack (error code 500) state, the only solution to restore it is rebooting.

I have seen these threads in my research on support forum. However, I was hoping to get some confirmation if I could pinpoint the issue + I could get the confirmation if the upgrade really is going to solve this problem.

Please do mention some manageable method of exporting the data.

Thanks @LN ,
Saad

You can turn off the AWS machine and then do a snapshot. If anything at all goes wrong you can revert. For extra redundancy, you could do a pgdump to disk then upload to S3 which will take a while. But snapshots are generally considered reliable and have the advantage of being simple.

The threads about this issue have a number of sample logs. You would be looking for something like this. No confirmation is perfect but that would give you a pretty good idea of whether that's what you experience.

Alternately you can try the monit approach that @yanokwa suggested here. How frequently is this happening?

Thanks. I am already making the snapshots of the server as a precaution. However, it doesn't really give me the exact data (i.e. images), especially if I want to take them outside of AWS environment (the ultimate safety).

I will have a look at pgdump tool. Would it allow to export images?

The issue occurs around 2 or 3 times a week now, as the traffic and data has grown. Next time it happens, I will try to pick up all types of logs from all containers, and try to do matchmaking with many of the troubleshooting threads you mentioned. If a match, I will probably plan for the upgrade.

Thanks. Will keep this thread open for a while until I am able to post a solution for future help of anyone.

Saad

1 Like

Hi Saad,

Yes pg_dump includes blob files in the database dump.

2 Likes

Hi @LN and all,

I have my CENTRAL in the same error 500 state right now. Please let me know which logs should I pick up for troubleshooting and share with you.

Thanks

Logs for central_service_1

ConnectionError: timeout exceeded when trying to connect
    at Object.createConnection (/usr/odk/node_modules/slonik/dist/src/factories/createConnection.js:54:23)ConnectionError: timeout exceeded when trying to connect
    at Object.createConnection (/usr/odk/node_modules/slonik/dist/src/factories/createConnection.js:54:23)
ConnectionError: timeout exceeded when trying to connect
    at Object.createConnection (/usr/odk/node_modules/slonik/dist/src/factories/createConnection.js:54:23)ConnectionError: timeout exceeded when trying to connect
    at Object.createConnection (/usr/odk/node_modules/slonik/dist/src/factories/createConnection.js:54:23)
::ffff:172.18.0.8 - - [03/Oct/2022:20:32:05 +0000] "GET /v1/sessions/restore HTTP/1.0" 500 271
::ffff:172.18.0.8 - - [03/Oct/2022:20:32:05 +0000] "GET /v1/sessions/restore HTTP/1.0" 500 271

Logs for central_postgres_1 (I assume it contains lines for a previous reboot as well):

PostgreSQL Database directory appears to contain a database; Skipping initialization

LOG:  database system was shut down at 2022-10-03 12:08:24 UTC
LOG:  MultiXact member wraparound protections are now enabled
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
LOG:  incomplete startup packet
LOG:  unexpected EOF on client connection with an open transaction
LOG:  unexpected EOF on client connection with an open transaction
LOG:  unexpected EOF on client connection with an open transaction
LOG:  unexpected EOF on client connection with an open transaction
LOG:  received fast shutdown request
LOG:  aborting any active transactions
FATAL:  terminating connection due to administrator command
FATAL:  terminating connection due to administrator command
FATAL:  terminating connection due to administrator command
FATAL:  terminating connection due to administrator command
FATAL:  terminating connection due to administrator command
FATAL:  terminating connection due to administrator command
FATAL:  terminating connection due to administrator command
FATAL:  terminating connection due to administrator command
FATAL:  terminating connection due to administrator command
LOG:  autovacuum launcher shutting down
FATAL:  terminating connection due to administrator command
LOG:  shutting down
LOG:  database system is shut down

PostgreSQL Database directory appears to contain a database; Skipping initialization

LOG:  database system was shut down at 2022-10-03 12:57:13 UTC
LOG:  MultiXact member wraparound protections are now enabled
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
LOG:  incomplete startup packet
LOG:  unexpected EOF on client connection with an open transaction
LOG:  unexpected EOF on client connection with an open transaction

Your issue will most certainly be resolved if you upgrade to the latest version of Central (v1.5.3).

I would encourage you to take a snapshot of your current install, do the upgrade on that snapshot, and to confirm that it works. If it does, then you can make a plan to upgrade your production install.

Hi,

I have upgraded the central server to v1.5.3, and so far I have not faced any issue. So seemingly this is the solution.

Many thanks @yanokwa and @LN for your help!

4 Likes

Thanks so much for following up, that’s really helpful!

1 Like