I am facing the same issue on 2 of my ODK central installations, both containing quite a big amount of data. One instance has around 40k+ submissions (each with at least 2 images), while the other has around 6k submissions, each with around 3 images each.
I suspect that the issue comes when someone tries to start a download of all data with media images. The Central server starts giving following issues:
Unable to login. Message coming as SOMETHING WENT WRONG - ERROR 500.
If you have a logged in session already, the portal starts giving message: THE USER IS ALREADY LOGGED IN. PLEASE REFRESH THE PAGE.
There is no way to login to the central portal via any account.
Also, the field workers are unable to send data to the server. The Collect app also starts giving error of 500 and CANNOT CONNECT messages.
Central version 1.3.3. No issue of RAM or CPU or disk space.
Temporary workaround: I tried everything but nothing worked. I had to restart the full machine again. Then everything started working. But after a couple of days, the same thing happened again.
If someone could help, I would be grateful. Let me know if you need to me pull out some logs.
Thanks. It does not work for me like this. I have tried clearing cache, changing browsers and changing computers as well.
Restart is a not a good way, although it is the only way I know so far. There is always a risk of anything going wrong, ending up losing all data. The data so far is so huge that it's not easily possible to back it up frequently. No issue of RAM, CPU or disk space. Version is 1.3.3, but I assume 1.4 is also having similar issues (I saw some thread in forum).
I used ODK Aggregate to its breaking point with huge huge amount of data in different projects, and it never broke down. But Central is causing issues. Need some permanent fix for it.
Let me know if I could get you some logs for troubleshooting.
Thanks, very valid points. My infra is in AWS, and I select pretty powerful machines (never lighter than t3.large). However, the RAM and CPU checks I do on the host only, not inside containers.
Would need some hand-holding for picking up logs (which and how), and also how to check container resources?
Note that t3.large is burstable and so performance isn't predictable.
Independent of that, a machine with lots of RAM won't help because node (which Central uses) won't use more than 2 GB of RAM unless you specifically allocate more RAM.
It seems the download initiated is causing the issue. I have an 8GB dual core server which is showing below 5% RAM usage all the time, even when the issue occurs. Even the service containers are also not overloaded with memory issue I suppose. The version of CENTRAL I am using is 1.3.3.
Are you using the default database configuration? Do you have any database monitoring you can refer to? What does CPU usage look like? Have you looked at the links provided above?
Yes, default database config. Following absolutely default installation instructions which are given on ODK website, no customization from my side.
I don't have any database monitoring. It's all default docker containers. Do you suspect its database issue? I can pick up database container logs and share if you want.
CPU usage remains under 5% all the time, even when the error occurs.
Yes, have seen the links above. No issue of memory on main host and service container.
My server is on AWS cloud, ubuntu, with 8GB RAM. Around 40GB of data of ODK.
Sorry, looks like I missed that in your original message. We've made a lot of changes related to database connections since then. If you can make a backup and upgrade to v1.5.3, I think that's likely to solve the problem.
My best guess is that you're running into the issue reported in this thread. You can read through the thread for some ideas to get useful logging. In particular, this thread may be helpful. Once you get 500s, it's safe to docker-compose stop && docker-compose up --detach to restart.
Thanks. Yes, upgrade is on my mind (I am pretty much a new-version-enthusiast!). But since it's a production machine, I am a bit hesitant to upgrade, because of fear of losing data. The data backup is also an issue, because pulling out 40GB of images data from ODK interface does not really work out for me. When the server goes into whack (error code 500) state, the only solution to restore it is rebooting.
I have seen these threads in my research on support forum. However, I was hoping to get some confirmation if I could pinpoint the issue + I could get the confirmation if the upgrade really is going to solve this problem.
Please do mention some manageable method of exporting the data.
You can turn off the AWS machine and then do a snapshot. If anything at all goes wrong you can revert. For extra redundancy, you could do a pgdump to disk then upload to S3 which will take a while. But snapshots are generally considered reliable and have the advantage of being simple.
The threads about this issue have a number of sample logs. You would be looking for something like this. No confirmation is perfect but that would give you a pretty good idea of whether that's what you experience.
Alternately you can try the monit approach that @yanokwa suggested here. How frequently is this happening?
Thanks. I am already making the snapshots of the server as a precaution. However, it doesn't really give me the exact data (i.e. images), especially if I want to take them outside of AWS environment (the ultimate safety).
I will have a look at pgdump tool. Would it allow to export images?
The issue occurs around 2 or 3 times a week now, as the traffic and data has grown. Next time it happens, I will try to pick up all types of logs from all containers, and try to do matchmaking with many of the troubleshooting threads you mentioned. If a match, I will probably plan for the upgrade.
Thanks. Will keep this thread open for a while until I am able to post a solution for future help of anyone.
ConnectionError: timeout exceeded when trying to connect
at Object.createConnection (/usr/odk/node_modules/slonik/dist/src/factories/createConnection.js:54:23)ConnectionError: timeout exceeded when trying to connect
at Object.createConnection (/usr/odk/node_modules/slonik/dist/src/factories/createConnection.js:54:23)
ConnectionError: timeout exceeded when trying to connect
at Object.createConnection (/usr/odk/node_modules/slonik/dist/src/factories/createConnection.js:54:23)ConnectionError: timeout exceeded when trying to connect
at Object.createConnection (/usr/odk/node_modules/slonik/dist/src/factories/createConnection.js:54:23)
::ffff:172.18.0.8 - - [03/Oct/2022:20:32:05 +0000] "GET /v1/sessions/restore HTTP/1.0" 500 271
::ffff:172.18.0.8 - - [03/Oct/2022:20:32:05 +0000] "GET /v1/sessions/restore HTTP/1.0" 500 271
Logs for central_postgres_1 (I assume it contains lines for a previous reboot as well):
PostgreSQL Database directory appears to contain a database; Skipping initialization
LOG: database system was shut down at 2022-10-03 12:08:24 UTC
LOG: MultiXact member wraparound protections are now enabled
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
LOG: incomplete startup packet
LOG: unexpected EOF on client connection with an open transaction
LOG: unexpected EOF on client connection with an open transaction
LOG: unexpected EOF on client connection with an open transaction
LOG: unexpected EOF on client connection with an open transaction
LOG: received fast shutdown request
LOG: aborting any active transactions
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
LOG: autovacuum launcher shutting down
FATAL: terminating connection due to administrator command
LOG: shutting down
LOG: database system is shut down
PostgreSQL Database directory appears to contain a database; Skipping initialization
LOG: database system was shut down at 2022-10-03 12:57:13 UTC
LOG: MultiXact member wraparound protections are now enabled
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
LOG: incomplete startup packet
LOG: unexpected EOF on client connection with an open transaction
LOG: unexpected EOF on client connection with an open transaction
Your issue will most certainly be resolved if you upgrade to the latest version of Central (v1.5.3).
I would encourage you to take a snapshot of your current install, do the upgrade on that snapshot, and to confirm that it works. If it does, then you can make a plan to upgrade your production install.