I am facing the same issue on 2 of my ODK central installations, both containing quite a big amount of data. One instance has around 40k+ submissions (each with at least 2 images), while the other has around 6k submissions, each with around 3 images each.
I suspect that the issue comes when someone tries to start a download of all data with media images. The Central server starts giving following issues:
Unable to login. Message coming as SOMETHING WENT WRONG - ERROR 500.
If you have a logged in session already, the portal starts giving message: THE USER IS ALREADY LOGGED IN. PLEASE REFRESH THE PAGE.
There is no way to login to the central portal via any account.
Also, the field workers are unable to send data to the server. The Collect app also starts giving error of 500 and CANNOT CONNECT messages.
Central version 1.3.3. No issue of RAM or CPU or disk space.
Temporary workaround: I tried everything but nothing worked. I had to restart the full machine again. Then everything started working. But after a couple of days, the same thing happened again.
If someone could help, I would be grateful. Let me know if you need to me pull out some logs.
Thanks. It does not work for me like this. I have tried clearing cache, changing browsers and changing computers as well.
Restart is a not a good way, although it is the only way I know so far. There is always a risk of anything going wrong, ending up losing all data. The data so far is so huge that it's not easily possible to back it up frequently. No issue of RAM, CPU or disk space. Version is 1.3.3, but I assume 1.4 is also having similar issues (I saw some thread in forum).
I used ODK Aggregate to its breaking point with huge huge amount of data in different projects, and it never broke down. But Central is causing issues. Need some permanent fix for it.
Let me know if I could get you some logs for troubleshooting.
It seems the download initiated is causing the issue. I have an 8GB dual core server which is showing below 5% RAM usage all the time, even when the issue occurs. Even the service containers are also not overloaded with memory issue I suppose. The version of CENTRAL I am using is 1.3.3.
Sorry, looks like I missed that in your original message. We've made a lot of changes related to database connections since then. If you can make a backup and upgrade to v1.5.3, I think that's likely to solve the problem.
My best guess is that you're running into the issue reported in this thread. You can read through the thread for some ideas to get useful logging. In particular, this thread may be helpful. Once you get 500s, it's safe to docker-compose stop && docker-compose up --detach to restart.
Thanks. Yes, upgrade is on my mind (I am pretty much a new-version-enthusiast!). But since it's a production machine, I am a bit hesitant to upgrade, because of fear of losing data. The data backup is also an issue, because pulling out 40GB of images data from ODK interface does not really work out for me. When the server goes into whack (error code 500) state, the only solution to restore it is rebooting.
I have seen these threads in my research on support forum. However, I was hoping to get some confirmation if I could pinpoint the issue + I could get the confirmation if the upgrade really is going to solve this problem.
Please do mention some manageable method of exporting the data.
You can turn off the AWS machine and then do a snapshot. If anything at all goes wrong you can revert. For extra redundancy, you could do a pgdump to disk then upload to S3 which will take a while. But snapshots are generally considered reliable and have the advantage of being simple.
The threads about this issue have a number of sample logs. You would be looking for something like this. No confirmation is perfect but that would give you a pretty good idea of whether that's what you experience.
Alternately you can try the monit approach that @yanokwa suggested here. How frequently is this happening?