Postgres Error causing Internal Server Error 500

Don’t do this! Central is intended to be configured and administrated entirely with docker-compose stop and that is what we show in all our instructions. If you inadvertently do a docker-compose down, you’ll have to do a lot of extra work to reattach your database.

https://docs.getodk.org/central-install/#installing-elsewhere has this warning.

2 Likes

Thank you for your reply. To answer your question about what happens in the service log, it's mostly this over and over again:

ConnectionError: timeout exceeded when trying to connect
    at Object.createConnection (/usr/odk/node_modules/slonik/dist/src/factories/createConnection.js:54:23)    at Object.createConnection (/usr/odk/node_modules/slonik/dist/src/factories/createConnection.js:54:23)

... until I restart it. Then everything works fine for a while.

To answer your other questions, yes it is happening about once per day, and the entire service is down until I restart it.

We are using a local (same host as the rest of the services in a digital ocean droplet) postgres database docker container (image postgres:9.6) located on the host at /var/lib/postgresql/data in a volume called postgres_data.

Thank you for your warning. I will certainly heed your advice.

So at this point I'm thinking it is either:

  1. A result of not shutting down and restarting the service correctly as mentioned by @yanokwa. This seems unlikely because our docker configuration is set to store data in a volume located on the host, and it does not appear that we have lost any data when we bring the service back up.
    OR
  2. A bug because the system should be able to fail gracefully without killing the entire server if someone tries to publish a form with details that conflict with something already published. IMHO, this seems more likely.

What direction should I head?

If docker-compose down broke your install, it would be like you had a fresh install with no data. It seems like that is not the case. My guess (and @Matthew_White can correct me if I'm wrong) is that there is a database leak.

That's what I'm thinking as well. The service log is often a better guide to what actually went wrong, and the ConnectionError points to the possibility that the database connection pool is full. Because of the ConnectionError, I think it's possible that the behavior you're seeing is related to this post:

This may be an issue with Central, and we're investigating the issue further. We'll let you know once we have more information (or more questions)!

Thank you for your response and for your diligent efforts. We would be very grateful for any help you can provide. It definitely has a business impact for us as you can imagine. We eagerly await your response. When the error happens again, I will post full logs from both the database and the central service here. In the mean-time, it appears that you are on the right track, and we are excited to see what you come up with.

Blessings to you in your work!

The problem occurred again. I have attached a log file of the central service for long before and long after the connection to the database failed. The first sign of failure that I can see is at timestamp 29/Apr/2022:23:23:59 in the log files.

(attached file removed for privacy)

Can I bump this issue to get an idea of what path we are on toward a solution? I don't feel like we have any power to solve this as users of ODK. If we need to purchase a support agreement, I'm willing to discuss it.

Thanks again for all your help.

We have some ideas that we are exploring and hope to have more for you tomorrow.

Can you confirm whether you have periodic scripts doing API requests?

Hello LN, and thank you for your response. We do not have any periodic scripts that make any API requests. The central service log is attached to one of my previous posts, and records the entire time from when I restarted services to the time it went down. I was hoping that log would be helpful. If there is anything else I can provide, please let me know. I will be happy to provide it to you.

Hello again LN,

As it turns out, we actually do have an Azure Data Factory that is accessing our ODK Central instance using an API call to get CSV files. We are working on getting the exact URL that is called. But I wanted to update this as quickly as possible so you know you might be on the right track with your question.

Will update as soon as I know what URL is being called.

Thanks!

@Matthew_White can now reproduce a similar crash with 10+ simultaneous long-running requests. This would be more likely to happen with requests for larger results or with a slower database or possibly a slow connection between Central and the requester. What's the submission order of magnitude? 10k? 100k? 1m?

Had the exact same forms with the same data collector behavior and the same data request all been working on v1.2? Has there been a recent big flurry of submissions?

Hello LN,

I'm not really sure how to get answers to your questions. We would be happy to provide you with a login to our Digital Ocean droplet so that you can look around. I would rather not post credentials here though. If this is an avenue we can pursue, please feel free to reach out to me via email:

(email masked)

We had previously been on version 1.1, so it was a big upgrade for us. We are going to test tonight to see if our Azure Data Factory request is the thing killing the server.

Thank you again for your support. We very much appreciate it.

1 Like

We don't see this issue on ODK Cloud, so there is some combination of your infrastructure and usage that is triggering this issue.

The Central team is investigating the possibility of a database leak and we'll update this topic once we learn more. We are also considering releasing a patch to increase the database pool size as a short-term workaround while we try to get to the root cause. More on that later.

In the meanwhile, here are a few things to do to reduce the load on the server:

You can also automate restarting the Central service when you get 500s. This is not great long-term fix, but the restart happens very quickly, so it's something to consider if you don't want to babysit the server.

Assuming you are using Ubuntu, here are the instructions:

Install monit
apt-get install monit;
Add /etc/monit/conf.d/central.conf
check program central-healthcheck with path /usr/local/bin/central-healthcheck.sh
  if status > 0 for 2 cycles then exec "/bin/bash -c '/usr/bin/docker restart central_service_1'"
Add /usr/local/bin/central-healthcheck.sh
#!/bin/bash

CENTRAL_HTTP_CODE=$(curl --silent --insecure --include --request POST --header "Content-Type: application/json" --data-binary "{\"email\": \"email\",\"password\": \"password\"}" --write-out "%{http_code}" --output /dev/null https://127.0.0.1/v1/sessions)
if [[ $CENTRAL_HTTP_CODE -ne 401 ]]; then
  echo "$CENTRAL_HTTP_CODE" > /dev/stderr
  exit 1
fi
exit
Set permissions and reload monit
chmod 0755 /usr/local/bin/central-healthcheck.sh;
chmod 0644 /etc/monit/conf.d/central.conf;
monit reload;

Hello Yaw,

Thank you again for your reply here, and for your world-class support. We were able to convert to Power BI API calls for data refreshing and we have turned off our Azure Data Factory calls to reduce load on the server. We have also installed and begun to run monit on our server and hopefully it will minimize our business impact for this particular issue.

I will raise the other issue we discussed with our Sr. IT Staff and see what direction they want to go. We are very thankful for your support.

A post was split to a new topic: Rebooting Central automatically in Oracle Linux

Hello ODK Team,

We disabled the nightly process that runs the request for unfiltered submissions in CAREGIVERS. The service did not need to restart. With further experimentation today, and looking at the logs, it appears that our client that requests the unfiltered submissions may have a timeout that is less than the max execution time that the central service supports. Because the js code is using a pipeline, when the client closes the stream, there is a ERR_STREAM_PREMATURE_CLOSE error from nodejs. Below is the stack trace from the logs.

Error [ERR_STREAM_PREMATURE_CLOSE]: Premature close
    at ServerResponse.onclose (internal/streams/end-of-stream.js:121:38)
    at ServerResponse.emit (events.js:412:35)
    at ServerResponse.emit (domain.js:532:15)
    at Socket.onServerResponseClose (_http_server.js:221:44)
    at Socket.emit (events.js:412:35)
    at Socket.emit (domain.js:470:12)
    at TCP.<anonymous> (net.js:675:12)
    at TCP.callbackTrampoline (internal/async_hooks.js:131:17)
    at ServerResponse.onclose (internal/streams/end-of-stream.js:121:38)
    at ServerResponse.emit (events.js:412:35)
    at ServerResponse.emit (domain.js:532:15)
    at Socket.onServerResponseClose (_http_server.js:221:44)
    at Socket.emit (events.js:412:35)
    at Socket.emit (domain.js:470:12)
    at TCP.<anonymous> (net.js:675:12)
    at TCP.callbackTrampoline (internal/async_hooks.js:131:17)

After that point the client apparently has a retry rate that will immediately request the same unfiltered submissions.csv file.

I wonder if the pipeline issue were handled more gracefully to close the database connection when it fails instead of throwing an unhandled exception, if that might contribute to a better pooled connection management?

I might be entirely on the wrong track here. But I thought I would throw it out there.

Thanks for your consideration.

1 Like

Thanks, @jcwatson11! This is useful information. We've looked some into these premature close errors, and as far as we can tell, a premature close error on its own doesn't result in a database connection leak. (Node knows to close the database connection if there's a premature close, even though it's logging that premature close.) However, we have identified a case where if there are enough simultaneous long-running requests, that will result in a connection leak. We're hoping to fix that case as part of the upcoming v1.5 release. We're also planning to add the ability to configure the database connection pool so that it can support more simultaneous connections. (Note that that is only a good idea for larger servers that have the resources to support a larger connection pool.) We're currently planning to release v1.5 in early June.

Thanks for the update, Michael. Yaw had mentioned trying to get a baseline before recommending a patch. Is this 1.5 June release the patch that he was talking about? Or is there any possibility of having something in the master branch that we can update/build to get the patch sooner than the official release?

Again, we very much appreciate your help with all of this. You guys are providing some pretty amazing support.

1 Like