Postgres Error causing Internal Server Error 500

We have some ideas that we are exploring and hope to have more for you tomorrow.

Can you confirm whether you have periodic scripts doing API requests?

Hello LN, and thank you for your response. We do not have any periodic scripts that make any API requests. The central service log is attached to one of my previous posts, and records the entire time from when I restarted services to the time it went down. I was hoping that log would be helpful. If there is anything else I can provide, please let me know. I will be happy to provide it to you.

Hello again LN,

As it turns out, we actually do have an Azure Data Factory that is accessing our ODK Central instance using an API call to get CSV files. We are working on getting the exact URL that is called. But I wanted to update this as quickly as possible so you know you might be on the right track with your question.

Will update as soon as I know what URL is being called.

Thanks!

@Matthew_White can now reproduce a similar crash with 10+ simultaneous long-running requests. This would be more likely to happen with requests for larger results or with a slower database or possibly a slow connection between Central and the requester. What's the submission order of magnitude? 10k? 100k? 1m?

Had the exact same forms with the same data collector behavior and the same data request all been working on v1.2? Has there been a recent big flurry of submissions?

Hello LN,

I'm not really sure how to get answers to your questions. We would be happy to provide you with a login to our Digital Ocean droplet so that you can look around. I would rather not post credentials here though. If this is an avenue we can pursue, please feel free to reach out to me via email:

(email masked)

We had previously been on version 1.1, so it was a big upgrade for us. We are going to test tonight to see if our Azure Data Factory request is the thing killing the server.

Thank you again for your support. We very much appreciate it.

1 Like

We don't see this issue on ODK Cloud, so there is some combination of your infrastructure and usage that is triggering this issue.

The Central team is investigating the possibility of a database leak and we'll update this topic once we learn more. We are also considering releasing a patch to increase the database pool size as a short-term workaround while we try to get to the root cause. More on that later.

In the meanwhile, here are a few things to do to reduce the load on the server:

You can also automate restarting the Central service when you get 500s. This is not great long-term fix, but the restart happens very quickly, so it's something to consider if you don't want to babysit the server.

Assuming you are using Ubuntu, here are the instructions:

Install monit
apt-get install monit;
Add /etc/monit/conf.d/central.conf
check program central-healthcheck with path /usr/local/bin/central-healthcheck.sh
  if status > 0 for 2 cycles then exec "/bin/bash -c '/usr/bin/docker restart central_service_1'"
Add /usr/local/bin/central-healthcheck.sh
#!/bin/bash

CENTRAL_HTTP_CODE=$(curl --silent --insecure --include --request POST --header "Content-Type: application/json" --data-binary "{\"email\": \"email\",\"password\": \"password\"}" --write-out "%{http_code}" --output /dev/null https://127.0.0.1/v1/sessions)
if [[ $CENTRAL_HTTP_CODE -ne 401 ]]; then
  echo "$CENTRAL_HTTP_CODE" > /dev/stderr
  exit 1
fi
exit
Set permissions and reload monit
chmod 0755 /usr/local/bin/central-healthcheck.sh;
chmod 0644 /etc/monit/conf.d/central.conf;
monit reload;

Hello Yaw,

Thank you again for your reply here, and for your world-class support. We were able to convert to Power BI API calls for data refreshing and we have turned off our Azure Data Factory calls to reduce load on the server. We have also installed and begun to run monit on our server and hopefully it will minimize our business impact for this particular issue.

I will raise the other issue we discussed with our Sr. IT Staff and see what direction they want to go. We are very thankful for your support.

A post was split to a new topic: Rebooting Central automatically in Oracle Linux

Hello ODK Team,

We disabled the nightly process that runs the request for unfiltered submissions in CAREGIVERS. The service did not need to restart. With further experimentation today, and looking at the logs, it appears that our client that requests the unfiltered submissions may have a timeout that is less than the max execution time that the central service supports. Because the js code is using a pipeline, when the client closes the stream, there is a ERR_STREAM_PREMATURE_CLOSE error from nodejs. Below is the stack trace from the logs.

Error [ERR_STREAM_PREMATURE_CLOSE]: Premature close
    at ServerResponse.onclose (internal/streams/end-of-stream.js:121:38)
    at ServerResponse.emit (events.js:412:35)
    at ServerResponse.emit (domain.js:532:15)
    at Socket.onServerResponseClose (_http_server.js:221:44)
    at Socket.emit (events.js:412:35)
    at Socket.emit (domain.js:470:12)
    at TCP.<anonymous> (net.js:675:12)
    at TCP.callbackTrampoline (internal/async_hooks.js:131:17)
    at ServerResponse.onclose (internal/streams/end-of-stream.js:121:38)
    at ServerResponse.emit (events.js:412:35)
    at ServerResponse.emit (domain.js:532:15)
    at Socket.onServerResponseClose (_http_server.js:221:44)
    at Socket.emit (events.js:412:35)
    at Socket.emit (domain.js:470:12)
    at TCP.<anonymous> (net.js:675:12)
    at TCP.callbackTrampoline (internal/async_hooks.js:131:17)

After that point the client apparently has a retry rate that will immediately request the same unfiltered submissions.csv file.

I wonder if the pipeline issue were handled more gracefully to close the database connection when it fails instead of throwing an unhandled exception, if that might contribute to a better pooled connection management?

I might be entirely on the wrong track here. But I thought I would throw it out there.

Thanks for your consideration.

1 Like

Thanks, @jcwatson11! This is useful information. We've looked some into these premature close errors, and as far as we can tell, a premature close error on its own doesn't result in a database connection leak. (Node knows to close the database connection if there's a premature close, even though it's logging that premature close.) However, we have identified a case where if there are enough simultaneous long-running requests, that will result in a connection leak. We're hoping to fix that case as part of the upcoming v1.5 release. We're also planning to add the ability to configure the database connection pool so that it can support more simultaneous connections. (Note that that is only a good idea for larger servers that have the resources to support a larger connection pool.) We're currently planning to release v1.5 in early June.

Thanks for the update, Michael. Yaw had mentioned trying to get a baseline before recommending a patch. Is this 1.5 June release the patch that he was talking about? Or is there any possibility of having something in the master branch that we can update/build to get the patch sooner than the official release?

Again, we very much appreciate your help with all of this. You guys are providing some pretty amazing support.

1 Like

The v1.5 release will be a full release that will include new functionality. Since we're planning to release v1.5 in just a few weeks, we likely won't release a patch. Instead, we'll be continuing to investigate the root cause of the issue. We're hoping to be able to provide a fix for the case we know about as part of v1.5. We'll also provide a way to configure the size of the database connection pool.

Has your server been needing to restart? If so, how often is it being restart? Has monit been automating that restart?

Hi Matthew,

Thanks for your response. It makes sense, and we will eagerly await your new release.

We have switched the external process that was triggering the bug to manual, and we are attempting to run it manually at times of least business impact. However, when we run it, the service apparently still fails. Then monit restarts the service and everything is back to normal within a few minutes without human intervention. So we are thankful for that. Business impact of this bug has changed from nightly critical business failure, to minimal business impact only when we choose to try to get the data.

Yaw still has access to the server, and is welcome to get logs for more specific details if you need them.

Thanks again.

We've made progress investigating this issue, so I wanted to update this topic with the latest information.

We've seen this issue occur when there are 10+ simultaneous long-running requests. We've also confirmed that the issue will sometimes occur after a premature close of a request of the submission data. In both cases, we believe the issue traces back to an issue with the database library we use. I've filed a GitHub issue with that library. Once the library releases an update, we'll be able to update Central as well. Because we're close to releasing v1.5 and will need to wait for an update from the library, that change will likely go out with v1.6.

Until we've made that change, affected users should take a look at Yaw's suggestions above. We believe that this issue should be uncommon. The issue may come up in the following situations:

  • There are 10+ simultaneous long-running requests.
  • There is a premature close of a request of the submission data (either the CSV data or the OData feed). 10+ of these premature closes can lead to this issue.

These situations will be more common when the database is large or when the network connection is slow (either the connection to Central or the connection to the database).

As part of v1.5, we will also add the ability to configure the size of the database connection pool. We haven't experimented much with this configuration ourselves. However, if you have experience tuning a database connection pool and enough RAM, you can explore whether adjusting the configuration reduces the frequency of this issue.

Thanks to @jcwatson11 and @abhishek for reporting this issue and for helping us investigate. If there are other users who encounter this issue, please let us know below.

4 Likes

Thank you very much for the very detailed posts Matthew and Yaw

A context where this issue occurs not so infrequently (~ 1 time every two weeks, sometimes even slightly more often) has been reported by my colleagues from Tanzania on a server (ODK Central v1.4.2) which is used by different teams for different projects, among which several encrypted projects with large databases which contain large media attachments, and daily automated requests to export CSV via the API which may possibly be conflicting between the different teams. My Tanzanian colleagues suggested that there may also be frequent premature request close happening with manual exports from the server as the Main data table (no repeats) CSV export is not successful with large encrypted data and the end-user has to select All data and media files to retrieve the data but because the media attachments are large and the network connection can indeed be quite slow, it is more likely to not be successful.

In addition to the very useful suggestions from Yaw, and as a temporary "fix", we suggested voluntarily restarting the services on a weekly basis (to "reset" the number of premature closes we may have ahead of us), to avoid the likelihood of having the services down while not being able to reach admins for some reason when it is absolutely urgent to restart it in the next 15 minutes - but my guess is that it would still not prevent any failure due to simultaneous requests, so probably using data range or other filters (where possible) should be encouraged in our case.

1 Like

Hello Everyone,

We really appreciate the work done on this. From what I can see in the github repository, it looks like there is a new version 1.5.0 released there. Is this the new version that will allow us to adjust the connection pool settings?

I understand from previous posts in this thread that the root cause of the bug is actually dependent on a third party library that ODK staff do not have control over. But being able to adjust the connection pool size would be helpful.

If 1.5.0 is ready, would you also update your Upgrade docs on the web site with any special instructions related to that upgrade, including how to adjust the connection pool size?

Thanks again!

The bulk of the changes for this release were on the frontend so there are no special upgrade instructions.

We did add connection pool configuration but haven't had a chance to think through it further or try it in real-world environments so we're not ready to document it formally. Once we do, we'd like to offer some guidance on how to set it.

In the mean time, you can look at this pull request and try it out yourself. If you do experiment with it, please let us know your findings.

1 Like

Hello @LN,

Thank you for your ongoing diligence in following the bug in the gajus/slonik package. I just bumped the issue there in hopes of a meaningful response from the maintainer.

Jon

2 Likes

@Matthew_White and @LN

It looks like the repository maintainer for the connection pool issue has responded, asking for a PR.

I'm afraid I do not understand enough about the issue to create one. From Matthew's bug report it looked like he might be in a better place to create a PR.

Now that the maintainer has responded. Is it possible that ODK Staff might be able to look into this issue and submit a PR?

Thank you in advance!

Jon

1 Like

We'll be looking at it for the next Central release and will provide updates here.