Postgres Error causing Internal Server Error 500

Matthew_White · May 6, 2022, 7:49pm

Thanks, @jcwatson11! This is useful information. We've looked some into these premature close errors, and as far as we can tell, a premature close error on its own doesn't result in a database connection leak. (Node knows to close the database connection if there's a premature close, even though it's logging that premature close.) However, we have identified a case where if there are enough simultaneous long-running requests, that will result in a connection leak. We're hoping to fix that case as part of the upcoming v1.5 release. We're also planning to add the ability to configure the database connection pool so that it can support more simultaneous connections. (Note that that is only a good idea for larger servers that have the resources to support a larger connection pool.) We're currently planning to release v1.5 in early June.

jcwatson11 · May 9, 2022, 5:44pm

Thanks for the update, Michael. Yaw had mentioned trying to get a baseline before recommending a patch. Is this 1.5 June release the patch that he was talking about? Or is there any possibility of having something in the master branch that we can update/build to get the patch sooner than the official release?

Again, we very much appreciate your help with all of this. You guys are providing some pretty amazing support.

Matthew_White · May 10, 2022, 2:30pm

The v1.5 release will be a full release that will include new functionality. Since we're planning to release v1.5 in just a few weeks, we likely won't release a patch. Instead, we'll be continuing to investigate the root cause of the issue. We're hoping to be able to provide a fix for the case we know about as part of v1.5. We'll also provide a way to configure the size of the database connection pool.

Has your server been needing to restart? If so, how often is it being restart? Has monit been automating that restart?

jcwatson11 · May 10, 2022, 6:36pm

Hi Matthew,

Thanks for your response. It makes sense, and we will eagerly await your new release.

We have switched the external process that was triggering the bug to manual, and we are attempting to run it manually at times of least business impact. However, when we run it, the service apparently still fails. Then monit restarts the service and everything is back to normal within a few minutes without human intervention. So we are thankful for that. Business impact of this bug has changed from nightly critical business failure, to minimal business impact only when we choose to try to get the data.

Yaw still has access to the server, and is welcome to get logs for more specific details if you need them.

Thanks again.

Matthew_White · May 18, 2022, 7:34pm

We've made progress investigating this issue, so I wanted to update this topic with the latest information.

We've seen this issue occur when there are 10+ simultaneous long-running requests. We've also confirmed that the issue will sometimes occur after a premature close of a request of the submission data. In both cases, we believe the issue traces back to an issue with the database library we use. I've filed a GitHub issue with that library. Once the library releases an update, we'll be able to update Central as well. Because we're close to releasing v1.5 and will need to wait for an update from the library, that change will likely go out with v1.6.

Until we've made that change, affected users should take a look at Yaw's suggestions above. We believe that this issue should be uncommon. The issue may come up in the following situations:

There are 10+ simultaneous long-running requests.
There is a premature close of a request of the submission data (either the CSV data or the OData feed). 10+ of these premature closes can lead to this issue.

These situations will be more common when the database is large or when the network connection is slow (either the connection to Central or the connection to the database).

As part of v1.5, we will also add the ability to configure the size of the database connection pool. We haven't experimented much with this configuration ourselves. However, if you have experience tuning a database connection pool and enough RAM, you can explore whether adjusting the configuration reduces the frequency of this issue.

Thanks to @jcwatson11 and @abhishek for reporting this issue and for helping us investigate. If there are other users who encounter this issue, please let us know below.

Thalie · June 1, 2022, 5:40pm

Thank you very much for the very detailed posts Matthew and Yaw

A context where this issue occurs not so infrequently (~ 1 time every two weeks, sometimes even slightly more often) has been reported by my colleagues from Tanzania on a server (ODK Central v1.4.2) which is used by different teams for different projects, among which several encrypted projects with large databases which contain large media attachments, and daily automated requests to export CSV via the API which may possibly be conflicting between the different teams. My Tanzanian colleagues suggested that there may also be frequent premature request close happening with manual exports from the server as the Main data table (no repeats) CSV export is not successful with large encrypted data and the end-user has to select All data and media files to retrieve the data but because the media attachments are large and the network connection can indeed be quite slow, it is more likely to not be successful.

In addition to the very useful suggestions from Yaw, and as a temporary "fix", we suggested voluntarily restarting the services on a weekly basis (to "reset" the number of premature closes we may have ahead of us), to avoid the likelihood of having the services down while not being able to reach admins for some reason when it is absolutely urgent to restart it in the next 15 minutes - but my guess is that it would still not prevent any failure due to simultaneous requests, so probably using data range or other filters (where possible) should be encouraged in our case.

jcwatson11 · June 10, 2022, 6:48pm

Hello Everyone,

We really appreciate the work done on this. From what I can see in the github repository, it looks like there is a new version 1.5.0 released there. Is this the new version that will allow us to adjust the connection pool settings?

I understand from previous posts in this thread that the root cause of the bug is actually dependent on a third party library that ODK staff do not have control over. But being able to adjust the connection pool size would be helpful.

If 1.5.0 is ready, would you also update your Upgrade docs on the web site with any special instructions related to that upgrade, including how to adjust the connection pool size?

Thanks again!

LN · June 24, 2022, 4:54pm

The bulk of the changes for this release were on the frontend so there are no special upgrade instructions.

We did add connection pool configuration but haven't had a chance to think through it further or try it in real-world environments so we're not ready to document it formally. Once we do, we'd like to offer some guidance on how to set it.

In the mean time, you can look at this pull request and try it out yourself. If you do experiment with it, please let us know your findings.

jcwatson11 · July 1, 2022, 10:27pm

Hello @LN,

Thank you for your ongoing diligence in following the bug in the gajus/slonik package. I just bumped the issue there in hopes of a meaningful response from the maintainer.

Jon

jcwatson11 · July 5, 2022, 7:04pm

@Matthew_White and @LN

It looks like the repository maintainer for the connection pool issue has responded, asking for a PR.

I'm afraid I do not understand enough about the issue to create one. From Matthew's bug report it looked like he might be in a better place to create a PR.

Now that the maintainer has responded. Is it possible that ODK Staff might be able to look into this issue and submit a PR?

Thank you in advance!

Jon

LN · July 5, 2022, 9:20pm

We'll be looking at it for the next Central release and will provide updates here.

jcwatson11 · July 5, 2022, 9:34pm

Thanks @LN !

Looking forward to it. And we very much appreciate you and the ODK Support staff.

Jon

mathieubossaert · September 2, 2022, 9:06pm

We are facing this issue 4/5 time a week since mid august with 2 last versions of Central.
The only change on our side is the size of the database with more submissions and medias each days.
Querying the pg_stat_activity table :

SELECT count(pid) as num_active, query
FROM pg_stat_activity
WHERE age(clock_timestamp(), query_start)> interval '1 hour'
GROUP BY query;

reveals several active queries too much old for "simple" SELECT statements, with the same beginning of statement above (occurring when asking for submission list on central webpage or calling API with central2pg functions):

select submissions."id" as "submissions!id",submissions."formId" as "submissions!formId",submissions."instanceId" as "submissions!instanceId",submissions."submitte
rId" as "submissions!submitterId",submissions."deviceId" as "submissions!deviceId",submissions."createdAt" as "submissions!createdAt",submissions."updatedAt" as "submissions!upd
atedAt",submissions."reviewState" as "submissions!reviewState",submission_defs."id" as "submission_defs!id",submission_defs."submissionId" as "submission_defs!submissionId",subm
ission_defs."formDefId" as "submission_defs!formDefId"...

In order to get a kind of real time data availability in our database and GIS tools, we call the API for submissions a lot of time every day (each our for our main form).
As our campaign starts in march, submissions represent a lot of data (even if media are pulled only once). So I think that such frequency and volume must increase the risk of pending connections and fake active queries over the database.
I am working on upgrade our central2pg functions to filter data in order to pull only the recent data and implement a more lightweight workflow (server resources, bandwidth consomption).
But for the moment I am trying a cron task that terminate such long running queries. I'll tell you soon if I face any troubleshot with that workaround.

Florian_May · September 3, 2022, 11:38am

I see a similar pattern but haven't debugged that properly.

I run a cronjob that downloads all submissions (retains already downloaded media attachments), then does other tasks (downstream ETL and feeding a dashboard) every four hours. I should, but don't, download iteratively from a last known submission date because the subtables don't support oData date filters (need to double check that).

I've noticed Central's web ui getting slow or impossible to login until I restart the docker containers.

My issue could well be an exhausted connection pool or some stale db queries.

mathieubossaert · September 3, 2022, 12:34pm

Hi Florian.

I can confirm. We had a discussion with @Matthew_White about it.
I noticed that when I can't log into Central, killing a zombie query and then free a connection to the database is sufficient to connect to central, without restarting docker containers.

Matthew_White · September 7, 2022, 2:55am

Thanks for reporting these issues, @mathieubossaert and @Florian_May! We've been working on a fix in this area and are planning to include it in the next patch release (v1.5.3). We're hoping to release that next week or possibly the week after.

jcwatson11 · September 16, 2022, 9:51pm

Related: https://github.com/gajus/slonik/issues/347#issuecomment-1249789813

Looks like the underlying problem in the vendor library was resolved. Is this vendor update what will be included in the 1.5.3 release?

yanokwa · September 16, 2022, 10:30pm

Yes, @alxndrsn’s patch was accepted and we are preparing a Central release that will include that change.

Matthew_White · September 22, 2022, 4:30pm

Central v1.5.3 has been released and includes a fix to address this issue. Thank you, @jcwatson11, for reporting this issue!

jcwatson11 · September 22, 2022, 8:58pm

Thanks @Matthew_White, Very much appreciated. I've raised the issue of when to plan for the upgrade with my team.