Postgres Error causing Internal Server Error 500

jcwatson11 · April 28, 2022, 11:37pm

1. What is the issue? Please be detailed.
ODK Postgres instance fails with the following error log details:

LOG:  database system was shut down at 2022-04-28 15:43:04 UTC
LOG:  MultiXact member wraparound protections are now enabled
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
LOG:  incomplete startup packet
ERROR:  ODK02:3:'CAREGIVERS-Beta':'21087'
CONTEXT:  PL/pgSQL function check_form_version() line 16 at RAISE
STATEMENT:  
	update form_defs
	set "draftToken"=$1,"enketoId"=$2,"publishedAt"=$3
	
	where "id"=$4
	returning *
ERROR:  current transaction is aborted, commands ignored until end of transaction block
STATEMENT:  
	insert into audits ("actorId", action, "acteeId", details, notes, "loggedAt", processed, failures)
	values ($1, $2, $3, $4, $5, clock_timestamp(), $6, 0)
ERROR:  current transaction is aborted, commands ignored until end of transaction block
STATEMENT:  delete from submissions where "formId"=$1 and draft=true
ERROR:  ODK02:3:'CAREGIVERS-Beta':'21088'
CONTEXT:  PL/pgSQL function check_form_version() line 16 at RAISE
STATEMENT:  
	update form_defs
	set "draftToken"=$1,"enketoId"=$2,"publishedAt"=$3
	
	where "id"=$4
	returning *
ERROR:  current transaction is aborted, commands ignored until end of transaction block
STATEMENT:  
	insert into audits ("actorId", action, "acteeId", details, notes, "loggedAt", processed, failures)
	values ($1, $2, $3, $4, $5, clock_timestamp(), $6, 0)
ERROR:  current transaction is aborted, commands ignored until end of transaction block
STATEMENT:  delete from submissions where "formId"=$1 and draft=true
ERROR:  ODK02:3:'IDELA-Beta':'21059'
CONTEXT:  PL/pgSQL function check_form_version() line 16 at RAISE
STATEMENT:  
	update form_defs
	set "draftToken"=$1,"enketoId"=$2,"publishedAt"=$3
	
	where "id"=$4
	returning *
ERROR:  current transaction is aborted, commands ignored until end of transaction block
STATEMENT:  
	insert into audits ("actorId", action, "acteeId", details, notes, "loggedAt", processed, failures)
	values ($1, $2, $3, $4, $5, clock_timestamp(), $6, 0)
ERROR:  current transaction is aborted, commands ignored until end of transaction block
STATEMENT:  delete from submissions where "formId"=$1 and draft=true
LOG:  unexpected EOF on client connection with an open transaction

2. What steps can we take to reproduce this issue?
Install ODK version 1.4.2.
Subject to normal use.
Observe periodic error as described above resulting in postgres server going down.

3. What have you tried to fix the issue?
Restart using docker-compose down / docker-compose up -d

4. Upload any test forms, screenshots, or logs below.
Logs included above.

Matthew_White · April 29, 2022, 12:57am

Most of what I see in the log looks normal to me, for example, this error:

ERROR:  ODK02:3:'CAREGIVERS-Beta':'21087'

This error happens when there is an attempt to publish a Draft Form with the same combination of Project ID, Form ID, and Form version string as an existing published Form version. If you use the ODK Central frontend, this error should be pretty rare, because the frontend will prevent you from publishing a Draft Form with the same version string as a published version of the Form. The error can come up if you delete a Form, then create a new Form in the same Project with the same Form ID. It can also come up if you aren't using the frontend and are using the API instead. If you can determine what's causing these errors, that might be part of the answer.

That said, this is a database error that Central knows how to handle. Normally, this error will result in a 409 error, not a 500. The part of the log that is more unexpected is this line:

LOG:  unexpected EOF on client connection with an open transaction

I'm not quite sure what could be causing this error. It might depend on your database setup. Are you using an external database?

How often does this cause the server to go down?

It'd also be useful to see a log of the service container in case there's an error visible there (the docs list the specific command to run).

yanokwa · April 29, 2022, 11:04am

Don’t do this! Central is intended to be configured and administrated entirely with docker-compose stop and that is what we show in all our instructions. If you inadvertently do a docker-compose down, you’ll have to do a lot of extra work to reattach your database.

https://docs.getodk.org/central-install/#installing-elsewhere has this warning.

jcwatson11 · April 29, 2022, 3:11pm

Thank you for your reply. To answer your question about what happens in the service log, it's mostly this over and over again:

ConnectionError: timeout exceeded when trying to connect
    at Object.createConnection (/usr/odk/node_modules/slonik/dist/src/factories/createConnection.js:54:23)    at Object.createConnection (/usr/odk/node_modules/slonik/dist/src/factories/createConnection.js:54:23)

... until I restart it. Then everything works fine for a while.

To answer your other questions, yes it is happening about once per day, and the entire service is down until I restart it.

We are using a local (same host as the rest of the services in a digital ocean droplet) postgres database docker container (image postgres:9.6) located on the host at /var/lib/postgresql/data in a volume called postgres_data.

jcwatson11 · April 29, 2022, 3:12pm

Thank you for your warning. I will certainly heed your advice.

jcwatson11 · April 29, 2022, 5:34pm

So at this point I'm thinking it is either:

A result of not shutting down and restarting the service correctly as mentioned by @yanokwa. This seems unlikely because our docker configuration is set to store data in a volume located on the host, and it does not appear that we have lost any data when we bring the service back up.
OR
A bug because the system should be able to fail gracefully without killing the entire server if someone tries to publish a form with details that conflict with something already published. IMHO, this seems more likely.

What direction should I head?

yanokwa · April 29, 2022, 7:02pm

If docker-compose down broke your install, it would be like you had a fresh install with no data. It seems like that is not the case. My guess (and @Matthew_White can correct me if I'm wrong) is that there is a database leak.

Matthew_White · April 29, 2022, 7:12pm

That's what I'm thinking as well. The service log is often a better guide to what actually went wrong, and the ConnectionError points to the possibility that the database connection pool is full. Because of the ConnectionError, I think it's possible that the behavior you're seeing is related to this post:

This may be an issue with Central, and we're investigating the issue further. We'll let you know once we have more information (or more questions)!

jcwatson11 · April 29, 2022, 8:38pm

Thank you for your response and for your diligent efforts. We would be very grateful for any help you can provide. It definitely has a business impact for us as you can imagine. We eagerly await your response. When the error happens again, I will post full logs from both the database and the central service here. In the mean-time, it appears that you are on the right track, and we are excited to see what you come up with.

Blessings to you in your work!

jcwatson11 · May 1, 2022, 6:26am

The problem occurred again. I have attached a log file of the central service for long before and long after the connection to the database failed. The first sign of failure that I can see is at timestamp 29/Apr/2022:23:23:59 in the log files.

(attached file removed for privacy)

jcwatson11 · May 3, 2022, 2:27am

Can I bump this issue to get an idea of what path we are on toward a solution? I don't feel like we have any power to solve this as users of ODK. If we need to purchase a support agreement, I'm willing to discuss it.

Thanks again for all your help.

LN · May 3, 2022, 4:12am

We have some ideas that we are exploring and hope to have more for you tomorrow.

Can you confirm whether you have periodic scripts doing API requests?

jcwatson11 · May 3, 2022, 4:35am

Hello LN, and thank you for your response. We do not have any periodic scripts that make any API requests. The central service log is attached to one of my previous posts, and records the entire time from when I restarted services to the time it went down. I was hoping that log would be helpful. If there is anything else I can provide, please let me know. I will be happy to provide it to you.

jcwatson11 · May 3, 2022, 8:19pm

Hello again LN,

As it turns out, we actually do have an Azure Data Factory that is accessing our ODK Central instance using an API call to get CSV files. We are working on getting the exact URL that is called. But I wanted to update this as quickly as possible so you know you might be on the right track with your question.

Will update as soon as I know what URL is being called.

LN · May 3, 2022, 10:11pm

Thanks!

@Matthew_White can now reproduce a similar crash with 10+ simultaneous long-running requests. This would be more likely to happen with requests for larger results or with a slower database or possibly a slow connection between Central and the requester. What's the submission order of magnitude? 10k? 100k? 1m?

Had the exact same forms with the same data collector behavior and the same data request all been working on v1.2? Has there been a recent big flurry of submissions?

jcwatson11 · May 3, 2022, 10:40pm

Hello LN,

I'm not really sure how to get answers to your questions. We would be happy to provide you with a login to our Digital Ocean droplet so that you can look around. I would rather not post credentials here though. If this is an avenue we can pursue, please feel free to reach out to me via email:

(email masked)

We had previously been on version 1.1, so it was a big upgrade for us. We are going to test tonight to see if our Azure Data Factory request is the thing killing the server.

Thank you again for your support. We very much appreciate it.

yanokwa · May 4, 2022, 6:06pm

We don't see this issue on ODK Cloud, so there is some combination of your infrastructure and usage that is triggering this issue.

The Central team is investigating the possibility of a database leak and we'll update this topic once we learn more. We are also considering releasing a patch to increase the database pool size as a short-term workaround while we try to get to the root cause. More on that later.

In the meanwhile, here are a few things to do to reduce the load on the server:

If you are exporting submission to CSV via API, use the $filter syntax to only export recent submissions. See https://odkcentral.docs.apiary.io/#reference/submissions/submissions/exporting-form-submissions-to-csv for more.
If you are feeding data indirectly to Power BI or Excel, consider using the OData feed instead. See https://docs.getodk.org/central-submissions/#connecting-to-submission-data-over-odata for more.

You can also automate restarting the Central service when you get 500s. This is not great long-term fix, but the restart happens very quickly, so it's something to consider if you don't want to babysit the server.

Assuming you are using Ubuntu, here are the instructions:

Install monit

apt-get install monit;

Add /etc/monit/conf.d/central.conf

check program central-healthcheck with path /usr/local/bin/central-healthcheck.sh
  if status > 0 for 2 cycles then exec "/bin/bash -c '/usr/bin/docker restart central_service_1'"

Add /usr/local/bin/central-healthcheck.sh

#!/bin/bash

CENTRAL_HTTP_CODE=$(curl --silent --insecure --include --request POST --header "Content-Type: application/json" --data-binary "{\"email\": \"email\",\"password\": \"password\"}" --write-out "%{http_code}" --output /dev/null https://127.0.0.1/v1/sessions)
if [[ $CENTRAL_HTTP_CODE -ne 401 ]]; then
  echo "$CENTRAL_HTTP_CODE" > /dev/stderr
  exit 1
fi
exit

Set permissions and reload monit

chmod 0755 /usr/local/bin/central-healthcheck.sh;
chmod 0644 /etc/monit/conf.d/central.conf;
monit reload;

jcwatson11 · May 4, 2022, 6:45pm

Hello Yaw,

Thank you again for your reply here, and for your world-class support. We were able to convert to Power BI API calls for data refreshing and we have turned off our Azure Data Factory calls to reduce load on the server. We have also installed and begun to run monit on our server and hopefully it will minimize our business impact for this particular issue.

I will raise the other issue we discussed with our Sr. IT Staff and see what direction they want to go. We are very thankful for your support.

yanokwa · May 5, 2022, 12:58pm

A post was split to a new topic: Rebooting Central automatically in Oracle Linux

jcwatson11 · May 5, 2022, 5:26pm

Hello ODK Team,

We disabled the nightly process that runs the request for unfiltered submissions in CAREGIVERS. The service did not need to restart. With further experimentation today, and looking at the logs, it appears that our client that requests the unfiltered submissions may have a timeout that is less than the max execution time that the central service supports. Because the js code is using a pipeline, when the client closes the stream, there is a ERR_STREAM_PREMATURE_CLOSE error from nodejs. Below is the stack trace from the logs.

Error [ERR_STREAM_PREMATURE_CLOSE]: Premature close
    at ServerResponse.onclose (internal/streams/end-of-stream.js:121:38)
    at ServerResponse.emit (events.js:412:35)
    at ServerResponse.emit (domain.js:532:15)
    at Socket.onServerResponseClose (_http_server.js:221:44)
    at Socket.emit (events.js:412:35)
    at Socket.emit (domain.js:470:12)
    at TCP.<anonymous> (net.js:675:12)
    at TCP.callbackTrampoline (internal/async_hooks.js:131:17)
    at ServerResponse.onclose (internal/streams/end-of-stream.js:121:38)
    at ServerResponse.emit (events.js:412:35)
    at ServerResponse.emit (domain.js:532:15)
    at Socket.onServerResponseClose (_http_server.js:221:44)
    at Socket.emit (events.js:412:35)
    at Socket.emit (domain.js:470:12)
    at TCP.<anonymous> (net.js:675:12)
    at TCP.callbackTrampoline (internal/async_hooks.js:131:17)

After that point the client apparently has a retry rate that will immediately request the same unfiltered submissions.csv file.

I wonder if the pipeline issue were handled more gracefully to close the database connection when it fails instead of throwing an unhandled exception, if that might contribute to a better pooled connection management?

I might be entirely on the wrong track here. But I thought I would throw it out there.

Thanks for your consideration.