Server configuration with reverse proxy returns timeout and proxy errors

jary · June 8, 2021, 3:07pm

1. What is the problem? Be very detailed.
ODK Central stopped responding after a non-ODK related upgrade of packages. I have been receiving timeout and proxy errors since.

I had a perfectly working Central setup v. 0.8.0 (whole story here) and after doing a general apt-get upgrade, Central started to return 504 Gateway timeout errors.

The packages that could matter (apache, libssl, certbot, iptables, systemd, python) had only security updates, nothing significant.

I upgraded Central to version 1.2.0. Now the following happens:

On plain http, when accessing static content, like version.txt, I get an immediate response. Good, I suppose that means server works, at least partially.
On plain http, when accessing the main page, background gray appears. After 60 seconds the rest of the page with a login window appears. The restore file returns a HTTP 504 Gateway time-out.
On https, for any request, i get a HTTP 500 Proxy Error: The proxy server could not handle the request. Reason: Error during SSL Handshake with remote server

Also, the docker-compose@central service is down, but I have no idea if it is relevant.

Postgresql database works. I can connect manually and perform queries.

2. What app or server are you using and on what device and operating system? Include version numbers.
First, Central 0.8.0, now 1.2.0
Standalone machine with Debian 10, kernel 4.19.0.
The network setup: There is another web server running concurrently on the same machine, so an Apache (2.4.38) reverse proxy is forwarding packets with the address survey.example.com to ports :4480 and :4443.
There is a Let's Encrypt certificate that was already used.

3. What you have you tried to fix the problem?
I upgraded Central to version 1.2.0. Because some settings have changed between versions, I upgraded by ditching my old working central folder and instead git-cloned the repository in a new folder.

I had to stop and delete the current nginx docker container, before starting a new one. Otherwise it would start but was "unhealthy".

My .env file looks like this:

# Use fully qualified domain names. Set to DOMAIN=local if SSL_TYPE=selfsign.
DOMAIN=survey.example.com

# Used for Let's Encrypt expiration emails and Enketo technical support emails
SYSADMIN_EMAIL=mail@example.com

# Options: letsencrypt, customssl, upstream, selfsign
SSL_TYPE=upstream

# Do not change if using SSL_TYPE=letsencrypt
HTTP_PORT=4480
HTTPS_PORT=4443

Having everything configured in .env, I could leave docker-compose.yml in its original state. (Thank you for this change.)

The relevant Apache config:

<VirtualHost *:80>
    ServerName survey.example.com
    ProxyPreserveHost On
    ProxyPass / http://localhost:4480/
    ProxyPassReverse / http://localhost:4480/
</VirtualHost>

<VirtualHost *:443>
    SSLProxyEngine On
    ServerName survey.example.com
    ProxyPreserveHost On
    SSLEngine On                # 
    SSLProxyVerify none         # These settings are 
    SSLProxyCheckPeerCN off     # supposed to help but
    SSLProxyCheckPeerName off   # seem to have no effect.
    SSLProxyCheckPeerExpire off # 
    ProxyPass / https://localhost:4443/
    ProxyPassReverse / https://localhost:4443/

    SSLCertificateFile /etc/letsencrypt/live/my.example.com/fullchain.pem
    SSLCertificateKeyFile /etc/letsencrypt/live/my.example.com/privkey.pem
    Include /etc/letsencrypt/options-ssl-apache.conf
</VirtualHost>

I tried various combinations of SSL_TYPE and HTTP ports (80 vs 4480 and 443 vs 4443) and this combination kind of worked the best. I'm not really sure what settings to choose. If I have a pre-existing certificate that is from Let's Encrypt, is it upstream or letsencrypt?

4. What steps can we take to reproduce the problem?
This is probably a configuration problem than a straight out error of ODK.
Because plain HTTP works, I think the problem is related to HTTPS misconfiguration. The timeouts can be explained by databases refusing non-secure connections, so fixing those should help with the timeouts as well (or so I hope).

If someone could give me any insight what could go wrong, i would really appreciate it.

5. Anything else we should know or have? If you have a test form or screenshots or logs, attach below.

Some log excerpts that may be useful:

Nginx

2021/06/08 10:16:04 [error] 14#14: *2101 upstream timed out (110: Connection timed out) while connecting to upstream, client: 172.19.0.1, server: , request: "GET /v1/sessions/restore HTTP/1.1", upstream: "http://172.19.0.7:8383/v1/sessions/restore", host: "survey.example.com", referrer: "http://survey.example.com/"

Service

starting cron..
using 4 worker(s) based on available memory (32894260)..
starting server.
unable to connect to ipc-file `naught.ipc`

removing the ipc-file and attempting to continue
(node:59) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
ConnectionError: Connection terminated due to connection timeout
    at Object.createConnection (/usr/odk/node_modules/slonik/dist/src/factories/createConnection.js:54:23)
    at processTicksAndRejections (internal/process/task_queues.js:85:5)ConnectionError: Connection terminated due to connection timeout
    at Object.createConnection (/usr/odk/node_modules/slonik/dist/src/factories/createConnection.js:54:23)
    at processTicksAndRejections (internal/process/task_queues.js:85:5)Failed to audit-log task success message!
ConnectionError: Connection terminated due to connection timeout
    at Object.createConnection (/usr/odk/node_modules/slonik/dist/src/factories/createConnection.js:54:23)
    at processTicksAndRejections (internal/process/task_queues.js:85:5)
'{"did":"nothing","because":"no backup configured"}'

Enketo

13:48:20 0|enketo  | Worker 4397 ready for duty at port 8005! (environment: production)
13:50:30 0|enketo  | Error: Redis connection to enketo_redis_main:6379 failed - connect ETIMEDOUT 172.19.0.3:6379
13:50:30 0|enketo  |     at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1144:16) {
13:50:30 0|enketo  |   errno: 'ETIMEDOUT',
13:50:30 0|enketo  |   code: 'ETIMEDOUT',
13:50:30 0|enketo  |   syscall: 'connect',
13:50:30 0|enketo  |   address: '172.19.0.3',
13:50:30 0|enketo  |   port: 6379
13:50:30 0|enketo  | }

Redis Cache and Main

1:C 07 Jun 2021 17:48:58.506 # Configuration loaded
1:M 07 Jun 2021 17:48:58.507 * Running mode=standalone, port=6380.
1:M 07 Jun 2021 17:48:58.507 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:M 07 Jun 2021 17:48:58.507 # Server initialized
1:M 07 Jun 2021 17:48:58.507 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
1:M 07 Jun 2021 17:48:58.507 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
1:M 07 Jun 2021 17:48:58.507 * DB loaded from disk: 0.000 seconds
1:M 07 Jun 2021 17:48:58.507 * Ready to accept connections

Apache

[Tue Jun 08 12:23:24.357237 2021] [proxy:error] [pid 33513] [client 233.252.0.0:50800] AH00898: Error during SSL Handshake with remote server returned by /version.txt
[Tue Jun 08 12:23:24.357250 2021] [proxy_http:error] [pid 33513] [client 233.252.0.0:50800] AH01097: pass request body failed to [::1]:4443 (localhost) from 233.252.0.0 ()

Docker-compose via systemd

systemd[1]: Started central via docker-compose.
systemd[14217]: docker-compose@central.service: Changing to the requested working directory failed: No such file or directory
systemd[14217]: docker-compose@central.service: Failed at step CHDIR spawning /usr/bin/docker-compose: No such file or directory
systemd[1]: docker-compose@central.service: Main process exited, code=exited, status=200/CHDIR
systemd[1]: docker-compose@central.service: Failed with result 'exit-code'.
systemd[1]: docker-compose@central.service: Service RestartSec=100ms expired, scheduling restart.
systemd[1]: docker-compose@central.service: Scheduled restart job, restart counter is at 5.
systemd[1]: Stopped central via docker-compose.

docker ps

IMAGE                             COMMAND                  CREATED      STATUS                PORTS                 NAMES
central_service                   "docker-entrypoint..."   25 hours ago Up 20 hours           8383/tcp              central_service_1
central_enketo                    "docker-entrypoint..."   25 hours ago Up 20 hours           8005/tcp              central_enketo_1
itsissa/namshi-smtp:4.89-2.deb9u5 "/bin/entrypoint.s..."   25 hours ago Up 20 hours           25/tcp                central_mail_1
postgres:9.6                      "docker-entrypoint..."   6 days ago   Up 20 hours           5432/tcp              central_postgres_1
getodk/pyxform-http:v1.5.1        "gunicorn --bind 0..."   6 days ago   Up 20 hours                                 central_pyxform_1
redis:5                           "docker-entrypoint..."   6 days ago   Up 20 hours           6379/tcp              central_enketo_redis_cache_1
redis:5                           "docker-entrypoint..."   6 days ago   Up 20 hours           6379/tcp              central_enketo_redis_main_1
central_nginx                     "/bin/bash /script..."   20 hours ago Up 20 hours (healthy) 0.0.0.0:4480->80/tcp, central_nginx_1
      																						  0.0.0.0:4443->443/tcp

yanokwa · June 8, 2021, 4:02pm

Hi @jary! Thanks for laying the groundwork for the upstream SSL features we added to Central v1.2.

Your .env file looks fine to me. I have a very similar config with an nginx proxy routing traffic to two servers. The proxy's nginx config is embedded below. The only difference seems to be I route everything over HTTP, but then in Central's nginx config, we set X-Forwarded-Proto to HTTPS to https.

nginx.conf

server {
    server_name prod.central.example.com;
    location / {
            proxy_pass http://0.0.0.0:8080;
    }
    listen [::]:443 ssl;
    listen 443 ssl;
    ssl_certificate /etc/ssl/certs/example.com.crt;
    ssl_certificate_key /etc/ssl/certs/example.com.key;
    ssl_trusted_certificate /etc/ssl/certs/example.com.pem;
    ssl_protocols TLSv1.2 TLSv1.1 TLSv1;
    ssl_prefer_server_ciphers on;
    ssl_ciphers "EECDH+ECDSA+AESGCM EECDH+aRSA+AESGCM EECDH+ECDSA+SHA384 EECDH+ECDSA+SHA256 EECDH+aRSA+SHA384 EECDH+aRSA+SHA256 EECDH+aRSA+RC4 EECDH EDH+aRSA RC4 !aNULL !eNULL !LOW !3DES !MD5 !EXP !PSK !SRP !DSS";
    ssl_dhparam /etc/dh/nginx.pem;
}

server {
    server_name test.central.example.com
    location / {
            proxy_pass http://0.0.0.0:9080;
    }
    listen [::]:443 ssl;
    listen 443 ssl;
    ssl_certificate /etc/ssl/certs/example.com.crt;
    ssl_certificate_key /etc/ssl/certs/example.com.key;
    ssl_trusted_certificate /etc/ssl/certs/example.com.pem;
    ssl_protocols TLSv1.2 TLSv1.1 TLSv1;
    ssl_prefer_server_ciphers on;
    ssl_ciphers "EECDH+ECDSA+AESGCM EECDH+aRSA+AESGCM EECDH+ECDSA+SHA384 EECDH+ECDSA+SHA256 EECDH+aRSA+SHA384 EECDH+aRSA+SHA256 EECDH+aRSA+RC4 EECDH EDH+aRSA RC4 !aNULL !eNULL !LOW !3DES !MD5 !EXP !PSK !SRP !DSS";
    ssl_dhparam /etc/dh/nginx.pem;
}

server {
    if ($host = prod.central.example.com) {
        return 301 https://$host$request_uri;
    }
    if ($host = test.central.example.com) {
        return 301 https://$host$request_uri;
    }
    listen 80 ;
    listen [::]:80 ;
    server_name prod.central.example.com test.central.example.com;
    return 404;
}

What happens if you disable your on-machine firewall entirely as described at https://docs.getodk.org/central-upgrade/#upgrading-to-central-v0-9? iptables interacts with Docker in strange ways...

Is there any chance you are using an external database? Maybe with special characters in the password? If so, see Connection timeout after upgrading to Central v1.2 - #5 by Matthew_White

Are there any other changes to your ~/central directory?

yanokwa · June 9, 2021, 12:36pm

A post was merged into an existing topic: Enketo says "Could not connect with Server" on preview request

jary · June 9, 2021, 3:28pm

Hello @yanokwa! Thank you for your time. I'm glad that my previous problem was put to good use.

Disabling the firewall resulted in no change, there was still a Proxy Error.

The database is the same from the original Central installation (v 0.8.0). (By the way, it had 130 GB and still ran smoothly.)

In ~/central, I changed .env. I also added the certificates as prescribed here. That was a dead end, so I removed them, rebuilt, no change.

Maybe there is some misconfiguration in some of the docker containers that was taken from the previous system settings. What could I do to perform something as close as possible to a clean reinstall (i.e. preserving the data and access settings)? Some containers can be rm'd, like mail or nginx, but what about others?

jary · June 18, 2021, 11:24pm

Finally, we managed to get it work. It was a tough one, actually there were two layers of problems. Unfortunately, since I wasn't the one who came with the solution, I couldn't write it down accurately. Sorry, future troubleshooter.

First, there was the problem with a certificate causing proxy errors. Re-issuing the certificate may have solved that.

Also, the part in Apache config file pasted above was thrown out.

Now we have a new file /etc/apache2/sites-available/odk-ssl.conf. I know little about it, but it is based on default-ssl.conf. The contents (without comments) are:

<IfModule mod_ssl.c>
SSLStaplingCache "shmcb:logs/ssl_stapling(32768)"
  <VirtualHost survey.example.com:443>
    ServerAdmin webmaster@localhost
    ServerName survey.example.com
    DocumentRoot /var/www/html
	
    Protocols h2 http/1.1
    Header always set Strict-Transport-Security "max-age=63072000"
	
    ErrorLog ${APACHE_LOG_DIR}/error.log
    CustomLog ${APACHE_LOG_DIR}/access.log combined
	
    SSLEngine on
	
    ProxyRequests   Off
    ProxyPass / http://127.0.0.1:4480/
    ProxyPassReverse / http://127.0.0.1:4480/
    
    <FilesMatch "\.(cgi|shtml|phtml|php)$">
      SSLOptions +StdEnvVars
    </FilesMatch>
    
    <Directory /usr/lib/cgi-bin>
      SSLOptions +StdEnvVars
    </Directory>
    
    Include /etc/letsencrypt/options-ssl-apache.conf
    SSLCertificateFile /etc/letsencrypt/live/survey.example.com/fullchain.pem
    SSLCertificateKeyFile /etc/letsencrypt/live/survey.example.com/privkey.pem
  </VirtualHost>
</IfModule>

This worked for me, it may or may not work for you. This file is also softlinked in the adjacent sites-enabled folder. Notice that separate certificate files for ODK are stored in /etc/letsencrypt/live/survey.example.com/.

After this the HTTP 500 Proxy Error went away. HTTPS was working, but now every database request timed out after 60 seconds and nothing happened. Enketo complained about a connection to Redis:

Error: Redis connection to enketo_redis_main:6379 failed - connect ETIMEDOUT 172.19.0.3:6379

The firewall was not the source of the problem. Finally, it was pinned down to the fact that something was broken. (No, sorry, we couldn't figure it out.)

What did help was simple:

docker-compose down
docker-compose up -d

Using docker-compose down is normally discouraged, as it causes detaching the database from the application. However, the database is not stored in a container, but in a separate entity called a volume and Docker does not delete volumes easily.*

It is a good idea to mark down the ID of the database volume before downing so you don't have to search for it.

Reattaching the database volume is actually easy.

docker-compose restart and off we go.

* Disclaimer: Of course you value your data and thus you did a backup because you know that things can go wrong, nothing is guaranteed and well-intended instructions from strangers on the internet may not be always right.