Inconsistent results running ODK Central v0.2 Beta

MatthewMac · October 9, 2018, 2:26pm

I'm running into some problems trying to get ODK Central up and running consistently.

If I set up Ubuntu from scratch, install Docker, clone and build Central, the server will invariably come up perfectly. Soon as I restart (either Docker or the OS), the central_nginx_1 container will come up unhealthy. One time (from scratch) central_nginx_1 initially came up unhealthy, it came up healthy after running docker-compose restart... but not after the next stop/up. Sometimes a docker-compose build will get it going, and sometimes if I make a minor change to .env or config.json.template.

Am fairly new to Docker, so could be missing something obvious, but any pointers here would be really appreciated.

Matthew

OS: Ubuntu 16.04.5 LTS (x86_64)
Docker version: 18.06.1-ce
Docker compose: 1.20.0 (manual download as the latest version doesn't work with the Central .yaml file)

After installing Ubuntu and Docker...

odk@odk-central:~$ git clone https://github.com/opendatakit/central.git
odk@odk-central:~$ cd central 
odk@odk-central:~/central$ git submodule update -i

odk@odk-central:~/central$ vi .env
     SSL_TYPE=customssl
     DOMAIN=local
     SYSADMIN_EMAIL=<My email address>

odk@odk-central:~/central$ cp ../fullchain.pem files/local/customssl/
odk@odk-central:~/central$ cp ../fullchain.pem files/local/customssl/

odk@odk-central:~/central$ docker-compose build
odk@odk-central:~/central$ docker-compose up

Starting central_postgres_1 ... done
Starting central_mail_1     ... done
Starting central_service_1  ... done
Starting central_nginx_1    ... done
Attaching to central_postgres_1, central_mail_1, central_service_1, central_nginx_1
postgres_1  | LOG:  database system was shut down at 2018-10-09 14:16:57 UTC
postgres_1  | LOG:  MultiXact member wraparound protections are now enabled
postgres_1  | LOG:  database system is ready to accept connections
postgres_1  | LOG:  autovacuum launcher started
postgres_1  | LOG:  incomplete startup packet
mail_1      | + sed -ri '
mail_1      |   s/^#?(dc_local_interfaces)=.*/\1='\''[0.0.0.0]:25 ; [::0]:25'\''/;
mail_1      |   s/^#?(dc_other_hostnames)=.*/\1='\'''\''/;
mail_1      |   s/^#?(dc_relay_nets)=.*/\1='\''172.18.0.3\/16'\''/;
mail_1      |   s/^#?(dc_eximconfig_configtype)=.*/\1='\''internet'\''/;
mail_1      | ' /etc/exim4/update-exim4.conf.conf
mail_1      | + update-exim4.conf -v
mail_1      | using non-split configuration scheme from /etc/exim4/exim4.conf.template
mail_1      |     1 LOG: MAIN
mail_1      |     1   exim 4.84_2 daemon started: pid=1, -q15m, listening for SMTP on port 25 (IPv6 and IPv4)
service_1   | wait-for-it.sh: waiting 15 seconds for postgres:5432
service_1   | wait-for-it.sh: postgres:5432 is available after 0 seconds
service_1   | running migrations..
service_1   | starting cron..
service_1   | starting server.
nginx_1     | writing a new nginx configuration file..
nginx_1     | Done with startup
nginx_1     | Run certbot
nginx_1     | + parse_domains
nginx_1     | + sed -n -e s&^\s*ssl_certificate_key\s*\/etc/letsencrypt/live/\(.*\)/privkey.pem;&\1&p /etc/nginx/conf.d/certbot.conf
nginx_1     | + xargs echo
nginx_1     | + sed -n -e s&^\s*ssl_certificate_key\s*\/etc/letsencrypt/live/\(.*\)/privkey.pem;&\1&p /etc/nginx/conf.d/odk.conf
nginx_1     | + xargs echo
nginx_1     | + auto_enable_configs
nginx_1     | + keyfiles_exist /etc/nginx/conf.d/certbot.conf
nginx_1     | + parse_keyfiles /etc/nginx/conf.d/certbot.conf
nginx_1     | + sed -n -e s&^\s*ssl_certificate_key\s*\(.*\);&\1&p /etc/nginx/conf.d/certbot.conf
nginx_1     | + return 0
nginx_1     | + [ conf = nokey ]
nginx_1     | + keyfiles_exist /etc/nginx/conf.d/odk.conf
nginx_1     | + parse_keyfiles /etc/nginx/conf.d/odk.conf
nginx_1     | + sed -n -e s&^\s*ssl_certificate_key\s*\(.*\);&\1&p /etc/nginx/conf.d/odk.conf
nginx_1     | + [ ! -f /etc/customssl/live/local/privkey.pem ]
nginx_1     | + return 0
nginx_1     | + [ conf = nokey ]
nginx_1     | + kill -HUP 16
nginx_1     | + set +x
nginx_1     | /scripts/entrypoint.sh: line 37:    16 Hangup                  nginx -g "daemon off;"
service_1   | unable to connect to ipc-file `naught.ipc`
service_1   |
service_1   | removing the ipc-file and attempting to continue


odk@odk-central:~/central$ docker-compose ps
   Name                     Command                   State                         Ports
---------------------------------------------------------------------------------------------------------------
central_mail_1       /bin/entrypoint.sh exim -b ...   Up               25/tcp
central_nginx_1      /bin/bash /scripts/odk-set ...   Up (unhealthy)   0.0.0.0:443->443/tcp, 0.0.0.0:80->80/tcp
central_postgres_1   docker-entrypoint.sh postgres    Up               5432/tcp
central_service_1    ./wait-for-it.sh postgres: ...   Up               8383/tcp

issa · October 9, 2018, 9:48pm

So, you've run into something I've also noticed on my development environment. For whatever reason, when using a self-signed or custom SSL certificate, the nginx container will sometimes hang on startup. I'm still digging into why this happens, but my two suggestions in the meantime are:

This doesn't appear to be a problem when using the Letsencrypt path, which so far seems to be really resilient to all sorts of abuse on my part. I'd strongly recommend using that approach if at all possible.
As you note, the first boot seems to always work. So in the meanwhile, as a workaround, make sure you keep all your configuration laying around on the host machine, and if you run docker-compose rm nginx and then do build or up again, the first spinup will work for you.

MatthewMac · October 10, 2018, 3:29pm

Hey Clint
Thanks - glad it's not just me that's seeing this.

The Letsencrypt route isn't really ideal for us, especially when we get pukka certs for everything else on our domain. I did try (without success) to disable some of the startup scripts from running, as it struck me as unnecessary to run Certbot and re-write the odk.conf file every time the nginx container is started.

Trashing the nginx container after every restart does seem quite drastic, and slow as it needs to rebuild the DH key every time. Even when I do, the first spin-up doesn't always work... or even the second. Hopefully then the little man in there rolling the dice gives in eventually and comes up healthy, but I never have the confidence to be sure.

Once it's up, it's all running really well. Really appreciate the work you're doing on this, and hope we can figure out our way over the initial bumps.

Matthew

issa · October 10, 2018, 8:46pm

Ha, I used to feel the same way about writing the odk.conf file every time until i burned an hour when I changed the conf and didn’t realize it wasn’t updating on the machine when I restarted it. It takes moments to write.

issa · October 11, 2018, 11:46pm

Hey Matthew, I took another look at this issue today. And of course, now that I'm trying again to fix this issue the laws of stochasticity dictate that I find it hard to duplicate the problem now.

But I do have suspicions and an experimental fix/workaround for you to try, if you've got time. It's available under the cxlt/25 branch, and so if you do a git fetch and git checkout cxlt/25 you should be able to switch to it. Then just docker-compose build nginx (and possibly blow away the nginx container for good measure) and we can see whether it makes a different for you or not.

When I was trying to repro today pre-fix I got it to hang after maybe 7 or 8 tries; I gave it another 10 tries after the fix and I couldn't get it to fail. Hoping it helps for you too.

MatthewMac · October 12, 2018, 4:48pm

Hi Clint - many thanks for cooking this. First few stop/starts it was all looking consistently good, but as soon as I bounced the server it then took three restarts to get nginx to come up healthy again.

Short-term I think we've improved the odds enough to know I can get the server to start, which is great so we can start testing. Looking ahead, the dummy server is probably a bit too much of a bodge for a permanent fix... Can't help feeling there's too many scripts running and dynamic config going on in the startup process - most of it seems for the benefit of LetsEncrypt that people might not be using. I'm pretty new to Docker, so sorry I can't help seriously with the dev yet.

Matthew

MatthewMac · October 12, 2018, 4:55pm

On a completely different topic, is there any way to delete a user once I've created one?
odk-cmd only supports create, promote and user-set-password. Can I put in a feature request for user-delete? Or can I hack a user out from the database with a psql command, although I'd need to figure out the credentials to do this.

Seems the email addresses are case-sensitive. I invited a colleague with capitalization on his email address, and found the invitations and passwords resets don't work with a mixed case address. Ended up having to create a 'duplicate' user - bizarrely with the same address, but all lower case.

issa · October 12, 2018, 7:38pm

User deletion is on our shortlist of to-dos, but it's not available yet, nor is it currently planned for the upcoming release (sorry).

You can get into psql by doing docker-compose exec postgres psql -U odk

MatthewMac · October 16, 2018, 2:55pm

Thanks for the entry point into Postgres. I'm guessing you probably don't really want people doing this stuff but... I can see where to delete my duplicate user out of the users table. Should I also zap them from actors and actees too? Or am I just going to screw things up...?

issa · October 16, 2018, 3:32pm

I’m fairly certain that if you zap those three it’ll work just fine. The critical pair is users and actors, but it is probably smart to remove the actees record as well.

The most likely scenario for breakage is if the user in question has uploaded forms or other artifacts, but there are foreign key constraints in place that would trip if that is the case.

issa · October 17, 2018, 12:09am

Alright, inspired by your comment that we're running a lot of things you don't need, I put together a second take on fixing the nginx problem; this time we just don't even run the parent scripts at all and we directly start nginx if you are doing self-signed or custom ssl.

You can find the branch at cxlt/25-2. Hopefully it works better this time!

MatthewMac · October 17, 2018, 11:46am

Hey Clint - I think you did it here. Pulled in the branch, rm nginx, build and up and now the server comes up every time. Tried a number of restarts and reboots and can't get the thing not to start. And as a bonus (although I haven't timed it) I'd say the server comes up in about half the time too. Great work!

issa · October 17, 2018, 5:09pm

Oh wonderful, that's great to hear. One issue down. I'll file a PR etc soon here.

yanokwa · August 19, 2021, 10:17pm

2 posts were split to a new topic: Having trouble installing Central