PostgreSQL High CPU Usage Issue on ODK Central

AyushShinde · May 14, 2025, 7:45pm

1. What is the issue? Please be detailed.

For the past couple of days, our server has been experiencing persistent 100% CPU utilization, primarily caused by PostgreSQL checkpointer processes. This high CPU usage continues for several hours, after which I have to restart the containers to bring the system back to normal. The server then remains stable for a few hours, but eventually, the CPU usage spikes again to 100%, and RAM usage also increases significantly.

I recently upgraded ODK Central from v2024.1.0 to v2024.2.1 to enable S3 feature support. However, the issue has only started appearing over the past few days, even though the upgrade was done a couple of weeks ago.

System Details
Server Specs: 2 CPU cores and 4 GB RAM (3917 MB total, with 824 MB available)

Disk: 60 GB total, 33 GB available

Swap: 1.1 GB of 4 GB in use

Current PostgreSQL Configuration (default)
shared_buffers: 128MB (16384 × 8kB)

checkpoint_timeout: 300s (5 minutes)

max_wal_size: 1024MB

min_wal_size: 80MB

checkpoint_completion_target: 0.9

I have not modified any of these settings.

3. What have you tried to fix the issue?

Observed Symptoms:

PostgreSQL checkpointer processes consistently consume ~100% CPU each

Multiple checkpointer processes run simultaneously

Swap usage increases (1.1 GB of 4 GB used)

The issue persists even though the application appears to function normally

I haven’t made any changes or optimizations yet to resolve this.

4. Upload any forms or screenshots you can share publicly below.

I have attached a couple of screenshots below.

yanokwa · May 14, 2025, 10:49pm

Have you started to receive a lot of submissions or blobs in the last few days? What do your container logs say when you get these spikes?

AyushShinde · May 15, 2025, 3:08am

Yes, we've been receiving a high volume of submissions lately, especially those containing blobs. However, the issue has only started appearing over the past couple of days. The first thing I checked was the logs for both the Postgres and service containers, but everything seemed normal—nothing unusual stood out.

yanokwa · May 15, 2025, 4:37am

The blobs are written to the DB first and then removed when they are migrated to S3. Those inserts/deletes generate WALs. So too do vacuums (including autovacuums to clean up the dirty data in heap pages.

Try setting up a test environment and experiment with tweaking checkpoint frequency, enabling WAL compression, or using UNLOGGED on the blobs table.

Related: Proposal: better large attachments handling via presigned upload urls for S3 - #3 by LN

AyushShinde · May 15, 2025, 6:05am

Thanks @yanokwa for your suggestion, I'll implement your suggestion on a test environment.

Can you share some resources that will help me set up the changes you suggested, especially for using UNLOGGED on the blobs table.