Proposal: better large attachments handling via presigned upload urls for S3

mlazowik · February 28, 2025, 10:55pm

ODK Central recently gained support for S3, this was a very good first step towards general scalability and in particular handling of large files. Thank you!

The current implementation preserves the existing ingestion of files into the database, and adds a periodic (by default once per day) cron job that sends the blobs to S3 and removes them from the db.

The problem

There are a couple of issues related to this workflow:

Cannot upload attachments larger than 1GiB – this one is not solvable within the current process. Well, you could chunk into a separate table just like TOAST is doing in the first place, but I don't think that's something anyone wants to do
Files larger than ~256MiB are failing to upload to s3, stopping the whole process – this one is solvable within the current process, but probably not trivial
Database backups are inflated, especially when WALs are archived for point in time recovery support. While one could time the periodic backups to happen soon after the blobs upload, the WALs will contain all of the attachment data, likely inflated as compared to the actual file sizes as well.
ODK Central backend memory usage – even with uploads effectively limited at around ~256MiB, significant memory is required to keep the system running. We had to bump ours to 8GiB, and I bet it could run at a lot less if it wasn't for how the uploads are handled right now. In particular it seems like the entire attachment has to fit in RAM, and who knows how much could it balloon to if there are multiple uploads happening at once. While this problem looks like something that could be solved within the current workflow, it might take significant effort depending on how many code paths the attachment handling is going through.
No support for resumable uploads, which means that if the internet is not stable the attachments upload has to restart from zero after each disruption. This is more of a problem the larger the attachments are. Again, this could be implemented directly in Central + Collect, but again not trivial.

Proposed solution

I'd like to propose a solution that would remove all those problems and generally guarantee much better long term scalability – presigned S3 upload urls. A downside is that this would require changes to ODK Collect as well.

The idea is that instead of uploading files to Central, Central would use the S3 API to generate presigned upload URLs and return them to clients. Clients would upload straight to S3, and later tell Central that they are done uploading.

It's important to note that Central would still preserve full ownership of the bucket – the presigned urls include the upload path, and one url only allows uploading into that specific path.

The added benefit is that S3 servers (both from cloud providers and open source options like minio) are very well optimized towards handling of many concurrent uploads, it's their bread and butter. They will not only have more stable memory usage, but the S3 upload api also supports resumable uploads that could be later implemented in Collect. I bet there are java/kotlin SDKs for that.

A fun digression

Fun fact: prior to S3 support the maximum total size of attachments stored by a single instance of ODK Central (assuming default postgres settings, but tweaking them would not change the fundamental problem) was at around 8 TiB (2^32 * 2000b). Significant performance problems would show up much sooner, though.

The reason is handling of TOAST values in postgres

Rough outline:

Postgres limits single row size to 1/4th of a page – 2KiB
If a row is larger, the largest cell value is taken, and moved to the TOAST table corresponding to the table the data is being inserted into. This is repeated until the row is under 2KiB.
When entering values into the TOAST table, they are chunked so that every row in the TOAST table is under 2KiB as well.
The TOAST table uses the oid type for chunk_id. oid is a special data type that sources 4 byte unsigned ints from a global counter. That global counter not only services all TOAST tables, but many other system tables as well.
This means that the limit of number of rows in a single TOAST table is 2^32. As each row is one 2KiB chunk, the total size that can be stored is around 2^32 * 2KiB
The global counter that provides ints for oid will pretty much for sure wraparound way before then, which means that it'll start providing duplicate numbers, so postgres will have to keep increasing it until it finds a number that is not being used in a given TOAST table. This will take more and more time as the numbers space is being filled up.

Arguably this is a very poor design choice for postgres. Postgres docs explicitly state:

The oid type is currently implemented as an unsigned four-byte integer. Therefore, it is not large enough to provide database-wide uniqueness in large databases, or even in large individual tables.

It's supposed to be used in system tables, and while TOAST is technically a system table, it is also bound in size to the actual data size.

spwoodcock · March 2, 2025, 6:06pm

Love this idea

As this would only affect uploads, as far as I can tell it wouldn't cause any backward incompatability for downloads (keeping the current behaviour).

Providing an upload URL for the dynamically generated S3 path should be easy enough.

If the proposal is accepted I would be happy to assist with implementation if it helps out (in a few months when I have more time).

LN · March 7, 2025, 9:40pm

Thanks for taking the time to write this up in detail, @mlazowik!

For context, I want to crosslink to an earlier post where you described this media you're collecting:

@spwoodcock what do you see as the biggest benefit of this approach for your needs? I don't think you have very large files, right?

I realize this is a bit of an unusual architecture! Here are some of the things that led us down that path:

Typical binaries that we are aware of are quite small. We've seen video up to several dozen megabytes but attachments larger than that have not been a target use case. We'll make sure to document this assumption more clearly and now that you've raised the issue I'm sure if others have this need they will chime in here.
Desire to keep the changes needed isolated to the server. We currently maintain several clients with a lot on their roadmaps and don't want to take on additional complexity there.
Desire to reduce the impact of issues with connection to the storage service. Because everything gets submitted to Central first, client users can't get error messages from both Central and the storage service. If something goes wrong with the connection to the storage service, only the Central administrator is affected. Because binaries continue to be ingested in the database, there is minimal time pressure to address issues related to storage service downtime (not an issue with S3 but could be with different systems).

Absolutely. We haven't considered this a major issue because as I mentioned above, very large files have not so far been a target use case. That said, are you expecting to support point-in-time recovery for the entire duration of your project? I'd generally expect a limited window for storing WALs, maybe augmented with periodic snapshots. You may also want to look into the UNLOGGED directive which was introduced a few postgres versions ago, that could be appropriate for the blobs table when using S3.

That sounds right to me. We'll discuss whether there might be a straightforward path to not having to keep entire files in memory. Like you said, this has implications for projects with small media files being concurrently submitted (much more typical usage).

Yes, agreed that this would be really nice to have.

While I completely see the appeal of this approach, it's not something that we are open to considering for the core at this time. It doesn't fit in thematically with our current priorities (see the roadmap) and implies future maintenance we don't want to take on at this time.

That said, those priorities may change over time, especially if we learn of other projects with large file needs.

You could consider doing this work in forks. If it ends up looking clean and maintainable with limited risk, it could shift our thinking on taking it on (but no guarantees).

Interesting! Software is hard!

spwoodcock · March 9, 2025, 4:37am

Really great write up - you make lots of important points @LN!

Regarding the main benefit of this proposal to me, its not actually about file size, but having the media available immediately in S3 instead of needing a sync.

I made an override of the cron from 24hr --> 15min on my setup, but its still not ideal (users or managers may want to view submission photos immediately).

There are two approahes to solve this I thought of:

Webhook for the submission to trigger the S3 sync (not a huge fan of this).
Direct S3 upload via pre-signing, as outlined in the proposal.

While I agree with your points, I was thinking 'could we just make the presign endpoint available as an option, but not necessarily a recommended one'. The main problem with doing this is that we cant guarantee the file is uploaded by the user though (or know when it fails)

mlazowik · June 5, 2025, 11:38am

Hi, we've hit another problem. When there are a lot of attachments uploaded the job that uploads blobs to s3 is not able to catch up.

IIUC the current upload process is sequential, it would be great to be able to run things in parallel to make the process more efficient.

Another problem is – what if the upload job scheduled by cron does not finish before the next scheduled run?

@LN do you know if running more than one upload-pending jobs in parallel is safe? If yes, this is not a problem. In fact maybe we could just run multiple jobs in parallel to speed things up.

EDIT: ok, I see that when blobs are picked up for upload they are locked for update, and skip locked is used, so looks like running multiple jobs in parallel should be safe? https://github.com/getodk/central-backend/blob/0d881c12dbdc2bed3859128693e7d47aded1e406/lib/model/query/blobs.js#L91-L92

mlazowik · June 5, 2025, 5:13pm

Yet another improvement, possibly the cheapest one: the upload script could retry on failures. IIUC if any request fails now it'll just stop and be run again only when cron schedules it again