ODK Central recently gained support for S3, this was a very good first step towards general scalability and in particular handling of large files. Thank you!
The current implementation preserves the existing ingestion of files into the database, and adds a periodic (by default once per day) cron job that sends the blobs to S3 and removes them from the db.
The problem
There are a couple of issues related to this workflow:
- Cannot upload attachments larger than 1GiB – this one is not solvable within the current process. Well, you could chunk into a separate table just like TOAST is doing in the first place, but I don't think that's something anyone wants to do
- Files larger than ~256MiB are failing to upload to s3, stopping the whole process – this one is solvable within the current process, but probably not trivial
- Database backups are inflated, especially when WALs are archived for point in time recovery support. While one could time the periodic backups to happen soon after the blobs upload, the WALs will contain all of the attachment data, likely inflated as compared to the actual file sizes as well.
- ODK Central backend memory usage – even with uploads effectively limited at around ~256MiB, significant memory is required to keep the system running. We had to bump ours to 8GiB, and I bet it could run at a lot less if it wasn't for how the uploads are handled right now. In particular it seems like the entire attachment has to fit in RAM, and who knows how much could it balloon to if there are multiple uploads happening at once. While this problem looks like something that could be solved within the current workflow, it might take significant effort depending on how many code paths the attachment handling is going through.
- No support for resumable uploads, which means that if the internet is not stable the attachments upload has to restart from zero after each disruption. This is more of a problem the larger the attachments are. Again, this could be implemented directly in Central + Collect, but again not trivial.
Proposed solution
I'd like to propose a solution that would remove all those problems and generally guarantee much better long term scalability – presigned S3 upload urls. A downside is that this would require changes to ODK Collect as well.
The idea is that instead of uploading files to Central, Central would use the S3 API to generate presigned upload URLs and return them to clients. Clients would upload straight to S3, and later tell Central that they are done uploading.
It's important to note that Central would still preserve full ownership of the bucket – the presigned urls include the upload path, and one url only allows uploading into that specific path.
The added benefit is that S3 servers (both from cloud providers and open source options like minio) are very well optimized towards handling of many concurrent uploads, it's their bread and butter. They will not only have more stable memory usage, but the S3 upload api also supports resumable uploads that could be later implemented in Collect. I bet there are java/kotlin SDKs for that.
A fun digression
Fun fact: prior to S3 support the maximum total size of attachments stored by a single instance of ODK Central (assuming default postgres settings, but tweaking them would not change the fundamental problem) was at around 8 TiB (2^32 * 2000b). Significant performance problems would show up much sooner, though.
The reason is handling of TOAST values in postgres
- https://www.postgresql.org/message-id/A88C1E63-DB62-43C6-B8EC-E78D07574690%40ebureau.com
- https://www.postgresql.org/message-id/10802.1477521336%40sss.pgh.pa.us
- https://www.postgresql.org/docs/current/storage-toast.html
Rough outline:
- Postgres limits single row size to 1/4th of a page – 2KiB
- If a row is larger, the largest cell value is taken, and moved to the TOAST table corresponding to the table the data is being inserted into. This is repeated until the row is under 2KiB.
- When entering values into the TOAST table, they are chunked so that every row in the TOAST table is under 2KiB as well.
- The TOAST table uses the
oid
type forchunk_id
.oid
is a special data type that sources 4 byte unsigned ints from a global counter. That global counter not only services all TOAST tables, but many other system tables as well. - This means that the limit of number of rows in a single TOAST table is 2^32. As each row is one 2KiB chunk, the total size that can be stored is around
2^32 * 2KiB
- The global counter that provides ints for
oid
will pretty much for sure wraparound way before then, which means that it'll start providing duplicate numbers, so postgres will have to keep increasing it until it finds a number that is not being used in a given TOAST table. This will take more and more time as the numbers space is being filled up.
Arguably this is a very poor design choice for postgres. Postgres docs explicitly state:
The oid type is currently implemented as an unsigned four-byte integer. Therefore, it is not large enough to provide database-wide uniqueness in large databases, or even in large individual tables.
It's supposed to be used in system tables, and while TOAST is technically a system table, it is also bound in size to the actual data size.