1. What is the issue? Please be detailed.
When trying to upload forms with attachments larger than 1GiB postgres ends up dropping connection.
2. What steps can we take to reproduce this issue?
- Set up self hosted ODK Central
- Raise the default nginx body limit from 100m to e.g. 2g
- Upload large files (via ODK Collect, Enketo won't allow you)
3. What have you tried to fix the issue?
We ended up chunking files so that individual attachments are smaller.
4. Upload any forms or screenshots you can share publicly below.
Any form with file input
More info: the most likely reason is that any single field in postgres can store at most 1GiB of data: https://www.postgresql.org/docs/current/limits.html
Using S3 does not resolve this, because files are still uploaded to the db first, and only then pushed to S3 on a schedule.
1 Like
Great analysis!
With the current implementation, I'm not sure this would be easily solvable.
As you say, the file is first uploaded to the DB, then synced to S3.
This is done to maintain compatibility and not force using S3 on users.
Perhaps you could work around this by moving the submission attachment upload outside of the ODK workflow?
- Create a form with a text field
s3_attachment_url
- Handle the multipart upload elsewhere (if you are writing an app that wraps ODK this is easy. If you are using ODK Collect, then perhaps you could create a question that links to an uploader webpage?)
- Get the URL for the uploaded file
- Insert into form question
s3_attachment_url
Not a perfect workflow by any means, but it could be an option 
Also, as another possible solution in future:
Perhaps this could be supplemented by a separate 'external S3 attachments' workflow in future using Web Forms?
- Add a getPresignedUploadUrl method to ODK Central backend.
- Create a question type that is an upload field in Web Forms.
- When the user uploads a file for the question, generate a pre-signed upload URL on the backend, and allow upload directly to the bucket.
- Insert the relevant entries into the ODK Central database on submission (minus the blob data - status=uploaded)
(I was going to make a separate thread requesting we can either have (1) configurable S3 upload schedule or (2) an endpoint to trigger the S3 uploads. However, I think I prefer the approach of uploading directly to the bucket, as it solves many issues at once)
1 Like
Thanks!
As the files are coming from an external app we have control over, we just made the app chunk the files and save each chunk into a separate file field in the form. We're sort of good for our current specific use case (apart from ballooning db backup as we use WAL archival).
I'm also leaning towards presigned urls, creating those two issues was mostly about setting stage for the proposal. Here it goes: Proposal: better large attachments handling via presigned upload urls for S3
1 Like