External storage for blobs in ODK Central

Currently, attachments to submissions and forms are stored as blobs in a database table. It will be great if possible to store these in a network file system, or best in a managed storage solution like Google Cloud Storage, Amazon S3, etc.

1. What is the general goal of the feature?
Reduce the database size, which will greatly speed up the backup and restore operations.

2. What are some example use cases for this feature?
At the moment, my self-hosted central database is around 70GB. The nightly export I have scheduled takes couple of hours. It will probably take similar amount of time to restore in an event of disaster. I only have experience with Google Storage buckets, but if the blobs were stored there, they would be backed up and automatically versioned by the cloud provider. It will be also possible to access the files via different means, like authenticated urls, or short-lived public urls.

3. What can you contribute to making this feature a reality?
Coding (if enough guidance given) and testing for the Google Cloud Storage implementation.

Hi @punkch! This feature is on the roadmap, and we believe that moving blobs out of the database would have several benefits. However, we also think that implementing this feature would be a large effort, requiring changes not just to the API server, but also to the frontend and our Docker setup. We're working hard on the upcoming v1.4 release and don't currently have the capacity to provide guidance about this feature or to review a PR. That said, we are interested in adding this at some point, so please feel free to check back about this feature in the future. We'd also be interested to hear from others who would find this feature useful.

Also on the roadmap is a separate (though perhaps related) feature about expanding access to Form Attachments and Submission Attachments. For example, see:

2 Likes

We've started doing some design work for this led by @alxndrsn.

Our current thinking is to use the S3 API which would work with many blob storage solutions such as minio for local deployments or Google Cloud Storage.

Some preliminary questions for those interested in this (@abhinav_sinha, @ravish_mallya, @Saad, @TobiasMcNulty ):

  • What are your expectations around automated migration? Is it a step you could take care of performing and validating yourself if we provide the necessary APIs? If not, would it be ok for this migration to be a blocking part of bringing a server back up (could take a while)? And/or would you be open to existing blobs staying in the DB and new blobs being written to storage?
  • Would it be acceptable for there to be no path back to storing blobs in the DB? In other words, if you configure external storage, you wouldn't be able to later change your mind and go back to storing blobs in the db.

Hi,
Sorry if this question is a bit off-topic, thought this may be related to some extent. I'm wondering if it is possible to implement scheduled exports from ODK Central into Google Cloud Storage (perhaps just using cron (ideally integrated with ODK Central web GUI). E.g., to have (incremental) submission csv files and the attachments in a subfolder. And ideally to be able to specify the GCS bucket location & authentication on a per-form basis.

It seems like this is possible using the ODK Central API or OData API, but requires significant logics to handle the incremental update management (e.g., using date filters, entry status, etc.), whereas it may already be actively managed by ODK Central in handling its internal DB. And particularly since this is something that would benefit a lot of ODK users, it may be of interest for a general feature improvement of ODK. And it seems that based on this thread, at least part of this topic is in the ODK roadmap?

Wondering what do you think about this?

Thank you in advance

@LN, Apologies for late revert, please find my answers below:
Point 1: I wouldn't mind if existing blobs stayed in the DB, and newer blobs were written over in S3.
Point 2: I would hope the external storage part is decoupled in a way that I should be able to change the desired storage for newer submissions with some configuration option.

1 Like