Some ideas about submissions/attachments management

Apologies if the subject of this post have been already discussed elsewhere, unfortunately I don’t have the time I wish I had to closely follow ODK developments.

What high-level problem are you trying to solve?

Nowadays every time a smartphone/tablet takes a picture and is attached to a form, there is a good chance that several MB of data are added to the ODK database backend. Even with a limited number of forms/devices/collaborators this means that the size of the DB quickly escalates, making management of disk space available problematic (especially but not limited to who uses VPS servers), not to mention backups and restores.

Any ideas on how ODK could help you solve it?

Before you “stone” me, yes I know that attachments can be placed in a S3 compatible storage, which of course is a good option to have. Anyway this adds a layer of complexity (and another possible point of failure) that not all can handle. 3rd party providers are plenty (AWS, Backblaze, etc.), but my experience is that the cost at the end of the month is often unpredictable because is not just a matter of how much storage is used, but also how much data is moved back and forth, the number of API calls, etc. Self hosting is an option too of course, but minIO open source is dead (they have now a “free” tier, but is to be seen what entails) and other projects like garageHQ or RustFS are very green and unclear if compatible with ODK Central. Speaking of the setup it is still very unclear to me what “The names of objects stored in S3-compatible storage do not stand alone and must be converted to useful filenames and connected to the right forms and/or submissions by Central. For example, object names will look like blob-412-950ababd4c8cf8d11rf5421433b5e3dafx5f6e75” entails (as seen in the Central docs).

Has the possibility to use “normal”/system disk space for the attachments be considered? If it was discarded as an option and/or if is not on the roadmap, why?

That said, I think it would be nice if there were improvements to help keep the database size at bay. From what I have understand the only real option is to delete submissions (that have attachments).

From the UI is possible to delete submissions, but unless I’m blind I can’t see a way to delete multiple submissions at a time, only one by one. When having thousands of submissions to delete that is impractical. Checkboxes to allow multiple selection or a way to select by date range would help a lot.

About the 30 days retention time for the “soft delete”: it always seemed arbitrary to me. For who is in desperate need to shrink the database size (and reduce the overall amount of disk space used on the server) this is a long wait. Why not letting the system manager decide with a configuration parameter? Why not adding a button in the UI (with all the flashing warnings that are necessary) to allow a project manager to hard delete the previously soft deleted submissions?

Last but not least: deleting an entire submission with heavy attachment to reduce the DB size seems a bit drastic. Just deleting the attachment(s) (that in the meantime could have been download by other means, like the UI or the API) would be much better. I was expecting the API method “Clearing a Submission Attachment” to do just that, but I tried it and even after a manual VACUUM FULL the database size does not change, so I’m miss what that is really for.

In an old thread I found the reference to what apparently is/was a real “delete submission attachment” API method, but can’t see it anywhere in the docs, so maybe I misinterpreted the post or maybe ir was removed(?). Either way this would also help.

Regards

Hi @manghig ,

I will provide some suggestions based on my experience.

If you are not using the max-pixels parameter on your images, you should look into using it or setting it to a lower value. It should reduce the picture quality and therefore save some space on your disk. If, for some reason, this setting does not work, you can also look into using different default cameras on your devices (e.g. Open Camera) that allow you to set max sizes or resolutions for your images.

There is no way to bulk delete submissions from the ODK Central UI at this time. I agree with you that it should be a priority on the ODK roadmap, as I also ran into a use case of deleting bulk deleting submissions some years ago. However, if you have the talent to self-host ODK, you should be able to use the API to delete submissions. @yanokwa provided some code that could be adapted to do this in here. There is a soft delete period of 30 days, and I do not know how to work around that.

I believe the “delete submission attachment via the API” you are referring to is clearing a submission attachment. I have not tried this myself to know whether it deletes data or just the reference to the data.

In general, I think ODK Central wasn’t designed for emergency bulk deletes. Typically, you would want to test and monitor your installation so that you don’t run into surprises with data sizes, and if you do, it expects you to be able to expand your disk size to support the larger database sizes. As you mentioned, if local disk storage isn’t a suitable solution, then a S3-compatible storage should work. If all these options are outside your organization’s technical expertise, I would suggest managed ODK hosting provided by getodk.org or finding a hosting organization willing to provide managed ODK hosting for your needs.

Best of luck!

2 Likes

Hi @jniles thanks for this suggestion.

Agree.

As system manager I can handle the API, project/form manager usually can’t.

Also I’m suggesting to let the administrator to decide about the soft delete period. Why is so long and basically hardcoded? And why there should not be a way in UI to flush soft deleted submissions (eventually accessible only to administrators, of course).

I’m looking for a clarification (actually 2/3) here, because:

the clear attachment method does not make the database size any smaller, even after a “VACUUM FULL”

so… what is the goal of this method? The docs are very unclear to me.

I have found (here in this forum) references to a “delete attachments” method, but is not on docs, so not sure if is the same as “clear” or something that was removed or something that was referenced by mistake.

I don’t agree here. Forms are updated, new forms added, surveyors ask to take more pictures, and so on. The forms/scenarios may have been initially tested may have changed along time. It is not a matter of “emergency bulk deletes” but just to keep the database size to a reasonable size. Having a database that is always expanding in size, unless deleting submissions and doing a vacuum full or adding complexity/costs with a S3 endpoint, seems odd to me.

But I can cope with it, I think that allowing to really delete only attachments (without deleting the whole submission) and/or allowing to manually flush soft deleted submissions is not unreasonable.

I can handle setting up an S3 endpoint, and while is good to have this option I think that it can be argued that adding complexity (and costs) is not always desiderable. As I wrote in my original post: more stuff to manage and that can break, with 3rd party providers costs at the end of the month are frequently unclear to foresee (because is not just a matter of storage used), now that minIO open source is dead is unclear if central is compatible with projects like rustFS that anyway are very new and therefore unclear if they will stick around.

I don’t think that having an option to store the attachments on the “local” filesystem (it can be a samba or nfs share), out of the database, is an unreasonable thing to have/add.

cheers!

1 Like

I self-host and I share some of your pain, but the alternative feels unaffordable to me (or I'm just fundamentally too tight to spend a fair percentage of our turnover on a managed solution!). So I try to work out how to keep things as simple and controllable as possible even though I am at the far end of my competence (or maybe well beyond it). So far have kept my server alive and in working order for 3 years... I think I have around 10 projects, about 50 forms, 10 or so entity lists and thousands of submissions, most with attachments. So probably not very big, but enough to worry about.

As suggested by @jniles the 'max-pixels' parameter for images is invaluable, even if you are not worried about disk space.

I was running towards the brick wall of 'full disk', so investigated the Delete / clear attachments as soon as it was viable for me to do so. You will still find it in the docs (not removed), also referenced above:

(the summary text looks like it's pointing to the wrong section of the docs as I type this, so you might need to follow the link)

I think I've posted in the forum too. I can confirm that this works - it removes the attachment only (soft delete and then purge after 30 days) which can then allow the database size to shrink. The submission remains, so you have an 'audit log' easily accessible. None of this is 'immediate' due to the purge delay, so probably needs a bit of forethought - when you can see the capacity of your disk filling, maybe some 'housekeeping' is good.

I think you know this, but you can use a pyODK script to handle bulk deletes:

What I haven't done is combine the bulk delete submission script with just deleting the attachment - but it seems to be eminently possible using pyODK's HTTP requests:

@yanokwa wrote an explanation about the database with a little caveat that you focussed on in your other recent post:

I'm not a database administration expert, but I take that as meaning it frees up space inside the 'tent' of the database, but the 'tent' doesn't automatically collapse inside the warehouse of your disk, so you can't use that space until the size/shape of the tent changes - and there are other factors involved as to how / whether that happens so no black/white answer is correct. My experience is that the Central database reconfigures it's actual size on the disk during it's daily work-out, if there is enough space outside the tent to do so.

I have also switched to S3 storage. So far (about 6 months) my provider (not AWS) has charged me a fixed fee for a fixed amount of space, irrespective of bandwidth. This is cheaper than the disk space I was using when holding everything within Central - and would also be if I wanted to store the attachments on that server in additional block storage (and I'd then need to back that up separately as well as my other back ups).

I can't speak of any underlying reasons for what is and isn't available in the UI of Central, but it seems reasonable that an Open Source project that allows 'cash-free' access to its systems might have a different approach to a commercial offering that wants to get you 'hooked'. I always expect the 'entry hurdles' to be higher and there could be some 'here be dragons' moments. So the ODK team might not prioritise things that make it 'easy' to administer things, but are likely to provide tools to keep things safe, secure and logical. Respectfully, perhaps some of your desires are tending towards the commercial service?

Personally I wouldn't want the 30 day purge to be removed / changed - I really like the fact that the ODK team look after us 'idiots' by keeping us away from 'hot surfaces'! Adding complexity so that someone poorly qualified (e.g. me) can break something inadvertently doesn't seem like a good idea.

We've all (probably) been guilty of wanting something for nothing - 'free tier' is paid for by someone, and it doesn't always stay that way, so making commercial / long-term decisions on that basis might not be good practice. Look how Google Drive capability came and went with ODK, as the 'model' changed... I had to back pedal on that!

Hopefully some of your ideas can filter in to make life easier for us, but I'm always trying to be careful what I wish for. I'm happy to find tools to work with and can live without bells and whistles.

1 Like

Thanks @seewhy for your reply.

I have less projects and forms and the database size is approaching the 150gb mark, then I need to have at least 150 more free gb on the server to be able to “vacuum full” once submissions/attachment will be purged from trash, if I understand it correctly.

Thanks.

So, if I understand correctly the “clean attachments” method puts them in the soft delete/trash status as the deleted submissions, right?

As in UI doesn’t show anything about it (it only show the number of submissions deleted) I assumed that was not the case for attachments.

Yes thanks. I prefer to use other scripting languages but it’s irrelevant here.

It is that maybe I’m just dumb, but I can’t understand this thing/concept.

Ah cool, can I kndly ask you some feedback here? Questions/doubts on using S3 compatible storage

Not sure if I understand you correctly. Just to be clear: I’m very familiarized with how a Open Source project is run and is developed. I’m a 15+ years contributor of a well known open source project. I contributed to it a lot of time (QA), some code and a substantial amount of money (well north of 20k euros). I know well that asking to add features or fix bugs is not enough and that to get things done is sometimes necessary to be proactive. I’m not asking to implement/fix anything here, I’m just suggesting a few things that I think would be valuable for the project. If it is possible have some of this ideas done as part of a commercial service, then I’m all ears.

I have seen this argument in at least another thread. I’m not saying that the 30 day retention period should be removed or changed, I’m arguing that there should be a way to allow at least the system administrator to get around it.

see above.

cheers!

I don’t have the complete historical context of why that was ruled out, but: actually safely storing files transactionally is not as straightforward as developers may often think. There’s the need to fsync (“commit”) the file contents, but also the parent directory inode changes need to be committed, before denoting the file as stored in the meta-registration in the DB, and the best approach differs by filesystem type and even kernel version, while with the dockerized setup we don’t have influence on either. So one complexity is that to do it right is actually quite a bit of work and worry. Generally venerable DBs, eg PostgreSQL/SQLite get it right though, as they eventually also use files to store the contents. Or are at least sufficiently interested in getting it right.

Another downside is that when using the filesystem there’s no referential integrity constraints. In the DB we can’t delete file contents still referenced by some submission, but if we store files in the file system, there’s no such protection. Coupled with the likelihood that with filesystem-stored files, especially when files are going to be stored on network shares, people will be very tempted to start accessing (and potentially accidentally moving, or intentionally deleting) files, this will increase the risk of inadvertent data loss and will probably increase the support load and the risk of people going (heavily dramatized version) “ZOMG ODK IS EATING OUR FILES” (whereas it was actually the fault of their ex-colleague, two months ago, who inadvertently renamed some of the files in the filesystem, noticing something was broken, but kept it hush hoping no one would notice. And no one else knows this happened…). This scenario could be somewhat mitigated by letting Central scan its hypothetical file storage backend index regularly and surfacing any inconsistencies so that they may be acted upon, but that’ll be another extra thing.

Using the filesystem straight would also somewhat complicate making and restoring consistent backups. There’ll be the DB dump, and separately the file storage tree. Not impossible, but again, more work. Then again the current practice of dragging every attachment through the DB (even when later it will be uploaded to S3 storage) makes for a huge DB WAL archive, so anyone continuously archiving WALs with the intent of doing PITR (aka “serious DB backups”) is not going to be happy with attachment-heavy submissions, since they’ll be maintaining a gigantic WAL archive yet still need snapshots of their S3 storage for full disaster recovery.

That said, in some other contexts using straight filesystem storage to… well… store files, would be a no-brainer. The common practice is to use some content-hash based addressing/naming scheme for the file contents, keep all metadata in the DB, and best of all: let the application webserver write out headers, but let the frontend webserver serve out the actual file contents through sendfile which instructs the OS to copy from one file descriptor (the open file) to another (the network socket) without involving user space. That’d be the efficient-ideal setup, and so much more efficient than the current route of “1. DB reads from disk, 2. copies to userspace buffer, 3. OS of DB server copies it to network stack (chopping it up in packets), 4. OS instance of application server reads it into network stack (reassembling from packets), 5. application server reads DB query results from network stack, copies to buffer, 6. asks OS to put that buffer into the network stack piecemeal, 7. then frontend web server reads from network stack, reassembles the bytestream, and 8 again copies that to its OS network stack piecemeal so that it eventually may end up at the user’s browser”. If making 8 copies and traversing 4 network stacks sounds inefficient it’s because it is :laughing: but probably the worst effect is that using the DB for files bears a great opportunity cost, files are large and reading them flushes out the DB’s in-memory buffers where something way more useful and reusable (say, the index on the sessions table…) could be held. Also keeping a scarce “smart” DB connection occupied doing the “dumb” work of shoveling bytes (which could, with another architecture, be just a sendfile system call) is really not great for scaling.

I hope this elaboration on pros and cons gives some background info on the potential considerations — they’re my personal considerations and thoughts, I don’t speak for the whole team, and I don’t have the full historical context. As far as I know we’re currently not really considering storage backends other than the existing DB or write-through-DB-to-S3. But I do really understand that there are drawbacks of the current available options that would be solved with straight-to-filesystem storage. The dream would be (could be) that the storage interface would become modular to the extent that someone like you could easily write/adapt a storage backend plugin that indeed uses the filesystem, and then it’s your party from there onwards.

2 Likes

There's some great discussion here and understanding of the tradeoffs we considered when building the database-based blob storage (as opposed to storing files in the file system) and the 30-day soft-delete trash system.

I've posted a couple other places today about

I also wanted to note @manghig's need for:

  1. Bulk deleting submissions via the Central UI. You're not blind, we have this for entities but not submissions, and the disconnect is unfortunate! We should definitely revisit this.
  2. More clarity in our docs about how sys admins can override the 30-day trash.
  3. More info about clearing submission attachments and how that relates to database storage.

Thanks for bringing up these pain points!

1 Like

@Nonsequitur I couldn’t have asked for a more complete answer, thanks. Most of the technical details are way over my head, but I understand now that is less obvious than I imagined.

And this would be perfectly reasonable: who self-host Central needs already to take care to backup the server/vm/database/submissions/etc. adding to that the responsibility to backup the folder where attachments are stores would not add more… still it would be much less overhead/costs than using a S3 storage.

cheers!