Some ideas about submissions/attachments management

Apologies if the subject of this post have been already discussed elsewhere, unfortunately I don’t have the time I wish I had to closely follow ODK developments.

What high-level problem are you trying to solve?

Nowadays every time a smartphone/tablet takes a picture and is attached to a form, there is a good chance that several MB of data are added to the ODK database backend. Even with a limited number of forms/devices/collaborators this means that the size of the DB quickly escalates, making management of disk space available problematic (especially but not limited to who uses VPS servers), not to mention backups and restores.

Any ideas on how ODK could help you solve it?

Before you “stone” me, yes I know that attachments can be placed in a S3 compatible storage, which of course is a good option to have. Anyway this adds a layer of complexity (and another possible point of failure) that not all can handle. 3rd party providers are plenty (AWS, Backblaze, etc.), but my experience is that the cost at the end of the month is often unpredictable because is not just a matter of how much storage is used, but also how much data is moved back and forth, the number of API calls, etc. Self hosting is an option too of course, but minIO open source is dead (they have now a “free” tier, but is to be seen what entails) and other projects like garageHQ or RustFS are very green and unclear if compatible with ODK Central. Speaking of the setup it is still very unclear to me what “The names of objects stored in S3-compatible storage do not stand alone and must be converted to useful filenames and connected to the right forms and/or submissions by Central. For example, object names will look like blob-412-950ababd4c8cf8d11rf5421433b5e3dafx5f6e75” entails (as seen in the Central docs).

Has the possibility to use “normal”/system disk space for the attachments be considered? If it was discarded as an option and/or if is not on the roadmap, why?

That said, I think it would be nice if there were improvements to help keep the database size at bay. From what I have understand the only real option is to delete submissions (that have attachments).

From the UI is possible to delete submissions, but unless I’m blind I can’t see a way to delete multiple submissions at a time, only one by one. When having thousands of submissions to delete that is impractical. Checkboxes to allow multiple selection or a way to select by date range would help a lot.

About the 30 days retention time for the “soft delete”: it always seemed arbitrary to me. For who is in desperate need to shrink the database size (and reduce the overall amount of disk space used on the server) this is a long wait. Why not letting the system manager decide with a configuration parameter? Why not adding a button in the UI (with all the flashing warnings that are necessary) to allow a project manager to hard delete the previously soft deleted submissions?

Last but not least: deleting an entire submission with heavy attachment to reduce the DB size seems a bit drastic. Just deleting the attachment(s) (that in the meantime could have been download by other means, like the UI or the API) would be much better. I was expecting the API method “Clearing a Submission Attachment” to do just that, but I tried it and even after a manual VACUUM FULL the database size does not change, so I’m miss what that is really for.

In an old thread I found the reference to what apparently is/was a real “delete submission attachment” API method, but can’t see it anywhere in the docs, so maybe I misinterpreted the post or maybe ir was removed(?). Either way this would also help.

Regards

Hi @manghig ,

I will provide some suggestions based on my experience.

If you are not using the max-pixels parameter on your images, you should look into using it or setting it to a lower value. It should reduce the picture quality and therefore save some space on your disk. If, for some reason, this setting does not work, you can also look into using different default cameras on your devices (e.g. Open Camera) that allow you to set max sizes or resolutions for your images.

There is no way to bulk delete submissions from the ODK Central UI at this time. I agree with you that it should be a priority on the ODK roadmap, as I also ran into a use case of deleting bulk deleting submissions some years ago. However, if you have the talent to self-host ODK, you should be able to use the API to delete submissions. @yanokwa provided some code that could be adapted to do this in here. There is a soft delete period of 30 days, and I do not know how to work around that.

I believe the “delete submission attachment via the API” you are referring to is clearing a submission attachment. I have not tried this myself to know whether it deletes data or just the reference to the data.

In general, I think ODK Central wasn’t designed for emergency bulk deletes. Typically, you would want to test and monitor your installation so that you don’t run into surprises with data sizes, and if you do, it expects you to be able to expand your disk size to support the larger database sizes. As you mentioned, if local disk storage isn’t a suitable solution, then a S3-compatible storage should work. If all these options are outside your organization’s technical expertise, I would suggest managed ODK hosting provided by getodk.org or finding a hosting organization willing to provide managed ODK hosting for your needs.

Best of luck!

1 Like

Hi @jniles thanks for this suggestion.

Agree.

As system manager I can handle the API, project/form manager usually can’t.

Also I’m suggesting to let the administrator to decide about the soft delete period. Why is so long and basically hardcoded? And why there should not be a way in UI to flush soft deleted submissions (eventually accessible only to administrators, of course).

I’m looking for a clarification (actually 2/3) here, because:

the clear attachment method does not make the database size any smaller, even after a “VACUUM FULL”

so… what is the goal of this method? The docs are very unclear to me.

I have found (here in this forum) references to a “delete attachments” method, but is not on docs, so not sure if is the same as “clear” or something that was removed or something that was referenced by mistake.

I don’t agree here. Forms are updated, new forms added, surveyors ask to take more pictures, and so on. The forms/scenarios may have been initially tested may have changed along time. It is not a matter of “emergency bulk deletes” but just to keep the database size to a reasonable size. Having a database that is always expanding in size, unless deleting submissions and doing a vacuum full or adding complexity/costs with a S3 endpoint, seems odd to me.

But I can cope with it, I think that allowing to really delete only attachments (without deleting the whole submission) and/or allowing to manually flush soft deleted submissions is not unreasonable.

I can handle setting up an S3 endpoint, and while is good to have this option I think that it can be argued that adding complexity (and costs) is not always desiderable. As I wrote in my original post: more stuff to manage and that can break, with 3rd party providers costs at the end of the month are frequently unclear to foresee (because is not just a matter of storage used), now that minIO open source is dead is unclear if central is compatible with projects like rustFS that anyway are very new and therefore unclear if they will stick around.

I don’t think that having an option to store the attachments on the “local” filesystem (it can be a samba or nfs share), out of the database, is an unreasonable thing to have/add.

cheers!

1 Like

I self-host and I share some of your pain, but the alternative feels unaffordable to me (or I'm just fundamentally too tight to spend a fair percentage of our turnover on a managed solution!). So I try to work out how to keep things as simple and controllable as possible even though I am at the far end of my competence (or maybe well beyond it). So far have kept my server alive and in working order for 3 years... I think I have around 10 projects, about 50 forms, 10 or so entity lists and thousands of submissions, most with attachments. So probably not very big, but enough to worry about.

As suggested by @jniles the 'max-pixels' parameter for images is invaluable, even if you are not worried about disk space.

I was running towards the brick wall of 'full disk', so investigated the Delete / clear attachments as soon as it was viable for me to do so. You will still find it in the docs (not removed), also referenced above:

(the summary text looks like it's pointing to the wrong section of the docs as I type this, so you might need to follow the link)

I think I've posted in the forum too. I can confirm that this works - it removes the attachment only (soft delete and then purge after 30 days) which can then allow the database size to shrink. The submission remains, so you have an 'audit log' easily accessible. None of this is 'immediate' due to the purge delay, so probably needs a bit of forethought - when you can see the capacity of your disk filling, maybe some 'housekeeping' is good.

I think you know this, but you can use a pyODK script to handle bulk deletes:

What I haven't done is combine the bulk delete submission script with just deleting the attachment - but it seems to be eminently possible using pyODK's HTTP requests:

@yanokwa wrote an explanation about the database with a little caveat that you focussed on in your other recent post:

I'm not a database administration expert, but I take that as meaning it frees up space inside the 'tent' of the database, but the 'tent' doesn't automatically collapse inside the warehouse of your disk, so you can't use that space until the size/shape of the tent changes - and there are other factors involved as to how / whether that happens so no black/white answer is correct. My experience is that the Central database reconfigures it's actual size on the disk during it's daily work-out, if there is enough space outside the tent to do so.

I have also switched to S3 storage. So far (about 6 months) my provider (not AWS) has charged me a fixed fee for a fixed amount of space, irrespective of bandwidth. This is cheaper than the disk space I was using when holding everything within Central - and would also be if I wanted to store the attachments on that server in additional block storage (and I'd then need to back that up separately as well as my other back ups).

I can't speak of any underlying reasons for what is and isn't available in the UI of Central, but it seems reasonable that an Open Source project that allows 'cash-free' access to its systems might have a different approach to a commercial offering that wants to get you 'hooked'. I always expect the 'entry hurdles' to be higher and there could be some 'here be dragons' moments. So the ODK team might not prioritise things that make it 'easy' to administer things, but are likely to provide tools to keep things safe, secure and logical. Respectfully, perhaps some of your desires are tending towards the commercial service?

Personally I wouldn't want the 30 day purge to be removed / changed - I really like the fact that the ODK team look after us 'idiots' by keeping us away from 'hot surfaces'! Adding complexity so that someone poorly qualified (e.g. me) can break something inadvertently doesn't seem like a good idea.

We've all (probably) been guilty of wanting something for nothing - 'free tier' is paid for by someone, and it doesn't always stay that way, so making commercial / long-term decisions on that basis might not be good practice. Look how Google Drive capability came and went with ODK, as the 'model' changed... I had to back pedal on that!

Hopefully some of your ideas can filter in to make life easier for us, but I'm always trying to be careful what I wish for. I'm happy to find tools to work with and can live without bells and whistles.

1 Like

Thanks @seewhy for your reply.

I have less projects and forms and the database size is approaching the 150gb mark, then I need to have at least 150 more free gb on the server to be able to “vacuum full” once submissions/attachment will be purged from trash, if I understand it correctly.

Thanks.

So, if I understand correctly the “clean attachments” method puts them in the soft delete/trash status as the deleted submissions, right?

As in UI doesn’t show anything about it (it only show the number of submissions deleted) I assumed that was not the case for attachments.

Yes thanks. I prefer to use other scripting languages but it’s irrelevant here.

It is that maybe I’m just dumb, but I can’t understand this thing/concept.

Ah cool, can I kndly ask you some feedback here? Questions/doubts on using S3 compatible storage

Not sure if I understand you correctly. Just to be clear: I’m very familiarized with how a Open Source project is run and is developed. I’m a 15+ years contributor of a well known open source project. I contributed to it a lot of time (QA), some code and a substantial amount of money (well north of 20k euros). I know well that asking to add features or fix bugs is not enough and that to get things done is sometimes necessary to be proactive. I’m not asking to implement/fix anything here, I’m just suggesting a few things that I think would be valuable for the project. If it is possible have some of this ideas done as part of a commercial service, then I’m all ears.

I have seen this argument in at least another thread. I’m not saying that the 30 day retention period should be removed or changed, I’m arguing that there should be a way to allow at least the system administrator to get around it.

see above.

cheers!