ODK Central - How to delete a submission

Am I right to say that at this time (August 2024) there is still no way to delete submissions from the Central GUI or odk API? If the answer is "yes" then this is really surprising, I can think at least 3/4 good reasons for that, one of them has been pointed in this thread: to programmatically allow to recoup space on the server where Central runs. A project/form with a few tens of app users sending over data with heavy media (photos/videos) will quickly drain disk space especially on cloud servers. And why I should want to keep them? (Hopefully) the API allows to download submissions + media filtering by date, so it is very reasonable to throw away submissions older than a specific point in time.

At very least I would expect an API call/method to delete/remove media files from submissions...

Anyway... as I guess that there is still no way to delete submissions as one would expect, what would be the correct procedure to delete them from within the DB? I can't really afford to mangle the DB/form/central installation and the last time I checked the data model used by Central was all but obvious.

Thanks in advance.

1 Like

Hi @manghig,

The API now supports submission deletion: https://docs.getodk.org/central-api-submission-management/#deleting-a-submission

https://docs.getodk.org/central-install-digital-ocean/#using-s3-compatible-storage might solve your storage space constraints.

As for partial downloads, the API definitely supports pagination as well as filtering by e.g. date.
Both pyODK (https://getodk.github.io/pyodk/submissions/#pyodk._endpoints.submissions.SubmissionService.get_table) and
ruODK support pagination via top/skip and OData filters where you can e.g. filter for submission date, see examples.

Downloading data via ruODK::odata_submission_get() doesn't take all that long as it skips already downloaded attachments.

2 Likes

Great news, thanks!

Yes of course, but that is not an option for many.

Yes I am aware that the API can do that, my whole rant (and I'm sorry for that) in other posts too, is that with briefcase it was also possible without needing to write "complex" scripts: just parameters on a easy to use command line tool (I'm not even taking into account the GUI that was a blessing for many not tech people).

Only rust?

regards

Hi @manghig, the ability to delete Submissions from the UI in Central is coming soon in the next release! We know this is a highly requested feature and we don't want to limit it to just API users

2 Likes

That is great news, thanks!
Would be possible to select >1 submissions to be deleted?

In general (API or GUI), deleting submissions will recoup disk space when they contain media files? I remember something about PostgreSQL and some extra step to do when deleting records with BLOBs and disk space...

MAYBE this answers the question - assuming that the delete submission function works in the same way...

So it might not be immediate recuperation of space, but once the 'trash' is purged I think that it will be recovered...

Any idea of how the SQL command looks like? Also some help on how to run it with CRON given that the database is inside docker...

Sorry, way beyond my competence - that's under the bonnet / here be dragons kind of stuff. All I know is that there is a CRON that does the necessary compaction of the database when the media 'expires' in the trash (what I'm referring to as a purge) - my guess is that this now takes place as part of deleting the submission (i.e. the submission and its media go to trash). An earlier post in this thread indicated a way of shortening the 'wait time' to purge, but that has a 'with caution' caveat on it...

That was all I needed to know - that it happens, not how to do it manually... I self host, so better for me to push the envelope than throw it on the fire.

So are you saying that it is all automatic? No need to set up that cron job manually?

That's my understanding - have a look at the thread that I linked to from @ktuite - it says "a daily cron job for purging deleted forms ... will do its work." and being a little 'retentive' I asked the very same question:

Not sure if this will help, but here is a basic pyODK script that works for deleting specific attachments, so should be possible to adapt very slightly for the new delete submissions [remove +'/attachments/'+str(filename) from the end]... You will need to set up the credentials for pyODK - which are in the DOCS.

This would give you the opportunity to loop through a series of submissions and/or attachments in standard python methods - you may be far more skilled with pyODK, so please don't be offended by me offering this:

from pyodk.client import Client
client = Client()
projectId = [a number]
xmlFormId = 'your_form'
instanceId = 'uuid:...'
filename = 'e.g. epoch_time.jpg'
response = client.delete('/projects/'+str(projectId)+'/forms/'+str(xmlFormId)+'/submissions/'+str(instanceId)+'/attachments/'+str(filename))
print(str(instanceId)+' / '+str(filename)+' status: '+str(response.status_code))

(the projectId and xmlformId can be found in Central under the link 'analyse with OData' in the form submissions page)

Also, I don't know how to retrieve attachments that are in Trash - so some caution is required...

Good luck with the deletions and disk space recovery.

It should not so so hard to require any luck... It is really a basic feature... and there almost no point to delete submissions with media files if disk spaced is not recovered in an automatic way...

The 'luck' is in getting everything right, not how hard it is - if, like me, you have thousands of images to manage and some to delete, you might wish for a little luck that you don't have to come to the forum asking for help to undelete / recover something you did by mistake. And that you'd be lucky enough for someone to answer...

Whilst I might agree that being able to delete submissions should be a core function, I do understand why the ODK team have been cautious in letting something out into the wild that could have devastating consequences to data integrity, possibly in a mission-critical situation (by which I mean life or death, not 'am I going to get into trouble for deleting a spurious image'). Nothing basic about that.

And I would encourage you to read the posts that I linked to before repeating the complaint about automatic disc space recovery - you are turning it into a conspiracy. It does happen automatically and it was confirmed unequivocally by a member of the ODK team. That's why I linked to it.

I understand your frustration with this, but please remember that this is open source software - my time is not free, but I might choose to give you it. I could not operate my business as effectively without ODK but I don't pay cash to use it. The least we can do is to contribute something and support its continued development...

2 Likes

As @ktuite noted, when you delete a form or submission in Central, all of its resources and attachments are soft-deleted (put in trash) for 30 days. During those 30 days, you can restore that data. After the 30 days, the data is purged (trash is emptied) and marked in the database as "dead".

PostgreSQL, the database used by Central, automatically vacuums up "dead" data and makes it available for re-use. When the autovacuum happens and whether or not it returns space to the operating system (as opposed to only the database) is decided by Postgres based on how it's configured and how you are using it.

Postgres doesn't always return disk space to the operating system because doing so would be an expensive operation and so there is an optimization tradeoff that it makes. True recovery (aka full vacuum or shrinking) by Postgres typically requires blocking reads/writes and twice the size of the database because it's essentially a backup and restore. This is one of the reasons why most cloud databases (e.g., RDS) can't shrink a database once it grows.

If you want to force vacuuming, you can do so:

  1. Run docker exec -it central-postgres14-1 psql -U odk -W odk
  2. Enter the default password: odk
  3. Run vacuum
    • Reclaims space and makes it available for re-use by the database. Does not always return space to the operating system.
  4. Run vacuum full
    • Reclaims more space, but takes much longer, locks the database so you can't use it, and requires extra disk space. Returns space to the operating system.

So in summary, if you delete data in Central, it eventually frees up space for re-use inside the database, but it may not free up space for re-use by the operating system.

2 Likes

So in summary, if you delete data in Central, it eventually frees up space for re-use inside the database, but it may not free up space for re-use by the operating system

@yanokwa Hi. If I move the attachments to and S3 compatible storage https://docs.getodk.org/central-install-digital-ocean/#using-s3-compatible-storage will this trigger freeing disk space on the VPS where Central is now running (and storing the attachments in the PgSQL DB)? Or the vacuum full needs to be run manually anyway? Thanks in advance.

@yanokwa follow up question: would this https://docs.getodk.org/central-api-submission-management/#clearing-a-submission-attachment help freeing disk space on the server?

follow up question: would this https://docs.getodk.org/central-api-submission-management/#clearing-a-submission-attachment help freeing disk space on the server?

Answering to myself… this by itself does not free any disk space on the server. Did I get wrong that this API call does?

If you want to free up disk space, you need to run vacuum full after deleting data or uploading the blobs to block storage. If you don't have 2x the disk space or can't lock the DB for the full vacuum, I've used and recommend pg_repack as an alternative.

@yanokwa Hi. I did cleared attachments with https://docs.getodk.org/central-api-submission-management/#clearing-a-submission-attachment now I can see on Central that attachment are missing

image

I then did the VACUUM FULL; operation but after it finished the size of the DB remained unchanged, as seen with

SELECT pg_size_pretty(pg_database_size('odk')) AS size;

before and after clearing attachments with the Central API.

I did another check (before and after) with

odk=# SELECT relname, relkind, pg_size_pretty(pg_total_relation_size(oid))
FROM pg_class
ORDER BY pg_total_relation_size(oid) DESC
LIMIT 15;
relname | relkind | pg_size_pretty
--------------------------------------+---------+----------------
blobs | r | 116 GB
pg_toast_16560 | t | 116 GB
pg_toast_16560_index | i | 1294 MB
audits | r | 84 MB
client_audits | r | 74 MB
submission_defs | r | 39 MB
client_audits_start_index | i | 13 MB
audits_action_acteeid_loggedat_index | i | 9800 kB
audits_acteeid_loggedat_index | i | 6992 kB
submissions | r | 6160 kB
audits_actorid_action_loggedat_index | i | 6040 kB
form_defs | r | 5240 kB
pg_toast_16915 | t | 5024 kB
form_field_values | r | 4728 kB
submission_attachments | r | 4392 kB
(15 rows)

As far as I have understood, when UPDATING rows with BLOBs rather then DELETING them (as I assume the “clear submission attachments” API method does):

a new row version is created

the old row version remains in place

the TOASTed BLOB rows referenced by the old row are still there

Is this the correct interpretation? If yes, does this mean that the only way to really shrink the database size is to DELETE submissions with attachments and wait 30 days (other than using a S3 compatible storage)?

thanks in advance