Info about accessing submission attachments in external S3 storage

Hi all,

I'm loving the addition of external S3 storage since 2024.2.0: ODK Central v2024.2: Submission deletes via API and S3 media storage

This post is mostly to provide info for anyone searching for similar, rather than a question. Hope this is ok :smile:
(Also I added this to the development section as it's probably more for devs)

Preamble

Originally I wanted to post a question requesting that the S3 keys for submission attachments are included in the submission JSON/CSV somewhere.

This was primarily because I am using a public access S3 bucket, so it made sense to simply construct the S3 URL using the key and access the data (e.g. to embed multiple submission photos in a web page from their S3 URLs).

However, I realise the typical use case for this would involve a private access bucket, so instead went about another route.

Getting the S3 URLs for submission attachments

This is quite simple in hindsight.
The main process is: list attachments --> request attachment --> get pre-signed S3 URL --> do what you want with the URL! (download, display the img, etc).

  1. List the submissions for the project (i.e. get the submission UUID you are interested in):

    /v1/projects/{PROJECT_ID}/forms/{FORM_ID}/submissions

    Returns:

    {
        instanceId: "uuid:e83db2b4-5e82-4e61-bc32-04750e511aff"
        ...
    }
    

    It's also possible to get submission UUIDs via the OData endpoint.

  2. List the attachments for a given submission UUID:

    /v1/projects/{PROJECT_ID}/forms/{FORM_ID}/submissions/{SUBMISSION_UUID}/attachments

    Returns:

    [{"name":"1731676401897.jpg","exists":true}, ...]
    

    The 'name' field here is stored in the Central database table submission_attachments as field name, and is generated to be unique.

    This is the field that is used to download the attachment below.

  3. Request a pre-signed URL for each attachment:

    /v1/projects/{PROJECT_ID}/forms/{FORM_ID}/submissions/{SUBMISSION_UUID}/attachments/{ATTACHMENT_NAME}

    Returns (example):

    https://YOUR_S3_PROVIDER/BUCKET_NAME/blob-5-35bd9c1c5cbb5fb549b5f2bfa9d1f8a7fad45fc2?
    response-content-disposition=attachment%3B%20filename%3D%221731676401897.jpg%22%3B%20
    filename%2A%3DUTF-8%27%271731676401897.jpg&response-content-type=image%2Fjpeg&X-Amz-Algorithm=AWS4-HMAC-SHA256&
    X-Amz-Credential=fmtm%2F20241115%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20241115T160531Z&X-Amz-Expires=60&
    X-Amz-SignedHeaders=host&X-Amz-Signature=3cd31e4303c5be4a2e649500322679952f6940b95432f4abb1bbd4b83916c5e5
    

    Note that Central will seamlessly handle either sending the blob directly from the database, or providing a pre-signed URL for download from the S3 bucket.

Like I said above, this seems obvious with hindsight, but I didn't realise Central was capable of providing pre-signed URLs to access the images.
(originally I thought the only way to access the S3 data was from a submission .zip dump).

Hope this helps someone!

2 Likes

Great news! I have yet to start fiddling with the external S3 storage feature, but it is already looking great from the info in your post.

One thing I notice in your "Returns (example)" url is the X-Amz-Expires query parameter:

This means that the signed url will only work for 60 seconds after it's generated and it might come short in some scenarios (store image url to display later).

From the AWS Docs:

X-Amz-Expires
Provides the time period, in seconds, for which the generated presigned URL is valid. For example, 86400 (24 hours). This value is an integer. The minimum value you can set is 1, and the maximum is 604800 (seven days). A presigned URL can be valid for a maximum of seven days because the signing key you use in signature calculation is valid for up to seven days.

It could be a great feature to have this configurable somehow, or even better, implemented as a query parameter of the authenticated endpoint /v1/projects/{PROJECT_ID}/forms/{FORM_ID}/submissions/{SUBMISSION_UUID}/attachments/{ATTACHMENT_NAME}

Very good point!

Looks like this is currently hardcoded at 60s (which is quite strict): https://github.com/getodk/central-backend/blob/afa039773d4b07c5c0cac5afb149ddb9cf5bca13/lib/external/s3.js#L134

It could possibly be a param added to the endpoint you reference --> util.blob.blobResponse --> s3.urlForBlob --> minioClient.presignedGetObject: https://github.com/getodk/central-backend/blob/afa039773d4b07c5c0cac5afb149ddb9cf5bca13/lib/resources/submissions.js#L431

Alternatively it could be an env variable configuration.

I could easily PR for this. No idea if it's a desirable change for the dev team though :pray:

1 Like

Yes! But I believe that if your bucket is public all of the query parameters are superfluous and could be omitted, Central just doesn't have special handling for that case. In particular, the expiration time period is not relevant.

@punkch are you expecting to use a private bucket?

1 Like

The docs recommend using a private bucket & I imagine this is the main use case no?

I had quite a unique case to use a public bucket & was also just being lazy :laughing:

1 Like

Ah yes, got it! I didn't read your original post carefully enough to see you were sharing info mostly for the private bucket context. Yes, we do expect that to be the common path.

We'll give some thought to if and how it makes sense to configure signed link expiration and get back to you. The maximum possible time is 7 days.

1 Like

Awesome! That could be a useful feature. Although, I personally have no pressing need for it right now, so it might be worth seeing if many others would use it.

(saying that, it's quite a minor change)

1 Like