Spec proposal: naming media files

Media files taken by users (like images/videos/audios) are saved in ODK Collect with names that represent the timestamp (in milliseconds) when they were gathered (e.g. 1660920606988.jpg). For many users, this is a problem because they can't easily link files named in this way with data they were taken for (see: Naming pictures based on values from the filled form)

We would like to allow specifying how file names should look like. In order to do that, we want to introduce a new body attribute named filename

XLSForm

type name label calculation filename
text firstName First name
text lastName Last name
calculate calcFileName concat( ${firstName} , ${lastName})
image photo Please attach your photo ${calcFileName}

XForm

<bind nodeset="/data/firstName" type="string" />
<bind nodeset="/data/lastName" type="string" />
<bind nodeset="/data/calcFileName" type="string" calculate="concat( /data/firstName ,  /data/lastName)"/>
<bind nodeset="/data/image" type="binary" />
…	  
<upload mediatype="image/*" filename="/data/calcFileName" ref="/data/image">
	<label>Please attach your photo</label>
</upload>

Assuming that a user passes John as their first name and Smith as their last name, images will be saved with names like JohnSmith.jpg.

In order to reduce the risk of duplicates, we are thinking about the following options:

  • always add a suffix with timestamp (e.g. JohnSmith_1660920606988.jpg)
  • always add a suffix and add a second attribute that lets users explicitly opt-out of suffix (e.g. filenameNoSuffix=”true”)
  • do nothing and let users take full responsibility to avoid conflicts (e.g. by adding the timestamp deliberately concat( ${firstName} , ${lastName}, now()))

Another decision we need to make is when to evaluate such an expression. The options are:

  • evaluating at image save time. This is the easiest way, but also means it could get out of sync with form. For example, if a user takes a photo with answers John Smith and then changes first name from John to Adam the file name won't be updated to AdamSmith.jpg (it will remain JohnSmith.jpg)
  • evaluate and rename at finalization/submission time. That means once the forms are saved/submitted, we make sure that attached files have proper names.
  • evaluate as dependent values change. This seems to be the best option because file names would always be updated immediately, but it's also the most difficult option to implement for us.

I know that a lot of you have been waiting for this feature, so please help us implement it by sharing your thoughts.

10 Likes

I think these are very reasonable options.

2 Likes

Add a uuid() suffix to names

I'd vote to add a uuid() suffix as the first implementation and wait for feedback. If people want that uuid to be optional, then we can add that as an attribute (e.g., allowDuplicateNames=true).

I prefer uuid() instead of now() because it means servers don't have to handle duplicates on export.

Evaluate name at submission time

One reason users want the file naming feature is to easily link data in the form to the media in a downstream tool or process.

It'd be very unexpected to edit a person's name field (before submission or in Central) and have a totally different name on the picture associated with that person when you export. For that reason, I think the bare minimum is to set the name at submission time.

Updating the name as dependent values change could be neat, but I can't think of a good use-case that requires that. Maybe if we use that name in the UI, it's helpful for debugging?

Anyway, if we come up with a good use-case for dynamic updates, we can always change the implementation and we won't break backwards compatibility.

Other questions

  • Is there a reason why you want to add another column and not make this a parameter? My guess is because there's no easy way to evaluate an expression in the parameter column?
  • How do you handle the case of repeats? I guess you could suggest users to include position() in the name? Or do it for them?
  • Does Android have filename size limits?
1 Like

Thanks for this thread!

One question about using uuid (or even now()) as a suffix on the name... Would you mean that the name of the image might be: imagename_loooooooong_uuid.jpg (for example)?

Right now, the path to each image, when I download and store on my computer (from kobotoolbox) is a folder with my username, then a folder with another project ID code, then folders within that that are the uuid's, then images within each uuid folder (named according to the current methodology) .

Right now, it makes for long file path. If we then also included uuid in the image name, this would further extend the path name.

I guess this is also a consideration for user specified filenames also... Would there be a length limit?

I'm not sure if others also ever deal with this, but I often find that storing images from ODK on my computer often results in the error of having a file path that's too long, which is also a result of where I'm storing those exports. But I often move my image folder export to my desktop to access the media/image files due to this. Would this new naming method further exacerbate this?

I don't think this is much of an issue when storing files on a cloud server, but mainly just when I'm dealing with the files on my own system. And also, not a huge issue overall, just thought I'd mention it to clarify!

Thanks again!
Janna

1 Like

Hi,

This is something that was requested recently by one of our clients but they wanted to know from the image file what image was (flower, leaf, or whole plant as they capture 3 images). We solved this at submission time by replacing the image with the name of the variable.

I think the problem of renaming the image at collection time is the possibility of having the same file name in the device or as @yanokwa mentioned (even if it's very unexpected) if the file name is generated by another value and such value is edited before submission.

I think that if the problem is linking media with data (e.g., in a media export), then such linking should be made by the platform collecting submissions. For example, a media export from FormShare is a zip file structured in a way that images are inside directories where each directory is the submission ID or the primary key selected by the user (e.g., Farmer ID).

Do we really need this renaming as part of the ODK specification? Or this is something that you can resolve at submission time in Central

1 Like

Could we show the value of the calculate (which will update as values change) in the widget and then actually rename the file and write the filename to the submission at finalization time? That way as far as the user sees the filename is consistent with the form even if it's a bit of an illusion. Having thought about it more I think evaluating at image save time is unacceptable for the reasons @yanokwa mentions.

Seems it could be a parameter and it could be enforced to be a reference (e.g. filename=${foo} is ok but filename=foo is not).

I don't think this is an issue unless a user tries to use a value from outside the repeat. One thing we could do is show an error on attempt to save a duplicate filename at finalization time.

That's something to discuss with Kobo. Windows has short path limits and that's one of the reasons ODK servers export media to a flat structure. It also makes analysis somewhat easier depending on the tool. It's also why filenames are currently system-determined and use a likely globally unique value that's not too long. This has its own challenges -- massive projects or projects with very specific time constraints are likely to have multiple pictures taken in the same millisecond (and we have seen it). This desire to have a flat export is why we're considering filename uniqueness not just for a single client but within an entire form submission set.

I think it's most helpful to be able to configure this as part of form design and in a portable way. Doing the rename at the client side means the raw submissions have the filename that matches the files themselves. This is in line with one of our design principles: keep data as raw and consistent as possible throughout the system. Additionally, there are many servers that use ODK Collect and Enketo. Making this functionality available in clients makes it accessible to the broadest user base in a consistent, predictable way.

3 Likes

So you think uuid() is safer in this case? I'm not sure... Using now() you can end up with duplicates only if two files are taken at exactly the same time plus the prefix (defined by a user) would also need to be the same. However uuid() will not rather return the same value when called at the same time or even during the same day but in big project that collect data over a long period of time it might happen right?

We might end up with an outdated file name displayed in the hierarchy of questions. It would be better to always display proper values but I'm also not sure if renaming files every time when related fields change is a good idea because that might require performing many operations like that during filling a form. Alternatively we could somehow always keep the updated name in media questions in order to for example display proper values but rename their media files only at finalization time. This basically would be the same what @LN said:

but I think this could be added later and the first implementation can be simple with ignoring updates like that).

It doesn't matter if this is a parameter vs a new column in terms of evaluating. As long as it's converted to a body attribute in xml it will look the same. A better question here would be whether there is any difference in complexity with evaluating bind vs body attributes. I discussed it with @LN and she said that probably not.
So why did I choose a separate column? No big reason, probably at first I thought that this new attribute would keep the logic instead of referring to a separate calculate question but now it seems easier to use the second option. Additionally there are body attributes that have their own columns and other that we define using one parameters column so we don't have one approach and it's not clear when we should use the first vs second option.

I don't think we should do anything special here. Users should use values from those repeats which should be unique plus they will have that now()/uuid() prefix. We can just advice them to be careful in the docs.

Yes I think it's 127 chars.

Now as I dug deeper I understand that ending up with two equal uuids is like almost impossible plus we will have prefix too.
Attaching two files at the same time has always seemed to me like something impossible as well but now as I think about it, it's probably more likely especially in big projects where there are many enumerators working at the same time. So I agree that uuid might be a better solution here.

The only thing to consider is its length... uuid returns a string that consists of 36 chars. It's quite long so maybe we could reduce it to 8 for example? Why 8? Because 8 chars from 36 (we use a-z + 0-9 chars) would give 2.821.109.907.456 unique ids so big enough I think to avoid conflicts.

1 Like

Hi @Grzesiek2010, this would be pretty nice indeed! At the moment I am using additional processing to store media files in separate directories (one directory by record) to facilitate their review/analysis. I also realise now that the more data collectors you have the higher the likelihood of getting two media created exactly at the same timestamp.

I would actually favour a uuid() prefix with a now() suffix, mostly because my understanding is that the primary need for end-users would be to easily (visually) see what media are associated with a specific record in the ODK database, while ensuring uniqueness. About repeats, you should be able to follow the same strategy as the timestamp cannot possibly be the same.

0a1de6e1-8b7c-4443-9ca1-a7ef557e3991_1662027931.jpg
0a1de6e1-8b7c-4443-9ca1-a7ef557e3991_1662027945.jpg
308e2ddb-571e-43c5-a607-c8048c2259ea_1630495591.jpg

About using a customized prefix based on names or any other identifiers, my own experience is that you always have data entry errors that can lead to duplicates in large data collection (e.g. we experienced that data collectors can scan the same QR code twice...), and this may be really time-consuming to correct manually and may quickly lead to ambiguities and conflicts on what individual record you are referring to when using any identifier other than the uuid(). Concatenating any prefix with a timestamp would ensure getting identical filenames but would not reduce the ambiguity about what record is associated with what media (at least without referring to the database). Also in my current use case, we re-enroll the same participant several time over the course of our data collection so that I would actually expect duplicated names (but unique uuids) in my databases. In addition, although name combination have a good likelihood to be unique when using three names, this is much less true when using two names only, which could then lead to pretty lengthy filenames to ensure uniqueness.

I understand the time of evaluation is an issue in case of filenames concatenated from variables that are modified at a later stage - which for me would be another reason to favour a combination of uuid() and now() as this will not be modified later. Otherwise I think there is a risk of creating even more confusions for end-users if data entries have to be modified at a later stage (even just to correct typos), and I agree with LN that consistency is essential.

I wish you were right, but actually this is much less unlikely that you would think :sweat_smile: - I have had several cases where participant names where corrected at a later point of the form filling process (also considering names can be a fluid concept in certain contexts).

2 Likes

Thanks @Thalie for sharing your thoughts!

Our aim is to find a good balance between file names that will be unique and relatively short and meaningful. That's why by default I thought about adding a suffix with a few random numbers. There are different projects in terms of size and needs. In case of huge projects we want to allow doing things like you described (uuid + timestamp or anything else) and for very small projects we also want to allow disabling any suffixes but not by default. It would be a decision people responsible for creating forms would need to make.

Yes this seems to be important. If we decided to implement this feature without a mechanism for keeping those file names updated it could do more harm than good in some cases.

2 Likes

Can you perhaps say, in a little more detail, how precisely this (new?) definition for the filename attribute will differ from the XForms spec definition? I understand it'll now be referencing a binding (instead/) - which I would conclude could be an element node containing a (static) string with the desired filename, or the result of a calculation binding (or ?...)

There appears to be some overlap here, but I cant quite determine what it is. :thinking:

Can you perhaps say, in a little more detail, how precisely this (new?) definition for the filename attribute will differ from the XForms spec 1 definition? I understand it'll now be referencing a binding (instead/) - which I would conclude could be an element node containing a (static) string with the desired filename, or the result of a calculation binding (or ?...)
There appears to be some overlap here, but I cant quite determine what it is. :thinking:

Yes that new filename attribute will hold a reference to another node (previous question or calculation most likely because that's what we think users need). It will also be possible to pass a static value.
The one described in the link you provided seems to work with static values only. Apart from that it will look pretty much the same.

Hi All,

Apologies for this if it makes no coherent sense...

Whilst I understand the popular use case and desire for these features, I am concerned that implementation has the potential to compromise data integrity of data downloaded from the server to a local workstation. I'll preface this by saying that some good guidelines on how to use this effectively can probably solve a lot of the issues, so I may be worrying about nothing.

Whilst we will no doubt implement something that only keeps the correct, most recent file on the server (i.e. by keeping the link to the most recent photo), people will get duplicate photos on their local hard drives if they're appending new files to an existing media folder. Lots of these issues are not a thing to worry about if each download is to a fresh folder, but if photos from previous downloads have been used elsewhere (like in google photos or similar) then we have an issue.

Imagine this scenario where X6S9O9 is the hash/uuid attached to any version of all photos taken of me (in a single field on the XLS form)

A photo is taken and the new feature labels it as follows

Cristy_Roberts_X6S9O9.jpg

Then there's an edit because my name was wrong

Kristy_Roberts_X6S9O9.jpg

Downloading the data gives me a local folder with the newer file

media/Kristy_Roberts_X6S9O9.jpg

But then I take a new photo because the first was out of focus, and also do another edit because it is still wrong

Chrissy_Roberts_X6S9O9.jpg

and then download.

My new media folder now looks like this

media/Christy_Roberts_X6S9O9.jpg
media/Kristy_Roberts_X6S9O9.jpg

Then I share the data to my friend. They have no idea which one is the most recent one, unless they fish around in the EXIF data, so use the first one, which is out of focus. Assuming that there's also some names that start with D,E,F,G,H,I and J in the data set, these won't sort together lexicographically, so will likely be missed in any case.

media/Christy_Roberts_X6S9O9.jpg
media/Crusty_Robards_XK39DJ.jpg
media/Dev_Jamme_TW98OW.jpg
media/Dilys_Barnards_P0292K1.jpg
media/Eliot_Dillards_X6S8O9.jpg
media/Edwina_Currey_PL1KD9.jpg
media/Fredwina_Curtley_JSNAM1.jpg
media/Gustave_Rombards_B92KN1.jpg
media/Kirsty_Remmilard_L1K2MA.jpg
media/Kristy_Roberts_X6S9O9.jpg
media/Kristoffer_Mumbarlard_X87J1B.jpg
media/Krusty_Roberds_X6S9O9.jpg
media/Kyllian_Zemenides_XS92L1.jpg

Yes, I had a lot of fun making up those names, but did you spot the other duplicate that I snuck in there or were you too busy looking for Kristy and Chrissy?

So should we then put the hash first? It would then be easier to see multiple files for one person but hard to find the person because they'd no longer be in alphabetical order.

media/X6S9O9_Christy_Roberts.jpg
media/X6S9O9_Kristy_Roberts.jpg
media/X6S9O9_Krusty_Roberds.jpg

If I gave my friend the folder of photos and no data set, the only way they know that Kristy is the same person as Chrissy is by looking at the hash (which is random so hard to read), or by having insight in to the data. With a more common name than mine, you'd hit the problem of having potential to confuse this further with people who have the same name, but different hashes.

media/Chrissy_Roberts_2022-08-09_X6S9O9.jpg
media/Kristy_Roberts_2022-05-09_X6S9O9.jpg
media/Kristy_Roberts_2021-05-09_X2LJ1A.jpg
media/Kristy_Roberts_2020-01-14_PI14JX.jpg
media/Krusty_Roberds_2022-05-09_X6S9O9.jpg

If we used timestamps instead of hashes it would be marginally less problematic with regards which is the most recent file, but the current timestamp format is epoch time, which is no use to anyone (who wants this kind of simple to use feature) but is useful to the lexicographic view of the computer. If we're going this way, then a YYYY-MM-DD ISO format (also lexicographic but still meaningful to humans) should be an option, as the whole point is to make it easier for the user to understand what the photo is by looking at the filename.

media/Chrissy_Roberts_2022-08-09_X6S9O9.jpg
media/Kristy_Roberts_2022-05-09_X6S9O9.jpg
media/Krusty_Roberds_2022-05-09_X6S9O9.jpg

Which helps with dates, but not with the naming and sorting issues unless this way around

media/X6S9O9_Christy_Roberts_2022-08-09.jpg
media/X6S9O9_Kristy_Roberts_2022-05-09.jpg
media/X6S9O9_Krusty_Roberds_2022-05-09.jpg

What if we had two photos from different fields? UUID would be useful to link these, but hashes not so much. How do we make it easy to find all photos from one person? Add more fields?

For me, the only really useful way to use this feature is to organise by multiple fields.
Here, I have some photos that are organised by UK county, town, postcode district, surname, first name, date and hash. That's super useful for getting a really nicely organised folder of photos and I suspect is really the kind of way many people will end up using this (i.e. in household surveys, wildlife monitoring, entity based stuff etc). Having a hierarchical organisation for the name makes it highly searchable.

media/Sussex_Burgess_Hill_RH15_P0011_Wolfeschlegelsteinhausenbergerdorff_Hubert_2022-08-09_X6S9O9.jpg

You can also add in something to differentiate photos from multiple fields like

media/Sussex_Brighton_RH1_P0011_Smith_Harry_Face_2022-01-13_NHA9OK.jpg
media/Sussex_Brighton_RH1_P0011_Smith_Harry_Hand_2022-01-13_ASM8U2.jpg
media/Sussex_Brighton_RH1_P0011_Smith_Harry_House_2022-01-13_NKH4H1.jpg

but the bigger you go, the worse the issue for long file paths. PC users doing this may have to consider that there's a maximum limit of (I think) 255 characters in a path to a file on Windows.

In the above example with the very long surname name, we reach well over 100 characters just for the file name, so burying the folder too deep in a file tree could cause problems, though I expect this is not really an issue with Central where downloads are in a level 1 subdirectory and power users will be able to set the working directory to escape the issue. Briefcase users may find this more problematic as it is very nested.

Finally, there's an extension of the concept, where we could specify folders as well as filenames.

In my example I would want to specify the following file tree to organise my photos

media/county/town/postcode_district/household/

so my files would be like this

media/Somerset/Williton/TA4/P0099/Ollerton_Gustav_2022-09-31_LSA2KN.jpg
media/Sussex/Brighton/BN1/P0009/West_Kanye_2022-08-03_LKNM12.jpg
media/Sussex/Brighton/BN1/P0009/West_Lizzo_2022-08-03_67SAK2.jpg
media/Sussex/Brighton/BN2/P0015/Bojang_Muhammed_2021-02-12_MNM23S.jpg
media/Sussex/Brighton/BN3/P0027/West_Lizzo_2022-08-03_67SAK2.jpg
media/Sussex/Burgess_Hill/RH15/P0011/Iqbal_Saaida_2022-08-09_JNA872.jpg
media/Sussex/Burgess_Hill/RH15/P0011/Smith_Bridget_2022-08-09_KJL98S.jpg
media/Sussex/Burgess_Hill/RH15/P0011/Smith_Eliot_2022-08-09_J2JN42.jpg
media/Sussex/Burgess_Hill/RH15/P0011/Smith_John_2022-08-09_X6S9O9.jpg
media/Sussex/Burgess_Hill/RH12/P0012/Lee_Kim_2022-08-10_A8J79D.jpg
media/Sussex/Burgess_Hill/RH12/P0012/Lee_Sang_2022-08-10_LKJLAS.jpg

This is far more useful to me than including all the extra fields in the filename.
Admittedly it means that the integrity of the system only remains until you change the folder names. But that's true of renaming files, which brings me full circle to my original concern about orphaned files.

I really hope that this makes sense and is useful to the team when thinking further about this feature.

Chrissy

Thanks to everyone who has engaged with this conversation.

If I understand correctly, your concerns are around keeping a local mirror of submissions that may have been edited. I would consider both of those to be advanced and mostly relevant to larger scale projects. A lot of projects collect all of their data and then do a single data export when the data is stable. A lot of projects also do without edits from Central.

I think the problem you describe is more general: keeping a local mirror of the data without downloading a complete new copy is hard to get right when there are edits. The same problem exists aside from this feature, I think. If you’re not really careful about how you’re doing your progressive update, you’re likely to end up with either only the old version of an edited submission or both the old and the new. If your integrity checks aren’t complete, you could also be missing some submission attachments or have the wrong version of them. I agree this is a data integrity issue and that most people should use exports or OData connectors to ensure they have the latest data.

I do think this feature could make it more apparent when a data update pipeline is too naive and for example doesn’t clean up obsolete submissions.

The system you described that downloads new submission attachments without cleaning up old ones would lead to extra media files with today’s filenames as well. If I understand correctly, you’re saying that the proposed functionality changes things because folks would be more likely to share just media files without the form data if the filenames were useful. I think that’s true. And indeed if they do progressive local updates and form edits and don’t clean up stale media they would be in a weird state. I’m not convinced yet that this is a reason not to provide the functionality.

I think this is the best scenarios for incorporating a UUID (however it ends up being incorporated) because it allows for enough variation to eliminate duplicates and also removes unnecessarily long names that are not meaningful. I've seen others do this for short URLs intended to be used in QR Codes for scientific names of plants to make QR codes easier to read. If a timestamp was concatenated with UUID, the UUID could be shortened to less than 8 if total character limit is the issue without having duplicates.

I'm not sure where this feature ended up and if it is still being developed, but this is definitely something I'm interested in and would like to see happen. Thanks for all your work!

I also agree with this. Just to add on, a combination of media name, collector identifier, and a shortened UUID should be good enough. I would like to see the filename customizable based on user needs.

Has this feature been realized already?