Problems with image filename when downloading images with ruODK

1. What is the issue? Please be detailed.
I am using ruODK and odata_submission_get in R.

All the images appear to get downloaded but the associated filenames in the datasets are sometimes present or missing.

I can't figure out why sometimes the filename eg "1709121722411.jpg" is present and other times is it absent in column "x_g8_x_g10_x_user_approval_photo" below. The problem comes to renaming the images. I use a loop to rename images based on other collected data using the filename, but is unable to when the filename is missing in the dataset but is present on the downloaded image file.

I know the image is downloaded for a given instance ID because when I open the image the contents of the image have an ID number. Which I can then search against a column in the downloaded dataset and find the corresponding instance ID. Also I can convert the downloaded image's filename to a date time to find the associated row in the dataset.

Any help would be great.

Thank you,
Charlie.

2. What steps can we take to reproduce this issue?
ODK Central v2024.2.1

data <- ruODK::odata_submission_get(table = frm_tbl$url[1], download = TRUE, local_dir = dwn_dir) 
data_sub1 <- ruODK::odata_submission_get(table = frm_tbl$url[2], download = TRUE, local_dir = dwn_dir)
sub_data <- ruODK::odata_submission_get(table = frm_tbl$url[1], download = TRUE, local_dir = dwn_dir, parse = FALSE) %>%
ruODK::odata_submission_rectangle()
sub_data %>%
  filter(x_user_approval == "active" & is.na(x_g8_x_g10_x_active_consent_image)) %>%
  select(system_attachments_present, system_attachments_expected, meta_instance_id, x_g8_x_g10_x_active_consent_image)
  r$> sub_data %>%
      filter(x_user_approval == "active" & is.na(x_g8_x_g10_x_user_approval_photo)) %>%
      select(system_attachments_present, system_attachments_expected, meta_instance_id, x_g8_x_g10_x_user_approval_photo)
filter: removed 8,006 rows (96%), 369 rows remaining
select: dropped 47 variables (x_start, x_end, x_today, x_deviceid, x_date, …)
A tibble: 369 × 4
   system_attachments_present system_attachments_expected meta_instance_id                          x_g8_x_g10_x_user_approval_photo
                        <int>                       <int> <chr>                                     <chr>                            
 1                          0                           0 uuid:be28d31f-c224-4f90-8a1d-8b233cea8997 NA                               
 2                          1                           1 uuid:8031f4b3-d4ae-4a45-958f-73015faacab3 NA                               
 3                          1                           1 uuid:0b54c64c-ad02-4f4f-83c7-e28ccff323f2 NA                               
 4                          1                           1 uuid:f5dde949-7c51-4685-8501-d74acaa88ccb NA                               
 5                          1                           1 uuid:12612ced-1867-4f57-9006-fb0a703b6abd NA                               
 6                          1                           1 uuid:9441a954-cbf5-4e83-aacf-e3f8c9b07238 NA                               
 7                          1                           1 uuid:37556a60-f186-41a3-8453-b711790c8db9 NA                               
 8                          1                           1 uuid:c3fd6e7d-f948-4a25-aff4-a7b637e68e51 NA                               
 9                          1                           1 uuid:98d03c75-03db-4a54-9e0f-607a3dd1e36d NA                               
10                          1                           1 uuid:bdcbaf11-a5d2-4082-8148-ebe1ff375169 NA



r$> sub_data %>%
      filter(x_user_approval == "active" & !is.na(x_g8_x_g10_x_user_approval_photo)) %>%
      select(system_attachments_present, system_attachments_expected, meta_instance_id, x_g8_x_g10_x_user_approval_photo)
filter: removed 6,644 rows (79%), 1,731 rows remaining
select: dropped 47 variables (x_start, x_end, x_today, x_deviceid, x_date, …)
# A tibble: 1,731 × 4
   system_attachments_present system_attachments_expected meta_instance_id                          x_g8_x_g10_x_user_approval_photo
                        <int>                       <int> <chr>                                     <chr>                            
 1                          2                           2 uuid:142e62aa-f1af-4e27-9c52-3cc618f7f64e 1726484487534.jpg                
 2                          2                           2 uuid:192969fc-c116-43d2-a9b5-5da1d82ee3cc 1726499896667.jpg                
 3                          2                           2 uuid:9a17ad87-073c-4494-9bf5-55a7e0f97add 1726499828637.jpg                
 4                          2                           2 uuid:100af8d7-3fb8-44d6-8279-d0e1a41b0d33 1726496007451.jpg                
 5                          2                           2 uuid:69c0c168-23ed-493c-a1bd-7779563e2de6 1726495871159.jpg                
 6                          2                           2 uuid:a8b72892-875a-4584-8b86-83d964704f36 1726408408018.jpg                
 7                          2                           2 uuid:5cb8c312-0fc2-404a-8688-9a4eff89e621 1726404226906.jpg                
 8                          2                           2 uuid:6046ed21-7eaf-4679-a443-913c9f13229d 1726320700107.jpg                
 9                          2                           2 uuid:b9ba2e67-393c-4f6f-8f2e-dd736b3023b5 1726320618011.jpg                
10                          2                           2 uuid:720c5246-94bf-4941-ad6e-4e8619a904cc 1726320556408.jpg

sub_data_1 <- sub_data %>%
  filter(meta_instance_id == "uuid:a6e1a0b0-6744-469c-8d9f-890c6879275f")


r$> sub_data_1 <- sub_data_1 %>%
      ruODK::handle_ru_attachments(
        form_schema = data_form_schema,
        local_dir = dwn_dir
      )
Warning message:
There were 3 warnings in `mutate()`.
The first warning was:
ℹ In argument: `x_g2_x_g3_x_g5_x_hh_consent_image = (structure(function (..., .x = ..1, .y = ..2, . = ..1) ...`.
Caused by warning in `httr::RETRY("GET", src, httr::authenticate(un, pw), httr::write_disk(pth, overwrite = TRUE), times = retries, terminate_on = c(404)) %>%
    httr::warn_for_status(task = glue::glue("download media attachment {fn}.\n", "Troubleshooting tips:\n",
      "* Does the file resource {fn} exist? Run in a Terminal:\n", "  curl -ipu {un} {src} | cat\n",
      "* Is {fn} an expected attachment of this submission? Run:\n", "  curl -ipu {un} {stringr::str_replace(src, fn, \"\")}\n", ))`:
! Not Found (HTTP 404). Failed to download media attachment NA.
Troubleshooting tips:
* Does the file resource NA exist? Run in a Terminal:
  curl -ipu xxxx
* Is NA an expected attachment of this submission? Run:
  curl -ipu xxxx
ℹ Run `dplyr::last_dplyr_warnings()` to see the 2 remaining warnings.

r$> sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Big Sur ... 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/London
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] tidylog_1.1.0   lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4     purrr_1.0.2     readr_2.1.4     tidyr_1.3.1    
 [9] tibble_3.2.1    ggplot2_3.4.4   tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] gtable_0.3.4     jsonlite_1.8.8   crayon_1.5.2     compiler_4.3.1   Rcpp_1.0.11      tidyselect_1.2.1 assertthat_0.2.1
 [8] snakecase_0.11.1 scales_1.3.0     semver_0.2.0     R6_2.5.1         generics_0.1.3   curl_5.2.0       ruODK_1.5.0     
[15] janitor_2.2.0    munsell_0.5.0    pillar_1.9.0     tzdb_0.4.0       rlang_1.1.2      utf8_1.2.4       stringi_1.8.3   
[22] fs_1.6.3         timechange_0.2.0 cli_3.6.2        withr_2.5.2      magrittr_2.0.3   grid_4.3.1       hms_1.1.3       
[29] clisymbols_1.2.0 lifecycle_1.0.4  vctrs_0.6.5      glue_1.6.2       fansi_1.0.6      colorspace_2.1-0 httr_1.4.7      
[36] tools_4.3.1      pkgconfig_2.0.3

Hi @CharlieKeyes,

thanks for the great bug report! I'm taking a look now.

Could you find out whether the image filenames are missing

  • on ODK Central - you can hack together the URL as https://{CENTRAL_URL}/#/projects/{PROJECT_ID}/forms/{FORM_ID}/submissions/uuid={INSTANCE_ID}
  • in the raw XML from odata_submission_get(parse=FALSE) - use listviewer::jsonedit(submission_unparsed)and filter to Submission ID
  • using the REST endpoint submission_get

Thanks!

1 Like

Thank you Florian for the quick response. Indeed it appears that the image filenames are missing.
I tried with an instance ID with the image file name and without, see below.

With image

> sl_with_image <- ruODK::submission_list() %>%
>   filter(instance_id == "uuid:142e62aa-f1af-4e27-9c52-3cc618f7f64e")
> 
> sub <- ruODK::get_one_submission(sl_with_image$instance_id[[1]])
> listviewer::jsonedit(sub)

Without image


sl_with_no_image <- ruODK::submission_list() %>%
  filter(instance_id == "uuid:be28d31f-c224-4f90-8a1d-8b233cea8997")

sub2 <- ruODK::get_one_submission(sl_with_no_image$instance_id[[1]]) 
listviewer::jsonedit(sub2)

Hi Charlie,
Have you tried looking to see if the submissions have any attributes attached to a problematic instance (ie with no filename entry in the table). You could use the following commands

x1 <- ruODK::submission_list() %>% filter(instance_id == "{problem_uuid}")
ruODK::get_one_submission_att_list(x1$instance_id)

hope that helps,
S

Hi,
yes that helps, for a given UUID with a missing filename it returns the following, but still it doesn't tell me which column name each JPG belongs to.

I suspect this could be due to a problem with having the image capture nested within groups in the form?

Charlie.

subs <- ruODK::submission_list()
sub_1 <- subs %>%
  filter(instance_id =="uuid:e85e6297-db79-4d1f-a15d-484a5b3ff61e")
al <- ruODK::get_one_submission_att_list(sub_1$instance_id)
r$> al
A tibble: 3 × 2
  name              exists
1 1684483925610.jpg TRUE  
2 1684484005742.jpg TRUE  
3 1684484060166.jpg TRUE

Hi @CharlieKeyes,

We're one step further! And I have more questions for you - bear with my Monday brain :slight_smile:

I can see that instance ID uuid:142e62aa-f1af-4e27-9c52-3cc618f7f64e has a filename under x_samples_consent_image. Did that file download for you OK?

I can also see that instance ID uuid:be28d31f-c224-4f90-8a1d-8b233cea8997 does not have an image associated. I suspect that the enumerator did not take a photo when creating that submission. Could that be the case? ruODK or ODK Central might have omitted the empty image field (something worth investigating with known data - I'll take a look when I have more time). Is that missing filename here the unexpected behaviour?

You're saying the image filenames are missing but I'm unclear from where:

  • In ODK Central? Then there was no image captured in that submission.
  • In the data returned by ruODK::odata_submission_get()? Then I'd expect ruODK would also fail to download that image, as it uses the image filename to download the actual image file in a separate step.
  • In the downloaded folder? So there's a filename in the data from ruODK::odata_submission_get() but ruODK did not download an image?

Or is the question how to link already downloaded images to the instance ID and column in the data they're from?
In this case I'd approach the issue from the data side, not from the file side: For each image column, iterate through the rows, and when you find a non-empty filename you know that this file will exist in the download folder you specified in odata_sumission_get(). With the knowledge of the form field and filename (from the data) and the download folder path you can then construct any system command, e.g. to move or to rename the file. In this approach you could skip any empty image filenames where enumerators have not submitted an image.

1 Like

Thank you Florian.

  1. yes images for 484a5b3ff61e downloaded ok and I could see the image name under the appropriate column in the table.

  2. It's correct that 8b233cea8997 doesn't have an image associated with it. I gave the wrong example sorry! I'm a bit in the weeds with all this.

Perhaps I can explain the process more clearly.
I use ruODK::odata_submission_get() to download all the images and return the form's associated tables.

I then use some code to check which downloaded image is associated with which row in the table. In order to batch rename them to something more identifiable for me, eg patient ID and date. But there are some downloaded images that have no table rows associated with them. :face_with_raised_eyebrow: :face_with_raised_eyebrow:

With these images with no associated table entry, I can open the image file and the image has information about which data row it belongs to (ID numbers, names, etc in the images). When I check the ID number in the image against the rows in the table the values under expected "_image" variables are missing. So I'm unable to identify which downloaded image is associated with with row, for some not all downloaded images.

My last attempt to solve this was to dust off ODK briefcase and check the offending XML submissions.

I get the same problem, but interestingly the XML submission file has the image filenames embedded (as well as the images in the XML submission folder) but the associated dataset is missing the row with the filenames. :face_with_monocle:

I've attached the XML submission and both the Odata and ODK briefcase table with the blank values.

From checking the XML submission and the table it appears to be either due to the "image" variables nested within groups that don't make it to the associated table. But why it works for some and not other is strange. Perhaps it is due to the form's schema?

There are about 9000 images associated with this form and 30Gb sitting on the server. I'll try next to download directly through ODK Central.

uuid:e85e6297-db79-4d1f-a15d-484a5b3ff61e:
submission.xml (1.9 KB)
odata table.csv (1.7 KB)
odk_Briefcase_table.csv (1.3 KB)
frm_scan(2).xlsx (27.8 KB)

That helps, thanks!

More questions. This is fascinating!

the image has information about which data row it belongs to (ID numbers, names, etc in the images)

Is that embedded in the image metadata or is that something the photo itself shows, as in the enumerator takes a picture of some patient data?

But there are some downloaded images that have no table rows associated with them.

So you start with an empty download folder, download all submissions (including attachments) and at the end you find images downloaded where their filename is nowhere in the data table?

the XML submission file has the image filenames embedded (as well as the images in the XML submission folder) but the associated dataset is missing the row with the filenames

When you say associated dataset, do you mean dataset in the sense of ODK Central EntityList or table returned by odata_submission_get()?

From checking the XML submission and the table it appears to be either due to the "image" variables nested within groups that don't make it to the associated table. But why it works for some and not other is strange. Perhaps it is due to the form's schema?

I will have to spend some time looking at the attached files (thanks for these - extremely helpful!). In groups where image filenames are missing, do you see other fields populated or is the entire subgroup empty/missing?

Thank you Florian, I managed to figure out the problem, it's due to the multiple versions of the form on Central.
I downloaded all the submissions from ODK Central's webiste as well as their attachments and checked the box "Include fields not in the published Form". This downloaded all the data tables and it included the filenames in all the corresponding variable names.

I think this is because the form_schema() only returns the the most recent version of the form, GitHub issue here.

If it helps I've answered your questions below.

  1. The photo itself shows the data, as in the enumerator takes a picture of the data (which is also entered on the form)

  2. Yes I start with an empty download folder, download all the images and their filenames are not in the data tables.

  3. table returned by odata_submission_get()

Best,
Charlie

2 Likes

Perhaps partially solved then!

Great find and great job figuring out the problem!

I will add form versions to form_schema() (will have to percolate through all functions using form_schema() like odata_submission_get()) within the next days.

1 Like

Thank you. I had also discovered some issues with downloading the data from the ODK-Central server. In the data tables the columns of data were miss-aligned, some appearing under columns without column headers. I was able to figure it out, because of the contents of the columns.

This was becasue there was 5 versions of the form and variables had been moved in and out of groups within repeats. Thus malforming the data tables. But it was still salvagable.

thank you!

This could be interesting for the core team - pinging @Sadiq_Khoja

I will have to add ruODK unit tests with a form that has several versions and moves the same question in and out of groups across these versions.

thanks @Florian_May for tagging me.

Hi @CharlieKeyes,

Regarding miss-aligned columns, would you be able to share a minimum reproducible example? I tried to move fields around in various Form versions but downloaded CSV from Central always showed data under correct columns.

Do you have special characters / line breaks / unicode in your data? It is quiet possible that Excel is unable to open/parse the CSV file correctly.

I would also like to add that we strongly discourage to delete/rename/move fields across groups/repeats, that's why we show warning when new Form version is uploaded that does that:

image

You can read more about it here and alternative ways to handle cases when you want to delete/move fields.

Thanks,

Sadiq

2 Likes

I'm cross-linking https://github.com/ropensci/ruODK/issues/161 for reference.

From the ruODK side I'll investigate whether I can support download of older form_versions (see https://github.com/ropensci/ruODK/issues/129) and include fields not published in the current form version (like https://docs.getodk.org/central-submissions/#export-options).

1 Like

@CharlieKeyes early preview for you:

ruODK now implements CSV export with deleted fields.

Edit: This is now released as v1.5.1 which you can install via

remotes::install_github(
  "ropensci/ruODK",
  dependencies = TRUE,
  upgrade = "always",
  build_vignettes = FALSE
)

If you want to give this one a try, let us know whether it retrieves the missing values!

Thank you Forian,
I will give this ago later tonight. It's a large file so take some time.
Charlie

1 Like