Standardize on ISO8601 for date and time exports in Briefcase, Aggregate, Central

yanokwa · September 13, 2018, 4:14am

What is the general goal of the feature?
As a data manager, I would like to process dates and times that use standard formats.

What are some example use cases for this feature?
As described at Date format on export different on different machines (with identical setups), depending on the locale and version of Java you have, Briefcase will output date times in a different format. This makes it hard to write and manage downstream tools that work with the data.

What can you contribute to making this feature a reality?
Spec and implementation guidance. Also docs.

Risks

This will break a lot of downstream scripts so we need to have a longer beta and clear release notes.
Newer versions of Java change the date formats we're using, meaning that doing nothing will also get us into trouble.
Might not be possible to easily change formats with Aggregate or Central's publishers. That is, perhaps FusionTables or Tableau, might not love the new date formats. In that case, we can start with just CSV exports.

Checklist
Here are the steps that we'll need to take to complete this.
Understand the potential date/time data types the XForms supports
Understand what formats Briefcase and Aggregate can parse
Understand what formats Central can parse
Understand what date/time formats Briefcase exports
Understand what date/time formats Aggregate exports and publishes
Understand what date/time formats Central exports and publishes
Update Briefcase to export ISO8601
Update Aggregate to export/publish ISO8601
Update Central to export/publish ISO8601
Update documentation
Release betas with deprecation warnings
Release final

yanokwa · September 13, 2018, 4:26am

@ggalmazor We've had an issue at https://github.com/opendatakit/briefcase/issues/539 for a while, but I wanted to file a more formal feature request so we can be sure we aren't missing anything. Is there anything that I'm missing in my checklist?

ggalmazor · September 13, 2018, 6:44am

Awesome! ᕕ( ᐛ )ᕗ

Thanks for putting the checklist together! It helps to understand the reach and scope of this change.

We can add to the risks section that newer versions of Java change the date formats we're using, meaning that doing nothing will also get us into trouble 100% sure (not very urgent, though, I guess)
I think we need to add to the checklists a pair of new items to understand what formats Briefcase and Aggregate can parse, since they have to interact with third-party compatible suites.

yanokwa · September 13, 2018, 6:08pm

@LN Any other risks you can think of that we should document? If not, then, perhaps @ggalmazor can do a quick technical spike to explore the spec, Briefcase, and Aggregate to see if there are any gotchas.

ggalmazor · September 27, 2018, 7:36am

I've compiled information about how we deal with dates and times in Central (thanks, @issa!), Briefcase, Aggregate, and Collect to have a better understanding of the scope of this change and all the changes we'll have to coordinate.

You can check it out here

aurdipas · October 26, 2020, 9:34am

Hi all,

It would be great if this issue will be addressed at some point.
Probably the recommendation from @ggalmazor "Regardless of the selected output format, or the API that has to be used, temporal data can always be considered a literal string of characters." would resolve the issue.

I'm working on a cohort project running in Peru and Laos and when it comes to analysis we normally use briefcase 1.17.4 to pull the data and then export to csv and ODK Meta to import that in Stata.
The Peru team is using the latest ODK Aggregate version on Digital Ocean and when they export the data to csv they get for today and start in the following format

And then ODK Meta doesn't recognise the format and it breaks (they needed to modify the do file to make the process work)
When we upload the same data to ODK Central on a Ubuntu server (locale Switzerland) and then I download directly the data as csv then I get the following for SubmissionDate and start:

And this format (Java Time yyyy-MM-dd'T'HH:mm:ss), not exact the same for the 2 fields again mess up with ODK Meta.

I tried to download the data from the Central from 2 different laptops in Switzerland via briefcase... one with openJDK 15 installed and one with the Oracle Java 1.8.
The first one has no problems and I can use ODK Meta smoothly without changes.
The second one export the data in German format (e.g. 13 Mär, 2020) and this format again is not recognised from ODK Meta.

Xiphware · October 26, 2020, 8:00pm

FYI, there is something on the ODK roadmap for this: https://github.com/getodk/roadmap/projects/1#card-10881844

LN · October 26, 2020, 9:19pm

It's helpful to get a concrete problematic scenario, @aurdipas, and we'll have to see who can work on this when. @ggalmazor had started looking into this but is now no longer part of the core team. We miss you, @ggalmazor!

@ggalmazor I had a vague memory that you had started writing code around this but I'm not seeing it now. Perhaps I'm just remembering wrong and the document was where we'd stopped?

Matthew_White · October 26, 2020, 10:02pm

I believe that Central exports SubmissionDate as ISO 8601, and for any date/time/dateTime field, passes through whatever value was sent from the client without modification. In theory, there might be differences between ISO 8601 and the ODK XForms specification for date or time values. However, in practice, it seems that Collect uses ISO 8601, which means that those values should also be ISO 8601 in the Central export.

However, one thing that's interesting about your example is that while SubmissionDate and the dateTime field tab_start are both ISO 8601, they are in different time zones. Specifically, SubmissionDate is UTC, while tab_start is the time zone set by Collect.

Are you planning to use SubmissionDate and tab_start within Stata? If so, could you describe how you plan to use the data? Or is the main issue that the time zone differences are resulting in an odkmeta error?

Also, are all the values of tab_start in the same time zone? Or is the data for Peru and Laos collected together and downloaded in a single export?

aurdipas · October 27, 2020, 9:34am

Hi @Matthew_White,

mybe I'm wrong but I find strange the difference between the format of SubmissionDate and start(end)

While the first has Z at the end and the otehrs +01:00.

They all treated as datetime from ODKMeta have different format but in anycase when we run the do file we have the message that the variables "start.end,SubmissionDate" can not be converted usin the Mask defined in the di file.
I I had to remove the T and Z from the SubmissionDate to make the do file work
generate double temp'=fcn'(regexr(regexr(`var', "T", " "), "Z", ""), "``mask''")

The interesting thing is that if I use Briefcase with Central then exported file (through briefcase) has the format expected by ODK Meta.

We use these 3 variables to understand A when the data was submitted and start-end for quality reasons.

Matthew_White · October 27, 2020, 10:49pm

Could you say more about what these quality checks look like? We're always interested in learning more about how dateTime fields are used. Do you look at the amount of time between start and end (the duration of the interview)?

It sounds like it'd be most useful for your analysis for all dateTime values to be converted to a single time zone. I suspect that that will be the case for most analysis. From what I can tell, @Florian_May's ruODK also converts dateTime values to a single time zone.

The last time I checked, Stata doesn't provide functionality around time zones: it won't parse the time zone and doesn't provide a way to convert between time zones or do related arithmetic. (I think the same is true of Excel.) However, if you know that all dateTime values other than SubmissionDate are in the same time zone, you could use a Stata mask that discards the T and the time zone:

YMD#hms#

You could then convert SubmissionDate to the desired time zone by adding/subtracting a constant. (You could do that either in the odkmeta do-file or in whatever do-file comes next.)

The Central CSV .zip download exports values in the way I mentioned above, passing through whatever value was sent from the client without modification. However, Briefcase formats values from the client, whether Briefcase pulls from Central or from Aggregate.

LN · October 27, 2020, 11:13pm

To be explicit regarding this specific case, the format is consistent. Z means UTC: https://en.wikipedia.org/wiki/ISO_8601#Coordinated_Universal_Time_(UTC)

aurdipas · October 28, 2020, 8:29am

we use start date to check when the interview started and then through calculations we generate the endtime of the interview. We use the metadata "end" to know the date the last time the form was saved/edited.
Especially during piloting or some specific studies we record start and end of a section but in these are calculated values checking previous answered questions and using the once() operator.

I'll try out I'm not an expert in stata and I found that parsing function on a stata forum.

Florian_May · October 29, 2020, 4:04am

For reference, ruODK parses dates and datetimes from a range of possible formats into an R datetime object showing the local timezone as follows:

github.com

ropensci/ruODK/blob/main/R/handle_ru_datetimes.R

#' Parse datetimes of submission data according to a form schema.
#'
#' `r lifecycle::badge("stable")`
#'
#' @details For a given tibble of submissions, parse all columns which are
#' marked in the form schema as type "date" or "dateTime" using a set of
#' lubridate orders and a given timezone.
#' @param data Submissions rectangled into a tibble. E.g. the output of
#'   ```
#'   ruODK::odata_submission_get(parse = FALSE) %>%
#'   ruODK::odata_submission_rectangle()
#'   ```
#' @param form_schema The `form_schema` for the submissions.
#'   E.g. the output of `ruODK::form_schema()`.
#' @template param-orders
#' @template param-tz
#' @template param-verbose
#' @return The submissions tibble with all date/dateTime columns mutated as
#'   lubridate datetimes.
#' @export

This file has been truncated. show original

github.com

ropensci/ruODK/blob/main/R/isodt_to_local.R

#' Parse an ISO8601 datetime string to a timezone aware datetime.
#'
#' `r lifecycle::badge("stable")`
#'
#' This function is used internally by \code{ruODK} to parse ISO timestamps
#' to timezone-aware local times.
#'
#' @param datetime_string (character) An ISO8601 datetime string as produced by
#'   XForms exported from ODK Central.
#' @param orders (vector of character) Orders of datetime elements for
#'   lubridate.
#'   Default: \code{c("YmdHMS", "YmdHMSz", "Ymd HMS", "Ymd HMSz")}.
#' @template param-tz
#' @return A lubridate PosixCT datetime in the given timezone.
#' @family utilities
#' @keywords internal
isodt_to_local <- function(datetime_string,
                           orders = c("YmdHMS", "YmdHMSz"),
                           tz = get_default_tz()) {
  datetime_string %>%

This file has been truncated. show original