Internship project: ODK Briefcase - New export types

So if we are flattening the repeats do we put everything in the same ‘main’ CSV file under independent columns or is there a better way of doing this?

It can depend on how that CSV is going to be consumed.

If users want to work with it on excel, an independent column for each repeated field would work. Say that field A is a repeat group with one instance with field B and a nested repeat group C with two instances with fields D and E. Then, the CSV would have columns:

  • A_1.B
  • A_1.C_1.D
  • A_1.C_1.E
  • A_1.C_2.D
  • A_1.C_2.E

The total number of columns would be dynamicly based on the maximum amount repeated answers and nesting levels

If the CSV is to be imported on a database with JSON support like PostgreSQL, encoding the repeats into JSON in one column would work.

Since some of the possible new export formats are privative, I wanted to open a discussion about the options we have - especially to verify the work we are going to do here. Some questions we need to answer are:

  • Do people contributing to this new feature have to have access to SPSS / Stata? How do we achieve this?
  • How are we going to automatically verify that the output format is correct?
  • Do we need extra specific QA manual tests for this?

@yanokwa, @ln, @Matthew_White, @Ronald_Munjoma, @mayank8318, any ideas? Does ODK have licenses we can use for this?

1 Like

Regarding flattening repeats :
There are two approaches I could come up with-

  1. Using 1 line per submission and adding new columns for each combination of repeat instances and columns. The problem with this approach is mentioned by @ggalmazor Internship project: ODK Briefcase - New export types - #10 by ggalmazor
    This can significantly increase the no of columns in the CSV(as the complexity of nesting increases) but the advantage is that each observation is just one row.

  2. Using different rows to represent each repeat group instance.
    Screenshot%20from%202018-03-05%2016-28-16

Here, the person Mayank added 2 instances of the repeat group.
This approach is not exactly "flattening" the results but serves the purpose of fitting everything on a single csv without increasing the number of columns.

@ggalmazor @yanokwa @LN @Matthew_White @Ronald_Munjoma Please let me know what you think of the two ideas and which one is better suited for our users.

@ggalmazor ODK does not have licenses to STATA or SPSS. A short-term solution is for students to use a trial license which lasts 30 days.

While that is ongoing, we can put a call out to the community to see if we can have get access to a few community member's licenses for a few months. Perhaps @Matthew_White can help here on the Stata front?

As far as automated testing, https://bitbucket.org/fomcl/savreaderwriter has some test data that they use for SPSS. Perhaps we can leverage that in some way?

@ggalmazor @mayank8318 Multiple rows are what I would expect. I prefer that approach because downstream tools often have a max number of columns.

So if you have data about a head of household and her children's names, ages, and favorite colors, you'd have an output like this.

2 Likes

I'm not sure this is the best approach, and it requires some manual effort, but for each odkmeta test, I've followed these steps:

  • Import a form's submission data by using odkmeta to generate a Stata do-file and running the do-file.
  • Manually check the resulting Stata dataset to make sure that it has been imported as expected. Follow these steps to do so.
  • If the dataset has been imported as expected, save it in the repository. For the test, run odkmeta to generate the do-file, then run the do-file, then compare the resulting dataset to the manually checked one that is saved in the repository. They should be the same.

I actually don't have access to a Stata license at the moment, but I think you're right that someone in the community must. I also think it's possible that Stata would donate a license if we contacted them.

1 Like

Just thinking out loud, what would the resulting file look like if there are multiple, unrelated repeat groups? For example, a farmer who has multiple children and multiple plots of land. If we ask questions about the farmer (name, age), their children (name, age, favorite color), and their plots (size, crop type) such that there is a repeat group for the children and a separate repeat group for the plots, what would each row of the file represent?

1 Like

That is a very good question. I couldn't come up with a good solution for this case. If we display all possible combinations (that would be no of repeat_group_instances in RG1 * no of repeat_group_instances in RG2). This doesn't seem to be a good solution but I cannot think of anything else. One website(http://www.novixys.com/SwiftXML/flatten-xml.html) that I came across gives a similar solution, though it doesn't talk about this scenario. What would you suggest? How does this work in Google Fusion Tables? Maybe someone who has used it might be able to shed some light on this?

I think your first approach (one row per submission) is preferable in many cases for data analysis in Excel or a statistical package: analysts often tend to prefer wide data sets. It's actually pretty rare that I've run into the limit on the number of columns in Excel or Stata. However, like Yaw said, it does happen, and it's probably more common in other tools. Maybe in that case, the user could select a subset of fields or repeat groups to export?

But then that approach is not applicable for all cases, right? Suppose there exists a repeat group RG1 and there is one such form submission which has X no. of instances of this RG1. Here X is the minimum no of instances required to max out the column restriction. So, that approach of fitting everything on a single row won't work here. What do we do in this case? What I want to say is that this is not a universal solution and so is this the most user friendly way of doing it?

Ah, I see what you mean. I think ideally, we would know the maximum number of instances of each repeat group at the time the user is selecting the subset of fields. That way, we could show them the expected number of columns based on their selection. If that's not possible, another option might be to display the number of columns at the end of the export. If the number is too high, they can retry the export, selecting a smaller number of fields.

I think one high-level question is: who are the intended users of this functionality? If some of the primary users of this functionality will be those working in Excel and statistical packages, and the output of the first approach tends to best facilitate their data analysis, my personal view is that we should do what we can to provide it for them. From what I can tell from a quick glance, it seems that most servers in the ecosystem that support flattening take this first approach.

However, it's certainly possible that there is a group of users who would prefer the output of the first approach, yet because of the limit on the number of columns, would find the first approach infeasible, even with field selection. Or it's possible that there is a group of users who, even without the limit, would actually prefer the output of the second approach and find that it best facilitates their analysis. It'd be interesting to learn more about such use cases. As part of that, it'd be interesting to hear thoughts on how to address the question of unrelated repeat groups.

2 Likes

I'm loving this discussion!

I think the intended users are folks who work in Excel and other statistical packages are the primary users and so I defer to @Matthew_White's judgment since he has more experience with those users. And it also seems like other servers use this approach so it can't be too bad :smile:

@mayank8318 Excel has 16,384 max columns. For a first pass, we can just not allow more than that in the flattened CSV and see what feedback users have. It seems unlikely that someone would complain.

1 Like

@yanokwa Seems fair then. I will make changes in the proposal accordingly. Tackling real life problems is so much fun :slight_smile:

3 Likes

@Matthew_White @yanokwa Is this right?
There is a form with fields Farmer(name, age), Child[Repeat](name, age , favourite colour), Plots[Repeat](size, crop type).
Two farmers FamX(3 children, 2 plots) and FamY(2 children, 2 plots) fill the form. Their flattened CSV will look like this -


2 Likes

Hi @ggalmazor!
Is this project still up for GSoC 2019?
Thanks!

Welcome to the ODK forum, @bhavyejain! We're glad you're here. When you get a chance, please introduce yourself here. I'd also encourage you to add a real picture as your avatar because it helps build community!

Unfortunately, I won't be participating on GSoC this year. Unless some other mentor shows up, there won't be any Briefcase project this time.

@ggalmazor I'll make sure to update my profile today itself and introduce myself!

Well this was a sad news. the project would have been a great learning experience for me!
If a new mentor shows up, where can we get updated?

1 Like

@ggalmazor the ideas page of ODK still shows GSoC - 2018. Will it be changed for GSoC 2019 or the same continues? Just in case new projects/ideas come ahead.

I think we're working on a different document for the 2019 edition. I'll see if I can add a link here as well

1 Like

@bhavyejain See GSoC and Outreachy 2019: What is the plan?