Understanding the "Select Multiple" CSV output

I am very interested to understand why the "Select Multiple", as made with
ODK Build stores the results of the data as it does (space separated).
Never seen space separated before. It seems very natural that a variable
with say four options, each of which is checked/not checked, should be
expanded into four separate boolean variables. The output on the CSV from
aggregate is horrible to deal with. I am guessing this was discussed at
some point, but I don't understand the tradeoffs. It seems to me if you
have a variable "shower", and you ask someone to check the days of the week
they take a shower, that instead of getting a return value of "1 3 5 6"
(each day of the week's value being 1 to 7), it makes more sense to create
7 variables, "shower[0], shower[1]... ", etc. with values true or false.
This would at least make the csv output usable. Thanks for helping me to
understand this choice.
PS. Let the record reflect that this is not necessarily my shower schedule
; )

2 Likes

It's a OpenRosa/JavaRosa decision so you should ask that list. Since ODK
tools are built on top of the JavaRosa core and follow the OpenRosa
standard ODK tools simply follow the specification.

Quoting Mitch on a previous thread:
"If you have a multi-select with 30 choices, would you want a spreadsheet
to be generated with 30 column headings for this one multi-select, and a
"X" under the columns with matching values? What if there are 50
choices? What if the number of choices is unlimited (this is possible
with itemsets drawing their choices from an external database)? At some
point, it makes more sense to produce a spreadsheet with just "N" columns
for a given multi-select (1st choice, 2nd choice, ...), and show up to N
choices for that multiselect (but which N do you show?)."

Waylon

··· On Tue, Sep 4, 2012 at 1:37 PM, Curtis Broderick wrote:

I am very interested to understand why the "Select Multiple", as made with
ODK Build stores the results of the data as it does (space separated).
Never seen space separated before. It seems very natural that a variable
with say four options, each of which is checked/not checked, should be
expanded into four separate boolean variables. The output on the CSV from
aggregate is horrible to deal with. I am guessing this was discussed at
some point, but I don't understand the tradeoffs. It seems to me if you
have a variable "shower", and you ask someone to check the days of the week
they take a shower, that instead of getting a return value of "1 3 5 6"
(each day of the week's value being 1 to 7), it makes more sense to create
7 variables, "shower[0], shower[1]... ", etc. with values true or false.
This would at least make the csv output usable. Thanks for helping me to
understand this choice.
PS. Let the record reflect that this is not necessarily my shower schedule
; )

--
Post: opendatakit@googlegroups.com
Unsubscribe: opendatakit+unsubscribe@googlegroups.com
Options: http://groups.google.com/group/opendatakit?hl=en

I know this is an old thread, but thanks for you response Gaetano - I was looking for a quick, fairly user-proof fix. That fits the bill, I think most users will be able to get to use this with a couple examples. I post here so perhaps others looking for an easy answer might find it more easily.

As for the broader question of one vs many columns... In a way it doesn't matter THAT much - you can achieve the same results either ways. This being said, it seems to me it is easier to get rid of empty columns than split values cramp in a single cell. For this I would lean towards multiple columns. Also, a holy principle of database design is atomic distribution of data. The multiple values in a cell does break that principle, for the DB purists at least....

2 Likes

By the way, you can easily solve the subset problem by including spaces around the values:

=IFERROR(FIND("1",$AP2),0)

will not work if you values goes from 0 to 10 or more, but

=IFERROR(FIND(" 1 ",$AP2),0) & =IFERROR(FIND(" 11 ",$AP2),0)

will because find takes the spaces into account.

Thanks for the quick reply Waylon. I really did search the forum for this
subject, honest. I am not sure I understand Mitch's response. He suggests a
separate csv for the one 40 choice variable aside from the main form's csv?
That sounds pretty reasonable. I know computer science people like the
"need to take into account all situations", but the nature of a multichoice
is by definition a finite set, so unlimited seems a bit far fetched. As for
40 for 50 possible responses, that sounds pretty crazy. The selections are
captured for the goal of being analysed, no? If someone thinks it is that
important to have 50 possible check responses to a given question, OK,
maybe "Which countries have you visited?" (with 50 possible check boxes)
then it does not seem unreasonable to me that he/she ends up with 50
columns. How can you ask the average person to use Excel to parse space
separated data to give you a simple count? OK, this still does not change
javarosa.
Curtis

··· 2012/9/4 W. Brunette

It's a OpenRosa/JavaRosa decision so you should ask that list. Since ODK
tools are built on top of the JavaRosa core and follow the OpenRosa
standard ODK tools simply follow the specification.

Quoting Mitch on a previous thread:
"If you have a multi-select with 30 choices, would you want a spreadsheet
to be generated with 30 column headings for this one multi-select, and a
"X" under the columns with matching values? What if there are 50
choices? What if the number of choices is unlimited (this is possible
with itemsets drawing their choices from an external database)? At some
point, it makes more sense to produce a spreadsheet with just "N" columns
for a given multi-select (1st choice, 2nd choice, ...), and show up to N
choices for that multiselect (but which N do you show?)."

Waylon

On Tue, Sep 4, 2012 at 1:37 PM, Curtis Broderick < c.broderick@epiconcept.fr> wrote:

I am very interested to understand why the "Select Multiple", as made
with ODK Build stores the results of the data as it does (space separated).
Never seen space separated before. It seems very natural that a variable
with say four options, each of which is checked/not checked, should be
expanded into four separate boolean variables. The output on the CSV from
aggregate is horrible to deal with. I am guessing this was discussed at
some point, but I don't understand the tradeoffs. It seems to me if you
have a variable "shower", and you ask someone to check the days of the week
they take a shower, that instead of getting a return value of "1 3 5 6"
(each day of the week's value being 1 to 7), it makes more sense to create
7 variables, "shower[0], shower[1]... ", etc. with values true or false.
This would at least make the csv output usable. Thanks for helping me to
understand this choice.
PS. Let the record reflect that this is not necessarily my shower
schedule ; )

--
Post: opendatakit@googlegroups.com
Unsubscribe: opendatakit+unsubscribe@googlegroups.com
Options: http://groups.google.com/group/opendatakit?hl=en

--
Post: opendatakit@googlegroups.com
Unsubscribe: opendatakit+unsubscribe@googlegroups.com
Options: http://groups.google.com/group/opendatakit?hl=en

--

Curtis BRODERICK
Head of training and user support
Tél : + 33(0)1.53.02.40.61 06.15.33.57.37

---------------------------EpiConcept---------------------------
Systèmes d'Information en Santé
Health Information System
47, rue de Charenton - 75012 PARIS
Tel : +33 (0)1 53 02 40 60 Fax : +33 (0)1 53 02 40 62
-----------------------www.epiconcept.fr------------------------

Ce message et toutes les pièces jointes sont établis à l'intention
exclusive de ses destinataires et sont confidentiels.
Si vous recevez ce message par erreur, merci de le détruire et d'en
avertir l'expéditeur.

The information contained in the Email and any attachments is
confidential and intended solely and for the attention and use of the
named address(es). It may not be disclosed to any other person without
the express authority of EpiConcept, or the intended recipient, or
both.
If you are not the intended recipient, you must not disclose, copy,
distribute or retain this message or any part of it.

Hey Curtis,

An auto-complete widget (which builds on selects) with a list of
animals or diseases could easily take hundreds of options. Anyway, the
reason it's done this way is because this is the XForms spec and I
doubt the JavaROSA community is going to change it.

A better option is to change how Aggregate generates the CSV (file a
bug or submit a patch), or write a script to change that output to
something that is easier to analyze. I think the Kobo
(http://koboproject.org) team has done the latter so might be a good
place to start looking.

Yaw

··· On Tue, Sep 4, 2012 at 2:25 PM, Curtis Broderick wrote: > Thanks for the quick reply Waylon. I really did search the forum for this > subject, honest. I am not sure I understand Mitch's response. He suggests a > separate csv for the one 40 choice variable aside from the main form's csv? > That sounds pretty reasonable. I know computer science people like the "need > to take into account all situations", but the nature of a multichoice is by > definition a finite set, so unlimited seems a bit far fetched. As for 40 for > 50 possible responses, that sounds pretty crazy. The selections are captured > for the goal of being analysed, no? If someone thinks it is that important > to have 50 possible check responses to a given question, OK, maybe "Which > countries have you visited?" (with 50 possible check boxes) then it does not > seem unreasonable to me that he/she ends up with 50 columns. How can you ask > the average person to use Excel to parse space separated data to give you a > simple count? OK, this still does not change javarosa. > Curtis > > 2012/9/4 W. Brunette > >> It's a OpenRosa/JavaRosa decision so you should ask that list. Since ODK >> tools are built on top of the JavaRosa core and follow the OpenRosa standard >> ODK tools simply follow the specification. >> >> Quoting Mitch on a previous thread: >> "If you have a multi-select with 30 choices, would you want a spreadsheet >> to be generated with 30 column headings for this one multi-select, and a "X" >> under the columns with matching values? What if there are 50 choices? >> What if the number of choices is unlimited (this is possible with itemsets >> drawing their choices from an external database)? At some point, it makes >> more sense to produce a spreadsheet with just "N" columns for a given >> multi-select (1st choice, 2nd choice, ...), and show up to N choices for >> that multiselect (but which N do you show?)." >> >> Waylon >> >> >> On Tue, Sep 4, 2012 at 1:37 PM, Curtis Broderick wrote: >>> >>> I am very interested to understand why the "Select Multiple", as made >>> with ODK Build stores the results of the data as it does (space separated). >>> Never seen space separated before. It seems very natural that a variable >>> with say four options, each of which is checked/not checked, should be >>> expanded into four separate boolean variables. The output on the CSV from >>> aggregate is horrible to deal with. I am guessing this was discussed at some >>> point, but I don't understand the tradeoffs. It seems to me if you have a >>> variable "shower", and you ask someone to check the days of the week they >>> take a shower, that instead of getting a return value of "1 3 5 6" (each day >>> of the week's value being 1 to 7), it makes more sense to create 7 >>> variables, "shower[0], shower[1]... ", etc. with values true or false. This >>> would at least make the csv output usable. Thanks for helping me to >>> understand this choice. >>> PS. Let the record reflect that this is not necessarily my shower >>> schedule ; ) >>> >>> -- >>> Post: opendatakit@googlegroups.com >>> Unsubscribe: opendatakit+unsubscribe@googlegroups.com >>> Options: http://groups.google.com/group/opendatakit?hl=en >> >> >> -- >> Post: opendatakit@googlegroups.com >> Unsubscribe: opendatakit+unsubscribe@googlegroups.com >> Options: http://groups.google.com/group/opendatakit?hl=en > > > > > -- > > Curtis BRODERICK > Head of training and user support > > Tél : + 33(0)1.53.02.40.61 06.15.33.57.37 > > ---------------------------EpiConcept--------------------------- > Systèmes d'Information en Santé > Health Information System > > 47, rue de Charenton - 75012 PARIS > Tel : +33 (0)1 53 02 40 60 Fax : +33 (0)1 53 02 40 62 > -----------------------www.epiconcept.fr------------------------ > > > Ce message et toutes les pièces jointes sont établis à l'intention exclusive > de ses destinataires et sont confidentiels. > Si vous recevez ce message par erreur, merci de le détruire et d'en avertir > l'expéditeur. > > > The information contained in the Email and any attachments is confidential > and intended solely and for the attention and use of the > named address(es). It may not be disclosed to any other person without the > express authority of EpiConcept, or the intended recipient, or both. > > If you are not the intended recipient, you must not disclose, copy, > distribute or retain this message or any part of it. > > > > -- > Post: opendatakit@googlegroups.com > Unsubscribe: opendatakit+unsubscribe@googlegroups.com > Options: http://groups.google.com/group/opendatakit?hl=en

Or you can add the following to your Excel file:

=IFERROR(FIND("mutli-select-value", A1), 0)
Put this is in a parallel column to your multi-select. The A1 would point
to the cell in the same row under the multi-select column. The formula
above will evaluate to a 0 or 1 depending on whether the multi-select-value
appears in that cell - you need the IFERROR because FIND returns an error
code if the string is not found. You should also make sure none of your
multi-select values are substrings of each other for this to work properly.
It should work fine with your examples of numbers.

You can then count the number of occurrences at the bottom of the column
using "COUNT(B1:Bn)" to sum up all the ones.

Given the number of ways different people would want to do this, it is
better to customize in Excel or wherever you take you .csv file. Aggregate
was designed to be minimal functionality.

Gaetano

··· On Tue, Sep 4, 2012 at 2:37 PM, Yaw Anokwa wrote:

Hey Curtis,

An auto-complete widget (which builds on selects) with a list of
animals or diseases could easily take hundreds of options. Anyway, the
reason it's done this way is because this is the XForms spec and I
doubt the JavaROSA community is going to change it.

A better option is to change how Aggregate generates the CSV (file a
bug or submit a patch), or write a script to change that output to
something that is easier to analyze. I think the Kobo
(http://koboproject.org) team has done the latter so might be a good
place to start looking.

Yaw

On Tue, Sep 4, 2012 at 2:25 PM, Curtis Broderick c.broderick@epiconcept.fr wrote:

Thanks for the quick reply Waylon. I really did search the forum for this
subject, honest. I am not sure I understand Mitch's response. He
suggests a
separate csv for the one 40 choice variable aside from the main form's
csv?
That sounds pretty reasonable. I know computer science people like the
"need
to take into account all situations", but the nature of a multichoice is
by
definition a finite set, so unlimited seems a bit far fetched. As for 40
for
50 possible responses, that sounds pretty crazy. The selections are
captured
for the goal of being analysed, no? If someone thinks it is that
important
to have 50 possible check responses to a given question, OK, maybe "Which
countries have you visited?" (with 50 possible check boxes) then it does
not
seem unreasonable to me that he/she ends up with 50 columns. How can you
ask
the average person to use Excel to parse space separated data to give
you a
simple count? OK, this still does not change javarosa.
Curtis

2012/9/4 W. Brunette wbrunette@gmail.com

It's a OpenRosa/JavaRosa decision so you should ask that list. Since ODK
tools are built on top of the JavaRosa core and follow the OpenRosa
standard
ODK tools simply follow the specification.

Quoting Mitch on a previous thread:
"If you have a multi-select with 30 choices, would you want a
spreadsheet
to be generated with 30 column headings for this one multi-select, and
a "X"
under the columns with matching values? What if there are 50 choices?
What if the number of choices is unlimited (this is possible with
itemsets
drawing their choices from an external database)? At some point, it
makes
more sense to produce a spreadsheet with just "N" columns for a given
multi-select (1st choice, 2nd choice, ...), and show up to N choices for
that multiselect (but which N do you show?)."

Waylon

On Tue, Sep 4, 2012 at 1:37 PM, Curtis Broderick c.broderick@epiconcept.fr wrote:

I am very interested to understand why the "Select Multiple", as made
with ODK Build stores the results of the data as it does (space
separated).
Never seen space separated before. It seems very natural that a
variable
with say four options, each of which is checked/not checked, should be
expanded into four separate boolean variables. The output on the CSV
from
aggregate is horrible to deal with. I am guessing this was discussed
at some
point, but I don't understand the tradeoffs. It seems to me if you
have a
variable "shower", and you ask someone to check the days of the week
they
take a shower, that instead of getting a return value of "1 3 5 6"
(each day
of the week's value being 1 to 7), it makes more sense to create 7
variables, "shower[0], shower[1]... ", etc. with values true or false.
This
would at least make the csv output usable. Thanks for helping me to
understand this choice.
PS. Let the record reflect that this is not necessarily my shower
schedule ; )

--
Post: opendatakit@googlegroups.com
Unsubscribe: opendatakit+unsubscribe@googlegroups.com
Options: http://groups.google.com/group/opendatakit?hl=en

--
Post: opendatakit@googlegroups.com
Unsubscribe: opendatakit+unsubscribe@googlegroups.com
Options: http://groups.google.com/group/opendatakit?hl=en

--

Curtis BRODERICK
Head of training and user support

Tél : + 33(0)1.53.02.40.61 06.15.33.57.37

---------------------------EpiConcept---------------------------
Systèmes d'Information en Santé
Health Information System

           47, rue de Charenton - 75012 PARIS
     Tel : +33 (0)1 53 02 40 60  Fax : +33 (0)1 53 02 40 62

-----------------------www.epiconcept.fr------------------------

Ce message et toutes les pièces jointes sont établis à l'intention
exclusive
de ses destinataires et sont confidentiels.
Si vous recevez ce message par erreur, merci de le détruire et d'en
avertir
l'expéditeur.

The information contained in the Email and any attachments is
confidential
and intended solely and for the attention and use of the
named address(es). It may not be disclosed to any other person without
the
express authority of EpiConcept, or the intended recipient, or both.

If you are not the intended recipient, you must not disclose, copy,
distribute or retain this message or any part of it.

--
Post: opendatakit@googlegroups.com
Unsubscribe: opendatakit+unsubscribe@googlegroups.com
Options: http://groups.google.com/group/opendatakit?hl=en

--
Post: opendatakit@googlegroups.com
Unsubscribe: opendatakit+unsubscribe@googlegroups.com
Options: http://groups.google.com/group/opendatakit?hl=en

I spent a little bit of time searching the forum but apologies if I didn't find the right post and this has been answered more recently. The notes from this thread still apply? (i.e that Aggregate exports just the one column of data for select_multiple questions and any analysis of the count of each answer needs to be done with some sort of FIND formula post-export)

Yes, the select_multiple column when exported has all the answers separated by spaces. If you are using Stata, odkmeta provides a nice way to split these variables.

I think it'd be nice to have this splitting as an option in ODK Briefcase one of these days, but there are higher priority export projects going on right now.

1 Like

I know this is an old thread but it seems to be getting some updates every now and then.

I'm assuming the "Select Multiple" CSV output is still the same as I have seen in my testing. Wanting to split the columns in the output CSV to several, and having multiple text values in the field, I used the formula suggested by Gaetano and added an IF condition. The initial formula would return the start number of the second and subsequent words, which would not return 1, so here is my modified formula:

=IFERROR(IF((FIND("commerce", B2))>0,1), 0)

Here, I'm looking for the value "commerce" in B2. If found, and if the word is not the first one, the additional IF would return 1, otherwise, it would still be 0. Any comments?

Hi, @techGus!

Thanks for rescuing this thread! We added an option to split select multiple values into separate columns while exporting forms to CSV in Briefcase v1.12, and we've just released Briefcase v1.13, which includes the option to the UI.

You can check it out here: ODK Briefcase v1.13

2 Likes

Thanks @ggamazor for the update. I had downloaded Briefcase 1.13 but haven't looked at the new settings. Glad to have an Excel fiddling step removed!

2 Likes