This refers to this post which asked if there was a way to sample without replacement in ODK.
The basic requirement was to select a random sample of n entities from a list p long, without replacement.
i.e. if you had these items in the list
A, B, C, D, E, F
and you wanted a random sample of 3 items without replacement. then an acceptable sample would be
A, D, E
or
E, F, B
but not
A, E, A
in which there is replacement of A in the list after A is sampled the first time.
In the previous post there was suggestion that this might become an obvious feature of ODK but unclear if it has been finished so in the meantime, I had a look at a quasi-random solution that uses only the random() and pulldata() commands that are already built in to ODK. This is pretty sketchy and I would like to see if anyone has a more elegant way to achieve the same end.
First we need to make an external CSV file that contains an array of truly random sequences of numbers which are sampled from a larger set without replacement. In the example below I've sampled 10000 sequences, each with 10 numbers between 1:50 (inclusive), but this could also be sampling of strings or logicals. The larger the number of sequences, the more random the system becomes, but 10000 should be good enough for most real world purposes such as clinical trial randomisations.
The following code (in R) will generate a csv file with the required name_key
structure to allow these sequences to be pulled in to odk using pulldata()
command.
#make a data frame with 10,000 rows
a<-(as.data.frame(1:10000)
#Change the header to name_key
names(a)<-"name_key"
#create columns to house random samples
a[,2:11]<-NA
#populate columns with randomly sampled data (here 10 columns, each with a number between 1 and 50 without replacement)
for(i in 1:nrow(a)){a[i,2:11]<-sample(size = 10,replace = F,x = 1:50)}
#save a CSV file
write.csv(a,file = "randomer.csv",row.names = F)
Then we need an XLSform design to use some of this
The rnd
variable simply generates a random integer from 1:10000
The pulldata
commands on the subsequent lines then use the random number from rnd
to access the matching line in the csv file.
Adding more lines here (I called them randomperson1...4
would extend the length of the random sample you get (in the example you could go up to ten, but there's no limit on this)
Convert this xls to xml and load to aggregate with the csv file attached and it should work.
type | name | label | calculation |
---|---|---|---|
calculate | rnd | once(int(10000*random())+1) | |
note | note_rnd | The random number is ${rnd} | |
calculate | randomperson1 | pulldata('randomer', 'V2', 'name_key', ${rnd}) | |
calculate | randomperson2 | pulldata('randomer', 'V3', 'name_key', ${rnd}) | |
calculate | randomperson3 | pulldata('randomer', 'V4', 'name_key', ${rnd}) | |
calculate | randomperson4 | pulldata('randomer', 'V5', 'name_key', ${rnd}) | |
calculate | randomperson5 | pulldata('randomer', 'V6', 'name_key', ${rnd}) | |
note | note_1 | The first person is ${randomperson1} | |
note | note_2 | The first person is ${randomperson2} | |
note | note_3 | The third person is ${randomperson3} | |
note | note_4 | The fourth person is ${randomperson4} | |
note | note_5 | The fifth person is ${randomperson5} |
sample_no_replacement.xml (3.0 KB)
sample_no_replacement.xlsx (10.0 KB)
randomer.csv (323.0 KB)