Random sampling in ODK Collect

Xiphware · September 9, 2023, 10:21pm

Random sampling is an established statistical method to roughly assess a large population when sampling every single individual therein is infeasible, by instead assessing a much smaller unbiased random subset, such that if many such samples were drawn the average sample would accurately represent the overall population. From Wikipedia:

"In statistics, a simple random sample (or SRS) is a subset of individuals (a sample) chosen from a larger set (a population) in which a subset of individuals are chosen randomly, all with the same probability. It is a process of selecting a sample in a random way."

A perhaps more pragmatic description of what is - and more importantly what isnt! - a random sample, and one which alludes to why it's a bit tricky to accomplish in ODK Collect, has been given elsewhere in the Forum by @chrissyhroberts:

The basic requirement was to select a random sample of n entities from a list p long, without replacement.

i.e. if you had these items in the list

A, B, C, D, E, F

and you wanted a random sample of 3 items without replacement. then an acceptable sample would be

A, D, E
or
E, F, B

but not

A, E, A in which there is replacement of A in the list after A is sampled the first time.

Practically speaking, generating a single random number between, say, 1 and 100 is easy. This number can then be used to pick one random sample out of your dataset, eg #42. However, if you want a sample size of 2, 3 or more, simply generating further random numbers like this wont work, because there is no guarantee you wont regenerate 42 a second time! This is why non-singular random sampling is a trickier problem - you must guarantee no duplicates.

This article describes a new technique to extract an arbitrarily sized (with no duplicates) random sample from an internal instance dataset, exactly the same sort of form dataset that you might use to populate select_one or select_multiple questions. This approach directly evolved of my previous post here about randomizing questions, and utilizes much of the same logic.

1. Ensure your dataset has a contiguous index

I'm using a generic dataset which I copied from here. The only important thing is that your dataset must have a field containing a contiguous index for each sample, from 1..N where N is the total size of the dataset. For simplicity, I'm using the regular XLSForm choices name field for this purpose. eg

2. Randomly shuffle some numbers

As with my randomizing questions solution, the next step is to apply the Secret Sauce® to randomly shuffle your dataset's index into an 'array' for later use:

once(join(' ', randomize(instance('dataset')/root/item[true()]/name)))

See Randomizing the order of questions for a more in-depth explanation of how this works.

After this, I now have a randomized non-repeating list of numbers from 1 to N; eg "22 42 3 15 10 ..."

3. Pick your sample size

Next determine what your sample size is going to be. It should be at least 1 but no more than the total number of samples in your dataset. To determine the latter, you can use an XPath trick to count() the internal instance dataset size.

count(instance('dataset')/root/item)

Normally count() is used to count the number of iterations of a repeat group, but in general it can be used to count the number of elements in an arbitrary nodeset, which internal datasets are. So in our case, count() gives us the total number of items in our dataset.

How you determine the sample size is entirely up to you; it can be a fixed value for your specific application, or it can be determined by other data acquired by your form, or it can simply be user specified, as shown here.

4. Extract a random sample set using a repeat group

The next step is to pull each of the random samples out of the dataset, according to the now randomized indices calculated above. For each repeat iteration, we get the index of the next (random) sample of interest by looking up the corresponding position in the randomized array:

This calculation probably needs some explaining:

selected-at(${randomized}, position(..)-1)

position(..) is basically the current repeat iteration we are in: 1, 2, 3...
selected-at(${randomized},n) is used to lookup the n-th element of our randomized 'array' of numbers. Its important to note that the selected-at() function is zero-indexed, so the first element of the array (ie the first randomized number) is at index 0, the next at index 1, and so on. Hence why we subtract 1 from position(..)

So during each iteration, we get the index (ie name) of a different (and non-repeating!) item in our dataset. If our randomized list of numbers is "22 42 3 15 10 ..." then in the first iteration we get index 22, the next iteration we get index 42, and so on. In this way, each iteration targets a different item in our dataset. We further limit the total number of repeat iterations by setting its repeat_count to the sample size. So we basically keep pulling the next randomized index out of our array till we reach the desired sample size.

Once we have the index of a specific (random) sample, we can lookup whatever desired field data we want about it, with calculations of the form:

instance('dataset')/root/item[name=${index}]/label

This is actually no different as you would normally do to reference values in dataset, as already described in some detail here: https://docs.getodk.org/form-datasets/#referencing-values-in-datasets. Basically, replace 'label' above with whatever column field in your dataset that you wish to get the value for this - ie [name=${index}] - specific sample. For example, 'Website'.

5. [Optional] Show the random sample

At the end of the form you could choose to display a select_multiple question with the entire dataset. This is purely for display purposes, to show all the samples and indicate which have been randomly selected. Consequently, this question should be set read-only so the user can't actually change the proscribed random selection. If you don't wish to display this in your form, just set the select_multiple's relevant=NO to hide it. But dont delete this question!

Result

in Enketo:

in Collect:

Here is the final form, with my dataset of 100 entries included.

randomsample.xlsx (23.5 KB)

Have a play around with your own dataset and let me know what you think!

Postscript

The get_size and show_sample field-list groups in my form are not terribly important; I'm just using them to display things nicely on the same screen in Collect.

As indicated, this random sampling form only works with internal instance datasets; the form as provided above unfortunately wont work against an external instance dataset (e.g. an accompanying csv attachment). But watch this space...
[UPDATE: 2023-09-17] Please see a subsequent post with an updated solution that supports external instance datasets. -Gareth]

wroos · September 10, 2023, 12:10pm

Thanks @Xiphware !
One question, please.

Why keep it with relevant=NO? As far as I know, the value will then be deleted on save /re-open.

Might be great, to add a link to your second posting, how to randomise questions? Randomizing the order of questions

Xiphware · September 10, 2023, 8:22pm

Having this (multi) select question present in the XLSForm defintion is necessary in order to ensure pyxform generates the necessary internal instance nodeset in the resulting XForm definition that eventually gets run by Collect/Enketo. Without it, ODK Validate throws a bunch of errors about not finding the nodeset; because the included dataset wont otherwise be overtly referenced by any select questions choices, presumably pyxform doesnt bother adding a nodeset for it into the form [@LN correct?].

Basically, the question has to be there to keep pyxform and Validate happy; you could probably manually remove the select_multi's associated control and binding entirely from the final XForm definition and it might still run ok, but its simpler to just hide it in the XLSForm IMHO.

Might be great, to add a link to your second posting, how to randomise questions?

I already had.

wroos · September 10, 2023, 10:05pm

Oh. sorry, I found the link now.

LN · September 11, 2023, 4:11am

Yes, that's the current behavior. We've speculated it could be this way because sometimes teams use a standard choice sheet but only reference some of the lists in any given form. Because some lists used like this are likely to be very large (e.g. all named locations in a country), it can be a helpful optimization.

That said, I don't know how common what I described above really is and @Lindsay_Stevens_Au and I have gone back and forth on whether we should maintain this behavior. It might be a nice companion to always generating secondary instances for selects to also always include all choice lists as secondary instances. We could also keep track of all instance names referenced by instance function calls and make sure those are included too.

Xiphware · September 11, 2023, 4:46am

No worries, its only when weirdos like me start mucking about that it becomes a problem.

LN · September 11, 2023, 5:00am

There are lots of useful applications of lookup tables and we even have some examples in the docs that use this awkward trick of having a non-relevant select to force inclusion of a secondary instance. I'm now leaning towards always including all choice lists as secondary instances.

Lindsay_Stevens_Au · September 11, 2023, 5:12am

^ ticket for choice list inclusion: https://github.com/XLSForm/pyxform/issues/647

Xiphware · September 17, 2023, 3:04am

To followup on my original posting, it turns out accommodating external datasets didn't take too much effort! I also made a couple improvements to the original (ie internal) random sampling form.

First, I removed the need to have a contiguous index field - 1,2,3...N - for every item in your dataset. Although this remains a strict requirement for the related randomizing question order form, its not actually required for this random sampling workflow; instead, you just need a field in your dataset containing a unique identifier for each element [Important: this id cannot contain spaces!]. The name of this id field isn't terribly important, but I've kept it as the regular choices 'name'. You can change the name of this field, to perhaps something that maps directly to your particular dataset, but you will then have to make the corresponding changes to the form, and you may get Validate errors being thrown when pyxform cant find the name and label fields that it is expecting for choice lists. So I'd probably recommend changing your dataset id field to 'name' until you are comfortable with exactly what this form is doing and how.

Second, I added in an optional filter on the overall (external) dataset, in case you might want to restrict the sub-sampling to a particular subset of your data. In the example shown here, I've restricted the random sample to only organizations with over 1000 employees. This requires adding an appropriate choice-list filter in a couple places; specifically

once(join(' ', randomize(instance('dataset')/root/item[Employees>=1000]/name)))

and

count(instance('dataset')/root/item[Employees>=1000])

If you dont need to filter and simply want to take a random sample from your entire dataset, just replace "Employees>=1000" with "true()" in both places.

Most everything else is much the same as internal dataset random sampling form, other than obviously using a select_multiple_from_file question at the end instead of the original select_multiple

Random sampling with external dataset form

The new random sampling form - for external datasets - is below, along with the sample dataset that I used (basically, the same data as previously, but now in an actual external cvs file rather than copied into the choices sheet).

randomsample_external.xlsx (10.3 KB)

dataset.csv (13.2 KB)