Creating Fake ODK Data

Dear ODK-ers,

1. What is the problem? Be very detailed.
I often find myself creating ODK forms in XLSForm, and then going through the forms over and over again, trying to create an initial fake dataset so I can build out an analysis dashboard before an assessment goes live. But I get annoyed creating a fake dataset manually, especially when I do it over and over again. This gets even more tedious when working with a very long form/questionnaire. I'm wondering if there's a solution that people use to quickly create a fake dataset with 100 or more responses.

2. What app or server are you using and on what device and operating system? Include version numbers.

I typically build the forms in XLSForm, and host them on KoboToolbox server, and collect data using ODK Collect.

3. What you have you tried to fix the problem?

I've Googled "how to create fake datasets on ODK". I found the following:
ODK Test Data Generator tool - #6 by kayr - but this one doesn't seem to be live anymore. On this thread, @Yaw also pointed to a thread about stress-testing the server, although I couldn't quite follow that.

I also saw a Github project about KoboSync, which has something of a random data generator. https://github.com/kobotoolbox/kobosync - Although couldn't quite understand how to make it work with the existing documentation.

Perhaps one of my challenges is that I'm not super confident to start going into the CLI (command line) and trying things without knowing how to do it really. If anyone knows how to use one of the above two solutions, great! Or if there's another solution...

4. What steps can we take to reproduce the problem?

If it would be helpful, would be glad to just put up a simple example XLSForm that I'm working with to get the example going.

5. Anything else we should know or have? If you have a test form or screenshots or logs, attach below.

If this is something that someone's figured out already and knows how to do, would love the help. If it's too big a problem, then just let me know.

Thanks!
Janna

This is something I developed a while ago.. not sure if the ODK libraries that I used still work.

If you do not mind then you can create a user account for me on your server and I try to verify it the data generator still works with the latest forms.

If it does then I could send you instructions on how to run it.

That being said the CLI application is pretty easy to work with since it has a Wizard-like interface once you get used to it.

I do know somethings do not work and I never got around to fix them.. like pulldata() or any other new stuff that may have been implemented.

What I used to do was to remove such advanced functions... generate the data, then put back the advanced functions.

Regards,
Ronald

1 Like

@janna I usually hack together Python script to do this. For example, here is a post-covid-submissions.py script that I used to generate realistic submissions for the demo of WHO COVID-19 Contact Tracing Form.

It'd be awesome if someone could build a friendly web UI to generate fake data. @kayr's CLI tool would be a great place to start.


#!/usr/bin/env python
# -*- coding: utf-8 -*-

import uuid
import random
import requests

# template is from a minimal submission
template = "<?xml version='1.0' ?><data id=\"covid-19_A0\" version=\"2020032802\" xmlns:ev=\"http://www.w3.org/2001/xml-events\" xmlns:h=\"http://www.w3.org/1999/xhtml\" xmlns:jr=\"http://openrosa.org/javarosa\" xmlns:odk=\"http://www.opendatakit.org/xforms\" xmlns:orx=\"http://openrosa.org/xforms\" xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\"><device_id>%s</device_id><start_time>%s</start_time><end_time>%s</end_time><case><case_id>%s</case_id><status>%s</status></case><data_collector><dc_name>%s</dc_name><dc_institution>%s</dc_institution></data_collector><case_info><sex>%s</sex></case_info><case_id_dob><age_years>%s</age_years><age_months>%s</age_months></case_id_dob><case_status>%s</case_status><symptoms_1><fever>%s</fever><sore_throat>%s</sore_throat><runny_nose>%s</runny_nose><cough>%s</cough></symptoms_1><symptoms_2><shortness_of_breath>%s</shortness_of_breath><vomiting>%s</vomiting><nausea>%s</nausea><diarrhoea>%s</diarrhoea></symptoms_2><meta><instanceID>%s</instanceID></meta></data>"

# bearer token is taken from the app user url
post_headers = {'Authorization': 'Bearer ABC123', 'Content-Type':'application/xml'}

yes_no = ["yes", "no"]

for i in range(0, 100):
   
    # generate fake data
    device_id = str("%015d" % random.randint(1, 999999999999999))
    start_time = "2020-03-29T" + str("%02d" % random.randint(0, 23)) + ":" + str("%02d" % random.randint(0, 59)) + ":" + str("%02d" % random.randint(0, 59))+ ".000-07:00"
    end_time = "2020-03-29T" + str("%02d" % random.randint(0, 23)) + ":" + str("%02d" % random.randint(0, 59)) + ":" + str("%02d" % random.randint(0, 59))+ ".000-07:00"
    case_id = str("%05d" % random.randint(1, 99999))
    status = random.sample(["alive", "dead"], 1)[0]
    dc_name = random.sample(["Alexander", "Alice", "Ayesha", "Benjamin", "Charlotte", "Do Yoon", "Emilia", "Emily", "Emma", "Francesco", "Gabriel", "Ha Yoon", "Hiroshi", "Hugo", "Jakob", "James", "Jose", "Junior", "Li", "Liam", "Louise", "Lucia", "Maria", "Mohammed", "Muhammed", "Noah", "Nozomi", "Oliver", "Olivia", "Precious", "Saanvi", "Sofia", "Sofie", "Tamar", "Wei", "William"], 1)[0]
    dc_institution = random.sample(["WHO", "CDC", "MOH", "Hospital", "Red Cross", "Clinic"], 1)[0]
    sex = random.sample(["male", "female"], 1)[0]
    age_years = random.randint(0, 99)
    age_months = random.randint(0, 11)
    case_status = random.sample(["suspected", "probable", "confirmed"], 1)[0]
    instanceid = uuid.uuid4()

    # insert data into template
    instance = template % (device_id, start_time, end_time, case_id, status, dc_name, dc_institution, sex, age_years, age_months, case_status, random.sample(yes_no, 1)[0], random.sample(yes_no, 1)[0], random.sample(yes_no, 1)[0], random.sample(yes_no, 1)[0], random.sample(yes_no, 1)[0], random.sample(yes_no, 1)[0], random.sample(yes_no, 1)[0], random.sample(yes_no, 1)[0], instanceid)

    # post to server
    result = requests.post("https://demo.example.com/v1/projects/1/forms/covid-19_A0/submissions", data = instance, headers = post_headers)
    # print(result)
7 Likes

@yanokwa Thanks for this solution. This looks great! I am hoping to do something similar with ODK central. I was wondering how you would deal with looped questions for these types of submission also? :slight_smile:

For those looking for a response to my previous question. I identified the XML of a previous submission using the ODK central API. The XML of this specific file is quite long so I will not share it here, but I advise this as a way of identifying the XML format of your specific dataset :slight_smile:

2 Likes

Hi @janna, @yanokwa, @kayr,

Sorry for all of the updates! I have been building on this a little more for a project I am working on (mostly in R). Generation of data is only a small part of the project, but I thought it might be useful to share the functions I have worked on in case they are of any use to anyone!

Functions to generate mock data using only the original survey xls form. An example of how I use these functions to submit fake data to an ODK central project.

Feel free to fork and to chop and change as you need. Just thought it might be helpful to share!

3 Likes