R script to download from Central Server and create SPSS Syntax file for importing data into SPSS

Here is a script in R using super useful packages "ruODK" and "repvisforODK" to create an SPSS syntax file to import all VARIABLE LABELS and VALUE LABELS
This also breaks select-multiple variables into variables named by options and the values are 1 for TRUE and 0for FALSE

Please ignore the data_d.xlsx probably it will be empty for most. We create some calculated variables with "d_" for our projects. This is for that.
The data_L.xlsx contains labelled values and variables.

SPSS syntax file is spss_syntax.sps
SPSS data file is spss_data.xlsx

1st import the data excel in SPSS then run the syntax to get all labels!

I have not tested the script in several environments. However, it worked for me on several projects.
Request feedback especially if you find a bug.

odk-to-spss-script.txt (8.3 KB)

5 Likes

Hi @Debjit_Biswas1 and many thanks for sharing your R script, this can be indeed very handy for doing further analyses with SPSS.

Just a small comment: I think ruODK should be sufficient here as it is the R package that allows to retrieve data from ODK. repvisforODK has a complementary purpose and is relying on ruODK to generate configurable visualisation/reporting templates.

From a quick look at your code, it seems to me that you are only using the function setup_ruODK from the repvisforODK package - this function is actually just a simple wrap of the ru_setup function from the ruODK package and meant to be used with one single parameter, the remaining parameters being fed by the defaults. It is a convenience function but does not bring any additional functionality compared to ruODK. In fact, I would probably define this function slightly differently with only one parameter and the default sets in the function body so that it does not have exactly the same behavior as ru_setup when you set all the parameters manually (as you did in your script) and I think such a variant of the ru_setup function would probably be better placed in the ruODK package rather than in repvisforODK - apologies for the confusion. If you don't use any other function from repvisforODK, then I would rather advise that you remove the dependency to this package. If you want to generate and customise automated HTML reports, then you are obviously more than welcome to keep it and encouraged to use it :slight_smile:

1 Like

Hi @Thalie! Thanks for your feedback. The only reason I used repvisforODK is to identify the single-select and select-multiple questions in the schema.

singlec <- names(single_choice_question_pie(df = data_r, df_schema_ext = schema, choice_col = 'choices_english', label_col = 'label_english'))
multic <- names(multiple_choice_question_bar(df = data_r, df_schema_ext = schema, choice_col = 'choices_english', label_col = 'label_english'))

Is there any method in the ruODK package to identify them? Please suggest.
I could probably try to identify them using the data or the XLSForm, but was in a hurry and used the repvisforODK to achieve this.

Ok noted, I had missed those ones. I think it is fine for your own code, but not advisable for users in general as repvisforODK should not overlap with ruODK functionalities.

Also from what I remember when discussing with @lucidviews (who created the repvisforODK package), I understood that it is not possible to fully discriminate between single and multiple questions only from the extended schema so that he had to use the content of the dataset to identify those columns which only had single entries. This means that if a multiple select question has only received single entries at the time of the export, this question will be classified as a single choice question. This is obviously more likely to happen at the start of the data collection, but not totally impossible.
Now looking at the doc of ruODK more in details, it may be actually more robust to rely on ruODK::form_xml to retrieve the XML representation of the form as the XMLForm relies on select (multiple selection) and select1 (single selection) elements (if I am not mistaken).

@Florian_May:

  • would you see the generation of codebooks for statistical analysis tools (SAS, SPSS, STATA) as part of the scope for ruODK? On my side, I am using Python to format XLSForms that are stored locally, but retrieving the structure of the form from the server as @Debjit_Biswas1 is showcasing for SPSS to generate these codebooks is definitely a much more robust/desirable approach.
    (obviously I am asking this while having absolutely no idea where to find the time to support this as my evenings and weekends are already pretty busy, but just feel it may be a nice add-on :sweat_smile:)
3 Likes

Hi @Thalie and @Debjit_Biswas1,

thanks for the kind words on ruODK!

Re SPSS / STATA / SAS / Primer integration in ruODK

In my original thinking, integration of any other statistical packages was out of scope for ruODK for the plain reason that if you've made it that far to access ODK Central data in R, then all statistical analyses, visualisations, and even general purpose programming are available straight away. So, once the client can formulate their analytical needs, the analyst can implement that cost-free and reproducibly in R.

ruODK is already a fairly heavyweight package due to the spatial and data vis dependencies.
To keep the package scope from further proliferation, I would suggest that integration with other statistical packages could be provided as separate packages using ruODK to access data, then generate the starting point to import the data and configure the statistical packages.

I would be the wrong developer for this, as I don't have need for or access to any of these packages. If however anyone would like to take this on, I'd be happy to advise and provide feedback.

Re multiple selects

ODK Central offers the option to export multiple selects into separate columns. I might need to add this option to ruODK.

In my own forms, I like to submit a full record with all options selected and clearly marked as "training/test record" to make sure that all fields have at least one value present. This way, Central will export the data with all fields present, even if most or all submissions leave some fields empty.

Hope this helps!

1 Like

odk_central_to_spss_R.txt (11.0 KB)
Hello @Florian_May
Thank you for your views. I quite agree with the fact that if I have ODK central data in R then analysing the same in R would be a natural choice. As a matter of fact, I started by doing the same.

However, there are two facts that kept me working towards the SPSS bridge:

  1. We have a team of analysts trained on SPSS. They had to hand-code the SPSS syntax with ODK Central data. This was time-consuming
  2. Also, a few of our clients would prefer SPSS data output over plain CSV

Therefore this SPSS bridge.

Hello @Thalie
I have worked on the script further and removed the repvisforODK package as suggested. Instead, I have included the XLS form as input. Improved the script further with multiple select options on the Variable Labels. I will continue to work on it as I am using this as our daily driver for SPSS conversion and will continue to update if there is any change in the script.
Thanks for your suggestions!

I have added 3 exports in the script:

  1. Export similar to ODK Central export with multi-select questions broken into option columns. (Not dependent on data, as in ODK Central. I am reading the XLS Form for the options list) data_out.xlsx
  2. SPSS importable XLS and the Syntax file (a pair or two files) spss_data.xlsx spss_syntax.sps
  3. Another XLSX output where Variable and Value labels are in the data in place of values data_out_L.xlsx

The updated R Script is attached. Hope this helps the community

2 Likes

Thanks for the context and for sharing the script!

Short of ODK Central providing direct outputs for the major players like SPSS et al., I can see as way forward:

  • ruODK should support the new submission export parameters like splitting select multiple, tracked here
  • ruODK::form_schema should provide as much useful information as possible to reduce workload on e.g. the SPSS export script, e.g. which columns are select multiple
  • Would you consider publishing the script on GitHub?

Thank you @Florian_May

indeed ruODK is a great package! Once it supports the select multiple splitting, it would be more useful. I will continue to track the progress on it.
ruODK::form_schema does it support the identification of select_multiple and select_one questions? Maybe I missed it. Will check the manual more carefully. I could not identify the difference between them from the ruODK retrieved schema.
Will soon publish it to GitHub and update it here. Thanks for the suggestion!

@Debjit_Biswas1 indeed it does, but I need to update the function reference.

ruODK::form_schema returns the column selectMultiple (boolean) which is TRUE if a field of type "select" is a select multiple and NA if it is a select one.
I'm testing this against ODK Central v 1.4.2, make sure your instance is updated too.

Thanks, @Florian_May will try this method and update.

Hi @Florian_May
On ODK Central 1.4.2 I am getting the selectMultiple boolean but in type, I am getting "string" in place of "select".


My central server version information is:
versions:
cd751d789e35e5cb1469cf651d6ba7275903bc03 (v1.4.2-4-gcd751d7)
461c1037997f9fb88075b75612b83cea297ced42 client (v1.4.0)
367a7f572d3ede411cea16d0fa1decb98e280e92 server (v1.4.2)

What ruODK version are you using?
I've just pushed v1.3.10 with updated docs on form_schema's selectMultiple. Probably a good idea to install the latest ruODK!
The rOpenSci r-universe build will be updated in a couple hours, so run remotes::install_github() to get the latest version.

Thanks @Florian_May
Was using v1.3.9

Hm, this string type is a mystery.
Can you share the form definition here?
My forms are XForms, are yours XLSForms or XForms?

ruODK 1.3.10 is now building on r-universe which, once finished, should update the docs.

Hi Florian, if ruODK::form_schema and ruODK::form_schema_ext have the same behaviour, I do not think you can discriminate between a single select and a multiple select from XLSForms.
This is exactly the reason why Lucas had to use both the form structure and the dataset content to guess which is which (and I had also double checked that this information was missing). I was not aware of the difference of behaviour when uploading XForms on ODK Central. This is the reason why I was suggesting having a look more at ruODK::form_xml (but I have not tested it yet).

About the use of SPSS / SAS, I think this is very specific to clinical trials (@Debjit_Biswas1 to confirm he is working in a similar setting), where we are working with statisticians who may not be proficient in R and use other statistical software instead (although on my side our marvelous main statistician is indeed a R-adept :star_struck:). Also statisticians will almost never directly access the datasets from ODK Central while data collection is active, or only at well identified time points (interim analyses, final analysis), after the datasets have been cleaned and locked, this to limit the risk of a false positive conclusion or a false negative conclusion when testing the hypotheses defined by the clinical trial protocol/analysis plan, so this is a very different mindset from other types of analysis (with a specific workflow), but we need ruODK for all the cleaning/data preparation stages (+ to run all other non-statistical fancy analyses we want :grinning:).

1 Like

Thanks @Debjit_Biswas1, I will test your code on my own forms to see how it performs and investigate whether ruODK::form_xml could be used in this case (the XML structure probably less straightforward to parse, especially if the form has a complex group structure, but this should be feasible). Keep you posted in 1-2 weeks.

Interesting to hear about clinical trial analysis, thanks for the explanations Thalie!

form_schema used to parse the XML output, now parses the much easier to read JSON output.
Worth investigating is the difference in form schema types.

Hi @Florian_May!
Updated to v1.3.10 of ruODK
However, the select questions still remain string.


All variable names starting with _sone_ are select_one and _smul_ are select_multiple questions. Attached is the XLSForm for your reference.
tasty-qre.xlsx (142.4 KB)
In any case, if we have selectMultiple boolean we can probably deal with the select thing by running a loop for all variables finding choices in the choices_english column of the schema. This will let us have distinct lists of select_one and select_multiple variables within R. But having the select type from ruODK would be really useful.

Here is the GitHub link for the SPSS syntax generator script: https://github.com/debjit-biswas/ODKCentralToSPSS

@Thalie I work for the Market Research industry as a data collection platform provider, Q're author and data analysis expert in India. We have a small team of Q're authors and statisticians (mostly trained in SPSS). Indeed our statisticians are not R-adept as you predicted :slight_smile: However, since the SPSS script, they are able to create the SPSS syntax on R all by themselves.