Not able to get KEY and PARENT_KEY using PyODK

Chimwemwe · July 9, 2024, 8:57am

1. What is the issue? Please be detailed.

Here's a revised and more concise version of your message:

Hello,

I have a Python script that I use to extract data for my workflow. I've noticed that the columns KEY and PARENT_KEY are missing in the output after running the script. I tried tracing if these columns were being pulled from the data and if any subsequent lines of code were removing them, but that doesn't seem to be the case. When I print the raw data returned from Submissions = client.submissions.get_table(form_id=form_id, table_name=table_name), these columns are missing.

Is there a way to ensure KEY and/or PARENT_KEY are included in the data? Below is the relevant part of my code:

for index, row in project_repeat_data.iterrows():
form_id = row['form_id']
table_name = row['table_name']
form_status = row['status'] # Assuming the status column name is 'status'

try:
    # Check if the form's status is "active"
    if form_status.lower() == 'active':
        # Retrieve data and process
        Submissions = client.submissions.get_table(form_id=form_id, table_name=table_name)
        print(f"Raw data for form_id {form_id}: {Submissions['value']}")
        obs_df = pd.json_normalize(data=Submissions['value'], sep='/')

Joseph_Muganga · July 9, 2024, 10:09am

Hi @Chimwemwe

Based on my observation, pyODK does not return "KEY" and "PARENT_KEY" through the API when returning the json file, as when downloading the data as CSV on ODK Central.

Instead, it provides "__id" as a unique identifier for each observation in the base part of the form, and "__Submissions-id" as a unique identifier for each observation within repeat groups, if they exist.

You can utilize "__id" as a replacement for "KEY" and "__Submissions-id" as a replacement for "PARENT_KEY" in your data processing."

This should clarify how to handle identifiers in your workflow using pyODK.

Chimwemwe · July 9, 2024, 11:41am

Thanks, @Joseph_Muganga; unfortunately, this doesn't address my problem. I've always used the columns instanceID from the parent table and __Submissions-id from the repeat table, but when you alter a record from central, the __Submissions-id changes, removing the link that was previously there. The only columns that may be used to link the two tables after updating/editing the data are KEY and PARENT_KEY.

rfvieira · July 11, 2024, 1:15am

Hi @Chimwemwe, sorry for bringing something new but I'm using Pandas to manipulate Central data in Python without any problems even after many edits. The code is something like this:

import pandas as pd

# main form id on Central
FORM_ID = 'MyFormID'

# you have to stablish a connection to Central using pyODK and your credentials as shown in the tutorial.

# getting submissions list for the main form
subm_list_master = client.submissions.get_table(form_id=FORM_ID, filter="__system/reviewState ne 'rejected'")['value']

# the submission list is a list of dictionaries so we can pop system and meta data like this
[d.pop('__system') for d in subm_list_master]
[d.pop('meta') for d in subm_list_master]

# pandas dataframe for the main form data
df_master = pd.DataFrame(pd.json_normalize(subm_list_master))

# submission list for the detail data, usually a table created by a repeat structure
subm_list_detail = client.submissions.get_table(form_id=FORM_ID, table_name='Submissions.MyRepeat')['value']

# pandas dataframe for the detail (repeat) data
df_detail = pd.DataFrame(pd.json_normalize(subm_list_detail))

# now we can merge both dataframes using Pandas pd.merge()
df_all_data = pd.merge(left=df_master, right=df_detail, left_on='__id', right_on='__Submissions-id', how='inner')

The KEY and PARENT_KEY follows @Joseph_Muganga explanation.