Pagination in API (with PyODK or Odata)

Hi all,

I have a large dataset that I want to retrieve with an API call. At first I did my calls with PyODK. However, my table has grown considerably. Recently I get ‘out of memory’ errors when calling for the data, probably because with PyODK, I call the entire table at once…. So I need pagination (in combination with a carbage cleaning process). There are several ways to do this (using the ‘skip’ and ‘top’ parameters. But I also read another option (more robust) is using the @odata.nextLinkodata.nextLink that is generated when you use the ‘top’ parameter).

I now tried to build an Python object to do the call with pagination, but failed so far. I got stuck with either using the ‘skip/top’ parameter@odata.nextLinkor using the @odata.nextLink. AI did not give me a good solution yet….:-). Now I tried the @odata.nextLink but I run against limitation using PyODK (not supporting an url…). Anyway, are there any experiences how I can best solve my pagination problem? I suppose I just need to use the Odata methods…? I am happy with any clarification that could send me in the right direction.

PS: My dataset is 22k rows with a large number of columns…

Thanks for the advise!

Hi, perhaps it would be suitable to use the filter parameter to fetch smaller chunks of data in each request - for example if the purpose of the script is to poll for data added or updated in the past day; or if all the data is needed every time the script runs then request it in a loop (day by day). The pyODK docstring for filter is:

Filter responses to those matching the query. Only certain fields
are available to reference (submitterId, createdAt, updatedAt, reviewState).
The operators lt, le, eq, neq, ge, gt, not, and, and or are supported, and the
built-in functions now, year, month, day, hour, minute, second.

The ODK docs AI suggested the following (I didn’t test it) to get submissions created or updated for a specific date (in UTC timezone) but it could also perhaps be simplified using one of the mentioned functions (e.g. day()) instead of providing a timestamp:

from pyodk.client import Client

client = Client().open()

day = "2024-01-15T00:00:00.000Z"
next_day = "2024-01-16T00:00:00.000Z"

f = (
    f"((__system/submissionDate ge {day} and __system/submissionDate lt {next_day}) "
    f"or (__system/updatedAt ge {day} and __system/updatedAt lt {next_day}))"
)

subs = client.submissions.get_table(
    form_id="your_form_id",
    project_id=1,
    filter=f,
)
rows = subs["value"]

The Central API docs have some more details and links about using the filter parameter: https://docs.getodk.org/central-api-odata-endpoints/#data-document

Edit: it may also help with memory usage to only request the submission fields that the script needs to perform its task - this can be done with the select parameter, which expects a comma-separated list of field names. This parameter is documented as:

If provided, will return only the selected fields.
Example: __id, age, name, meta/instanceID

Hi Edmond,

When you say PyODK is not supporting a URL, what do you mean?

If you need to use an endpoint in the Central API that is not explicitly supported in PyODK, you can use the raw .get() and .post() HTTP methods. For these methods, you should only pass the part of the URL that comes after “v1/”. So when using @odata.nextLink you’ll need to strip out the first part: http://www.yourdomain.com/v1/. PyODK already has this domain information as part of initializing the client.

Hi @ktuite and @Lindsay_Stevens_Au ,

Thanks for the tips! Very useful. In the meantime I managed to solve the issue by using the Odata get() using this in a Python class:

"""Class to get ODK submissions in pages"""
class ODKCentralClient:
def init(self, base_url, default_project_id, table_name, username, password, page_size):
"""
Initialize the client with ODK Central credentials and settings.

    :param base_url: Base URL of the ODK Central server (e.g. https://your-odk-server)
    :param default_project_id: ID of the project to access
    :param username: Username for Basic Auth
    :param password: Password for Basic Auth
    :param page_size: Number of submissions per page (default 200)
    """
    self.base_url = base_url.rstrip('/')
    self.default_project_id = default_project_id
    self.auth = HTTPBasicAuth(username, password)
    self.page_size = page_size
    self.table_name = table_name

def _build_endpoint(self, form_id):
    """
    Build the OData submissions endpoint URL with page size limit.

    :param form_id: The form ID (not form name)
    :return: Full URL string
    """
    return (f"{self.base_url}/v1/projects/{self.default_project_id}/forms/"
            f"{form_id}.svc/{self.table_name}?$top={self.page_size}")

def get_all_submissions(self, form_id):
    """
    Fetch all submissions for a form, handling pagination.

    :param form_id: The form ID to fetch submissions from

    :return: List of all submissions (each submission is a dict)
    """
    endpoint = self._build_endpoint(form_id)
    all_submissions = []
    page_number = 1

    while endpoint:
        response = requests.get(endpoint, auth=self.auth)
        if response.status_code == 200:
            data = response.json()
            current_page_submissions = data.get('value', [])

            print(f"Page {page_number} has {len(current_page_submissions)} submissions.")
            all_submissions.extend(current_page_submissions)

            endpoint = data.get('@odata.nextLink')
            if endpoint and not endpoint.startswith('http'):
                endpoint = f"{self.base_url}{endpoint}"
            page_number += 1 if endpoint else 0
        else:
            raise Exception(f"Failed to fetch data. Status code: {response.status_code}, Response: {response.text}")

    print(f"Total submissions fetched: {len(all_submissions)}")
    return all_submissions

This Python class seems to work well. I don’t get the memory message anymore.

The only issue I now have is with some specific endpoints, for example an endpoint with this table name:

“Submissions.group_existing_site.group_add_photos.repeat_additional_photos”

I get this authenticate error message:

“Traceback (most recent call last):

File "/Users/name/Documents/Python_scripts/GetODK/download_monitoring_data_v4.py", line 324, in 

json_additional_photos = client.get_all_submissions(form_id)

                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/Users/name/Documents/Python_scripts/GetODK/download_monitoring_data_v4.py", line 150, in get_all_submissions

raise Exception(f"Failed to fetch data. Status code: {response.status_code}, Response: {response.text}")

Exception: Failed to fetch data. Status code: 401, Response: {"message":"Could not authenticate with the provided credentials.","code":401.2}”

It is also strange that this table name does not create any problem using PyODK in data requests…

For other endpoints, there is no problem (also with repeats) but for this specific one there is, only with get()… Very strange.

Solved. It was an error in my script loosing the credential settings at later calls… Anyway, this pagination Python class works.

2 Likes