Interest in using async methods for pyodk

Context

A while ago @rsavoye made osm-fieldwork, which includes some utils for calling the ODK Central API (OdkCentral.py).

We use this in the Field Mapping Tasking Manager, but now that pyodk exists, it would be nice to combine efforts and contribute there instead.

Issue with sync methods

An issue we have been facing with osm-fieldwork is that making multiple Central API calls can be quite slow if not done concurrently.

Our use case is creating 10's to 100's of forms via the API, uploading attached media, and creating appusers for the forms, which I understand may be atypical usage.

Possible solution with async

I believe that making the calls async would improve performance significantly, as requests can be made 'concurrently' (not really with async in theory, but in practice yes).

As I would be adding this to osm-fieldwork anyway, I was thinking now is a good time to start contributing to pyodk instead.

Request

Would there be any interest in me adding async support to pyodk?
Does anyone in the community or dev team (@Lindsay_Stevens_Au) think this is valuable for the project?

Of course this would mean replacing requests with something like aiohttp or httpx, and would significantly change the usage of pyodk, requiring await to be used on each method call.

There may be a way to design this so that sync usage can be preserved, if preferred. Perhaps via a different import but sharing mostly the same logic underneath.

Related Info

We would be likely to add Entity support to osm-fieldwork in the very near future too, so perhaps we could help with https://github.com/getodk/pyodk/issues/62 too.

@Ivangayton @Niraj_Adhikari

Could you please elaborate on / quantify:

  1. what you mean by "quite slow" (e.g. uploading 100 submissions takes 2 minutes, 10 minutes, 2 hours, ...; attachment sizes 10kb, 10mb, 10gb, ...).
  2. the use case for doing high volume updates in minimal time (why is it time sensitive, why is there so much data to transfer, etc)?

Have you:

  1. tried using a thread pool, where each job has it's own client/session object? A pool sized up to the available threads on the host might deliver the desired throughput boost.
  2. investigated why the requests are taking longer than desired? That is, Central and it's resources (CPU, RAM, network, disk) have to be able to keep up with the requests, so sending more requests may not be faster.
2 Likes

Thanks for the response!

You asked some very good questions that helped to debug what my specific issue was (the main bottleneck was outside of the ODK Central calls).

  1. I have attached some outputs from a profiler, comparing the response time between sync and async usage of the Central API, see below.
  2. The use case is the creation of multiple forms or Entities via the Central API, initiated from a Web API (our tool, the Field Mapping Tasking Manager). We subdivide an area up into mappable chunks, then create a form for each task area (100's of task areas). However, this may no longer be a bottleneck, due to implementing Entities instead. Once there is an endpoint to bulk upload Entities, we shouldn't have much of an issue here.

For interest, the profiler outputs are below:

1 request, sync:

1 request, async:

15 requests, sync:

15 requests, async:

100 requests, sync:

100 requests, async:

1000 requests, sync:

1000 requests, async:

1000 requests, asyncio gather:

As you can see, there isn't a huge difference between sync and async usage here, both being reasonably performant. (3s difference for 1000 requests, 10s --> 7s).

However if batch calls are required, then usage via asyncio.gather is significantly faster, shaving time down either further 10s --> 3.5s.

I will probably update our ODK Central API wrapper (part of osm-fieldwork) to be async anyway.

If this is something that is desired by pyodk, it could be contributed.

Otherwise, happy to mark this as resolved :smile:

Code

If anyone is interested, here is the code I hacked together quickly.

FastAPI endpoint:

@router.get("/test-odk")
async def test_odk(
    project_id: int,
    db=Depends(database.get_db),
):
    from osm_fieldwork.OdkCentral import OdkProject

    from app.projects import project_deps

    odk_creds = await project_deps.get_odk_credentials(db, project_id)

    async with OdkProject(
        url=odk_creds.odk_central_url,
        user=odk_creds.odk_central_user,
        passwd=odk_creds.odk_central_password,
    ) as odk_central:
        projects = await odk_central.listProjects()
        first_project = projects[0].get("id")

        # Inefficient
        for index in range(1000):
            print(range)
            details = await odk_central.getFullDetails(first_project)

        # Efficient, asyncio.gather approach
        # details_tasks = [odk_central.getFullDetails(first_project) for _ in range(1000)]
        # details = await gather(*details_tasks)
    return details

example aiohttp usage (in methods):

        async with self.session.get(url, headers=headers, ssl=self.verify) as response:
            return await response.json()
2 Likes

The main requirement for this post was bulk uploading Entities.

I imagine that bulk creating forms, and other edge cases we may use, are not particularly useful for pyodk / the community (so async methods are not really necessary, although may still be useful for anyone using pyodk within an async web framework).

Looks like the requirement to bulk upload Entities will be covered by new API endpoints in Central 2024.01:

Requiring batch API calls for Entity creation is a temporary workaround, so I am closing / resolving this thread :+1:

For anyone that needs this right now, I have working code for bulk Entity uploads here: https://github.com/hotosm/osm-fieldwork/blob/5dd14cd505d4821051740a2d463c4662e18a873a/osm_fieldwork/OdkCentralAsync.py#L430

2 Likes