Interest in using async methods for pyodk

spwoodcock · March 24, 2024, 2:21am

Context

A while ago @rsavoye made osm-fieldwork, which includes some utils for calling the ODK Central API (OdkCentral.py).

We use this in the Field Mapping Tasking Manager, but now that pyodk exists, it would be nice to combine efforts and contribute there instead.

Issue with sync methods

An issue we have been facing with osm-fieldwork is that making multiple Central API calls can be quite slow if not done concurrently.

Our use case is creating 10's to 100's of forms via the API, uploading attached media, and creating appusers for the forms, which I understand may be atypical usage.

Possible solution with async

I believe that making the calls async would improve performance significantly, as requests can be made 'concurrently' (not really with async in theory, but in practice yes).

As I would be adding this to osm-fieldwork anyway, I was thinking now is a good time to start contributing to pyodk instead.

Request

Would there be any interest in me adding async support to pyodk?
Does anyone in the community or dev team (@Lindsay_Stevens_Au) think this is valuable for the project?

Of course this would mean replacing requests with something like aiohttp or httpx, and would significantly change the usage of pyodk, requiring await to be used on each method call.

There may be a way to design this so that sync usage can be preserved, if preferred. Perhaps via a different import but sharing mostly the same logic underneath.

Related Info

We would be likely to add Entity support to osm-fieldwork in the very near future too, so perhaps we could help with https://github.com/getodk/pyodk/issues/62 too.

@Ivangayton @Niraj_Adhikari

Lindsay_Stevens_Au · March 26, 2024, 10:40am

Could you please elaborate on / quantify:

what you mean by "quite slow" (e.g. uploading 100 submissions takes 2 minutes, 10 minutes, 2 hours, ...; attachment sizes 10kb, 10mb, 10gb, ...).
the use case for doing high volume updates in minimal time (why is it time sensitive, why is there so much data to transfer, etc)?

Have you:

tried using a thread pool, where each job has it's own client/session object? A pool sized up to the available threads on the host might deliver the desired throughput boost.
investigated why the requests are taking longer than desired? That is, Central and it's resources (CPU, RAM, network, disk) have to be able to keep up with the requests, so sending more requests may not be faster.

spwoodcock · April 2, 2024, 2:11pm

Thanks for the response!

You asked some very good questions that helped to debug what my specific issue was (the main bottleneck was outside of the ODK Central calls).

I have attached some outputs from a profiler, comparing the response time between sync and async usage of the Central API, see below.
The use case is the creation of multiple forms or Entities via the Central API, initiated from a Web API (our tool, the Field Mapping Tasking Manager). We subdivide an area up into mappable chunks, then create a form for each task area (100's of task areas). However, this may no longer be a bottleneck, due to implementing Entities instead. Once there is an endpoint to bulk upload Entities, we shouldn't have much of an issue here.

For interest, the profiler outputs are below:

1 request, sync:

1 request, async:

15 requests, sync:

15 requests, async:

100 requests, sync:

100 requests, async:

1000 requests, sync:

1000 requests, async:

1000 requests, asyncio gather:

As you can see, there isn't a huge difference between sync and async usage here, both being reasonably performant. (3s difference for 1000 requests, 10s --> 7s).

However if batch calls are required, then usage via asyncio.gather is significantly faster, shaving time down either further 10s --> 3.5s.

I will probably update our ODK Central API wrapper (part of osm-fieldwork) to be async anyway.

If this is something that is desired by pyodk, it could be contributed.

Otherwise, happy to mark this as resolved

Code

If anyone is interested, here is the code I hacked together quickly.

FastAPI endpoint:

@router.get("/test-odk")
async def test_odk(
    project_id: int,
    db=Depends(database.get_db),
):
    from osm_fieldwork.OdkCentral import OdkProject

    from app.projects import project_deps

    odk_creds = await project_deps.get_odk_credentials(db, project_id)

    async with OdkProject(
        url=odk_creds.odk_central_url,
        user=odk_creds.odk_central_user,
        passwd=odk_creds.odk_central_password,
    ) as odk_central:
        projects = await odk_central.listProjects()
        first_project = projects[0].get("id")

        # Inefficient
        for index in range(1000):
            print(range)
            details = await odk_central.getFullDetails(first_project)

        # Efficient, asyncio.gather approach
        # details_tasks = [odk_central.getFullDetails(first_project) for _ in range(1000)]
        # details = await gather(*details_tasks)
    return details

example aiohttp usage (in methods):

        async with self.session.get(url, headers=headers, ssl=self.verify) as response:
            return await response.json()

spwoodcock · April 22, 2024, 12:20pm

The main requirement for this post was bulk uploading Entities.

I imagine that bulk creating forms, and other edge cases we may use, are not particularly useful for pyodk / the community (so async methods are not really necessary, although may still be useful for anyone using pyodk within an async web framework).

Looks like the requirement to bulk upload Entities will be covered by new API endpoints in Central 2024.01:

github.com/getodk/central-backend

Bulk entity creation via API

getodk:master ← getodk:ktuite/bulk_append

opened 11:17PM - 11 Jan 24 UTC

ktuite

+936 -40

Closes https://github.com/getodk/central/issues/573 Uses POSTing to existing …`/v1/projects/1/datasets/people/entities` endpoint so it can handle * single entities with just `{uuid, label, data}` * multiple entities with `{entities: [{}, {}], source: {name, size}}` For the source object, * `source` is required * `name` within source is required (not sure if it should be but it will help with displaying on the front end...) * `size` (meant to represent file size) I was thinking about how to handle the bulk SQL insert and decided it was easiest to think/build two separate queries, one for the `entities` table and another for the `entity_defs` table. Elsewhere the code, we have complex queries for inserting a single entity's data into both tables at once, but it didn't seem reasonable to try to do that with multiple entities.  #### What has been done to verify that this works as intended? Tests, trying it out. #### Why is this the best possible solution? Were any other approaches considered? #### How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks? #### Does this change require updates to the API documentation? If so, please update docs/api.md as part of this PR. Documentation is included in this PR!! #### Before submitting this PR, please make sure you have: - [x] run `make test-full` and confirmed all checks still pass OR confirm CircleCI build passes - [x] verified that any code from external sources are properly credited in comments or that everything is internally sourced

Requiring batch API calls for Entity creation is a temporary workaround, so I am closing / resolving this thread

For anyone that needs this right now, I have working code for bulk Entity uploads here: https://github.com/hotosm/osm-fieldwork/blob/5dd14cd505d4821051740a2d463c4662e18a873a/osm_fieldwork/OdkCentralAsync.py#L430