pyODK merge - updates and conflicts

1. What is the issue? Please be detailed.

I am trying to use pyODK to create new entities and manage update of existing entities. I can work in reasonably simple scenarios, but before stepping towards release into the wild, I'd like to be sure that I haven't missed any critical 'features'.

In particular I am trying to make sure I understand how MERGE works before getting myself (and my data) into trouble. It seems to be a more 'all encompassing' technique of managing entities compared with UPDATE, but it looks like UPDATE has a check/balance of FORCE and BASE_VERSION... maybe UPDATE is 'safer' / useful in more complex scenarios?

If I supply a CSV with a list of entities to update using MERGE, will pyODK always overwrite the version on Central? I've looked through the source code (I know!) and can't quite follow whether there are and 'checks' on version or conflict status.

And that might affect working with offline entities - not so much of a problem if I have 'full control' over the entity list, but if I'm trying to update via pyODK and an enumerator tries to update via Collect (possibly offline or maybe since I last checked the server), how do I manage potential conflicts?

The docs are a little 'scant' in this area and I can't find any working examples of MERGE except where it is used to create_many... I'm also finding the language in the docs a little obtuse (or maybe I am too obtuse to understand the concepts).

"source_keys If provided, process only these keys in data. " - does this mean ignore for example 'label' or uuid / __id?

and how does that integrate / conflict with

match_keys: Dictionary keys common to source and target used to match rows. Defaults to ("label",). If a custom source_label_key is provided, specify that key as "label", because it is translated to "label" for matching. (that's a complicated sentence, which I think cross refers to other settings, but without examples I find it hard to follow.)

And clarifications for the terminally dim would be much appreciated :slight_smile:

The force and base_version options are relevant in contexts where your pyodk script has a local representation of Entities that could potentially get outdated and that should not be used as the primary source of truth. This could happen if you download an Entity List at one point in time and then try to use it as the basis for updates at a significantly later point in time.

In more dynamic contexts, there might be updates coming in at all moments from different Collect users, for example. That means that even if you download an Entity, make a change to it and immediately upload it again, there's a risk that the server's version would have changed. This would represent a conflicting update. If that's possible, it's highly recommended to specify the base version when making an update. That way, if someone else makes an update to an Entity while your script is running, your update will be rejected and you can decide what to do.

merge is appropriate when you have some Entity data source outside of Central that's intended to be the source of truth or you're making an update that definitely won't overlap with other users'.

For example, maybe you have an external process for registering participants (or trails) and whatever workflow you have in ODK is done in phases with batches of participants/trails. You could use merge periodically to set up the new batch. When you upload a data source without any overlap with the existing Entities in the system, all existing Entities will be deleted and all new ones will be added.

Another example could be that you want to add a new property that comes from an external data source. If your data has a natural id (something like a participant id or a trail id), you could use that as the sole value in match_keys. You could also specify the new property as the sole value in source_keys so only the values for that property would be updated. In this case, you don't have to worry about conflicting updates because you're updating a property that wasn't set before.

Or maybe you really do want to update all properties from an external data source. That would be fine to do without worrying about base version as long as you can either guarantee that other work that would result in Entity updates is paused or that your external data source contains the most accurate/true data.

If you think that's likely and both processes can update the same properties, I would recommend using update with base_version specified for any updates you want to make. You can still use merge with update_matched set to False to delete Entities not in your external data source.

Yes, exactly, you can specify only the columns to update. Those not specified will be "forwarded" from the existing version on the server.

Let's say you have some CSV with a column header of TrailName. Maybe that's the column you want to use as your Entities' labels. You could do that by specifying TrailName as source_label_key. In that case, you may also want to explicitly specify source_keys to exclude TrailName since it's already being remapped to label.

If you want to use the trail name as the key to match between your local data and remote Entities, you would specify that as label because of the remapping.

Hopefully those explanations help, do let me know if you have further questions. I agree it would be nice to have something like a Jupyter notebook with some examples. If you end up producing a resource as you're experimenting with these concepts, please share!

Hi @LN
Thanks for your comprehensive response - that's really helpful. Your explanation of scenarios mainly match what I imagined, so it's really useful to have that confirmed / clarified. And you even went with examples that 'float my boat'!

I am struggling with the merge concept because I am failing to see how I specify the eid / uuid / __id of existing entities to add properties / update them - i.e. how do I tell Central which entity to look for - should I just be using the label? In update I can specify the uuid, but that option isn't clearly available in merge (hence wondering if source_keys ignored uuid/__id). And maybe merge has the flaw (for me) of not working with versions? Going in circles a bit.

Apologies for being so slow at grasping this aspect. I'm going to have to disappear into the mountains for a few days so I'll look at it again when I get back.

I am pulling together some 'recipes' for different scenarios, as part of the QuODK plugin for QGIS, so that we can export vector layers as entities (csv), and pass them to pyODK to process - I think this a more pragmatic approach than a bespoke process that then doesn't fit people's needs. I just need to get them right so it doesn't cause problems for folk!

Hi @seewhy I am also writing some recipes to run pyodk and manage entities from PostgreSQL. I would be interested in yours to complete the scenarios I have imagined. At the moment I can create one or many entities, and delete entities by uuid :wink:

That's great. We are all 'chefs' now! I'm claiming this as my style of cooking and coding: https://en.wikipedia.org/wiki/Swedish_Chef - I was going to link to a video but I think that might extend beyond the 'rules' of the forum, but during my research for this post (!) I relived some childhood memories of watching the Muppets and so this post took longer than planned :slight_smile:

1 Like

The idea with merge is that if there's an external system involved as a source of truth there's typically a "natural id" that would be managed as its own column and used to match rows for updating (trail_id, feature_id, etc). If you do want to use the Central-provided ID, fantastic. I'm 99% sure that making sure your column with that ID is named __id and specifying __id in match_keys will have the desired effect. If you beat me to double-checking, please let me know what you find and we can explicitly document that.

I'm 99% sure that if your source data includes __id it will be included in the merge unless you exclude it from source_keys. If it's included, it must be specified in match_keys because it's the one row that may only contain unique values. Again, it would be good to double check this.