Multiple entity behaviour - difference observed between ODK central and ODK Collect

Thalie · March 23, 2026, 7:44am

1. What is the issue? Please be detailed.

While preparing a demo of a data collection flow for clinical trials using ODK (a workflow I have been eager to test now for quite some time and where I aimed to make some progress ahead of the upcoming Entity Office Hours as it makes use of multiple entities), I have encountered an unexpected behaviour where none of the entities I created are being updated in ODK Central v2025.4.4.

Interestingly, one of this entity do appear to be updated correctly in ODK Collect v2026.1.2, but the other does not. I have reviewed the form design for obvious errors but have not identified what could be the issue yet (it is totally possible I am missing something absolutely obvious). For reference, I have another form using two entities (creation of a new entity + update of an existing entity) that works without any issues.

2. What steps can we take to reproduce this issue?

Screen and enrol a new participant using Form 01
Open Form 02 (sealed enveloppe)

First submission of Form 02

Second submission of Form 02

The randomisation ID is shifted to the next available randomisation slot as would have been expected and the list of already allocated randomisation slots is also displayed as I would expect.

ODK Central

Participants

randomisation_list

ODK Collect after deletion/reinstallation of the ODK project (reflects entities on ODK Central as expected)

3. What have you tried to fix the issue?

I have reviewed the ODK documentation and I explicitly displayed values in the form to verify the calculations, however, I have yet to identify the root cause of the issue. It is totally possible I am missing something absolutely obvious. Fresh perspectives would be greatly appreciated!

4. Upload any forms or screenshots you can share publicly below.

01_screening_form.xlsx (690.5 KB)
02_randomisation.xlsx (657.1 KB)

LN · March 23, 2026, 5:00pm

Do you have a fake facility list you could share with me so I could replicate your starting state? All other lists start out blank, right?

Thalie · March 23, 2026, 5:23pm

Both facilities and randomisation_list at starting state. Randomisation sequence in trials is generally pre-generated as a sequence by independent statisticians.
The only thing is that you need to replace the value of fid by the facility entity ID to make the linking before uploading the randomisation_list (sorry this is very ugly, but I was still making some tests and always have this tension between user-defined keys and auto-generated unique IDs)

facilities.csv (131 Bytes)

randomisation_list.csv (660 Bytes)

LN · March 23, 2026, 5:24pm

Oh yes, me too. We're thinking about this problem a lot!

Thanks for that starting state, will let you know what I find.

LN · March 24, 2026, 4:49am

This turned out to be quite relevant! I believe the main issue is that on the entities sheet for the randomization form, you had a reference for the participants list's natural id, not the system id. Currently, in order to update an Entity you need to specify a reference to its system ID on the entities sheet.

I added a calculation to your survey sheet to look up the system ID based on the natural ID which is stored in the label column: instance('participants')/root/item[label=${pid}]/name. I'm now seeing consistent updates from the randomization form with 02_randomisation.xlsx (661.0 KB)

Note that the expression above assumes that labels are unique which is not currently enforced. Showing the latest assigned id in the enrollment form definitely does help but it's not a guarantee. You could choose to error when there's a duplicate since there's definitely a problem in that case or you could add [position()=1] like you did with another query to pick the first match.

Central and Collect indeed have slightly different behavior when there are multiple Entity declarations and only a subset of them are invalid. Central does no Entity processing in that case but Collect does process declarations that are correct. So you saw updates to randomisation_list fail in Central but succeed in Collect. I'll make sure to call this out as I continue improving the docs and I've put some raw notes below.

Details around implications of the discrepancy

Collect may create or update Entities that Central will not
If an Entity is created in Collect but skipped in Central, it will be deleted from Collect next time the form is updated (because of the integrity URL check)
- Then if it does eventually exist in Central, Collect will detect that it has an offline Entity that's the same version as an online Entity, will replace its local representation with the remote one, and mark it as online (test)
- Collect could make offline updates to an Entity that Central hasn't created yet. Entity processing would fail on Central for those update submissions because they reference a submission that doesn't exist. A Central user could edit the submissions and save to re-run Entity processing. They would be responsible for applying the updates in order or resolving conflicts if they don't
If an Entity that exists in both Collect and Central is updated offline in Collect, and the update is skipped by Central, Collect will keep its local version until the Entity version on Central reaches the version in Collect; updates may be applied out of order in Central, Central will warn about that

Pros:

Collect user is able to make progress offline on the Entities/updates that did succeed
Cons:
Collect user has no indication that some Entity processing failed and so may, for example, create the same Entity again. That could be hard to trace in Central
Partial failure is confusing

I made one other unrelated change to your randomization form. You have a AGE_CATS list that represents age categories and that you need to look up values from. There are two things that I changed:

First, I removed the non-relevant select. This used to be necessary for the list to be included but we now include all lists specified in the XLSForm.
Second, you used jr:choice-name to look up the label. This works but jr:choice-name is really intended to look up labels that could be translated. In this case, we know that they're not, so we can use instance('AGE_CATS')/root/item[name=${age}]/label. You can look values up in internal lists exactly the same way you do in external lists. That means you could also use additional columns beyond label and look values up from those.

Neither of these last changes is related to the issues you experienced but they're a good opportunity to share some ideas that could be useful in future form design.

Thalie · March 24, 2026, 8:49am

Ooooh I see indeed! Thank you so much for the thorough analysis and clear explanation - this is extremely helpful!

I was not expecting this dependency, but noted and good to know. I will also aim to be more systematic in testing multiple entities, probably testing one entity at a time before using multiple entities in the same form to avoid drawing conclusions for both .

It is actually true that I often find myself mixing up auto-generated IDs and manually defined keys when designing forms with entities. This is one of my most frequent observations during testing, and it might be a bottleneck for others as well. This is not necessarily something ODK needs to address urgently (but you may have ideas about this as you mentioned it), but I believe it does add complexity to the design process.

Thanks also for new best practices, I will definitely adopt (it is quite easy to design based on previously encountered limitations): Poor/inadapted design choices and their consequences are also helpful to improve.

Btw still wanted to say, multiple entities are amazing!!! (the sealed envelope principle I have been testing addresses a current gap as randomisation processes usually need internet connectivity to guarantee the process, so it is exciting to see that ODK could fill that gap beautifully and allow fine control on complex data management processes)

LN · March 24, 2026, 6:18pm

One more small change I recommend that I'd like to highlight! In the Entities quick start guide we mention putting the save_to column to the right of required for Entities forms. That makes it easier to scan and is a convention I find really helpful.

Agreed this is really important and hard to get right. We've been thinking about ways to automate testing including some of the ideas @mwaka brought up at Automatically generate multiple dummy submissions

One particular thing I find hard to model and test is what happens when multiple users are working offline in parallel.

In the workflow you've built, I imagine you only have one enumerator per facility, right? I believe that's a requirement for data integrity with the forms as you've defined them, specifically because of the nb property for the facilities list. When a new participant is registered, the form adds one to nb and saves it back to nb. So a first submission might set nb to 1, a second to 2, third to 3, etc. That should work fine when there's a single enumerator per facility because Collect makes submissions in order of creation and Central tries to process them in order as well (though that is more complicated with multiple Entity updates).

However, if multiple enumerators are working offline, they might update nb to the same value and throw off the count. An alternative would be to compute the number of registered participants per facility with something like count(instance('participants')/root/item[fid=${fid}]). The disadvantage is that it's going to get slower to compute as the number of participants gets large but if you're expecting fewer than ~60k it should not be noticeable.

In general, storing aggregate values -- counts, sums or concatenations -- is going to be risky when there's the possibility of multiple individuals working in parallel. Central will detect updates that come out of order but the project manager is responsible for noticing that conflict and addressing it appropriately. This gets harder and harder as time passes because errors will compound.

100% and this is a general data modeling challenge. For example, we have parts of ODK in which we use database row ids (system id) to link values and others where we use form_id, instanceID and other natural ids. There are tradeoffs between the two.

Our goal is to eventually display linked Entities much more richly in Central and to improve performance of queries across linked lists (e.g. the count expression I suggested above). Currently we think that the guaranteed uniqueness of system ids and the fact that they're required and known gives us advantages for these goals. But we also recognize that this often means users end up with two ID schemes which brings its own confusion and downsides. More on these themes soon and we're generally very interested in any observations or dreams you have.

Hurray!!