Entity-based data collection

LN · June 27, 2022, 10:18pm

1. What is the general goal of the feature?
Better support workflows that are centered around an entity which is visited one or more times. See the data collector workflows documentation for examples of how users address this need currently.

2. What are some example use cases for this feature?
There is a dedicated topic that links to use cases and related conversations over time. Thanks to everyone who has been involved in these conversations.

We will be adding the following concepts to Central and then over time exposing them more explicitly in Collect:

Entity: a person, place or thing that Forms can be about. Created, updated and archived by one or more Forms.
Entity Properties: values representing the current state of an Entity. Defined by Forms and populated by Submissions.
Dataset: a collection of Entities of the same type.

The implementation will start on the server side and will eventually lead to changes to Collect. A rough mockup of the data collector experience we will be working towards:

Mockup of Collect user interface re-oriented towards entities. This is an example for a guinea worm eradication project that needs to track dogs (vectors of guinea worms), households (may be responsible for dogs), water sources (which need to be treated against guinea worms), etc.

The first screen shows different Datasets (households, dogs, water sources) as well as Forms not connected to Entities (Monthly Abate Report, Security Update). The second screen shows an entity listing for the dogs Dataset. The third screen shows a specific dog Entity, its Entity Properties, and the Forms that can be filled out about it.

Entities:

Have at minimum a unique identifier
- See X3GJIA02 above which might be an identifier scanned from a barcode
May optionally have other properties
- See status, primary caretaker, etc for X3GJIA02 above
Expose all their properties to forms that are filled out about them for optional referencing
- For example, the “Wellness check” form above might use the primary caretaker name as part of question text
May get their properties set or updated by forms
- For example, the “Register rumor” form above might change the dog’s status
Expose all their properties to clients for use in list/map/summary views
- For example, some properties of each dog are shown in the entity list above
Have a flat property list (no groups, no repeats)
- Properties represent the state of a single entity at the current point in time
May optionally be created from a form definition
- For example, the + “floating action button” at the bottom right of the form listing screen launches an entity creation form
May optionally be archived by one or more form definitions
- For example, a “Register death” form for dogs might archive a dog, making it unavailable to field devices (but still visible on the server)

This model provides a layer on top of existing ODK concepts and functionality.

Initially, Datasets in Central will behave like server-managed CSV external datasets and there will be no changes to Collect or Enketo. We expect that the next release of Central (ETA: Fall 2022) will include:

Experimental form specification for declaring that Submissions of a Form create Entities in a particular Dataset (XML-only, XLSForm will come after)
Entity creation on Submission approval
Ability to attach a Central-managed Dataset instead of an uploaded CSV file to a form that declares a CSV external dataset

Initially these workflows will require a server round trip: a submission will have to be processed from Central and Collect/Enketo will need to get a CSV update. Eventually, Entity creation and update will happen entirely offline.

Q: This is amazing! When can I use it?
We aim to have the initial Central functionality outlined above released by the end of fall 2022. We will conduct field tests and focus groups with that alpha functionality before publishing a formal spec. This initial functionality will only enable Dataset and Entity creation (not update or archive) and only on submission approval, so it will be most useful for projects that have a registration step and a single follow-up. We will then progressively layer on functionality to support more complex and more dynamic workflows.

Q: This is terrible! ODK serves my needs today. Is my life about to get more complicated?
No. Our goal is to make these new features almost entirely invisible to users who are well-served by ODK today.

Q: Is this case management?
Case management involves coordinating services to bring some kind of case to closure. Examples of cases would be “a person with HIV”, “a pregnant woman”, “a refugee needing housing”. We expect that the tools we have outlined will eventually support many complex case management needs. We use the term “entity” because it is more generic and easier to reason about in industries where “case management” is not typically used (e.g. forestry).

Q: How was this model designed and what alternatives were considered?
The entity-based data collection working group worked through many decision points to get to this model. Some key ones:

Forms and Entities are distinct concepts. One notable alternative would be to only have the Form concept and to provide queriable access to all submissions. The downside is that getting the most recent state would often require complex queries across submissions from multiple forms. Entities store the most recent state, which is often enough to drive workflows.
Form Submissions continue to be the way data goes from a client to a server. There are alternatives like immediately syncing on a per-field basis. Keeping the submission model means more existing mental models and code can be reused and older clients will work with most of the system.
Entity creation, update, archive are configured in form definitions. An alternative would be to let servers manage that process. Adding to the form definition has two major benefits: it lets form designers think about all aspects of a form field at once and it ensures greater portability between compatible systems.

Q: What can I do to help?
Share your use cases, provide feedback on proposals, try out new releases.

Please also consider using ODK Cloud for your next data collection project. It’s what pays for the development of new functionality.

callawaywilson · August 15, 2022, 2:08pm

@LN, this feature looks great! Thanks to everyone who's making this a reality.

I had one question / potential case for consideration. I collaborate on some projects that would use this feature to generate entities for very large projects. For example, we might visit hundreds of thousands of households over the course of a project that spans multiple years. It would make a lot of sense for us to use the Entity functionality to register households as we refer to them when recording individuals and potential follow-ups.

Is there any consideration being given to large data volume handling?

I could imagine that larger entity volumes might cause slowness / errors in synchronizing and on-device processing / searching depending on how data are stored, indexed, and transmitted. It may not be a concern with the ODK solution, but I have seen issues with working with XML-based data on-device once it gets even into the thousands of entries.

Some other solutions I've seen include:

Scoping data somehow, per-user or user groups for example.
Completing / expiration of data to have it removed from devices (which I know approaches case management).

LN · August 30, 2022, 4:22pm

Yes, this is something we think about a lot.

Our initial releases will rely on existing select from file functionality. We are confident this works smoothly into the 10s of thousands of entities on modern devices. We have some short-term work planned to further improve performance.

We intend to enable user scoping within the next few releases. Initially this will likely look like only the original creator of an entity being able to manage it. Eventually this will become configurable.

We also expect that we will expand the form spec so that a form submission can archive an entity.

This high-level list of areas of development should give you a feel for how we're approaching this. Feel free to comment inline. As always, we will introduce new functionality iteratively and gather feedback at each stage.

alios82 · September 2, 2022, 11:53pm

Thanks for this
I currently use an asynchronous pipeline to achieve this that i wrote.
Have it as standard feature ooems many doors.

Syed_Muhammad_Qadeer · September 3, 2022, 8:06pm

Hi @LN , this is so amazing to know that finally the long awaited feature is coming. Looking forward to see the updates and try to juxtapose this example on our use case where we want a longitudinal study on farmers.
Salute to you and the team.

Syed_Muhammad_Qadeer · November 2, 2022, 1:49pm

Hi @alios82, would you mind sharing more details on what your tried.
Thanks a lot in advance.

alios82 · November 5, 2022, 11:07am

Each form represents data model entity e.g person, business, job, location.

Each entity gets a uuid generated four each submission.

Upon submission server side scripts ETL data from submissions and loads it as media to form as csv and form version is updated using APIs from central and G Apps.

Collect allows to download new media and search Uuid from fresh data.

Syed_Muhammad_Qadeer · January 15, 2023, 2:43am

Hi @LN
Hope you are well.
I have started using the new feature and its working great.
A few questions:

Is it possible to view / edit entities datasets in Central?
For instance, a registration form creates an entity. The submisson gets approved and entity becomes part of the dataset. Later on due to any reason, the registration form is updated. Will that change be reflected in the dataset as well? Will that edit create a new row or replace the existing one?
Related to performance: if the entities dataset contains 1million values, will potential effect it may have on the client using that list? Will the dataset get downloaded to the end user device?

Thanks a million for this amazing feature.

LN · January 17, 2023, 6:56pm

Not currently but we know those are important to add soon.

Right now entities can only be created, not updated through any means at all. The model we are currently planning on is that submission edits do NOT affect entities. In the future, submissions will be able to create, update or archive entities. Once a submission has performed that action, it is essentially "unlinked" from the entity it acted on.

Edits are complex no matter what but we think the model described above is the easiest to understand. We consider submissions to represent encounters and entities to represent current state. It's possible to find out that an encounter had errors and to edit that encounter but in the mean time, the entity may have gone through other transformations so it may no longer be appropriate to update. For example, consider this scenario:

A tree gets registered. The registration form submission has a diameter of 15cm, that's what gets written to the entity.
The tree gets measured again. A follow-up form submission has a diameter of 20cm, that now gets written to the entity (this can't be done today -- it's an update)
The registration form submission gets edited because another data source shows that the original diameter was actually 17cm.

I think we have four options:

Apply the edit to the current state of the entity. This would mean changing the 20cm to 17cm. That clearly doesn't make sense: the edit was to a previous point in time.
Apply the edit to historical state of the entity. This would probably get the history closer to reality but it's potentially extremely confusing. What if that 15cm was used in intermediary analysis? What if it was read by other forms? If we just magically change it to a 17, information is lost.
Don't apply the edit to the entity. This leaves it up to the person making edits to consider whether the new information should apply to the corresponding entity or not.
Apply the edit to the entity only if the edited submission was the latest to affect the entity. This is a hybrid of 1 and 3. It allows some submission edits to affect entities. It may be hard to build a mental model around.

Please keep in mind that this is preview functionality and not yet ready for usage at scale. You can read more in the docs where we recently tried to make the limitations more clear, including related to performance. We are going to be working on things like progressive updates to clients eventually.

ahblake · February 15, 2023, 4:11am

I've been thinking about entities recently in the context of my use and have had a look at the preview. It's a super exciting feature I can see many applications for.

For an entity, could/would the dataset contain duplicates for a given entity ID, in essence recording the history of that entity (eg tree circumference at a number of historical visits, weight at different ages etc). The most recent entity could be pulled by other forms based on timestamp of creation or a 'current' type flag, and a dataset could be pulled with the history for a field(s). Or is this the intent of archiving an entity so you have one live current state, and many archived (but still able to be consumed by a form or queried externally) states?

~~The datasets are CSVs currently, is there any way an entity could include media of any sort, eg a drawing/photo?~~ add files exists as 0.09. Before it's added, could the dataset have an image field that would trigger Central to add those files as attachments?

Creating multiple entities from a single submission (also discussed here) isn't on the functionality sheet yet, but seems to me to be quite important. eg for a family, add details for each member via a repeat, and add all members to the dataset.

LN · February 20, 2023, 10:13pm

Currently entity IDs are system-assigned UUIDs. They're guaranteed to be unique but what you're likely talking about are "natural" IDs from your domain (e.g. trees' barcodes, people's health ID numbers, license plate numbers, etc). You can use the same techniques I know you've used with last-saved to try to ensure local uniqueness while offline. You can also attach the same dataset you're populating and use a constraint if there's an attempt to re-introduce the same natural ID. These are good measures but they're not guarantees. At some point we do intend to give the system awareness of natural IDs.

So yes, you can absolutely have duplicates in a column that you consider to contain IDs. That ID column could represent the system-assigned entity IDs from another dataset.

Yes, exactly. We've deliberately used the extremely generic term "entity" so you can slice and dice them according to what is most convenient to your domain. We hope that at some point the default usage pattern will be to use entities to mostly match real-world concrete entities (a tree, a bus, a road, a person) because that's easiest to reason about.

That's not feasible without updates and archive. You can currently drive workflows with updates and archives by using more abstract entities like a visit, a measurement, an update, an archival etc. You could have a very generic interactions dataset that includes an interaction_type property or you could have separate datasets for updates and measurements, etc. There will be tradeoffs around performance and readability of queries. Over time we will mitigate some of those and it will become more and more about taste. This is similar to decisions that need to be made around using a form with repeats vs. a form that gets filled many times.

As I mentioned above, we expect the default usage pattern to be that entities map to real-world concrete entities. We expect those to have the most minimal set of properties to drive workflows. We expect the bulk of detailed analysis to continue to be done over submissions. Entities would drive live dashboards.

To tie this to the trees example in the docs, we expect it to be simplest and most performant to have a trees dataset with a few properties, including latest_circumference. A Tree Measurement form might update use that latest_circumference property in a constraint and also update it with the latest valid value. Then if the tree no longer needs to be measured, it could be archived. Where the measurement history is needed, submissions to Tree Measurement could be analyzed.

Something like that will likely be our first step but that still represents a fair amount of work.

Right, that's true. I guess we've considered it more of a specification addition but it's true that there'll be some Central work. I've added it as 0.11, thanks!

dr_michaelmarks · February 21, 2023, 2:58pm

Firstly this is super exciting so thank you @LN

Some initial thoughts having started to test his at LSHTM with @chrissyhroberts - if my descriptions arent clear please just say.

Entity dataset entries should be date & time stamped plus have something about who created them (public link vs app or web user ID).

I guess this could be done by creating these fields as metadata within the xlform and saving them as entity properties but I wonder if this can/should just be done by default? (Currently for example a submission might have these meta-data-fields but the entity it creates might not if you didnt specify save_to for those fields)

New properties for existing entities. Sometimes you use a follow-up form to collect new data about an existing property that was already defined (a new recording of tree height for example). But sometimes you use a new form to add a new property to that entity (leaf colour say). Currently it seems like this creates a new row in the dataset with all the existing ('original') properties blank but the new property for that entity completed. This has two linked issues
a) Unique entity IDs. My understanding is each entity has a unique ID. When you then use a separate form to create a new property of that entity the row in the dataset has the correct entity label but a different Unique ID. This can be seen in the attached CSV

Two people Mary and Jimmy have been created. Each has a Unique ID
Mary has had her hair entered twice. This creates two rows but the unique IDs of these dont match to Marys original unique ID. This would mean downstream matching would rely on the label variable I guess but this might not be unique.
In addition if I want to now reference her hair colour in a further file its not directly linked to all the other entity properties in the dataset which seems to create difficulties referencing it at the same time as the other 'original' entity properties. i.e if I pull in from the 'people' dataset the hair property is disconnected from the others which makes pulling it in tricky.

If you want to do cascading selects then you need to make sure that any 'parent level' properties are saved in descendent forms to do matching on.
i.e
I have a household form with properties district and head of house name
I have a person registration form. I first select a house from a houses dataset and then create some info about that person.
I have a person follow-up form. I want to a) select a house and then b) select from only the people that live in that house.
I can do this IF I remember to save the house level properties I will match on into the person level form but not otherwise. The way i am doing this currently is
a) Select the house.
b) Calculate the district and head of household.
c) Select from the person list with a choice filter that the saved District and Head of Household values in the person dataset match those I just pulled from the Houses dataset.

So I wonder if descendents could somehow have the Parent forms Unique ID appended to them.
i.e I create a house form it will have unique ID abcdefgh
When i then create people within that house they should all have a variable house which matches abcdefgh
Then i can do future matching based on this.

We will continue to explore and feedback!

Syed_Muhammad_Qadeer · March 2, 2023, 7:15pm

Dear @LN
Thank you very much for your super detailed answers. We highly appreciate your support.
A next question:
Is it possible to populate a choice list based on a column from entities dataset other than the Label column?

For example, I have a dataset (locations) containing Province, District and Tehsil names being populated via a form. I am using Province as the Label.

In another form, I call that dataset (locations) and create a drop down to select Province where its being fetched from the dataset labels.
In the next choice list, I want to show the Districts list based on the previous selection, from the same dataset (location).

Thank you for your guidance.
Any solution to do that?