Collect: Coming soon, Offline Entities!

Entities represent the people, places and things that are important to your workflow. They can be created or updated by form submissions, directly from Central or via Central’s API. Entities enable longitudinal data collection, form linking, case management, and more.

How Entities work today

Currently, Entity processing happens on the server only. This means that when a user fills out a form that creates or updates an Entity, that same user doesn’t see the new or updated Entity until they have sent the form submission to the server and then downloaded form updates to their device.

For people working in an environment with fast Internet connectivity, the updates will appear when forms automatically update (typically 15 minutes) or near real-time when manually refreshing. If they’re working offline, however, they may not see Entity changes that they have made themselves for some time. For example, if I register a participant while offline, I won’t be able to pick that participant from a list in another form until I submit the registration and update the form(s) that use the participant list.

What offline Entities will enable

Making it possible to create or update entities immediately while offline will help meet a lot of goals that currently either require connectivity or some other tools like a paper tally sheet. Here’s what will be possible, even if Collect is entirely offline:

  • a bed net distributor's map can update to only show households that still need nets
  • a lab technician can be warned if they scan a blood sample vial’s barcode twice
  • a volunteer can register a person in need and immediately document any distributed aid
  • an agronomist can only fill out forms about diseased crops found in last week's survey
  • a vaccinator can see their daily progress towards their households visited goal

Multiple updates to the same entity

Entity creation and updates will happen at form finalization time. Multiple updates to the same entity will be possible while offline and the order of submission finalization will determine the order that the updates are applied. Collect will make a best effort to send submissions in the order that they were created. If they do arrive out of order on Central, they will be marked as conflicts and can be resolved there.

If two offline devices modify the same entity without first updating from the server, that will result in a conflict shown in Central where it can be addressed. We do not plan to show conflicts to data collectors because it can be difficult to resolve conflicts without project-wide context. Instead, we recommend separating out work to reduce the likelihood of two data collectors making updates to the same entity. You can do that today with tools like choice filters on properties. We are starting to explore how Central will make it possible to segment Entities so that each data collector receives only the ones that are relevant to them (see this thread for an example and to share your needs).

What to expect next (and a rough timeline)

We have started work on offline entities and will share a prototype that you can try in a few weeks. We will then refine the implementation to address situations such as when an Entity is updated or deleted on Central and then also updated by an offline Collect install.

We will also work to increase performance of both entity lists and other attached CSVs. Our goal is to make it possible to use lists of hundreds of thousands of entities with no slowdowns (with some queries being slower than others, we will document this in detail).

We are currently hoping to release offline entity support in Collect v2024.3 in late 2024.

Let us know if you have any questions about this functionality!

19 Likes

We have made good progress on offline Entities! Read on for some detail on what to expect as a user, some limitations we're currently planning for the first release, and an update on timeline. If you are also interested in the form specification, please see this thread.

Please do let us know if you have any questions or feedback, particularly about some of the limitations we are planning for the first release. We will use the notes in this post for writing user documentation so any clarifying questions you have will be very helpful.

There is a lot of detail here! Don't worry, you don't need to understand all of it to make great use of Entities. We want to make sure this information is available for troubleshooting and so that community members can have input if they would like.

Timeline

Here's our tentative timeline, also linked from the form specification thread (click to enlarge):

First we will release a Collect beta and share access to one of our test servers to try it out with. We will then release Central with offline Entities off by default later this month so that users can try offline Entities with their own forms and so that we can spend time on quality assurance leading up to the full Collect release and then a Central release about a month after that which turns on offline Entities by default.

Old versions of Collect will unfortunately silently fail when attempting to download forms with the new spec version. Our goal is to give enough time for most users to upgrade Collect before we turn offline Entities on by default in Central.

How Central will work

Central's primary goal for offline Entities is to capture enough information about Submissions that affect Entities so that it could reconstruct the full history including offline branches from different clients.

However, Central will not expose all of this to users, instead it will continue to follow the "last write wins" approach with conflict detection. Our goal is to provide just enough information to users when Submissions are known to have come from offline branches so that issues are easier to identify and fix.

When there are multiple updates made offline by the same user, Central will mark these as offline updates:

It will also detect conflicts between multiple offline branches from different clients.

Additionally, Central will have new behavior to handle out-of-order submissions. In the ideal case, when multiple updates have been made from Collect while offline, Collect sends those in the order in which they were created. Central can then process them in the order it receives them and match the intended history.

In some cases, Submissions may be sent out of order. This can happen if Collect is configured to allow the user to manually submit or if a Submission fails. In the upcoming release, Central will detect a form Submission that specifies an Entity update that's out of order and wait to apply it. A Submission that has been held will be processed immediately once the missing earlier Submission(s) are received.

If a Submission is held for more than 5 days, then it will be applied as an update even if there are missing Submissions that should have come before it. In that case, the Submission is said to have been “force-processed” and will be marked as a conflict that you can resolve.

If an update is received for an Entity, but the create is missing (if no create is received for the Entity for 5 days), the update will be force-processed as a create. If the update did not specify a label, the label will be auto-generated (because every Entity must have a label). If the create does finally end up being received after the update is force-processed as a create, that original create will be processed as an update. Central's goal is to try to use all Entity data that’s submitted even if it arrives late or out-of-order.

How Collect will work

Collect will keep a database-backed representation of each Entity List. When a form instance that creates or updates an Entity is finalized, Collect will apply the change to its local Entity List representation (if the Entity spec version in the form definition is v2024.1.0 or newer, otherwise it will leave the Entity List unchanged).

Sending filled forms and receiving form updates will continue to be completely independent from each other. Eventually, we may combine the two into a single synchronization operation but for this initial release, the Collect user experience will remain unchanged. We strongly recommend using automatic submission and form updates when using Entities to keep the server and client data as closely aligned as possible.

Collect will process Entity List updates in the background at form update time. For every Entity in the list, Collect will compare the server version to its local version. If the version is the same or greater, it will take the update from the server. In some cases, this will mean temporarily replacing newer user data with older data that came from the server. Eventually, the user's data will be submitted, it will be processed by the server, and the combined Entity version will be received by Collect.

Collect will always use the highest Entity version between its local representation and the server, regardless of conflict status. Conflict status will only be shown to server users who have more context about what's happening across their full project.

Collect will keep track of whether the Entity version it has was created locally or came from the server. If it came from the server and Collect gets an update without that Entity, it will delete the Entity locally.

To avoid corrupt data while filling out a form, if a form is updating while a user tries to open it, the user will need to wait until the update is done before the form opens. If an automatic form update attempts to run while that form is being filled, the update will be rescheduled for later.

The database-backed representation of Entity Lists will allow faster lookups with less memory. In this initial release only some expressions will be optimized in this way and over time more and more will be.

Planned limitations

In order to release as soon as possible and start gathering real world experience with the system, we are planning to leave a few limitations in the initial release. Some we will definitely address over time but others we may leave if we don't hear that they are blockers for users. If any of these seem critical for your use case or if you have questions about other scenarios, please ask below.

Clients download full Entity List with each update

Entity Lists continue to be served as CSVs with all Entities included. This helps maintain data integrity between client and server but leads to a lot of data having to be shared from server to client, even though the CSVs are zipped. It also means that the server has to do a lot of work. Eventually clients will be able to request Entities that have changed only.

Rejecting Entity-creating submissions on the server will not reject them on the client

In this first release, Entities will always be created offline, even if submission approval is required on the server side. This means that some submissions which have been rejected by the server may have created Entities on the device. This is something we will address better in the future. For now, we recommend keeping Central's default behavior of automatic creation of Entities without submission approval.

This limitation may not be a problem for you if:

  1. Entity creation and update are done by different people
  2. You intend to hold submissions so you can make edits and then always approve them

Entity created locally and immediately deleted from server

In this system, Entity create/update on the client, Entity create/update on the server and Entity updates from server to client are completely independent from each other and can happen in any order. With our current implementation, this means that there are cases in which Entities can be created on the client, deleted on the server, and that deletion is never synchronized to the client. We have a proposed approach to address this described in this thread but it will not be in the first release.

CSV form attachments with the same name as an Entity List will be conflated with the Entity List

If a project contains a form that reads from a CSV with a name that matches an Entity List's name, that form will read from the Entity List in Collect instead of reading from the attached CSV. We expect that this is very rare but will try to address it better soon.

Multiple clients that submit interleaved offline branches

If multiple clients each make several edits while offline and then submit at the same time, there's a risk that their updates will be interleaved. It will be hard to understand what happened from the Central representation and conflicts may be marked as soft when they are in fact more serious.

Next steps

Thanks to all of you who have contributed to making this complex functionality a reality!

Although this post contains a lot of details about special cases, we believe that overall the system will behave like most users expect. If that's not the case, please let us know. We are looking forward to your feedback on the Collect beta and the Central release after that.

4 Likes

Currently a form can either link to an entity list or upload a CSV and use the attachment if preferred (eg entity list content is bad / want to only supply a subset / need to supply a sorted version).

I'm probably misunderstanding but does this statement mean that unlnking will be overridden in Collect (but not Enketo?) and the project entity list will be used regardless?

Yes, that's what we're considering leaving in as an initial limitation. Of all the ones we've mentioned, we consider it the most risky/unacceptable. We wanted to lay out these limitations now to get user feedback and decide whether we need to prioritize any of them before release.

Are these things that you override the Entity List link for yourself in published projects?

1 Like

I have a question about offline entities. A bit soon as what I have in mind is not yet possible...
When an entity is updated in a form, is the new state of the entity available in the form before it's finalization ?

I plan to use entities to manage places to be visited and to use a "visited" property to hide entities from map as soon they've been visited.

1 Like

No, the updates are applied at finalization time. That means that during a form filling session, Entity data does not change. You can think of offline Entities as giving the same experience as what you get today if you always pull server updates immediately after each time you submit a form.

I think you may be thinking ahead to when you can update multiple Entities in a repeat, right? In that case, you'll need to combine Entity data and current form state to hide visited Entities but it should be possible to do what you have in mind. We'll be sure to share examples.

This is not exactly what you're asking about but I did want to note that if you later view or edit the submission either from Collect or from the server, any calculations that involve Entities will use the latest version of those Entities. That's desirable in some cases like if a typo got fixed or new information got added. In many other cases, it's more desirable to get the same information that the data collector originally did. This can be achieved with dynamic defaults, the once function or the if function depending on the context. In general, consider whether submissions will need to be viewed or edited after initial submission and if so, make sure to account for that in your calculations.

1 Like

Thanks @LN . Yes I'm looking for a workaround for not selected() choice_filter wich can get slow with a lot of previously selected items.
But I'll share our form and feedbacks in a dedicated thread.

1 Like

It was more of a theoretical question! If I had linked a select to an entity list and had a need to override it to allow using an uploaded CSV after this change was made, I think the answer is to modify the form select's filename to unlink it (eg from EL_myitems.csv to CSV_myitems.csv). This could still update the entity list if desired as long the UUID/name etc matched, but would allow a custom content/sort for that form.

I've been slowly rolling out entities, but only for smaller infrequently changing datasets that aren't as affected by list order, and the first list population was with a Z-A sorted CSV upload so the choice list shown is then A-Z sorted.

I have been mapping out how to roll them out for other CSV replacements, and in some cases it's easy (a task list that has items added and removed, with label updates to include emojis (like @spwoodcock, I am a big fan of dynamically updating labels with definition/entity updates to include emojis for status/rank/etc indication)), but in other cases, I have limitations that have stopped me from using them yet. eg

  • a need to sort one set, that updates daily, by area to force draw order on a map due to overlaps, no solution known yet
  • a need to update two sets from the same submission, would need to separately update the second or both via API
  • a need to update one set >1 times from inside a repeat
  • a need to associate media with set items so the form must also include these filenames to allow upload in Central requiring form updates with new/changed set items
2 Likes

Ahh, I see. That won't help, unfortunately. I think what will be ideal is when we can show the map first and then have form entry go from that so you don't need to rely on repeats. I'll be interested to see your forms and learn more about your workflow so we can see whether there are other options.

Yes, that would work, but if you already had a form rolled out and all of a sudden the behavior changed it could be problematic. We'll see whether we can address this one.

Thank you for the summary! We're tracking all of those. I think it's likely that multiple updates from the same submission will be coming relatively soon. The others are likely further out but we'll be sure to update as we know more.