Enable Case Management/Preloading

adam.butler · June 20, 2018, 12:29pm

@aurdipas and I have just been emailing about some of the details of the process, especially de-duping entities, and I figured it would be good to continue the discussion here and get some more ideas and contributions.

A quick summary of the slides:

"case management" is taken to mean (a) defining entities and then (b) making multiple temporally distinct reports on those entities
this is implemented using two forms, Form A which stores responses as entities, and Form B which requires that an entity is selected from a list before it can be filled; the response to Form B includes the UUID of the relevant entity

@aurdipas had two good questions:

How do you transfer the already existing entity to the device?
How you can avoid that the same entity is not captured on a second device (duplication)?

These are the answers that I gave, but I'd love to hear peoples' thoughts:

I think we would use the kind of CSV preloading that is already available for options. It would probably also make sense to extend this so that is uses the mechanism as the recent form update notifications, so that there is a reasonable guarantee that devices have the complete entity list.
The auto-updates would go some way to resolving the duplication issue, but is obviously not a satisfactory solution. Probably it would make sense to build some duplication detection and resolution into ODK Central. Ideally, it would only possible to do data collection on entities that have come from Central, so that they will always have to go through this de-duping, but this is obviously not acceptable if I want to register a patient and then make a case report on them in a totally offline setting. I could see a possible solution using a kind of tombstone for de-duped entities, so that a process might look like this:

while offline, I register patient dd6c32a4 using Form A
dd6c32a4 is now marked as "pending" on my device, which means I can submit case reports against it, but it's not on Central
I then do a case report on dd6c32a4 using Form B
when eventually online, I submit both to Central
it turns out that patient dd6c32a4 is an exact duplicate of an existing patient, 19f44a40, who already has case reports
(more details about how exactly de-duping works here)
my case report is switched to refer to the existing patient, 19f44a40
patient dd6c32a4 is replaced in Central with a tombstone that refers to 19f44a40
all incoming case reports for dd6c32a4 will be switched to refer to 19f44a40
once my device has updated its entity list, I will no longer be able to make a case report against dd6c32a4

For the specifics of the de-duping process, I would probably use a combination of approaches. First you need to find possible matches, probably using an n-gram algorithm (or possible Levenshtein distances) on identifying fields such as name, village, etc. This is then combined with matches on other fields (e.g. date of birth or geopoint) to calculate a similarity score. You can then figure out values and say something like "if it's over 95%, just merge them automatically" and "if it's over 80%, flag them as probable dupes", and provide a simple interface that displays the data with yes/no buttons. I've done something like this for de-duping patient lists in DRC and it worked pretty well.

Another thing that @aurdipas suggested is that you could check through a list of entities before registering a new one with Form A, to make sure that the person/village/tree you're about to register doesn't already exist in the database, which is a good idea.

Any thoughts?