See [the forum table of contents](https://forum.getodk.org/t/odk-ecosystem-longi…tudinal-data-collection-table-of-contents/22234)
- https://github.com/getodk/central/issues/298 adds Datasets of Entities generated from form Submissions and attached to follow-up forms using the existing CSV mechanism.
<details>
<summary>2018 strawman proposal from @admbtlr </summary>
## User Stories
*As a health worker, I want to be able to collect a medical record every time a patient visits my health facility, so that I can keep track of the patient's progress over time*
*As a census taker, I want to visit a village every year and record population data*
*As a vaccine delivery driver, I want to keep track of the quantities of vaccines that I deliver to cold storage facilities during my weekly deliveries*
*As a regional vaccine administrator, I want to download CSV files that show the quantities of vaccine that have been delivered to all the cold storage facilities in my region over the last six months*
## Proposed Implementation
For the sake of this explanation, I'm going to use the following terminology:
- **Entity** refers to the thing about which data is collected. The kind of thing -- the "entity type" -- will depend on the use case. So in the above user stories, the entity types would be "patient", "village", "cold storage facility", "cold storage facility"
- **Record** refers to one round of data collection. So in the above user stories, a record would be
1. the details of patient's visit to a health facility
2. an annual set of population data for a village
3. the quantities delivered to a health facility in a given week
4. again, the quantities delivered to a health facility in a given week
The simplest solution is probably to have two separate forms, one to collect the details of an entity ("the Entity Form") and one to collect the details of each visit ("the Record Form"). A Record must have one (and only one) Entity associated with it. An Entity can have multiple Records associated with it.
### The Entity Form
Forms for creating entities must have a certain field (or fields) marked as an "identifying field". This would be for example a patient's name and DOB, or a village name and region, or a cold storage facility name and ID number. These identifying fields can then be used as labels in the CSV file that the Record Form uses to enable a data collector to choose the linked Entity. Entity Forms can also have fields marked as "filter fields". These will be used to reduce the number of options shown in the list of Entities (see *Getting Entity lists onto devices* below).
### The Record Form
Forms for creating records must have one attribute called `entity_type_id`; this attribute can only contain the UUID of an Entity Form. They must also have one field called `entity_id`. This field should be of type `select_one_external` (see *Getting Entity lists onto devices* below).
### Getting Entity lists onto devices
The first question in a Record Form should be a selection of the associated Entity. This question should be of type `select_one_external`. The values will then be loaded into the form from an external CSV file that is downloaded from the server. The CSV file should have the following format:
```
list_name,name,label,<filter_field_1>,<filter_field_2>,...
entities,<instanceID>,<identifying field value>,<filter field 1 value>,<filter field 2 value>,...
entities,<instanceID>,<identifying field value>,<filter field 1 value>,<filter field 2 value>,...
...
```
[More](http://xlsform.org/#external) on external CSV files in X(LS)Forms.
These CSV files should be generated automatically by ODK Central, and updated every time a new Entity Form is submitted. It should then be possible to use the automatic form update functionality to keep the CSV file up to date. _[Question: if a media file is updated - in this case the CSV - does that count as an updated form? or would ODK Central have to automatically make a new version of the form each time it updates the CSV file?]_
### Local Entities
A common use case is to create an Entity and then immediately create a Record for that Entity. In an offline scenario, this is not possible with the spec so far. It is there therefore necessary to add a mechanism for adding Entities locally, within ODK Collect. Every time the Entity form is completed, the data should be written to a local CSV file (or a local database?). There should then be a mechanism whereby the local CSV file is merged with the downloaded CSV file whenever the Record Form is opened.
It might make sense to clean up the local CSV file every time a new CSV file is downloaded from the server, but it's questionable whether this will be necessary (one reason: if an Entity is deleted on the server, it will still be in the local CSV and the merge will make it available in the form).
## Required Changes
### XForm Spec
- addition of concept of an Entity Form and a Record Form (not sure if this is totally necessary, but ODK Central will need to recognise an Entity Form so that it can do the automatic generation of CSV files)
- addition of identifying fields and filter fields
### ODK Central
- automatic generation of CSV files from Entity instances
- automatic form update after generation of CSV file (is this necessary?)
- a UI to enable display of Records by Entity
### ODK Collect
- ability to store a local Entity Instances CSV file and merge it with a downloaded CSV
## Additional Notes
### De-duplication of Entities
It would make sense to build some duplication detection and resolution into ODK Central. Ideally, it would only possible to do data collection on entities that have come from Central, so that they will always have to go through this de-duping, but this is obviously not acceptable if I want to register a patient and then make a case report on them in a totally offline setting. I could see a possible solution using a kind of tombstone for de-duped entities, so that a process might look like this:
- while offline, I register patient `dd6c32a4` using Form A
- `dd6c32a4` is now marked as "pending" on my device, which means I can submit case reports against it, but it's not on Central
- I then do a case report on `dd6c32a4` using Form B
- when eventually online, I submit both to Central
- it turns out that patient `dd6c32a4` is an exact duplicate of an existing patient, `19f44a40`, who already has case reports
- (more details about how exactly de-duping works here)
- my case report is switched to refer to the existing patient, `19f44a40`
- patient `dd6c32a4` is replaced in Central with a tombstone that refers to `19f44a40`
- all incoming case reports for `dd6c32a4` will be switched to refer to `19f44a40`
- once my device has updated its entity list, I will no longer be able to make a case report against `dd6c32a4`
For the specifics of the de-duping process, I would probably use a combination of approaches. First you need to find possible matches, probably using a trigram algorithm (or possible Levenshtein distances) on identifying fields such as name, village, etc. There's a really good trigram module for Postgres. This is then combined with matches on other fields (e.g. date of birth or geopoint) to calculate a similarity score. You can then figure out values and say something like "if it's over 95%, just merge them automatically" and "if it's over 80%, flag them as probable dupes", and provide a simple interface that displays the data with yes/no buttons. I've done something like this for de-duping patient lists in DRC and it worked pretty well.
</details>