Dataset and Entities creation via API

TomJ · August 22, 2023, 2:34pm

1. What is the issue? Please be detailed.
Trying out entities as a route to replacing our multiple-times-a-day definition updates.
As we have a catalogue of data from historical forms (and other sources), we want to create datasets and populate them with entities through the API. Trying to do this manually via a form - even using a csv dataset as a source - would be to slow to genuinely entertain as an option.
What I would like to do is:
-Call the API to create a each dataset required
-Assign properties to each dataset
-Loop over my data, using the API to create an entity for each record.

The issues are:

Cannot create datasets via API
As I understand it from the documentation Specify the Dataset the Form should save Entities to, datasets can only be created from within a form definition.
In essence, this requires I build a form and publish it, purely to instantiate the dataset in the project.
Cannot add properties to an existing dataset via API
Assuming I have a dataset in the project, there is no method to add additional properties to the dataset except through a form definition. A simple form with the minimum required to instantiate a dataset, where the dataset name could be updated and a new definition uploaded for each would solve problem No.1, but this is not useful, as all the properties need to be defined in the definition as well, requiring a fully expanded form unique to each dataset.
Creating an entity requires that the "UUID" be passed with the request, this cannot be known before creation.
I was missing something... (poor coding).
Collect will accept new entities as long as the UUID closely resembles that being created by a submission - i.e. "uuid: '1' " & "uuid: '5e1e671c-1234-abcd-6129869730bd' " are not valid.
So now my question is, what is the convention, so I am able to programmatically produce acceptable UUID values?

2. What steps can we take to reproduce this issue?
Issue No.1 & No.2 appear to be missing methods in the API.
Issue No.3 attempt to add an entity with a uuid construction incrementally departing from that of an existing one.

3. What have you tried to fix the issue?
Issue No.1 & No.2 I see no avenue to resolve via the RESTful API as it currently exists. (I would love to be wrong!)
Issue No.3 I have attempted changing the UUID value digit by digit until the creation is rejected by the server. Without an understanding of the convention used to build the UUID string, this seems like an unreliable method to produce a known range to use in production.

If there is no pathway forward with the RESTful API, has anyone in the community attempted to interact with the OpenRosa endpoints directly?

4. Upload any forms or screenshots you can share publicly below.

ktuite · August 22, 2023, 10:45pm

Hi @TomJ,

Welcome! When you have a moment, please introduce yourself!

To answer your first two questions, you're correct that there is currently no way to create a dataset or add properties via the API, it must be done through building, and updating and publishing forms. I agree that this is not ideal and it is not the first time it has come up. I will think about how easy it will be for us to add this missing functionality to the API.

To answer your third question, yes, you do need to programmatically generate the UUID yourself when creating entities via the API. Central is expecting a v4 UUID with no "uuid" prefix. In Python, I've used the uuid library and uuid.uuid4().

Here's an example payload I've used to create an entity:

{
  "uuid":"3efd0449-3dd0-449f-9897-dd445a0befd5",
  "label":"My Entity",
  "data":{
    "some_property":"foo"
  }
}

TomJ · August 23, 2023, 1:28am

Thanks @ktuite for the clarification.
I have rough and dirty implemented your suggestion above and all runs fine.

    def create_entity(self, odk_api, project_id:int, dataset_name:str, label:str, data):
        #Get all entities in the dataset
        #Generate a new uuid
        #Check that the uuid is not already in use
        #If it is, generate a new uuid
        #If it is not, create the entity
        used_uuids = []
        existing_entities = odk_api.get_entities_metadata(str(project_id), dataset_name)
        for entitiy in existing_entities:
            used_uuids.append(entitiy["uuid"])
        new_uuid = str(uuid.uuid4())
        while new_uuid in used_uuids:
            new_uuid = str(uuid.uuid4())

        #Create the entity      
        request_body = {
            "uuid": new_uuid,
            "label": label,
            "data": data
        }
        return odk_api.post_create_entity(str(project_id), dataset_name, request_body)

Less of a question and more of an observation,
This approach (its the same for creating a submission IIRC), seems to present a potential risk of a clash with UUID. I know the namespace is huge and the risk is low, but there is still room for this to try and push a request with a uuid that has been created between checking the existing entities and sending the new one. Large datasets/ submission counts, async functions, multiple origins (my app/ someone else's app/ multiple Collect instances/ etc.) could all get a request into the server in the time it takes me to check - generate - send a request.
I am sure there is a back-end reason for not generating these uuid on the server side, it just seems odd.

ktuite · August 23, 2023, 7:26pm

The probability of a UUID collision, while not zero, is extremely low, which is part of why we went with this approach.

Another reason we put the responsibility of generating the UUID on the client is to support offline entities in Collect in the future, where things can happen with entities without requiring a round trip to the Central server first. We could have had a mixed approach of taking the client-provided UUID if it existed or making a new one server-side, but we did not, I think because it didn't really solve the possible conflict problem and it was more complicated. Not as relevant to you primarily using the API, but I just wanted to mention it!

As for collisions, I kind of want to say... don't worry about it? Alternatively, instead of pre-fetching all the uuids and checking up front that there is not a collision, what will happen in Central in the backend is that the database will be used to check for UUID collisions in a way that's faster than a python check. You could generate a UUID + entity data, try submitting it to Central, and handle the error if it fails for any reason. It will return a 409 Conflict error if the new UUID is not unique. Your code above could go there, so it didn't run every time you submitted an entity, only when there was an error.