Collect will need to stop using IMEI as deviceID and making simSerial and subscriberID available

LN · January 6, 2020, 5:55pm

Google recently introduced stricter target API level requirements for applications published to the Play Store. ODK Collect currently targets Android 9 but as soon as Android 11 is released, it will need to target Android 10. This will likely be in fall 2020.

This has two major implications: Collect will no longer be able to use files in /sdcard/odk/ and it will no longer be able to read device-unique identifiers such as the International Mobile Equipment Identity which is used as deviceId (sent with every form submission and form list request). Because these may have impact across the ecosystem, I'll start a thread for each change describing a proposed approach and soliciting feedback. See Collect will need to stop using /sdcard/odk for files for the thread about the file storage change.

The ODK XForms specification for deviceId describes it as "Unique identifier of device." No unique device identifier will be available to Collect anymore. As an alternative, I propose that we use FirebaseInstanceId which uniquely identifies each app instance. This is recommended by the Android best practices. An alternative would be for Collect to generate and store its own UUID but I don't see any benefit to doing that given that Firebase is already in use for crash and usage reporting. I propose we prefix it with firebaseid: so that the source of the id is easily identifiable.

Both Aggregate and Central currently ignore deviceId as part of their primary functions (it does get logged which can be useful for troubleshooting). That said, deviceId can be requested as part of a form definition. There are certainly advanced users relying on those identifiers. Are there specific users who should be added to this conversation? @Batkinson comes to mind.

Does anyone know of any fork or ODK-compatible component that makes use of these identifiers? @tomsmyth (NEMO), @Ukang_a_Dickson (Ona), @jnm (Kobo)? Anyone else we should ask?

Although I don't think this will be very disruptive, I think we should delay making the change as long as we can so that it can be communicated. This means that we would make it in late summer of 2020 when we will be required to target API 29. In the mean time, we can use a community update to let as many system and form designers know as possible that the change is coming. Is there anything else we should do to alert users? We could do something like show a dialog to enumerators using a form that requests deviceId but I'd rather avoid that if we can because it would be confusing for enumerators.

EDIT: As @Grzesiek2010 has pointed out, additionally, the simSerial and subscriberID metadata properties will no longer be available.

yanokwa · January 6, 2020, 9:02pm

Agreed that we should make this switch as late as possible. Agree that putting it in a community announcement along with other big changes (e.g., scoped storage) seems reasonable. We can also put a note on social media warning folks. I agree that a dialog in Collect is not necessary or desirable.

I had initially thought that maybe we want to generate our own ID and not have the Firebase dependency, but Firebase and Android are pretty tightly coupled as is. And Google will likely do a much better job of keeping that ID stable than we will.

One minor tweak to your suggestion is to drop the id from the prefix and just make it firebase.

tomsmyth · January 6, 2020, 9:14pm

Thanks @LN for being on top of this!

We do log deviceID but only for debugging. We just drop it in a string column and leave it at that. So this shouldn't affect us negatively.

seadowg · January 7, 2020, 10:08am

This might not be a problem but my GDPR senses are tingling a little bit at this. Using a Firebase ID makes a submission potentially linkable to our analytics (the ID is in the submission and in Firebase somewhere) which if the submission includes username/user audit logging (or any PII) makes our analytics more sensitive. It's very unlikely that these two datasets would ever be combined but it does feel a little leaky from a privacy perspective.

I'd still vote for our own ID generation as then we don't have to think about that ID living in a third party database.

Grzesiek2010 · January 7, 2020, 11:19am

The topic is about the device id but just wanted to point out that the list of affected methods https://developer.android.com/about/versions/10/privacy/changes#non-resettable-device-ids contains getSimSerialNumber() and getSubscriberId() which we also use in ODK Collect.

LN · January 7, 2020, 6:14pm

That was what I was thinking. But I can see your point. I don't feel strongly about it one way the other.

That's a good point, @Grzesiek2010. I've never actually seen these in use so I forgot about them. But indeed, https://opendatakit.github.io/xforms-spec/#metadata does describe simSerial and subscriberID metadata properties. I propose that we deprecate these in the ODK XForms spec. @martijnr, has Enketo been able to access these?

I'll amend my original post to add this.

yanokwa · January 7, 2020, 8:13pm

This is a great point. Generating our own IDs is easy (I think) so we should bias to making the data impossible to link to PII.

@LN One think you should make clearer in the original post is that these new IDs are not stable. That is, both the Firebase and self-generated IDs are reset when you re-install the app or clear app data.

martijnr · January 7, 2020, 8:44pm

No, Enketo never had access to these. So fine with me to deprecate simSerial and subscriberId.

For deviceId Enketo is using a device-generated id prefixed with the domain (see https://enke.to/::YYyl).

LN · January 8, 2020, 5:28pm

@martijnr in Enketo, the ID is stored in some kind of browser-based storage, right? So it is per-browser and can be user-reset? I'm wondering whether we may also want to change the spec-level language around what deviceId is.

martijnr · January 8, 2020, 6:36pm

Yes, it is stored as a cookie per browser, so advanced users can reset it, or get a different id by switching to another browser. Following the spec "by device" (instead of by device && browser) would have been nice by we didn't want to go into user fingerprinting because of privacy concerns. So I see it as a best effort a web client can make.

I'm wondering whether we may also want to change the spec-level language around what deviceId is.

Such as unique for an 'app on a device' or something like that?

LN · January 8, 2020, 9:25pm

Yes, exactly. As you say, using an immutable device identifier is not great from a privacy standpoint which is why Android is no longer allowing it. Perhaps our spec should also explicitly mention that it should be user-resettable and kept as application state.

martijnr · January 8, 2020, 9:51pm

Yes, that sounds good to me, and Enketo could add an easier way to reset (if demand arises).

jpringle · January 16, 2020, 12:43am

We use deviceID in our survey operations at PMA, and if we lost it, we would need to make some changes. Here are two scenarios:

We want to know who is responsible for what submissions. The deviceID in a submission lets us know who submitted that form.
We also use deviceID to subset certain choice lists so that, for example, Enumerator A gets choices 1-10 and Enumerator B gets choices 11-20.

We don't want our Enumerators to self-identify in some way.

Before we start a new survey, we record the IMEI numbers from the work phones. We keep track of which enumerator has which phone and, transitively, which IMEI numbers. Therefore in scenario (1) above, we can easily link a deviceID in the form to the enumerator. For scenario (2), we use deviceID to be an XLSForm choice_filter as form logic.

Can we easily find out what unique identifier (firebase vs. Collect UUID) belongs to which phone? Could we access that unique identifier and use it for form logic?

yanokwa · January 16, 2020, 5:28pm

We can add the new deviceId in the User and device identity section of the general prefs to make it easier to track down and my expectation is that you'll be able to use it in the form logic.

The caveat is that new deviceId is not stable. If a user uninstalls and reinstalls Collect, they'll get a new one.

The way the Play Store is heading suggests that if you want some stable ID for enumerators, enumerator-specific server side login is the best way forward.

jpringle · January 16, 2020, 6:58pm

What you have described would work for us!

LN · January 16, 2020, 10:45pm

Thanks, @jpringle! We will generate and display this identifier starting in the next Collect release so that folks can start preparing for the change.

Since we have agreed to generate our own IDs, we need to decide what those will look like. I see two options:

a random UUID encoded to base 64. That would make it a 22-character string that would look like CSG+GQxwSGGxxQAAdyLbtA or ieAP5DzHSXihTgAAC7f17g. Possible concerns are that they could very well contain bad words in any language and that they are long to type in.
a 16-character random alphanumeric string to match Enketo's behavior. Collisions will be more likely but still improbable. Similarly, they could contain bad words.

For reference, IMEIs are 15 digits long (and the way they are assigned guarantees uniqueness).

@martijnr, what is the purpose of the domain prefix? Is it for preventing collisions? Is it common for submissions to the same form to come in from different Enketo installs?

@jpringle does 15 vs 22 characters make a big difference to you? Would being able to copy the value to the clipboard of the device be helpful?

martijnr · January 20, 2020, 5:48pm

It's to make a distinction between apps on the same device (Collect, Enketo). I think this may have been written in one of the older spec documents that was used a basis for ours.

LN · January 20, 2020, 10:15pm

Interesting! It seems that to create a roster of device ids someone would need to look them up separately for each app anyway so there wouldn't be a chance of confusion or conflict. That said, should Collect continue the pattern in some way? I suggested a uuid: prefix but I suppose we could do a collect: prefix to make it clear which app is generating the id.

jpringle · January 22, 2020, 12:49am

The length doesn't matter. We would prefer that the identifier be unique, and we would even be fine using base 16 encoding. Ability to auto-copy the number the clipboard would be useful. But now that I think of it, we would probably have our enumerators submit a form with this ID as a field to help us assemble the dataset I described earlier.

Batkinson · January 30, 2020, 8:37pm

Hi @LN. Sorry for such a late reponse. Our systems do capture the device id during form submission so we can track down any funny business by being able to identify the source of the form submissions. However, we likely will not need this moving forward, since we are employing device-level authentication (which matched the typical usage pattern used up until now). We can get better information without these, so this likely will be phased out of our systems.