Proposal: add `centroid` and `base64Binary-to-string` functions

LN · February 2, 2023, 9:14pm

I want to draw attention to https://github.com/getodk/xforms-spec/pull/301 which proposes two additions to the ODK XForms functions. It additionally clarifies that for convenience, new functions are not namespaced even if they don't come from W3C XForms.

centroid was suggested at Add centroid function for geoshapes by @MartinFroglife

base64Binary-to-string was suggested at Base64 decode support by @tobiasmcnulty

@eyelidlessness any forseen issues with Enketo implementations?

@Xiphware you are always a careful reviewer of all things spec-related so I'm interested in your thoughts. I know you're not going to love the non-namespacing. I don't love it either but I think it's helpful to codify what has actually happened in the past and is practical for our users.

eyelidlessness · February 2, 2023, 9:39pm

I think both of these should be straightforward

Xiphware · February 2, 2023, 11:50pm

Couple quick comments:

centroid() will need to [gracefully] handle being fed an arbitrary string, presumably containing a valid geoshape. But we'll have to explicitly define what it'll do if its not valid; in particular if somebody inadvertently feeds is a geotrace (ie the last point isnt the same as the first...)

base64Binary-to-string seems a bit verbose; any reason not to simply make it base64-decode? Also its worth nothing that, again, this could be fed an arbitrary string, so we'll have to explicitly define its behavior when it fails.

If we're adding base64 decode, we might consider adding base64 encode while we're at it, to effectively provide roundtrip base64 support.

LN · February 3, 2023, 1:22am

Thanks for the feedback and questions! Some background I put in the PR and should have put here too:

centroid is based on http://expath.org/spec/geo#d2e981 but with ODK types.

base64Binary-to-string is based on https://www.saxonica.com/html/documentation10/functions/saxon/base64Binary-to-string.html for its simplicity.

Additional candidates:

http://expath.org/spec/binary#decode-string has more complexity than we need
https://synapse.apache.org/userguide/xpath.html#base64_decode is similar but the Saxonica function better matches our naming conventions

None of those sources discuss errors. Error states also aren't specified in XPath 1.0 which is the basis of what our tools support. There is some about errors in the XPath 3 spec and it leaves some latitude to implementations. Our implementations tend to verify inputs and either throw an error or return a default value, it's not very consistent. I think both new functions could throw exceptions when given non-valid values, what do you think? Do you think we should start documenting that in the spec?

I like the idea of exactly matching an existing spec's signature (Saxon's). But I don't feel strongly about it and am happy to do base64-decode!

My preference would be to introduce it when there's a clear use case but again, I don't feel strongly about it!

yanokwa · February 3, 2023, 11:30pm

I prefer the elegance of base64-decode().

If it's easy to add encode, let's do it. If it's not, skip it.

LN · February 10, 2023, 11:21pm

@TobiasMcNulty commented on the spec PR and also prefers a shorter name. Let's go with base64-decode as suggested by @Xiphware.

How about using UTF-8 as the encoding and not making that user-configurable? If there's a need for configurability, we can always add a parameter later. Note that US-ASCII is a subset of UTF-8.

For invalid input, I'm currently thinking that producing blank output is the best option. The problem with throwing an exception is that it could lead to Validate errors or exceptions at the wrong time if the input needs to be built up. @Xiphware you'll remember this kind of case with dates. My sense is that it will be pretty clear for data collectors that things have gone wrong if they end up in a situation where they expect to see a value and it's blank.

I have a pull request with tests available if you want to see what all that looks like for base64-decode.

Xiphware · February 13, 2023, 4:23pm

Agree. No reason to add additional parameters (now) when 99% of the time user's wont need anything but the default.

For invalid input, I'm currently thinking that producing blank output is the best option.

Seems reasonable. My main concern was that there is a reliable means by which to tell the user their data is 'bad'; in this case, if the original 'base64' source string is not empty, but the base64-decode() result is, that means their original data is bad [vs the original base64 string just hasnt been set yet...]. eg this could be written in a constraint expression.