String-length and string compare counts multiple blanks/spaces as one

1. What is the issue? Please be detailed.
The string-length() function seems to count any sequence of multiple blancs/spaces only as 1 character. So, a string like " abc def " (3 leading, 3 middle, 3 trailing blanks) will be counted as length = 8, instead of really 15 characters.

It seems a more general issue: How multiple spaces in strings are treated. Also a comparison seems to treat multiple spaces as one. For ex. " a b cd " = " a b cd " >> true.The same reduction of multiple spaces seems to happen through generating the XForm, what might explain the strange results from the functions.

2. What steps can we take to reproduce this issue?
See example and screenshots below.
StringLength01.xlsx (10.6 KB)

For comparison
StringCompareSpaces01.xlsx (10.1 KB)

Generated XForm (download from XLSForm Online)
The original string " a b cd " losses the multiple spaces and becomes: " a b cd " (i.e. " a b cd ")

3. What have you tried to fix the issue?
Tried the different examples, looked at XML specification, searched forum.
Compared with normalize() function. View generated X-Form.
I could not find any documentation for this behaviour (which is also different to normalization).

4. Upload any forms or screenshots you can share publicly below.

The XLSForm

XLSForm Online

ODK Collect

Deployment was done with KoboToolbox. Maybe, to test with Central too?

Generated XForm (download from XLSForm Online)
The original string " a b cd " losses the multiple spaces and becomes: " a b cd " (i.e. " a b cd ")

So, to be clear, it looks like the associated strings in generated XForm itself appear to have been already 'normalized' (ie multiple adjacent spaces collapsed into 1), Correct? That is, the XPath string-length() function itself isnt doing anything wrong; rather the conversion from XLSForm to XForm (ie pyxform) is losing them...

Yes, thanks!

That is what I have seen finally too. So, the issue touches any string function, incl. regex. Do you think this conversion (pyxform) is ok? And is this documented anywhere?

I updated now even the posting above to get the multiple blanks saved here.

I'm not aware that its a documented 'feature' of pyxform (and no, it doesn't seem 'ok' to me...). I'm not intimately familiar with the pyxform codebase, but a quick look does reveal a clean_text_values() function which, if applied inappropriately, could result in this behavior. :thinking:

Hopefully one of the pyxform experts can chime in here.

2 Likes

Whom to adress here, please? @LN? @yanokwa?

P.S.: We could even live with the 'feature', if it is explained in the XLSForm (and Enketo) documentation.

The pyxform experts are aware of this behavior and appreciate you bringing this to our attention. How did you discover this? Why is it a problem, and how urgent of a problem is it?

2 Likes

We could even live with the 'feature', if it is explained in the XLSForm (and Enketo) documentation, please.
I found it by chance, first with string-length, working on another thread here.

We meanwhile saw several string treatments not well documented, e.g. stripping trailing, leading and duplicate spaces in text UI widgets. Normalization in labels, incl. notes. (Also, normalization in forum threads here.) Here, it seems that normalization is done, but the above reported XForm cleaning in expressions does not seem to fully trim, but to only reduce to one blank, incl. for heading and trailing.

These existing treatments might even have advantages, but, please, make it public/documented. At the moment, a workaround might be: When working with strings, first use normalize(), so you get what you expect and what is documented.

Examples


The original string stays with all spaces visible when you save and reopen the form, but is trimmed when you $-reference it and on submission (see below).

XLSForm


TextCleaning01.xlsx (13.1 KB)
image

Data
image

By default, cell values have whitespace stripped and collapsed. To turn that off, add a settings sheet to your workbook. In the header (first row) add a cell with the value "clean_text_values". For the setting value (2nd row), add a cell with the value "no" (or false). E.g.

| clean_text_values |
| no                |

Turning off the clean_text_values setting should preserve whitespace, except for in header names which need to be processed for parsing. I say "should", because there is no explicit test coverage for this setting right now, but it seems to work. There is just one relevant test to check that settings values by default have whitepaces stripped, so I can't be certain on the exact effects or scope of this setting.

This setting seems to have been part of pyxform since May 2012 (commit). I don't know if it was ever documented. I suppose that most of the time, whitespaces before and after content, or a run of more than one whitespace in content, is a mistake (particularly easy to make when copy/pasting) that is tedious to find, and can easily be corrected during form processing, so it's not worth interrupting a user to tell them to fix it. In some cases, like multi-choice, whitespaces are not allowed in choice names, though this is prevented by an error shown to the user. Also, as alluded to in the above commit, there may be some issues with spaces in relation to downstream XML processing (or possibly HTML, since extra spaces won't be rendered in browsers).

As for use cases, I'm not clear on why you would want to turn this setting off. Why calculate the length of a static string if we already know " abc def " (2 spaces in the middle, one at either end) is 10 characters long? If it's interfering with regex, there is the "\s" symbol for matching whitespace characters. If it's for formatting content like labels and hints, I don't know if that actually works meaningfully on Collect or Enketo (considering responsive layout) - maybe for indenting multi-line text?

@wroos Can you please describe your use case for preserving whitespace in your XLSForm?

@ln If you would say this setting is officially supported, could you please add it to the "Additional Columns" sheet in for settings at: https://xlsform.org/en/ref-table/ (I don't have edit access)? In that case we should also improve test coverage for it. Although if hardly anyone needs to disable this setting then maybe it should be removed (similar to pyxform/617).

2 Likes