.CSV file Lao characters not showing after export from ODK Briefcase

Hello to all,

Here in Laos, our enumerators use Lao characters. We used some in the XML form, no problem.
After successfully uploading our XML form on Agreggate, we used successfully this form for our surveys with Android devices (the Lao characters shows well on the devices). By the way we used HUAWEI - Mediapad T1 7.0 (8GB) tablets.
I have then used ODK Briefcase v1.11.2 to Pull from the devices to my PC (Windows 10) then Export (into .CSV files) to my PC again. And the problem is that last step gives me .csv files with "??????" instead of lao characters.
Note that I get several .csv files because there are loops, which is why I need to use Briefcase and not Aggregate directly.
Note also that I tried to Send finalized forms to Aggregate directly, and 1) the characters show correctly on Aggregate server, 2) they also show well on the .csv file I exported from Aggregate. But as I said I have several loops, so I believe I must use Briefcase.
Note again that after pulling from Android Devices with Briefcase, I get submission.xml files in the ODK Briefcase Storage folder, so I tried opening them with Excel, and the Lao characters showed correctly.

I have looked at several workarounds on the ODK forum and internet, I bet it is a problem with encoding UTF-8, or something like that (unfortunately I am not a Developer or IT specialist at all).
So I have tried to :

  1. open the .csv file with Notepad +++ (as suggested on Search csv with UTF - #5 by Mitch_S ), and encoded in UTF-8, it didn't work. Or saved after with UTF-8 without BOM, didn't work either.
  2. open the .csv file with Sublime Text (as suggested in CSV Support File - Not showing special characters on Android devices), but even when changing any encoding (UTF-8, UTF-16 BE, UTF-16 LE), it doesn't show any Lao character.
  3. open the .csv file with Open Office (as suggested in Devanagri Script (Hindi/Nepali) not rendering correctly - #4 by James_Dailey) instead of Excel, but again no success.

Here is a screenshot of what I get:

I am uploading my .csv file here, so if someone can help that would be great.

IDP ICS SURVEY 2018.csv (85.3 KB)

Let me know if you need more details.

Thank you in advance for your help.

Martin

Hi @Organic_Idp! Thanks for being so detailed in your post. It really helps figure out where the problem is.

This definitely sounds like Briefcase isn't exporting Unicode characters properly. Could you attach two or three of the submission.xml files with Lao characters to this post so @ggalmazor can try to reproduce?

Hi Yaw,

Thanks for your reply.
Here below are some of the submission.xml files pulled with Briefcase.
submission.xml (8.8 KB)
submission.xml (10.7 KB)
submission.xml (13.1 KB)

For information here is how one of these submission.xml file looks like when opened with Excel (we can see the Lao characters):

They actually come with 1 media file each (1 picture), but I don't think this changes anything.

Thanks a lot for such a detailed report, @Organic_Idp!

I will take on this issue right away.

1 Like

Hi, @Organic_Idp! Could you provide the blank form, please?

Hi @ggalmazor, thanks for helping!

Here is our Form (XML and XLS formats):
idp_organic_ics_survey_180618.xls (166.5 KB)
idp_organic_ics_survey_180618.xml (454.4 KB)

Hi, @Organic_Idp!

Here's what I've done:

  • I've uploaded the blank form to my testing Aggregate instance at aggregate-test-2.appspot.com (running 1.6.0)
  • I've manually uploaded the three submissions to Aggregate
  • I've exported the submissions with Aggregate and checked that the CSV file shows Lao characters correctly (checked that on Linux shell, Linux LibreOffice, Windows Notepad and Google Drive Spreasheets by importing the CSV file)



  • I've pulled the form using Briefcase v1.11.2, exported it and checked that the CSV files show Lao characters correctly as well (same programs)



If you would agree to give me access to your Aggregate server, I could check that there is no problem on that end.

Also, could you run java -version and copy here the output? Knowing the exact version of Java you're using for Briefcase could give us a hint too.

Hi @ggalmazor !

I have run java -version (I guess you meant with command prompt) , and here is what i get:
C:\Users\Admin>java -version
java version "1.8.0_172"
Java(TM) SE Runtime Environment (build 1.8.0_172-b11)
Java HotSpot(TM) Client VM (build 25.172-b11, mixed mode, sharing)

For checking the Aggregate server, yes you can (on https://idporganicicssurvey.appspot.com, then let me now if you need the password and username, how i can privately send you those info if needed).

But actually as i mentioned before, I already tried with some submissions to go through Aggregate like you and I also noticed it worked (we get the Lao characters).
The issue is that I have 6 loops in my form as you have seen, and therefore I cannot pull the whole data (7 csv files, 1 main file plus 6 loops) from Aggregate based on what I understand and what I have tried.
So I believe I must use Briefcase to pull directly from my Android devices (that is what I did last year, when we first used ODK, and it worked well), then again Briefcase to export from the "ODK Briefcase storage" (XML submissions files basically) to csv files. And I think that is at this export step that the problem is, because if I open (with Excel for example) the submissions.xml files before they were exported, then the Lao characters are visible.
Not sure if I am clear...

Anyway I let you do, I don't know much about all these things. Thanks again for your much appreciated help!

1 Like

Oh, I think I understand better your issue :slight_smile:

So, the difference between what I've done and what you've done is that I've used Briefcase to export the form after pulling it, which outputs some CSV files. The screenshots I've attached in my previous comment were from opening those exported CSV files, which show Lao chars correctly.

Could you try to export the forms using Briefcase and check the results too, please?

I guess opening the submission XML files is not very reliable. Would those CSV files exported by Briefcase work for you?

Well, I think I am not clear...

In fact, the first thing you might want to confirm to me (because i am not sure I am right) is that in my case (with several loops in my form) I cannot really use Aggregate server to export my submissions and get all my data. Because the loops will not be shown in separate csv files, and instead I will get only one csv file like this


with hyperlinks in place of loops (but which shows well the Lao characters though....).
I can however use Briefcase for pulling/exporting.
That is what I did last year (with a very similar form and survey) I used Aggregate only to upload my form on the server (so that the Android devices could download the form), but then after that I completely "bypassed" it following this: https://docs.opendatakit.org/briefcase-using/#pulling-forms-from-collect then this: https://docs.opendatakit.org/briefcase-using/#export-forms-to-csv. And at that time I had no problem with lao characters. Note that the Briefcase version was an older one.

Actually let me explain step by step how I do:

  1. I upload my xml form onto Aggregate server
  2. our enumerators download the form on their Android devices using ODK Collect. THey carry on with their survey (all Lao characters show OK)
  3. I take each Android devices, pull the zip odk files to my computer with a cable
  4. I use Briefcase to pull the odk file from my computer (in the last version of Briefcase there are only 3 choices for "pull from", that is Aggregate server/Collect directory/Form definition. when I say I pull from my computer, that means I pull from "Collect directory")
  5. I use Briefcase to export into csv files.

When I follow this process, the csv files have the Lao characters problem.

When I do like you, I also have the character problem:
3) manually upload submissions onto Aggregate,
4) use Briefcase to pull ("Pull from: Aggregate server")
5) use Briefcase to Export into csv
Then the Lao characters again don't show properly in the csv (using notepad+++ software for example)...

That is why I think the problem comes from Briefcase (since when I export from Aggregate there are no problems...except that I don't get my csv files for each loop, therefore it is not an option for me).

Thanks for the clarifications, @Organic_Idp!

Yes, forms with repeat groups can't be fully exported in Aggregate. You need to use Briefcase in that case.

Well, the problem is that I haven't been able to reproduce the problem in Ubuntu 18.04 or Windows 10 following those exact steps. That tells us there is something extra that's affecting the process. Maybe it's the regional settings, the Windows version, the fonts you're using...

Some things we could try now:

  • Could you try to do the same process on another computer?
  • Could you write the details of your Windows setup (version, regional settings, etc.)?
  • Can you send me one of the CSV files you have created when exporting with Briefcase, before opening them with any program at all (Notepad+++ or otherwise)?
  • I could try to follow the first process you've described ("Pull form collect") if you send me a zipped odk folder. My email is ggalmazor@gmail.com

In the meantime, I have also tried pulling the form directly from your server with Briefcase and I've been able to export it to CSV and the Lao chars are showing correctly too. I think we can rule out your Aggregate server as the cause of the problem.

1 Like

Hi @ggalmazor,

  • Could you try to do the same process on another computer?
    Yes I have tried with my colleague computer, running Windows 7 Ultimate, SP 3, 64bits. Same problem.
    His regional and Language settings are:
    Format: Lao (Lao PDR)
    Location: Thailand (for information, Thai language is quite similar to Lao, so Lao people like my colleague can usually read it)
    Administration>Language for Non-Unicode programs: Thai (Thailand)
    Keyboards and Languages>General>Default input Language: English (US), then Thai, then Lao (they have to switch often when typing different languages)

  • Could you write the details of your Windows setup (version, regional settings, etc.)?
    My computer runs on Windows 10 Home Single Language, 64bits.
    My regional and languages settings are:
    Country or region: UK
    Languages>Windows display language: English (UK) (plus the list of preferred languages: English US, French, Thai, Lao)
    Additional date, time and regional settings:
    Format: English (UK)
    Location: UK
    Administrative>Language for Non-Unicode programs: Thai (Thailand). here if I click on "Change system locale..." button, the message again shows Current system locale as Thai, and there is a box ticked "Beta: Use Unicode UTF-8 for worldwide language support".

Since Lao language is quite specific, and it needs special fonts (Phetsarath OT, Saysettha OT and others) virtually all Lao computers (including mine and my colleague one) have installed the free language software Lao script (https://laoscript.net/ ). It makes switching between keyboard languages quick and easy.

  • Can you send me one of the CSV files you have created when exporting with Briefcase, before opening them with any program at all (Notepad+++ or otherwise)?

Here you have the "main" csv file (without the loops csv files, I think no need), this is after pulled from Aggregate server with Briefcase, then Exported (fresh, unopened):
IDP ICS SURVEY 2018.csv (85.3 KB)
Here you have the "main" csv file (without the loops csv files, I think no need), this is after pulled from Collect directory (the unzipped folder I sent you by email) with Briefcase, then Exported (fresh, unopened):
IDP ICS SURVEY 2018.csv (84.2 KB)

By the way, as I often open Briefcase those days and sometimes it fails to just start, I also see new log files created next to the .jar file since a few days. Here they are:
briefcase.logs.zip (18.6 KB)

Also I am wondering, since I didn't have this Lao characters problems last year, while I used an older version of Briefcase, is it possible for me to download an older version (this 1.11.2 is still very new) just to see is the problems arise as well?

  • I could try to follow the first process you've described ("Pull form collect") if you send me a zipped odk folder. My email is ggalmazor@gmail.com

File sent.

Good luck and thanks again for your help!

Here are some older versions

1 Like

Thanks a lot @yanokwa !!
I installed the briefcase v1.4.10 , "pulled data from: Custom path to ODK Directory" , then exported into csv file :
IDP ICS SURVEY 2018.csv (77.8 KB)

Then opened with Notepad+++ ....and it worked, Lao characters show well!!! :sunny:
With Excel though, it comes out again differently when just double clicking on it:

But this is just the known encoding issue with Excel, so I imported into Excel with Data>From Text file, and just changed encoding to UTF-8, and all was good again:

So, as it seems it is an issue with the newer version of Briefcase, for the time being (we are in a hurry, right in a middle of a survey actually) I think I will use this older version to complete our current work.
But if you guys need us to test again the newer version after modifying some bits and parts, just let us know!

A big thanks to @ggalmazor and to @yanokwa !

Hi, @Organic_Idp!

I'm glad you solved your most pressing issue. I'll continue to work on this because I'm guessing that other users could have the same problem. I can confirm that the csv files you generate with Briefcase 1.11.2 have wrong chars in my computer too.

@Organic_Idp, I've prepared a JAR with some small changes in the way files are read and written by Briefcase. Instead of relying in default character sets, this JAR enforces UTF-8 in all file operations involved: briefcase-utf8.jar

Could you try to export the form with this JAR and see if the results still have wrong chars, please?

1 Like

Hi @ggalmazor , I have tried to export with your JAR and it worked for me, the Lao characters are now showing well. Thanks

2 Likes

Thanks for testing this, @Organic_Idp!

We have queued the change for the next release.

1 Like

Hi, @Organic_Idp!

We have just published a new Briefcase release (v1.11.3) that comes with the utf8 encoding improvements. You can get it here: https://github.com/opendatakit/briefcase/releases/tag/v1.11.3

Hi - I have basically the same problem as described above but with Khmer script (Cambodia). I am using v1.13.1 - so I am assuming the fix that was put into v1.11.3 will still be in place? However, in addition to the problem described above we are using encryption keys (for GDPR compliance) - so I don't know if this is what is causing the problem.

But to clarify, I have a number of forms which have an encryption key, and which have repeats in them. Therefore I need to use briefcase both to access the loops (as in the problem described in Laos) but also to decrypt the data. The forms have some free text input which is using Khmer script. The script looks fine on the tablet, but when I export it via briefcase I get script like this ស្រោមដៃស្ទេរីល

Any ideas? thanks so much
Helen