Removing trailing commas in the Central CSV

yanokwa · October 12, 2021, 4:43am

A user sent in this email that I wanted to document it here in case others are having this issue.

I am pulling ODK data into a postgres database using the API, and have a minor issue.

I am hitting the API at v1/projects/[ID]/forms/[ID]/submissions.csv.zip endpoint, unzipping the files, then running postgres COPY to import into a DB.

The issue I am seeing is that the final column of the exported CSV is the 'Edits' column, which is usually null, and the trailing comma is left off by default, i.e. the DeviceID value is last with no comma following. COPY in postgres hates this and won't load because of missing data.

If I tell it to ignore the last column, then it gets crabby if there is an edit, because there is extra data. I've solved the issue by processing the file a bit, so it's not killing me, but was unsure if this is your desired behavior.

Exporting directly from Central in the submissions tab has the same result. I don't know that it's "wrong" necessarily, but the inconsistent number of fields per row makes things a little trickier.

CSV is a popular format, but it's very under-specified. csvkit is my go to "validator" and it complains about the trailing spaces, so I'd bias to fixing it. Any objections?

Leland · October 12, 2021, 4:03pm

Thanks Yaw - To add clarity, the issue is that COPY, as a relatively dumb function, expects the same number of columns per row in the CSV file, and the lack of trailing comma for any unedited row means that it varies across the file. Obviously, this use case is specific to me, but it does mean that any ODK Central export isn't suitable as a direct feed into Postgres for data warehousing or other purposes without preprocessing, so there's some general applicability I'd think!

FWIW, my solution is also csvkit. I ran csvcut and selected a subset of columns that didn't include 'Edits' so I was able to solve by default. I'm not sure how it would handle it if I wanted to start including edit counts, but if csvkit also complains, then indeed may be worth fixing.