Central2pg : PostgreSQL set of functions to get data from Central

ridho · September 16, 2021, 10:48pm

Very nice work @mathieubossaert , but i still confused how to apply it in my own server. Where do i put the sql script? I use docker to serve odk central.

mathieubossaert · September 17, 2021, 7:02am

Hi @ridho and welcome to the ODK community. When you get a chance please introduce yourself here.
Thank you for your comment

Those functions are useful to get ODK data in an external database. So you have to run it in this destination database.

I understand your confusion because Central uses its own database wich is not exposed to the world and does not stores the data in a relationnal way.

ridho · September 17, 2021, 7:46am

Oh, sorry about that. So, if i want to retrieve data from ODKCentral postgresql database, I just need to run this script only central2pg.sql?

mathieubossaert · September 17, 2021, 12:04pm

This will create the fonctions in the schema you want (default is odk_central).
Once installed you will need to call the odk_central_to_pg() function has shown earlier in this thread.
Please report any bug or comment on github. I already found and fixed issues but external uses will be important to find and fix more ones

shashpal · January 31, 2022, 10:22am

Hi @mathieubossaert thank you very much for putting your effort in this script. It looks wonderful. Currently, I rely heavily on ruODK package and Central API (with PHP). But they are not much flexible to my need. I gave your script, central2pg.sql, a try. It made the schema. But when I try odk_central.odk_central_to_pg function, it gives me this error:

ERROR:  structure of query does not match function result type
DETAIL:  Returned type unknown does not match expected type text in column 1.
CONTEXT:  PL/pgSQL function odk_central.get_form_tables_list_from_central(text,text,text,integer,text) line 11 at RETURN QUERY
SQL statement "SELECT odk_central.get_submission_from_central(...
...
)
SQL state: 42804

I consulted other forums to address this column type compatibility error. But I am not much familiar with SQL. I know just the basics. Could you please help?

mathieubossaert · January 31, 2022, 11:16am

Hi @shashpal !
First of all welcome to the ODK community. When you'll get a chance, please introduce yourself here to the community.
And thank you for your interest to that work !

Cental2PG is not an ODK official tool, maybe we should discuss about this bug and solve it on github ?
Anyway, did you download the files from the master branch ? The release is old and not up to date.
Could you share your form with me, here or in a private message, then I can try it with central2pg and find the problem ?
It may be possible that you use characters I do not use and I did not escape in the functions ? Or something else ?
A colleague of mine used it for the first time on Friday and succeed.
Thanks again for your feedback !

shashpal · January 31, 2022, 2:29pm

Dear @mathieubossaert, thanks for your time. I used the files from the master branch. Sure, I can share the form details in a private message.

shashpal · February 1, 2022, 2:10pm

Thanks to @mathieubossaert for his kind support. I was using Postgres server v9.6 on my host, which was too old and didn't have the json functions used in the script. I upgraded to PG v14, and I can confirm that the script works. @mathieubossaert confirmed that the script works on PG v12, too.

mathieubossaert · February 8, 2022, 6:46pm

Last week I did a short demo to my french colleagues and I just added some subtitles to share it here

Theron_Morgan-Brown1 · March 10, 2022, 5:43am

Thanks Mathieu for sharing these functions. I'm using them with pg_cron to automatically pull ODK_central data every night into a PostgreSQL database. I've connected that database to Metabase to produce up to date dashboards of field activities.

I've also connected that database to a map created with the QGIS2WEB plugin so we have an up to date, detailed map of field activities.

I explored several ways of handling data from ODK_central and this is the simplest. Once the data is in a PostgresSQL database, the options for analysis and sharing are limitless.

I've made a post on my application here.

mathieubossaert · April 2, 2022, 6:05pm

Saturday homework!

central2pg now uses bearer token to connect to Central

Testers welcome to find and fill issues on github .

mathieubossaert · July 26, 2022, 10:25pm

On April 5, 2022 I committed a change that generates random name for indexes in order to avoid indexes names collision.
Yesterday I found a silent bug in that version of the code that always generates a new index on the table, each time you call odk_central_to_pg() function. That bug has the consequence to fill the hard drive with a lot of unnecessary redundant indexes.
I am really sorry for the inconvenient.
Tonight I committed a correction. Now the functions test if a unique index already exists on the table before creating a new one.
You will also find in a SQL code to generate the commands to delete unnecessary indexes.

github.com/mathieubossaert/central2pg

Random index naming generates one new index each time -> fills the database !

opened 03:26PM - 26 Jul 22 UTC

closed 10:06PM - 26 Jul 22 UTC

mathieubossaert

bug

Waiting for fix : this command generates the script to drop duplicated indexes :… ```sql SELECT concat ('BEGIN;DROP INDEX IF EXISTS your_central_data_schema."',indexname,'";COMMIT;') FROM pg_indexes WHERE schemaname = 'your_central_data_schema' AND indexname NOT IN ( SELECT max(indexname)--, split_part(indexdef, ' ON ',2) FROM pg_indexes WHERE schemaname = 'your_central_data_schema' AND split_part(indexdef,' ON ',2) IS NOT NULL GROUP BY split_part(indexdef, ' ON ',2) ) AND indexname ILIKE 'idx__%' ```

@GuilhemD @Theron_Morgan-Brown1 @shashpal @ridho @famador @c_roy @alios82

mrrodge · April 11, 2023, 8:17am

Apologies if I'm missing something - but can anyone point me in the direction of where to start with this, what to Google etc? I reckon I can install PostgreSQL and install curl but I don't know how to run the script and I don't understand whether it runs automatically, on a schedule etc?

I've tried and failed so many times to report directly from the API so this is the next stop.

Thanks!

mathieubossaert · April 11, 2023, 9:20am

Hi @mrrodge

These functions are useful if you want to get data from ODK central in your own PostgreSQL database.
Once curl installed on your database host, you might run the script central2pg.sql to create the functions in your database.
Then you may be able to run a query like this to get the data from central in your database :

SELECT odk_central.odk_central_to_pg(
	'me@mydomain.org',                  -- user
	'PassW0rd',                         -- password
	'my_central_server.org',            -- central FQDN
	2,                                  -- the project id, 
	'my_form_about_birds',              -- form ID
	'odk_data',                         -- schema where to creta tables and store data
	'point_auto,point,ligne,polygone'	-- columns to ignore in json transformation to database attributes (geojson fields of GeoWidgets)
);

to make this automatic, you'll need a scheduled job that run this query at the frequency you want on your database.

For files, this is the query to run :

SELECT odk_central.get_file_from_central(
		email text              -- the login (email adress) of a user who can get submissions
		password text           -- his password
		central_domain text     -- ODK Central fqdn : central.mydomain.org
		project_id integer      -- the Id of the project ex. 4
		form_id text            -- the name of the Form ex. Sicen
		submission_id text      -- the submission_id
		image text              -- the image name mentionned in the submission ex. 1611941389030.jpg
		destination text        -- Where you want curl to store the file (path to writable directory)
		output text             -- filename with extension)
;

Please note this project is not part of ODK tools. We may discuss here if you want :

mrrodge · April 11, 2023, 10:05am

Thanks for the quick reply - really appreciate it!

I'm new to all this so am trying to get my head around how it all works. Basically I have Central collecting a lot of photos and I'd like to create paginated PDF reports for my users to access. I've hit the problem of not being able to pass the bearer token in the HTTP header in every reporting tool I've tried, so can't embed the images on my paginated reports.

This is the logical next step - accessing a cloned database, but I've no database experience.

Am I right in thinking the below?

Your script adds functions to PostgreSQL that allows me to query. So if I run a query, this prompts the connection to the API?
How do I run the actual query? From my reporting platform, such as PowerBI or Redash? If that's the case, the data is pulled from the API and put into the database at the time of running it? If this is the case, to stop duplicating data (copy in the database and copy in central) is it worth purging the whole database, say, daily, so that only the latest queried data is retrieved/stored and data from older, maybe no longer needed queries aren't stored?

I'm doing a lot of guesswork here so I've no idea if I'm right or not!

Thanks again.

mathieubossaert · April 11, 2023, 1:29pm

This is what we do. Central lives its own life and we pull data from it in another database, dedicated to reporting and GIS.

The script wil create the database tables for each form (one for the submissions and one for each repeat group)

As you mention your email and password as parameters in the call, the script will ask central for a valid token and will use it.

In the scenario we discuss now, you will have a database dedicated to your reports. The scripts will be run into from this "reporting" database. You can run this sql script manually when you ant or create a schedule task that run it.
And your BI tool will connect to this database.
I don't know the amount of data you expect so it is hard to tell you about a strategy. But for now, central2pg get all the data since the form was created. That said it only add new datas to the "reporting" database.

In the next few days, I'll introduce another set of function that uses pyODK within the database and can filter data following for example the submission date.

But if your problem is "only" to get pictures from central, I'm sure a small python wrapper using pyODK could serve you the photos directly from Central instead of building all my database centric workflow.
But I will be glad to help if

mrrodge · April 11, 2023, 1:45pm

Yup - that, at the moment, is my only problem! Any pointers for where to start with a wrapper like you suggest? I've no experience with python either.

Many thanks for your efforts - it's greatly appreciated.

Looking at the get_file_from_central function, it suggests that files are saved to the filesystem but I'm guessing I would need to host them somehow to get them into my reports.

mathieubossaert · April 11, 2023, 4:01pm

Yes, in our case, the folder is used and visible by both Apache and PostgreSQL (user postgres) but it could be anywhere else your reporting tool can access

mrrodge · April 12, 2023, 7:57am

OK thanks - getting the idea now and am almost there. How do you use the get_file_from_central command? It looks like that will only download one image, where I want to download all the images for the form submissions I got with odk_central_to_pg.

Also what does it do with the file names? What's prise_image?

Thanks again!

mathieubossaert · April 12, 2023, 8:28am

Let's continue the discussion with direct messages or over Github