Central2pg : PostgreSQL set of functions to get data from Central

mathieubossaert · April 27, 2021, 6:21pm

Hi to all,
here is a first shared version of a recent hard work, which allow us to automate data retrieval from Central into PostgreSQL/PostGIS database.
Functions dynamically create PostgreSQL tables from Central's json and keep geojson fields as geojson.

Functions are here : https://framagit.org/mathieubossaert/central2pg

Feel free to comment, criticize and improve the work. Coming features into central will facilitate its improvement.
I will have some questions about some details ;

I just pasted the readme file below.

////////////////////////////////////
////////////////////////////////////

Central2PG

PostgreSQL's functions to retrieve datas from ODK Central's OData API to a PostgreSQL database

Fonctions pl/pgsql de récupération des données d'ODK central vers une base de données PostgreSQL

Those functions make use of the "COPY FROM PROGRAM" PostgreSQL capability. The called program is curl. So curl need to be installed on your database server.
Security issues are for the moment bypassed with the use of -k function, considering we know the server called by curl.

How to use it - Example

Complete update process from Central

SELECT odk_central.get_submission_from_central(
	user_name,
	pass_word,
	central_fqdn,
	project,
	form,
	tablename,
    '__system%%2FsubmissionDate',
	'gt',
	'2020-10-01',
	'odk_central',
	concat('form_',lower(form),'_',lower(split_part(tablename,'.',cardinality(regexp_split_to_array(tablename,'\.')))))
)
FROM odk_central.get_form_tables_list_from_central('my_email@address.org','my_passw0rd','central.myserver.org',	4,'Sicen');

SELECT odk_central.feed_data_tables_from_central('odk_central',concat('form_',lower(form),'_',lower(split_part(tablename,'.',cardinality(regexp_split_to_array(tablename,'\.'))))))
FROM odk_central.get_form_tables_list_from_central('my_email@address.org','my_passw0rd','central.myserver.org',	4,'Sicen');

/* 
	This is a view build upon generated data tables for our particular needs
*/
REFRESH MATERIALIZED VIEW odk_central.donnees_formulaire_sicen;

/* 	
	here we get attachments 
*/
SELECT outils.get_file_from_central_api(
	submission_id,
	prise_image,
	'my_email@address.org',
	'my_passw0rd',
	'https://central.myserver.org/v1/projects/4/forms/Sicen/Submissions',
	'/home/postgres/medias_odk',
	lower(concat(unaccent(replace(user_name,' ','_')),'_',prise_image))
) FROM odk_central.donnees_formulaire_sicen
WHERE prise_image IS NOT NULL;

/* 
	And here we ull data from the materialiezd view to our internat tool table, 
	to show ODK Collect data within our web internal tool and also QGIS or Redash
	This function just perform an 
	INSERT INTO table(data_id, col1,col2...)
	SELECT col_a, col_b,... 
	FROM odk_central.donnees_formulaire_sicen 
	LEFT JOIN data_already_there USING(data_id) 
	WHERE data_already_there.data_id IS NULL --to insert only new datas
*/
SELECT odk_central.formulaire_sicen_alimente_saisie_observation_especes();
/*

Functions are created in a schema named "odk_central". Adapt it to your needs.

"Main" functions

get_form_tables_list_from_central.sql

description

Returns the lists of "table" composing a form. The "core" one and each one corresponding to each repeat_group.

parameters :

email text				-- the login (email adress) of a user who can get submissions
password text			-- his password
central_domain text 	-- ODK Central fqdn : central.mydomain.org
project_id integer		-- the Id of the project ex. 4
form_id text			-- the name of the Form ex. Sicen

returning :

TABLE(user_name text, pass_word text, central_fqdn text, project integer, form text, tablename text)

get_submission_from_central.sql

description

Get json data from Central, feed a temporary table with a generic name central_json_from_central.
Once the temp table is created and filled, PG checks if the destination (permanent) table exists. If not PG creates it with only one json column named "value".
PG does the same to check if a unique constraint on the __id exists. This index will be use to ignore subissions already previously inserted in the table, using an "ON CONFLICT xxx DO NOTHING"

parameters :

email text						-- the login (email adress) of a user who can get submissions
password text					-- his password
central_domain text 			-- ODK Central fqdn : central.mydomain.org
project_id integer				-- the Id of the project ex. 4
form_id text					-- the name of the Form ex. Sicen
form_table_name text			-- the table of the form to get value from (one of thoses returned by get_form_tables_list_from_central() function
column_to_filter text			-- the column (__system/submitterId or __system/submissionDate  on wich you want to apply a filter (only works on Submissions table
filter text						-- the filter to apply (gt = greater than, lt = lower than)
filter_value text				-- the value to compare the column with
destination_schema_name text 	-- the name of the schema where to create the permanent table 
destination_table_name text		-- the name of this table

returning :

void

feed_data_tables_from_central.sql

description

parameters :

schema_name text	-- the schema where is the table containing plain json submission from the get_submission_from_central() function call
table_name text	-- the table containing plain json submission from the get_submission_from_central() function call

returning :

void

get_file_from_central_api.sql

description

Download each media mentioned in submissions

parameters :

email text				-- the login (email adress) of a user who can get submissions
password text			-- his password
central_domain text 	-- ODK Central fqdn : central.mydomain.org
project_id integer		-- the Id of the project ex. 4
form_id text			-- the name of the Form ex. Sicen
submission_id text
image text				-- the image name mentionned in the submission ex. 1611941389030.jpg
destination text		-- Where you want curl to store the file (path to directory)
output text				-- filename with extension

returning :

void

"Shadow" functions needed

dynamic_pivot.sql

description

-> adapted from https://postgresql.verite.pro/blog/2018/06/19/crosstab-pivot.html (thanks again)
CREATE a pivot table dynamically, withut specifying mannually the row structure.
Returns a cursor use by both following finction to create a table and feed it

parameters :

central_query text 	-- the query defining the data
headers_query text		-- the query defining the columns
INOUT cname refcursor	-- the name of the cursor

returning :

refcursor

create_table_from_refcursor.sql

description

-> inspired by https://stackoverflow.com/questions/50837548/insert-into-fetch-all-from-cant-be-compiled/52889381#52889381
Create a table corresponding to the curso structure (attribute types and names)

parameters :

_table_name text 		-- the name of the table to create
_ref refcursor			-- the name of the refcursor to get data from

returning :

void

insert_into_from_refcursor.sql

description

-> adapted from https://stackoverflow.com/questions/50837548/insert-into-fetch-all-from-cant-be-compiled/52889381#52889381
Feed the table with data

parameters :

_table_name text, 		-- the name of the table to create
_ref refcursor			-- the name of the refcursor to get data from

returning :

void

mathieubossaert · July 27, 2021, 4:56pm

Some news from this work

I moved the functions set to github :

Feel free to improve it ! I am not a pl-pgsql guru and I am sure it could be cleaner / safer... I would be glad to discuss about that work and on how to improve it with anybody else.

The filtering capability on submission_date has been removed as it only applies to Submissions table, not to its "daughters".
So we are in a "all data download" scenario and I hope we will be able to improve it later to improve bandwidth and resources consumption.

A top level function was added to simplify the procedure with only one function call, retrieving all data from a given form of a given Central instance to the PostgreSQL schema you want :

select odk_central.odk_central_to_pg(
  'myUserName', 
  'PassW0rd', 
  'my_central_server.org',
  project_id, 
  'form_id', 
  'destination_pg_schema', 
  'geo_question,to_ignore' );

select odk_central.odk_central_to_pg(
  'me@mydomain.org', 
  'PassW0rd', 
  'my_central_server.org', 
  2, 
  'my_form_about_birds', 
  'odk_data', 
  'point_auto,point,ligne,polygone');

This simple query will automatically :

ask Central (at my_central_server.org) for the table list of the form "my_form_about_birds"
get data for each table composing the form
create those tables (one text attribute per form question) in the schema "odk_data" of my database, except for questions listed in the last parameter (geo widgets columns)
feed those tables with the retrieved data

And at next call :

check for new form questions / table attributes
create it if needed
insert only new data

With my knowledge of the form, I now only have to query the tables to do whatever I need with the data, and to choose the best frequency for the cron task wich call the function, to serve data as fresh as possible in our internal tools.

famador · September 7, 2021, 9:48pm

Hello @mathieubossaert , This is amazing, I will test my forms in ODK Central.
Greetings.

mathieubossaert · September 8, 2021, 7:31am

Thank you for your enthusiasm @famador
Feel free to report any bug or suggestion on the github page !

ridho · September 16, 2021, 10:48pm

Very nice work @mathieubossaert , but i still confused how to apply it in my own server. Where do i put the sql script? I use docker to serve odk central.

mathieubossaert · September 17, 2021, 7:02am

Hi @ridho and welcome to the ODK community. When you get a chance please introduce yourself here.
Thank you for your comment

Those functions are useful to get ODK data in an external database. So you have to run it in this destination database.

I understand your confusion because Central uses its own database wich is not exposed to the world and does not stores the data in a relationnal way.

ridho · September 17, 2021, 7:46am

Oh, sorry about that. So, if i want to retrieve data from ODKCentral postgresql database, I just need to run this script only central2pg.sql?

mathieubossaert · September 17, 2021, 12:04pm

This will create the fonctions in the schema you want (default is odk_central).
Once installed you will need to call the odk_central_to_pg() function has shown earlier in this thread.
Please report any bug or comment on github. I already found and fixed issues but external uses will be important to find and fix more ones

shashpal · January 31, 2022, 10:22am

Hi @mathieubossaert thank you very much for putting your effort in this script. It looks wonderful. Currently, I rely heavily on ruODK package and Central API (with PHP). But they are not much flexible to my need. I gave your script, central2pg.sql, a try. It made the schema. But when I try odk_central.odk_central_to_pg function, it gives me this error:

ERROR:  structure of query does not match function result type
DETAIL:  Returned type unknown does not match expected type text in column 1.
CONTEXT:  PL/pgSQL function odk_central.get_form_tables_list_from_central(text,text,text,integer,text) line 11 at RETURN QUERY
SQL statement "SELECT odk_central.get_submission_from_central(...
...
)
SQL state: 42804

I consulted other forums to address this column type compatibility error. But I am not much familiar with SQL. I know just the basics. Could you please help?

mathieubossaert · January 31, 2022, 11:16am

Hi @shashpal !
First of all welcome to the ODK community. When you'll get a chance, please introduce yourself here to the community.
And thank you for your interest to that work !

Cental2PG is not an ODK official tool, maybe we should discuss about this bug and solve it on github ?
Anyway, did you download the files from the master branch ? The release is old and not up to date.
Could you share your form with me, here or in a private message, then I can try it with central2pg and find the problem ?
It may be possible that you use characters I do not use and I did not escape in the functions ? Or something else ?
A colleague of mine used it for the first time on Friday and succeed.
Thanks again for your feedback !

shashpal · January 31, 2022, 2:29pm

Dear @mathieubossaert, thanks for your time. I used the files from the master branch. Sure, I can share the form details in a private message.

shashpal · February 1, 2022, 2:10pm

Thanks to @mathieubossaert for his kind support. I was using Postgres server v9.6 on my host, which was too old and didn't have the json functions used in the script. I upgraded to PG v14, and I can confirm that the script works. @mathieubossaert confirmed that the script works on PG v12, too.

mathieubossaert · February 8, 2022, 6:46pm

Last week I did a short demo to my french colleagues and I just added some subtitles to share it here

Theron_Morgan-Brown1 · March 10, 2022, 5:43am

Thanks Mathieu for sharing these functions. I'm using them with pg_cron to automatically pull ODK_central data every night into a PostgreSQL database. I've connected that database to Metabase to produce up to date dashboards of field activities.

I've also connected that database to a map created with the QGIS2WEB plugin so we have an up to date, detailed map of field activities.

I explored several ways of handling data from ODK_central and this is the simplest. Once the data is in a PostgresSQL database, the options for analysis and sharing are limitless.

I've made a post on my application here.

mathieubossaert · April 2, 2022, 6:05pm

Saturday homework!

central2pg now uses bearer token to connect to Central

Testers welcome to find and fill issues on github .

mathieubossaert · July 26, 2022, 10:25pm

On April 5, 2022 I committed a change that generates random name for indexes in order to avoid indexes names collision.
Yesterday I found a silent bug in that version of the code that always generates a new index on the table, each time you call odk_central_to_pg() function. That bug has the consequence to fill the hard drive with a lot of unnecessary redundant indexes.
I am really sorry for the inconvenient.
Tonight I committed a correction. Now the functions test if a unique index already exists on the table before creating a new one.
You will also find in a SQL code to generate the commands to delete unnecessary indexes.

github.com/mathieubossaert/central2pg

Random index naming generates one new index each time -> fills the database !

opened 03:26PM - 26 Jul 22 UTC

closed 10:06PM - 26 Jul 22 UTC

mathieubossaert

bug

Waiting for fix : this command generates the script to drop duplicated indexes :… ```sql SELECT concat ('BEGIN;DROP INDEX IF EXISTS your_central_data_schema."',indexname,'";COMMIT;') FROM pg_indexes WHERE schemaname = 'your_central_data_schema' AND indexname NOT IN ( SELECT max(indexname)--, split_part(indexdef, ' ON ',2) FROM pg_indexes WHERE schemaname = 'your_central_data_schema' AND split_part(indexdef,' ON ',2) IS NOT NULL GROUP BY split_part(indexdef, ' ON ',2) ) AND indexname ILIKE 'idx__%' ```

@GuilhemD @Theron_Morgan-Brown1 @shashpal @ridho @famador @c_roy @alios82

mrrodge · April 11, 2023, 8:17am

Apologies if I'm missing something - but can anyone point me in the direction of where to start with this, what to Google etc? I reckon I can install PostgreSQL and install curl but I don't know how to run the script and I don't understand whether it runs automatically, on a schedule etc?

I've tried and failed so many times to report directly from the API so this is the next stop.

Thanks!

mathieubossaert · April 11, 2023, 9:20am

Hi @mrrodge

These functions are useful if you want to get data from ODK central in your own PostgreSQL database.
Once curl installed on your database host, you might run the script central2pg.sql to create the functions in your database.
Then you may be able to run a query like this to get the data from central in your database :

SELECT odk_central.odk_central_to_pg(
	'me@mydomain.org',                  -- user
	'PassW0rd',                         -- password
	'my_central_server.org',            -- central FQDN
	2,                                  -- the project id, 
	'my_form_about_birds',              -- form ID
	'odk_data',                         -- schema where to creta tables and store data
	'point_auto,point,ligne,polygone'	-- columns to ignore in json transformation to database attributes (geojson fields of GeoWidgets)
);

to make this automatic, you'll need a scheduled job that run this query at the frequency you want on your database.

For files, this is the query to run :

SELECT odk_central.get_file_from_central(
		email text              -- the login (email adress) of a user who can get submissions
		password text           -- his password
		central_domain text     -- ODK Central fqdn : central.mydomain.org
		project_id integer      -- the Id of the project ex. 4
		form_id text            -- the name of the Form ex. Sicen
		submission_id text      -- the submission_id
		image text              -- the image name mentionned in the submission ex. 1611941389030.jpg
		destination text        -- Where you want curl to store the file (path to writable directory)
		output text             -- filename with extension)
;

Please note this project is not part of ODK tools. We may discuss here if you want :

mrrodge · April 11, 2023, 10:05am

Thanks for the quick reply - really appreciate it!

I'm new to all this so am trying to get my head around how it all works. Basically I have Central collecting a lot of photos and I'd like to create paginated PDF reports for my users to access. I've hit the problem of not being able to pass the bearer token in the HTTP header in every reporting tool I've tried, so can't embed the images on my paginated reports.

This is the logical next step - accessing a cloned database, but I've no database experience.

Am I right in thinking the below?

Your script adds functions to PostgreSQL that allows me to query. So if I run a query, this prompts the connection to the API?
How do I run the actual query? From my reporting platform, such as PowerBI or Redash? If that's the case, the data is pulled from the API and put into the database at the time of running it? If this is the case, to stop duplicating data (copy in the database and copy in central) is it worth purging the whole database, say, daily, so that only the latest queried data is retrieved/stored and data from older, maybe no longer needed queries aren't stored?

I'm doing a lot of guesswork here so I've no idea if I'm right or not!

Thanks again.

mathieubossaert · April 11, 2023, 1:29pm

This is what we do. Central lives its own life and we pull data from it in another database, dedicated to reporting and GIS.

The script wil create the database tables for each form (one for the submissions and one for each repeat group)

As you mention your email and password as parameters in the call, the script will ask central for a valid token and will use it.

In the scenario we discuss now, you will have a database dedicated to your reports. The scripts will be run into from this "reporting" database. You can run this sql script manually when you ant or create a schedule task that run it.
And your BI tool will connect to this database.
I don't know the amount of data you expect so it is hard to tell you about a strategy. But for now, central2pg get all the data since the form was created. That said it only add new datas to the "reporting" database.

In the next few days, I'll introduce another set of function that uses pyODK within the database and can filter data following for example the submission date.

But if your problem is "only" to get pictures from central, I'm sure a small python wrapper using pyODK could serve you the photos directly from Central instead of building all my database centric workflow.
But I will be glad to help if