Central2pg : PostgreSQL set of functions to get data from Central

On April 5, 2022 I committed a change that generates random name for indexes in order to avoid indexes names collision.
Yesterday I found a silent bug in that version of the code that always generates a new index on the table, each time you call odk_central_to_pg() function. That bug has the consequence to fill the hard drive with a lot of unnecessary redundant indexes.
I am really sorry for the inconvenient.
Tonight I committed a correction. Now the functions test if a unique index already exists on the table before creating a new one.
You will also find in a SQL code to generate the commands to delete unnecessary indexes.

@GuilhemD @Theron_Morgan-Brown1 @shashpal @ridho @famador @c_roy @alios82

4 Likes

Apologies if I'm missing something - but can anyone point me in the direction of where to start with this, what to Google etc? I reckon I can install PostgreSQL and install curl but I don't know how to run the script and I don't understand whether it runs automatically, on a schedule etc?

I've tried and failed so many times to report directly from the API so this is the next stop.

Thanks!

Hi @mrrodge

These functions are useful if you want to get data from ODK central in your own PostgreSQL database.
Once curl installed on your database host, you might run the script central2pg.sql to create the functions in your database.
Then you may be able to run a query like this to get the data from central in your database :

SELECT odk_central.odk_central_to_pg(
	'me@mydomain.org',                  -- user
	'PassW0rd',                         -- password
	'my_central_server.org',            -- central FQDN
	2,                                  -- the project id, 
	'my_form_about_birds',              -- form ID
	'odk_data',                         -- schema where to creta tables and store data
	'point_auto,point,ligne,polygone'	-- columns to ignore in json transformation to database attributes (geojson fields of GeoWidgets)
);

to make this automatic, you'll need a scheduled job that run this query at the frequency you want on your database.

For files, this is the query to run :

SELECT odk_central.get_file_from_central(
		email text              -- the login (email adress) of a user who can get submissions
		password text           -- his password
		central_domain text     -- ODK Central fqdn : central.mydomain.org
		project_id integer      -- the Id of the project ex. 4
		form_id text            -- the name of the Form ex. Sicen
		submission_id text      -- the submission_id
		image text              -- the image name mentionned in the submission ex. 1611941389030.jpg
		destination text        -- Where you want curl to store the file (path to writable directory)
		output text             -- filename with extension)
;

Please note this project is not part of ODK tools. We may discuss here if you want :

1 Like

Thanks for the quick reply - really appreciate it!

I'm new to all this so am trying to get my head around how it all works. Basically I have Central collecting a lot of photos and I'd like to create paginated PDF reports for my users to access. I've hit the problem of not being able to pass the bearer token in the HTTP header in every reporting tool I've tried, so can't embed the images on my paginated reports.

This is the logical next step - accessing a cloned database, but I've no database experience.

Am I right in thinking the below?

  • Your script adds functions to PostgreSQL that allows me to query. So if I run a query, this prompts the connection to the API?

  • How do I run the actual query? From my reporting platform, such as PowerBI or Redash? If that's the case, the data is pulled from the API and put into the database at the time of running it? If this is the case, to stop duplicating data (copy in the database and copy in central) is it worth purging the whole database, say, daily, so that only the latest queried data is retrieved/stored and data from older, maybe no longer needed queries aren't stored?

I'm doing a lot of guesswork here so I've no idea if I'm right or not!

Thanks again.

This is what we do. Central lives its own life and we pull data from it in another database, dedicated to reporting and GIS.

The script wil create the database tables for each form (one for the submissions and one for each repeat group)

As you mention your email and password as parameters in the call, the script will ask central for a valid token and will use it.

In the scenario we discuss now, you will have a database dedicated to your reports. The scripts will be run into from this "reporting" database. You can run this sql script manually when you ant or create a schedule task that run it.
And your BI tool will connect to this database.
I don't know the amount of data you expect so it is hard to tell you about a strategy. But for now, central2pg get all the data since the form was created. That said it only add new datas to the "reporting" database.

In the next few days, I'll introduce another set of function that uses pyODK within the database and can filter data following for example the submission date.

But if your problem is "only" to get pictures from central, I'm sure a small python wrapper using pyODK could serve you the photos directly from Central instead of building all my database centric workflow.
But I will be glad to help if

1 Like

Yup - that, at the moment, is my only problem! Any pointers for where to start with a wrapper like you suggest? I've no experience with python either.

Many thanks for your efforts - it's greatly appreciated.

Looking at the get_file_from_central function, it suggests that files are saved to the filesystem but I'm guessing I would need to host them somehow to get them into my reports.

Yes, in our case, the folder is used and visible by both Apache and PostgreSQL (user postgres) but it could be anywhere else your reporting tool can access

OK thanks - getting the idea now and am almost there. How do you use the get_file_from_central command? It looks like that will only download one image, where I want to download all the images for the form submissions I got with odk_central_to_pg.

Also what does it do with the file names? What's prise_image?

Thanks again!

Let's continue the discussion with direct messages or over Github :slight_smile:

We achieved to get it work. Function were not ready to accept spaces chars in the form_id.
https://forum.cen-occitanie.org/t/suivi-bourreau-des-arbres/963/6?u=mathieu
I will also improve the example given on github's README page.
Thanks @mrrodge, maybe we can clean the showcase and drop the messages above ?

People who used central2pg may take a look at pl-pyodk.
It allows the use of filters to pull data from central.

With :confetti_ball: Central 2023.4 :confetti_ball: we can now filter data from repeat table using a top level (Submissions) filter as the submission date.
central2pg can make use of it right now and allows you to get only data that were submitted since a given timestamp or date :date:
For example, data from our main form are pulled every hour to get fresh data, without filtering, we needed to pull all the data created since the form was published (90000) !

Now we can filter subtables on submission date and pull only the 37 observations created since last known submission date :slight_smile:

As a consequence we can increase a lot the frequency of pulling data from Central and be very close to a real time field sync between our databases or desktop tools and the field.

Thanks a lot :pray: to the ODK Team !

5 Likes

Hi @mathieubossaert i wan to get data from ODK central to my own PostgreSQL database, when I run the function odk_central.odk_central_to_pg() I get the following error:

NOTICE: la table « central_json_from_central » n'existe pas, poursuite du traitement
NOTICE: la table « central_token » n'existe pas, poursuite du traitement

ERROR: l'argument de la requête d'EXECUTE est NULL
CONTEXT: fonction PL/pgSQL odk_central.get_form_tables_list_from_central(text,text,text,integer,text), ligne 11 à EXECUTE
instruction SQL « SELECT odk_central.get_submission_from_central(
user_name,
pass_word,
central_FQDN,
project,
form,
tablename,
'odk_central',
lower(trim(regexp_replace(left(concat('form_',form,'',split_part(tablename,'.',cardinality(regexp_split_to_array(tablename,'.')))),58), '[^a-zA-Z\d]', '', 'g'),''))
)
FROM odk_central.get_form_tables_list_from_central('@gmail.com','S','*****',3,'support'); »
fonction PL/pgSQL odk_central.odk_central_to_pg(text,text,text,integer,text,text,text), ligne 3 à EXECUTE

ERREUR: l'argument de la requête d'EXECUTE est NULL
SQL state: 22004

here is how the function is configured
SELECT odk_central.odk_central_to_pg(
'*****@gmail.com', -- user
'*******', -- password
'f
.com', -- central FQDN
3, -- the project id,
'support', -- form ID
'odk_central', -- schema where to creta tables and store data
'' -- columns to ignore in json transformation to database attributes (geojson fields of GeoWidgets)
);

Quel peut être le problème?
merci :pray:

Hi @steeve_kevin ,

let's talk in private or over Github if you're ok to investigate and fix your problem.
My gess is the curl function returns an error so PG does not have anything to transform.

And we'll be back here to explain the problem and help other users.

Hello,and thanks to your hard work for odkcentral2pg.

I've installed and it work great.

Is there way to launch the script with a "signal" from odkcentral, like a submission ? The cron job are great but use memory for nothing if not used.

Thanks a lot for ypur feedback !
And welcome to the forum. Please don't hesitate to introduce yourself on the dedicated thread :slightly_smiling_face:
I am close to solve @steeve_kevin's issue due to curl subtilities in a windows environement...

I don't know exactly how much memory the cron daemon uses but I think it is really light on our server. That might depend on everyone's context.
Since we can filter subtables, it is lighter than ever. If one su mission occurs thé thé last time, only one submission is downloaded.

Probably... PostgreSQL can notify events and listen to notifications but I am not sure it could help to triggering central2pg.

Maybe pl-pyodk in a web context and a tool like https://github.com/crunchydata/pg_eventserv would be better.

But it would need some modification on Central.

thank you for your reply.

I will fill in the thread soon.

My need is to display each submission as soon as it is created. I can't do a cronjob for every project every second.

My solution was to make a trigger on the submission table. I then use the odk_central_to_pg function with the form_id that fired the trigger. Regards

@steeve_kevin 's problem is solved and was due to curl usage on windows and the need to give the POST parameter as a file -d @data.json

thank you for everyone @mathieubossaert especially for your availability.

1 Like

Hello Mathieu,

it's ok with the trigger to push data when a user submit.

The next step is to push image in a specific folder, in a trigger too, and i have a several problem with the transaction time, i can select all informations for get_attachment_from_central, but when the fonction lunch, it's too early for the database and the fonction crash...
Anyway, I have 1 question :

-You haven't the problem that it's the postgres user who save the data with plpyodk ( for the rights of the image ) ?

You know how i may save the image as another user with plpyodk?

bye