ODK Central clarifications required

Hi,

I just moved from Aggregate to ODK Central. Although it is quite an improvement, I am struggling to figure out few basic things.

  • How to view the details of an individual submission? I see the SUBMISSIONS table, it lists first few variables of the form, nothing else. DOWNLOAD ALL and then view is not a good method of browsing one submission data. I think each submission should be clickable whereby we could see the details of the individual submission by clicking on it. Am I missing something?
  • How to see the images with any submission?
  • Where did the google maps layout go? It was a very good way of quickly viewing the location coordinates.

Perhaps someone with more knowledge could help me with these.

Thanks,
Saad

Hi @Saad!

In the next release (v1.1), we're planning to add the ability to select which columns are shown in the table. Right now only the first 10 columns are shown.

In the submissions table, a download link is shown for each image (for columns that are shown in the table). Images are also included in the .zip download.

We don't plan to add data visualization to Central, including maps. See this topic for more about the Central roadmap:

1 Like

Can you say more about the context in which you do this? Are you spot-checking data? Looking up a specific value? Something else? How do you do this currently in Aggregate -- do you use filters? Why not do it as part of your analysis pipeline?

See here for the structure of each URL. Again, it might help to know what you want to do with the images. Perhaps the expanded submission table will give you what you need or maybe there's something else that would be helpful.

To be clear (and there's some about this in the post @Matthew_White linked to), the rationale is that general-purpose visualization and analysis tools are huge endeavors. There are many great ones that exist already so over time we want to make it easier to connect to those. Virtually any tool you might want to use will allow you to import a CSV exported from Central. Currently the best live-updating options are through Excel or PowerBI. While the visualizations in Aggregate could be very helpful if they were exactly what you needed, they are hard to customize. Our hope is that by making it relatively straightforward to get a live-updating dataset into a tool that most people have (Excel on Windows -- this is not supported yet on macOS), the barriers to having custom dashboards will be lower.

Note also that the OData feed is JSON and there are lots of systems that can ingest it. For example, see Automating Data Delivery using the OData Endpoint in ODK Central for a description of how to use Kettle for data integration.

1 Like

Hi @Matthew_White, @LN,

Thank you very much for your analysis. These helped me a lot in understanding central.

Let me try to explain my side of understanding. This is primarily addressing @LN's queries.

In Aggregate, we have a dashboard which, in a minimal format, is absolutely workable. It shows all data, it shows all images, it show location maps, and many other things right from the aggregate interface. So, from a client point of view, as a consultant, if I install aggregate for them, it is enough for them to view all their data, in tabular and map format at minimum (let's exclude visualizations for now). In other words, they submit the data from the phones, and view the data, images and location right there from one single aggregate interface, and they are excited about it. This, in itself, is a complete end-to-end ecosystem of data collection that I always intend to introduce to them, and gain their attention and excitement. Later on, I always offer them more, by building custom dashboards on top of aggregate data and give them additional value added features. But this stage comes quite later, when they have already become an expert of handling the basic system, and start feeling the real need for custom features. So, in other words, with a data table, image viewer, and map view, Aggregate dashboard has everything (in a minimal, although non-optimal way) that excites a new client to go for ODK "without adding/installing any external tool or dashboard".

Now, I started using Central about a week ago. I was happy to see some of the features which were on my wishlist (great!). BUT (and it's a big one!), I am really surprised to see the basic features of aggregate missing from central; things like viewing the whole submission, viewing images, and location pins too. The problem that I see now, is that I cannot pass on this installation of Central to the client as a whole end-to-end system, without installing an additional dashboard/tool on the next hop! Excel, PowerBI, Google Data Studio are always workable but only at the second layer. The first layer (i.e. ODK Central) should have a complete set of tools to view all parts of data on its web-based dashboard. Without these tools, Central seriously looks handicapped as a dashboard, and it is not a complete solution until I pair it with yet another dashboard or external tool!

Please note that I am not negating the great work done on central already. It has great features, even better than how I imagined in the last 8+ years of using Aggregate. However I feel that the small things which are excluded from Central as part of optimization, include some of the very key features of data viewing, which are way too essential for Central to be part of its packaging, instead of passing it onto external, 2nd hop tools and dashboards. Third party systems can always do anything, but we should not exclude basic features from Central totally, since it seems to undermine the power of ODK as a complete set of data collection eco-system.

I hope my point of view is much clearer now. Happy to discuss more if needed.

Regards,
Saad

Hey,

These are very good insights.

I will be trying out Central soon. I tried to install docker in CentOs 8 recently but broke a few things.

The setup should continue soon.... One question I had, does Central work with ODK Briefcase?

Paul

We’ve tried to be clear about what we see as the bounds of Central in What's coming in Central over the next few years. I think it would be helpful to have some of this captured more prominently in the documentation and will make sure it gets added.

Central’s future direction can absolutely be influenced and we appreciate all feedback. But ultimately, it’s concrete financial or software development contributions that will have the biggest impact, especially where big changes in approach are desired. We fund the development of ODK tools through a mix of contract work and direct feature sponsorship. If more organizations that use or even sell ODK tools made investments in the platform, we could certainly do more than we can today.

Our primary focus on Central has been to provide rich form and user management features to support secure data collection at scale even on modest server hardware. A big priority has been to handle large volumes of incoming submissions quickly without risk of data corruption.

Our goal for data analysis has been to help users connect to systems that are specifically intended for analyzing data. We do this through fast CSV exports and the OData feed. You say "Central seriously looks handicapped as a dashboard" and indeed, it does not intend to be a dashboard.

One of the early decisions we made was to store incoming form submissions as XML blobs rather than splitting them into database tables as Aggregate does. This is a decision that we did not make lightly. It has helped our small team make quicker progress and has ensured speed and stability as submissions come in. We learned from Aggregate and other systems that splitting records is a big source of code complexity, bugs, and performance bottlenecks.

The tradeoff is that this limits the analysis that can be done directly on Central -- the data is not organized for any kind of fast operations across the dataset. Additional implications are that directly connecting to the Central database for analysis is not practical and that we don't provide a performant API for open-ended data querying. As I said previously, what we learned from Aggregate is that most people need to rely on external tools for analysis anyway. The OData feed makes all of the above possible with live-updating data.

What it sounds like you have been doing with Aggregate is using it as an entity repository. That is, the data collection you’re involved in is more about building registers of entities than about producing an analysis artifact. You want to be able to look up specific entities either geographically or filtered by some criteria. This is a completely valid use case, and what we’ve done for folks with that need is set up an Excel or PowerBI project with live-updating views on the data. Another great option that requires a little bit of R knowledge would be to provide a Shiny app. It sounds like that may not be practical for you and if Aggregate continues to do what you need it to, then you may not need to switch!

One area we could certainly improve is in having more explicit guides on how to set up common kinds of analysis or querying pipelines. There are some good examples shared in the Showcase but they aren’t incorporated in the documentation. The development team aims to provide complete documentation but we have focused more on software development than writing detailed guides because documentation is an area where community members could participate. There’s generally a lot of opportunity for community members to have high impact in analysis (shoutout to @Florian_May and ruODK).

As we explore more managed workflows for entity-based data collection, it is possible that we will introduce an entity concept that is more richly queryable. However, this is unlikely to become our immediate focus because it is an entirely new area of work. Additionally, what may look like simple functionality can be complex or computationally expensive to do on large datasets. We’d like to first strengthen what can be done with web-based forms (e.g. submission edits), improve and enrich the user and permissions model, and make sure the features we already have are polished.

4 Likes

It does though we've worked hard to make the CSV export fast so that an external tool is less necessary. Pulling from Central works completely as expected. Push to Central currently doesn't work when there are submissions corresponding to different form versions. We intend to have a release that fixes this soon.

I fully understand this, but the implementation is a bit short-sighted. It would have cost nothing to store XML in an XML Postgresql field, instead of as text. It's probably not extremely fast to filter on un-indexed items, but it is really easy to implement.

I always use Shiny as frontend, because I am fluent in R, but handling ruODK update via a local warehouse is made rather complex currentyl because of the limited filtering abilities.

2 Likes

Dieter's reply made me daydream (in R of course):
A shiny app that reads a regularly refreshed data dump from ODK Central (made via ruODK) would be awesome.
Thinking GH actions running a drake plan to download from ODK Central, saving all dataframes as named list to RData and pushing the RData to a directory readable by the Shiny app.
Substitute downloading from ODK Central with downloading from your custom data warehouse (where QA and value add happens) if you wish.

That shiny app as a well documented template that's easy enough to extend and customise should silence some of the whinging about ODK Central's missing dashboards.

It's only slightly off what I try do do - and what I have partially implemented, without the Shiny part. I try to avoid GH actions and the like, because it needs additional installation. Ok, my use of MQTT to signal Postgresql inserts is in violation of this principle, but I have MQTT anyway available for other messaging. The REST API is fine and nicely documented - BTW, the documentation of the REST API was the main reason why I switched from KOBO, which has a good API (v2) with almost non-existing docs.

The problem is "saving all dataframes" - I think you are still thinking in terms of scientific data reporting in a batch, but for this use case the existing ruODK is perfectly fine. I need some incrementation solution - when you have 100k Patient data sets, you do not want to download them. I do this by keeping a skip counter, but it could be not so robust once you allow deletion of records.

Nevertheless, the best solution would be to skip the local SQLite warehouse altogehter and allow querying of the database - but the decision not to use XML features of Postgresql is unfortunate.

Suggestion: In case the team is willing to make the XML column of type XML, I will have further looks into it how to implement a XPATH query API. It would void a lot of my work done, but that ok if we get a more robust solution.

1 Like

Storing XML natively is a discussion that could be taken up by the TAB and core team.

4 posts were split to a new topic: Forms with external file not appearing in Briefcase export tab

You are absolutely right that the change in column types is easy to do. There are some implications that we must consider, though. Like you say, performance is going to be problematic. I'd expect most filtering to be prohibitively slow when the submission count is in the 10k range with a moderately large form (say 150 questions) on a low-tier server (we use a 1gb, 1vCPU Digital Ocean droplet as our benchmark). We do try to be very clear about system performance and I'd be very interested in learning more about your findings as you experiment with this approach.

The other major consideration is that the database is intended to be an implementation detail and not part of the public API. Though Central currently stores submissions as raw XML, that may not always be the case. If we were to expose something like a formal API for XPath querying, then we'd be responsible for maintaining it in the future or communicating its deprecation. It may not be a big deal in and of itself but every feature added no matter how small adds to the long term maintenance of the system. And as much as I personally appreciate XPath and enjoy using it, I've learned that I am in the minority, particularly in our user community. This means we would be adding it for a small audience. That's not to say we shouldn't consider it but hopefully these notes have provided some context for some of the consequences to consider.

Don't give up (it sounds like your peers were not amused) !

I checked indexing for XML on Postgresql: It is possible to define an index on an XPATH, but XPATH is fixed, so you have to build one index for each query. It is not possible to build an index on the fly. However, having a limited number of indexes (e.g. a combined on Patient ID and date) would be perfectly ok.

  • Create an API call passing an XPATH to create the index.
  • Creation returns an id for the index.
  • Do a query on id/query parameter. Return the full record, that makes it simpler

I checked this, and it is ultrafast, and, at least according to the manual, scales well for larger number. I also checked on a 2 GB/1 CPU server.

Your argument on the 10k range, however, can be turned against you: it is my problem that a simple solution would require to read everything in the database, and this is not that fast when you are looking for one patient in the 10 k range. I have to maintain an SQL warehouse at the moment, which is nasty because there is no "get all since last update" currently. The latter was announced, and it is much easier to implement because the upload date is in a separate table as a simple field.

1 Like