pyODK: Using cursors to efficiently pull new data only

chrislrobert · December 1, 2022, 2:56pm

Thanks for releasing pyODK to facilitate Python-based workflows, for the helpful webinar, etc. I'd like to integrate ODK support into the Python surveydata package, and have a quick question about the best approach to cursors: if I want to pull only new or updated submissions for a form, would the best approach be to use SubmissionService.get_table()'s filter parameter with createdAt and/or updatedAt? Are there any examples out there I might look at? Other approaches you'd recommend?

This surveydata package is new and evolving, but the idea is to provide some abstraction between survey platforms and storage systems, and to allow for efficient synchronization across the two. Any advice on the best way to integrate ODK would be welcome, but I think it should be pretty easy to start with cursor-based data sync.

Thanks very much,

Chris

LN · December 5, 2022, 5:06am

Yes, that’s likely your best bet. There is one issue to be aware of: currently saving a timestamp and using it in a subsequent query with gt (greater than) will sometimes include duplicates that make it look like Central is doing a greater than or equal comparison. That’s because the database is storing higher precision timestamps than used in OData. This probably isn’t an issue if you’re dealing with submission updates but I did want to mention it because it’s surprising. We’ll likely have this patched by the end of January.

If you’d prefer to have data in a format other than JSON, that would be helpful to know. The CSV endpoint uses the same filtering. That would have to be accessed using pyodk’s raw HTTP verb methods. Central doesn’t currently provide filtering on the submission list endpoint but you could use the OData response to get submission ids and then request the raw XML for each submission if you wanted.

chrislrobert · December 6, 2022, 12:50pm

Okay, great, thanks! And I appreciate the warning re: potential duplicates, but I plan to use greater-or-equal with explicit duplicate handling anyway, out of an abundance of caution (in case the previous request happened to end with one but not all submissions with matching timestamps). JSON format also works well for the storage systems and pulling into Pandas, etc.

Thanks again,

Chris