Looping through three levels of data from attached CSVs is slow in web form

Guenther_Fink · December 21, 2023, 1:54pm

Hi all, I have a large data set (50k) that contains names of individuals nested in households, which are nested in villages. I can cascade select the household, but would then like to automatically loop through all household members. I managed to make this work using LN's code when I put all members in the Choices set - this does however not work with 50k rows. Is there a way to do a loop through a subset called in from a .csv file?

Thanks for your help!

Günther

Guenther_Fink · December 21, 2023, 2:00pm

p.s: The closest link I could find is

the main loop code
count(instance('shops')/root/item[shopper=${shopper}])
does not appear to work however when the list of shoppers is called in from an external CSV file

aurdipas · December 29, 2023, 12:03pm

@Grzesiek2010 any idea?

Grzesiek2010 · December 29, 2023, 1:36pm

Hi @Guenther_Fink

Can I see the structure of that csv file? You can attach just a part of it or prepare a sample one with the same structure if you can't share your data.
It sounds like it would be easier to have separate files for those different levels (villages, households etc.).

It would be also good to see how you achieve that to understand what you want to have with external csv files (if you need exactly the same behavior).

Guenther_Fink · December 30, 2023, 4:22pm

@Grzesiek2010 Yes, of course. I am attaching a mock data set here: the structure is very simple - it is just a name of people linked to a household ID. I can look up all people matching a given householdID and then do select-multiple. I also managed to do a count of matching IDs, but that seems to slow down the form quite a bit. Breaking down the file into subfiles could definitely work - is that the best way?

persons.csv (1.1 MB)

Grzesiek2010 · January 3, 2024, 1:54pm

Ok and the households are in a separate file with ids of their villages that are in another separate file?

Guenther_Fink · January 3, 2024, 3:56pm

dr.csv (1.0 KB)
form1.xlsx (630.2 KB)
form2.xlsx (630.3 KB)
households.csv (336.7 KB)
persons.csv (1.1 MB)

Guenther_Fink · January 3, 2024, 4:00pm

Yes, exactly. There are actually 3 levels here - villages (choices - there are few), census district (dr), and within each DR, there are households with members. I created two forms that both work: form1 used the select_multiple_from_file command and runs very smoothly, but requires all members to be manually selected (which is okay, but a bit annoying). form2 tries to apply the code created by LN, essentially loading the members through a "silent" question, and then looping through the list. This code does what I really want, but somehow is incredibly slow. Any ideas on how to make this equally fast as the first form?

Thanks for your help!

Grzesiek2010 · January 4, 2024, 6:41pm

I think this sample form should help you solve your problem. I've created a form with two external csv files (one for continents and one for countries). It works in the way that if you select Europe in the first question, there will be a repeat group looping through the European countries. Otherwise, if you select North America the loop will go through countries from North America.
I think this is exactly what you have described but simpler just with two levels, not three but it shows how to deal with such cases.
countries.xlsx (5.9 KB)
countries.csv (155 Bytes)
continents.csv (85 Bytes)

Guenther_Fink · January 5, 2024, 6:28am

Thanks so much for your help. This logic is what I used in form2.xlsx that I posted above. It works but is incredibly slow with the 50k list, while the standard select_multiple option I used in form1 is very fast. Any idea why this is the case?

LN · January 10, 2024, 11:48pm

Could you quantify this a little bit? I imagine it'd be something like 3 seconds * number of people in household + 1? The base number of seconds will vary depending on your device but I think it will grow proportionally to household members, does that seem right?

I believe that what's going on is that in instance('persons')/root/item[hhid=${hhid}][position()=current()/../mem_pos]/label Collect currently doesn't do any caching for expressions with multiple predicates so hhid=${hhid} requires iterating over the entire persons list for each household member and an additional time to get the count of people in the household.

@seadowg I'm a little bit surprised by this -- I thought Collect did do caching for first predicates. Am I remembering wrong?

Your form1 is a very clever way of capturing all members of the household exactly once. I think you might be able to achieve the same thing with a calculate that does join(" ", instance('persons')/root/item[hhid=${hhid}]). That would remove the user intervention which I agree is not ideal.

seadowg · January 11, 2024, 9:37am

That's certainly intended: https://github.com/getodk/javarosa/blob/f489a3eb02f802d783e8aec5709a09a21ae61895/src/test/java/org/javarosa/core/model/PredicateCachingTest.java#L171.

We'd need to do some profiling on the form to see exactly what's slow here. It could be that there's some edge case we're missing here or that something else unexpected is slowing us down.

Guenther_Fink · January 11, 2024, 12:23pm

Thank you very much @LN. 3 seconds x number of people in the household sounds about right. I just played with the code, and it seems like both the household member lookup and the household size count

count(instance('persons')/root/item[hhid=${hhid}])
slow down the form quite a bit. The join function is a bit faster it seems, but the form still gets stuck for about 30 seconds on the first question (select a village), which does not happen when I just use the select_multiple option in form 1. If I want to use the join function, then I would have to loop through the elements in the list created by join() next, correct?

Grzesiek2010 · January 12, 2024, 12:04am

I'm a little bit surprised by this -- I thought Collect did do caching for first predicates. Am I remembering wrong?

That might by it!

@Guenther_Fink What version of ODK Collect are you using? I've just checked and it works slow on v2022.4 but on v2023.3 everything seems to be fine. Maybe you have one of the older versions?

LN · January 12, 2024, 1:03am

Thank you, @Grzesiek2010, I just came to ask about Collect version as well! I tried form2 with the provided attachments on a Moto G5 running Android 7 and it feels fast as well.

Guenther_Fink · January 12, 2024, 7:09am

We are using ODK central - @aurdipas: are we on v4?

Grzesiek2010 · January 12, 2024, 8:16am

I'm asking about the mobile app version (it's displayed in the main menu at the bottom).

Guenther_Fink · January 12, 2024, 8:48am

Okay, sorry. I have been testing the forms directly via the browser on ODK central - let me also try them on the tablet and report back. Thanks!

Guenther_Fink · January 14, 2024, 1:15pm

I just updated ODK collect and re-loaded all forms - they work fine now, so thanks a million and sorry for the many questions. I guess the browser version does not pre-load the data in the same way - I guess I really should do all future testing directly on the tablet rather than using the browser-based testing on ODK central. In any case: thanks a million for all your help and input - really appreciate it!

LN · January 16, 2024, 4:53pm

We've been putting work into aligning web forms and the mobile app but there are still some major differences, including indeed some optimization work around attached datasets that hasn't made it to web forms yet. I agree that it's confusing and I'm glad that things are working well on the tablets you'll use for data collection. Don't hesitate to ask if you have further questions!