ODK Collect crashing on Tecno SAS6 after only one or two interviews

So our problem is not urgent for this field exercise, but if you have any insight from the ANR reports, we will appreciate hearing what you learn. Thank you!

Really glad to hear that the issue is no longer critical, @Dalerhoda. Still, we should figure out what's going on. Thanks for sharing your forms and we'll let you know when we know more.

I spent some time analyzing the problem you have ran into. I tried to reproduce the issue but to no avail, just like I expected taking into account what you have said that it has appeared in <10% your devices.
However I think that I know what the cause is... My general conclusion is that your forms might be too complex for the devices you have been using, below are more details:

  1. Tecno SA6S is a budget device and has just 1GB RAM it's very little taking into account it uses Android 7 (for example some devices I have with Android 4 or 5 have more).
  2. In your form you use pretty complex calculations (maybe not complex in terms of difficulty but they are long what makes them complex - those used in columns: calculation and relevant).

So using forms with complex calculations on not very powerful device might lead to such problems.

I can recommend:

  • please review your calculations and try to simply them, you can split them into a few smaller calculations. Here we had a similar problem with a complex form and such a trick helped.
  • you can periodically reboot your devices like every morning to free up some resources
  • you can ask your interviewers not to use those devices for other purposes, I mean not to play games, not to install not required apps (the same reason like above)

Unfortunately, it's not a thing that we could easily fix on our side. We have been improving the performance and probably there are still a lot to do but it's an ongoing process and it will never be perfect.

@Grzesiek2010 and @LN: Thank you very much for your attention. Thanks especially for the time you spent looking at ANR logs and trying to reproduce the problem.

I don't want to sound ungrateful...because I'm very grateful...but I do want to press into this theory that complex calculations could be the problem. If those were the culprit, I would have expected to see a fairly constant volume of failures across our seven days of data collection. Our teams visited over 8,000 homes per day for a week and the complexity of the interviews...the path thru the ODK form...should have been similarly complex across days. That is to say that the teams should have encountered many hundreds of respondents in the target audience, who took the longest possible path thru the interview and encountered the most complex calculations. So it seems odd to me that 10% of the phones would fail on day 1 and then fewer than a dozen per day...and often fewer than 5 per day would fail on the other days if this is a matter of calculation complexity. The ODK form was getting a thorough workout on about 300 phones per day, day after day, and didn't cause consistent problems. If the problem were with the calculations, wouldn't we expect to see widespread problems day after day?

Second, is there helpful guidance somewhere on recommended device specs when planning to do this kind of work? Of course the usual advice is to buy the best hardware you can afford, but is there any advice more useful than that? Overall, the Tecno SA6s devices served us well in this field effort. I would love to be able to plan and say "If we purchase XXX device with YYY specs (RAM, Android version, ODK version, etc), it should comfortably be able to collect data using interview form ZZZ from NNN respondents without needing to reboot or upload the forms." Is there a straightforward way to make such a confident statement to project planners and the procurement team?

Third...I take your point about making the calculations simple and we will strive to do that for future projects. Is there a utility that shows how many resources are used with different versions of the logic? It would be satisfying to load one form and see what gets used and then to load the simpler form and see the savings with the same survey responses. How might we do that? Ideally I would like to be able to amend the statement above and say "If we use the form in its most straightforward incarnation with the logic expressed in natural but somewhat complicated form, we can collect data from NNN respondents without rebooting and if we devote resources to simplify the form, we expect to see ___ operational benefit. (Not crashing? More interviews before reboot needed? Other?)

In this project, I seem to have gotten off lucky. The hardware recommendations and purchases were made before I got involved. My team developed a form that instantiated the questionnaire. We didn't see any problems during testing, although we also did not simulate an entire day of data collection. We won't make that mistake again. On field day 1 it looked like we had substantial problems, but then once the phones were re-re-updated, everything went fairly well and we got the data we hoped to collect.

I didn't particularly deserve to be this lucky...and I would rather not rely on good luck next time, so I'll appreciate pointers to resources to help plan and to decide how many resources to devote to simplifying the logic in the interview forms.

Thank you,
-Dale

If you were way outside the RAM needed by Collect to process your form, you're right that it would just fail systematically. And if you were comfortably within needed RAM, you'd never have a problem. But since you're mostly within the need but flirting with the edge, my guess is that you saw failures when something else was happening on the device outside of your control like another app updating or some operating system task running. Alternately it could be that some enumerators just got to more households for some reason and I'd expect them to have more problems as described below.

No but that's an interesting idea. The thing that is resource-intensive is relationships between fields. So if field B is computed using field A's value, that relationship means that field A's value changing has ripple effects. This gets magnified if you have long chains of relationships. Those relationships are represented in memory. The most effective change you can make in form design is capturing expressions that are identical in calculates and reusing them so that fewer relationships need to be represented.

There is some strangeness around how these relationships are represented when repeats are involved and that takes up further memory. @ggalmazor is actually currently exploring this part of the implementation to at least have it better documented but also to see whether there are improvements that could be made.

In your case, I'm guessing that you ran into issues because of the number of repeats that specific enumerators added and that if they had been assigned fewer households or worked half days or something, you would not have seen any issues. Did you notice adding repeats taking progressively longer? Or saving the form taking progressively longer as more repeats were added? Was that disruptive?

I agree that this should be better documented and have filed https://github.com/opendatakit/docs/issues/1149.

I wish! I know it sounds simple but because of the broad range of things that can be done in form design, the different Android versions available and the huge amount of variation in how devices are set up, I don't think we can really provide such specific and confident guidance. We'll know more after you answer some of my questions above but I'm fairly confident in your case that it's the combination of the relationship between fields and the number of repeats added that caused problems.

Thank you, again, @LN. I’m grateful for your thoughtful reply.

We’re cleaning the final dataset now and will know soon how many repeats we encountered. I did not hear reports of super-long wait times with additional repeats. Anecdotally we heard that some phones were failing very early in the day,
but we were scrambling to replace the phones and put the frozen ones back into service and I didn’t have a good process for capturing the timing and conditions of failure. I was also based in the operations center and only hearing the details third-hand.
Next time we’ll have some SOPs for documenting when they fail and what was observed just before.

The repeat theory also doesn’t feel right to me, although it could be right. Although we visited ~50k households in a week, we were in a part of Kenya where we shouldn’t have taxed the repeat portion of our instrument. We use repeats
for multiple families in a dwelling and multiple children under age 2 in a household, but there shouldn’t have been very much of that where we were working. And it should’ve been spread about evenly over our days of data collection. But maybe you’re right
about a conjunction of RAM demands when there was a pending update and a quirky set of logic.

We do have dreams of repeating the work in a more urban setting where we would likely see more multi-family dwellings and would therefore lean more heavily on the repeat portion of the form. So we’ll need to pressure test that very thoroughly
maybe before and after making some changes to the form.

I’m super curious about how to say something useful when deciding what hardware to procure and deciding how much more time to spent atomizing the logic. I don’t have a pressing need now, but it seems like this would be a phase of the project
that everyone faces sooner or later.

For the Kenya project, whomever was advising the acquisition of Tecno SA6s told the procurement people they could get three years of use out of those phones. I don’t have any idea how they justified that statement. And I don’t yet feel
qualified to say confidently what sorts of forms (and repeat / logic structure and workload in terms of stored forms per day) would be
safe to field on those phones and what would not. I don’t like that uncertainty.

New question: Is the pressure on the RAM affected by whether the data collector phones have SIM cards? In our recent trial, they did not. We didn’t think they needed them and we wanted to minimize distractions. In a future trial I might
lean more on Open Map Kit and want a moving map on the interviewer’s phone so they can tag an OSM rooftop rather than rely on geopoint coordinates. I’m afraid that the mapping…either via SIM card or via stored map tiles will chew up even more RAM resources…isn’t
that right? So although it might provide a justification for a SIM card, it might also bring additional RAM-hog processes into the mix. We can try things on individual phones: map from SIM; map from tiles; don’t map at all, but would appreciate guidance
if there’s some that’s ready-made or adaptable.

Another new question: Based on what happened recently, we’ll be sure to
wake up the phones and give them any updates they need before the next field trial. After doing a set of updates, is there a way to tell the phones to ignore updates that become available during the period of field work? I would like them to settle
down and simply run the software onboard, without consuming energy on mid-project updates. Is that possible? Recommended? Or are the phones a bit at the mercy of several asynchronous and unpredictable update calendars…and likely to devote some resources
to detecting the possibility and trying to install those updates when I want them to devote all available resources to the task at hand?

Thank you for your patient responses, and warm holiday regards!

-Dale

@ggalmazor and I are picking back up some analysis of your forms. From the first post, it sounded like child_vaccination_VOL_tool_v12 was the form giving issues but follow-up posts made me less certain. Could you please confirm? What's the relationship between the VOL and SUP variants?

If you have a rough estimated range, that'd be helpful (e.g. 20-40 households, 2-5 children per form).

Some services that couldn't run without connectivity might be able to run so there could be a slight difference but it shouldn't be much.

Please note that changes to Android will mean that the Play Store version of Collect will no longer be compatible with OpenMapKit by August. See Collect will need to stop using /sdcard/odk for files for more. If OMK is critical for you, please make sure you communicate that with the Open Map Kit team. It sounds like the most important functionality for you is being able to select features, is that right? Do you need to natively produce OSM files? We'd like to bring at least some of the OMK functionality into Collect to make it more readily usable.

The RAM needs likely wouldn't be at the same time so it might not make a difference. Like you say, experimenting is going to be your best bet.

Given that you were able to complete your data collection with so few problems and that your form is quite small, I'm not convinced that there's a systematic RAM issue. I think what you're describing about making sure that the phones have received their updates and have sat for a while before fielding them sounds like a great idea. I don't believe there is a way to stop system work at a certain time but if you let the devices sit online for some hours, confirm that they are idle and then take them offline, I don't think there's much they could be doing.

1 Like

Hi @LN...thank you again for your attention on this thread. I apologize for not responding in February.

The number of repeats in our November work was negligible. Nearly all the interviews were in single-family dwellings and only a very small portion of HHs had more than one eligible child.

That said, we anticipate future work in an urban setting where we could legitimately enter triple-nested loops, with households inside multi-family dwellings, and more than one eligible child inside some of those households. We're thinking ahead to that possibility and wonder if it would be worthwhile to do some structured testing to see whether our current phones have sufficient RAM.

In a related train of thought, we're thinking even farther down the road to spec'ing out phones for a possible future purchase in another country, and trying to think about how to confidently set the RAM requirement and/or think about redesigning forms that use RAM inefficiently.

I'd love to hear whether others do the sort of testing I describe below and whether you have insights or suggestions concerning this draft plan.

  1. Make sure the phone has any and all recent updates.
  2. Clear off all forms except the stress-test form (one form per dwelling; possibility of numerous households (HH) nested within dwellings; possibility of numerous children nested within HH)
  3. Reboot the phone
  4. In Excel on another device, note the amount of phone RAM currently used and amount currently free.
  5. In Excel, note which of the several scenarios below you are running (a-h):
  6. Enter data from:
    a) 1 dwelling; 40 HH; 1 child each; or
    b) 4 dwellings; 10 HH each; 1 child each; or
    c) 10 dwellings; 4 HH each; 1 child each; or
    d) 40 dwellings; 1 HH each; 1 child each; or

e or f or g or h corresponds to a or b or c or d with 2 children per HH.

  1. Close out the last form.
  2. Record the RAM available
  3. Upload the data; delete from device; go to #2 above.

This would give us a sense of how much RAM might be used in a "day's work" when visiting 40 HH and collecting data on 40-80 kids, in batches of 1 or 2 per HH. We could see if the RAM usage is an appreciable dent in the resources, and of course note if ODK crashes. We would also see if the different numbers of loops inside loops 40x1, 4x10, 10x4, 1x40 seems to make a difference.

(We could also drop Excel and add questions whereby we enter the scenario and RAM numbers into ODK. I don't suppose there is a way for ODK to query the phone and auto-magically record the amount of used and available RAM??? If no, is that worth filing an issue to request it??)

We could use those numbers to speculate about how many buildings/HHs/kids might be enough to crash the phone and decide whether to try to accomplish that by hand on several project phones.

(Is there general guidance on how low does free RAM need to go before there can be glitches or crashes?)

And if there is plenty of RAM available after each of these tests, we could stop thinking about re-organzing the form and guess that last November's crashes were not RAM-related and likely due to phones needing to be updated after a long time powered off.

Of course I'd like to hear back that there's some sort of easier clever way to characterize a form's likely RAM gobbling profile. If yes, I'm all ears. Or to hear that there's been an upgrade that makes nested loops less costly now than in the past.

I welcome any feedback on these ideas.

Respectfully and with gratitude in advance,
-Dale

We added a smoke test and a benchmark around the child_vaccination_VOL_tool_v12 form to see whether we could identify systematic issues. We were able to find some small performance improvements to make without fundamentally changing how repeats are implemented but we remain convinced that the behavior you experienced was due to activity outside of Collect.

Collect v1.28 has just been released and includes slight performance enhancements and code simplification related to the form. You've probably already tried the betas but as a reminder, we encourage everyone with repeats or complex form logic to carefully try out v1.28.

One thing I want to flag as definitely expensive to do is having calculations that act across instances of repeats. For example, count and sum calculations can really slow down a form. If you can avoid those, then you at least shouldn't have any significant slowing down as new repeat instances are added.

I think that ideally the answer to these nested scenarios would be a dedicated way of representing related entities as the threads about longitudinal data collection describe. This is a ways off but I do hope it's where we'll be going.

Yes, definitely. I think your experiment sounds good to get a ballpark idea of how things go. I'd start with a really improbably big scenario first to see if it's even worth doing more experimentation (like 1000 dwellings). I'd also urge you to add a new dwelling instance as part of your experimental protocol to trigger form re-computation. If you don't have any calculations across repeat instances (sum, count), that should be virtually instantaneous.

@ggalmazor & @LN,

Thank you again for your attention to this thread in days of yore. I'm mentioning my colleague, @cclary here, too, so she can find the thread easily. Picking up this discussion after some time away as our client (American Red Cross) wants to do another field test...smaller in scope, but focused on an urban area with likely numerous households per dwelling. Perhaps low-SES multi-flat dwellings.

First, @LN, you mentioned in Sept 2020 that your smoke test & benchmark identified some small edits that would improve performance without changing the way we repeat. I see the links to the code, but not to concrete conclusions or recommendations. If there's a way to capture or communicate those, we'll be grateful.

Our client is the American Red Cross, so @danbjoseph is having a look at our form, as well. The child_vaccination_VOL_tool_v12 form is still the relevant form. (Dropbox link: https://www.dropbox.com/s/81i3jrqdejvaamb/child_vaccination_VOL_tool_v12.xls?dl=0)

We did conduct a variation on the stress test I outlined earlier in this thread and encountered numerous problems, but it was not apparent that they were with RAM. I think they were mostly due to phones trying to update after long dormancy and due to issue #1 below. I can post notes from the team that did that work if anyone is interested.

Today I'm curious about several things, listed here starting with the highest priority:

  1. If the user selects 'single family dwelling', they can tap the 'save' icon whenever they wish and Collect behaves appropriately: pauses momentarily to save the form in progress. But, in Collect v1.30.1, if the user enters the path thru the 'multi-family dwelling' then any time the user clicks the 'save' icon, Collect hangs and we have to exit the app and then go to admin settings -> reset application -> clear the form load cache. This is obviously a non-starter. Dan is looking into whether a different way of handling the indexing of the repeats might help there. We see some forum posts where @LN has provided feedback on this sort of thing.
    The only reason we have that 'multi-family dwelling' path thru the questionnaire, with the (possibly inefficient) repeats, is to associate the SAME GPS location with all of the families...all of the, say, flats in the same apartment building. If someone has an elegant recommendation for another way to unambiguously associate a single GPS collection with numerous subsequent single-form interviews, I'm eagerly open to switching to a design with a single form per HH.

  2. We want to add some photo data collection to the form. Maybe three photos per child, where children are nested within households within dwellings. We are currently working with a phone that @Grzesiek2010 describes as a 'budget device' with only 1G of RAM and running Android 7. (It's the Tecno SA6S.) So the client wants to know whether we must upgrade the device now, or no? This is another variation my earlier pre-procurement questions: Is there rule-of-thumb guidance on what specifications are needed in terms of RAM to collect & store photos? We're talking about having each interviewer visit maybe 50 HH/day, a portion of which would have eligible kids and a portion of those would have the document we want to photograph. I'd like to be able to comfortably store the survey responses and say ~100 photos per phone before uploading the data. (No SIM cards so we do a nightly upload when the phone reaches a hotspot or wifi.) (I see the capability to specify the longest dimension of the photo in pixels. That's quite nice. We'll have to do some experimenting to see what resolution we need to read the documents we want to photograph.) Let's say we need 1000 pixels on the long axis for sake of discussion.

This motivates several questions:

  1. Let's ignore the multi-family dwelling nesting of HH and children for a moment. If these were simple single-form interviews in each HH, would I have any hope of storing 100 photos plus survey responses, plus run Collect & background processes, etc., on the Tecno SA6S? (Maybe 200 if someone forgets to upload forms at the end of day??) And if no...not 100, how many? What if we only took photos for every third eligible interview? Might that work?

  2. If that all feels too iffy, then what are the recommendations for tech specs for the device we should procure? The client seems to be open only to phones...not tablets...at this time.

  3. Now what if we add the frustratingly unquantifiable but anecdotally common assertion that nested forms chew up extra RAM? Do we have any hope of adding photos to our current form in the multi-family dwelling branch? We can run a stress test. That seems to be the best way to get a handle on these things, yes?

  4. Speaking of stress tests, my earlier stress test protocol outline, and the protocol our team used in their test, was to record the phone's available RAM before and after entering a bunch of forms, but did NOT include a check WHILE entering data. A tangentially associated consultant mentioned recently that the RAM demands of nested forms might be much higher DURING data collection...before the form is closed...while all the responses and maybe photos are in RAM and maybe not saved yet...than AFTER the form is closed. Do you think that's right? Any hard evidence of that? If yes, then the stress test protocol should be amended to say 'pop out of Collect' between every 5th or 10th household and record available RAM at that point. Does that sound like a good idea?

As always, I'm very grateful to those who read these long posts and respond.

Respectfully submitted,
-Dale

The dropbox link in the post wasn't downloading anything for me, not sure if it's an issue with the link or on my side. Can you edit the post and attach the file directly?

[1.] For the issue with the hang after clicking save, I thought it was solved by adding in a check to make sure the choice is not blank by changing the repeat_count formula from:

-if(${building_type} = 'single',1,if(${flatcount} = 0 or indexed-repeat(${finalflat},${household},${flatcount}) != 'yes' , ${flatcount}+1,${flatcount}))

to:

if(${building_type} = 'single',1,if (${flatcount} = 0 or (${household}[position()=${flatcount}]/finalflat != '' and ${household}[position()=${flatcount}]/finalflat != 'yes'), ${flatcount} + 1, ${flatcount}))

[2.] When images are part of a form it's my understanding that it's the phone's storage that matters. I don't think additional photo questions in a survey would require noticeably more processing power, just more storage space per submission stored on the device. Might need to consider more closely the use of settings such as "Delete after send."

Thank you for your attention, @danbjoseph.

I am sorry the dropbox link didn't work for you. Here's the XLSFORM as an attachment without your logic change suggestion. We'll try the recommended logic change next week.

-Dale

child_vaccination_VOL_tool_v12.xls (68 KB)

Here's a link to the updated form. We implemented the suggestion made above by @danbjoseph and relayed from an earlier post by @LN and it appears to have cleared up the problem of the save button hanging ODK Collect. Thank you, Dan and Hélène!

This leaves us with our questions about whether the Tecno SA6S is up to the job of collecting data with current forms PLUS photos. (And I should have mentioned earlier my client's policy on this project to NOT put SIM cards in the interviewer phones, so they need to hold at least one day's data with onboard resources.) We will try some stress testing in the absence of parameter-based advice.

And more generally, are there recommendable non-nested (or less-nested) examples of how to associate a single GPS lat/lon pair with numerous households when collecting data in an indoor multi-family dwelling? (Especially keeping in mind the no-SIM card constraint).

Thank you,
-Dale

child_vaccination_VOL_tool_v13.xls (68 KB)

I was talking about changes we made to the JavaRosa/Collect implementation. We did not specifically look for changes to make in the form design but I think if there had been anything majorly problematic we would have noticed it.

I agree with all of this. The images are not maintained in RAM after being taken so they should have minimal impact on performance if any at all. I do want to emphasize what @danbjoseph said -- storage can become an issue and bandwidth can as well. You should be warned on conversion from XLSForm to use max-pixel if you don't already and that's important.

If all of your data collectors are sending images at the same time, you should expect that the submission throughput will be low and that there may be errors sending. You will likely have to stagger submissions and you probably want to run some tests with the number of submissions with images you expect to send simultaneously on the connection you will be using.

If you have an example of a system that is arbitrarily programmable and tries to make assertions about likely performance of a user-produced artifact, that would be helpful to look at. Given a form definition, it would be possible (though not simple) to quantify the amount of memory used per repeat instance. That's only part of the battle, though, because as you've seen first hand, Android and other applications on the device can be doing all kinds of things and the decisions about what Android chooses to keep in memory and not are hard to predict.

I'm still not convinced that Collect's RAM usage is the primary problem. Android 7.0 specifically has a lot of bugs and so it might just be that you were hitting some before some additional reboots. You could try searching for articles about the most common Android 7.0 issues to see if any of those match what you've experienced and whether there are suggestions for approaching them.

Collect does save a snapshot of collected data on each screen change. One thing you could do is to train data collectors on staying calm if things freeze up, rebooting their device, and when they open up Collect and tap on the form they want to fill out, using the hierarchy view to navigate to the question they were on before the reboot.

Memory is reclaimed once a form is closed so that does not capture the point of highest memory usage, unfortunately.

I'm not sure that you're going to get actionable information this way. What I might suggest is focusing on identifying some repeatable processes that allow data collectors to proceed with minimal interruption. That might be things like clearing any other running processes, killing Collect and restarting it, rebooting as I described above.

I think another major question you need to answer is whether the really is an issue after devices have been updated, kept online for a little while and then rebooted. Your previous experience suggests that there might not be.

You could have some questions at the beginning of a flat household form that ask for building information and that use a default to pull the last saved value for each. That way a data collector can simply swipe through if all the information is the same as for the last record they saved. You could get fancy and calculate a hash of those building values to use as a building identifier (or even just concatenate them all together) when you analyze your data. I want to emphasize again that I am not convinced that there really is a problem with form design, though. I think it's worth doing some trials with the new form updates and the latest version of Collect and see what you find.

Deliciously understated. Touché.

Thank you again for your attention. I understand everything you've said and it makes sense.

I very much like the idea of a default based on the last saved value. I think that could simplify our form a bit.

Best,
-Dale

:blush: To be clear, I am genuinely interested in examples. I'd like for us to provide more guidance but it feels like everything needs a bunch of caveats. Seeing how others have framed similar information would certainly help.

Agreed that defaults based on last saved value can be helpful in a lot of places!

Looking forward to hearing what you find in your next round of testing.

I didn't doubt your sincerity for a moment and I appreciate the tactful manner in which you worded your response. We'll report back on how things go. Hope to do some fieldwork in June.

I'll be appreciative if you have a suggestion for my question over in XLS form- Repeat select_one question until "A" is selected - #24 by Dalerhoda. I'm sure you're pulled in a thousand directions and I think there's no hurry if we switch to the default / last-saved-value model and strip out the repeat in our current form (so use one form per HH) but it's still a point on which I'm curious: whether the repeat counter can be decremented elegantly if the user swipes back and indicates that they're finished looping after initially saying they needed one more ride on the merry-go-round. Thanks in advance for your thoughts when able.

@LN: We're going to try this last saved value approach. We've been hosting our data on Kobo for its turnkey simplicity, but they haven't implemented the last saved value support yet. Before I go to the trouble of figuring out how to partner with someone who can set up ODK Central for me, I want to confirm with you that the question types we want to pull forward will work. (The last-saved documentation is very succinct, which I hope means that it works well and across all question types!) Specifically, does the last saved feature work with geopoints? Thank you.

Yes. It will work with any field type that has a literal value. I believe that means the only limitation is that binary values won't work as expected because the files themselves won't appear in Collect.

1 Like