ODK Collect crashing on Tecno SAS6 after only one or two interviews

Q1, 2, 3: Problem/ versions / tried-so-far:

I'm supporting a field test in Kenya running ODK Collect v.1.24.1 and Android 7.0 on TECNO SAS6 hardware with 8GB of storage and 1GB of memory onboard. The data is being stored on https://kobo.humanitarianresponse.info/#/forms. Most of the interviewer phones do not have SIM cards, and are meant to upload data via hotspot 2-3 times per day. We hoped that each would do ~20-40 completed interviews per day, although the pilot is meant partly to sort whether that's realistic.

We have had numerous phones crash while runnning ODK Collect. I haven't been present when it happens, but I think today I can probably get my hands on a phone that has recently crashed. If there is something helpful I can pull out of the phone to help narrow down the cause, I'll be grateful to learn how to do that.

The project has an extended set of IT support staff, but at the moment we don't have a cohesive approach for this problem and different staff are trying different approaches and solutions and it's not obvious to me that any are effective. Some report that even after resetting the phone to factory settings and re-installing ODK and the forms, it crashes again quite soon.

Some of the IT team believe the problem might be related to the phones needing a Google update. Others believe the problem may have to do with the list of apps that we disabled or deleted to make the phones "less fun" for the field data collectors. I have a list of apps that were disabled/deleted. I've pasted a link to files that describe that below.

I read online today and understand that the problem can be due to using up all the RAM - either by storing too many completed forms or by using logic that is too complicated and not modular. Some of the crashes happened even during training before they had many forms at all...I don't think it's a problem (yet) of too many stored forms. I've pasted a dropbox link to a folder holding the forms. Is there a straightforward or objective tool or way analyze whether the logic is too complicated? I'm sure that if we had realized it was important, we could have made the calculations and relevance statements more succinct or modular, but at this time that is not an area where we can easily experiment as we have the forms in the field on 200+ forms. We CAN replace the forms on all phones if needed, but I would prefer not to do that more than once and I'd like to have a high degree of confidence that it will work before assembling the crew to do that.

Today is our first day in the field and I'll know in a few hours how many successful interviews we had and I'll have an informal report of the proportion of phones with a problem, but based on training yesterday and the day before it was on the order of 10% or more of the phones.

I've instituted a procedure to track which phones fail, what we try, and whether they fail again, but we lack good ideas other than running Google updates.

4. What steps can we take to reproduce the problem?

I don't know what to say here. I've been entering data on my project phone this morning and I think it's safe to say I've entered more data than anyone could have during the training...and I haven't crashed mine yet. I'll keep trying.

5. Anything else we should know or have? If you have a test form or screenshots or logs, attach below.

How do I access a log to share that?

Here's a link to a dropbox folder that holds two sub-folders:

"Apps removed" - holds two excel files...showing how the apps were dropped for phones with and without SIM cards.

"Forms" holds five .xls forms. I have permission to share the forms. The signin forms are almost trivial. The child_vaccination forms are the main point of this survey and are somewhat complex. The missed_vx_follow_up is also somewhat complex; we won't try that one in the field until late in the week, so if we should simplify the logic there, we probably have time. I believe that the failures are happening mostly in: child_vaccination_VOL_tool_v12.xls

Arrgh. I see that I inadvertently posted this in community instead of support. Apologies for that oversight. And thank you in advance for any input!

Hi @Dalerhoda

I'm sorry to hear that you are encountering crashes. We have done a lot to make the app crash free for last 1-2 years and especially v1.24.1 seems to be historically the most reliable release according to our reports.
In those reports we have info about devices, android versions etc and I can filter them. Unfortunately I can't find Tecno SAS6. If the app crashed on that device and the device had internet connection (when the app crashed or later it doesn't matter) a report should have been sent.
You said that you have many devices, are they all the same (Tecno SAS6)?
Finding reports from your devices would be the easiest way to help you because I wouldn't need to ask you when it happens, what you did etc. When I goggle it (Tecno SAS6) I don't even receive many results (this topic is the second one) maybe it has different name?

Thank you, @Grzesiek2010.

I made a mistake: It is the Tecno SA6S (not SAS6).

And I should have mentioned that our data collectors' phones do not have SIM cards, so they are offline until they meet up with a supervisor and upload forms via hotspot.

I am now holding a stack of failed phones and the word 'crash' is not quite right...'hang' would be a better description. I'm not sure what happened in the moment before it entered the 'hang' state, but here is what I see:

  1. The screen is completely black, with only the Android status bar at top. (Screenshot 1 in the dropbox linked below.)

  2. I press the square 'overview' button at bottom left several times and the phone comes back to life...I can not proceed in the Collect App, but I can close it and start again. If I do that, and load my simple sign-in form, I can proceed as normal. If I try to 'fill blank form' and select the more complicated form, then it says 'Loading Form'. And then the screen goes BLACK again.

Note that many phones ran the complicated form over-and-over again without difficulty. The problem appeared in about 25 phones out of the 300 we had in the field yesterday.

  1. I press the overview button several more times and get a dialog box: 'ODK Collect isn't responding: Close App or Wait'. If I wait, nothing happens. If I close App, I can go back to the loop above, but cannot proceed to fill the main 'child_vaccination_VOL_tool' form.

  2. Some of these phones show a message on the title page saying 'system wants to do a
    Google security patch and a second message about a Google Play Services 'account action required'. We've been pulling these phones back to the Ops Center, doing the updates, and when they don't ask for updates, we've been un/re-installing the Collect app and the forms. I don't have good data yet concerning whether the fixes are working. The IT guys have the impression that phones that fail and are fixed have been failing again, but we didn't start to keep careful track of this until this morning.

Because we are not crashing, per se, we are probably not sending helpful log files to Google Play. Is there a way for me to pull out a log file that might indicate why the program is 'not responding'. Why it hangs while loading the form?

Thank you again for any pointers,
-Dale

Ok so seems like it's not a crash but ANR (Application Not Responding) error.
Fortunately we collect data about ANRs as well and here I can find some devices Tecno SA6S.

To be sure the reports come from your devices please answer my questions:

  1. You said the problem occurred in 25/300 phones, are all of them the same Tecno SA6S?
  2. Do they send finalized forms directly from those devices or maybe you pull them using ODK Briefcase? I'm asking because I need to know if those devices are connected to the internet sometimes and they have a chance to send those reports at all.
  1. Yes...all Tecno SA6S...our training began on Nov 13 and most of the problems came on Nov 14 and 15.
  2. They send the forms directly from the phones when connected to a hot-spot once or twice a day. If they had been online, you would see probably three dozen ANR reports from our team over the past week.

Our IT team added updates and reset the problem phones and since Saturday we have had only 2-7 phones 'freeze' per day. We were very worried when we lost almost 10% the first day, but since then, the problem has been manageable

When a phone fails, we swap in a spare in the field and bring the ANR phone back to HQ where we exit the app and upload the forms collected before the problem. Our team installs updates and resets to factory settings and re-installs ODK & the forms and puts the phone back in the pool of spare phones.

So our problem is not urgent for this field exercise, but if you have any insight from the ANR reports, we will appreciate hearing what you learn. Thank you!

Really glad to hear that the issue is no longer critical, @Dalerhoda. Still, we should figure out what's going on. Thanks for sharing your forms and we'll let you know when we know more.

I spent some time analyzing the problem you have ran into. I tried to reproduce the issue but to no avail, just like I expected taking into account what you have said that it has appeared in <10% your devices.
However I think that I know what the cause is... My general conclusion is that your forms might be too complex for the devices you have been using, below are more details:

  1. Tecno SA6S is a budget device and has just 1GB RAM it's very little taking into account it uses Android 7 (for example some devices I have with Android 4 or 5 have more).
  2. In your form you use pretty complex calculations (maybe not complex in terms of difficulty but they are long what makes them complex - those used in columns: calculation and relevant).

So using forms with complex calculations on not very powerful device might lead to such problems.

I can recommend:

  • please review your calculations and try to simply them, you can split them into a few smaller calculations. Here we had a similar problem with a complex form and such a trick helped.
  • you can periodically reboot your devices like every morning to free up some resources
  • you can ask your interviewers not to use those devices for other purposes, I mean not to play games, not to install not required apps (the same reason like above)

Unfortunately, it's not a thing that we could easily fix on our side. We have been improving the performance and probably there are still a lot to do but it's an ongoing process and it will never be perfect.

@Grzesiek2010 and @LN: Thank you very much for your attention. Thanks especially for the time you spent looking at ANR logs and trying to reproduce the problem.

I don't want to sound ungrateful...because I'm very grateful...but I do want to press into this theory that complex calculations could be the problem. If those were the culprit, I would have expected to see a fairly constant volume of failures across our seven days of data collection. Our teams visited over 8,000 homes per day for a week and the complexity of the interviews...the path thru the ODK form...should have been similarly complex across days. That is to say that the teams should have encountered many hundreds of respondents in the target audience, who took the longest possible path thru the interview and encountered the most complex calculations. So it seems odd to me that 10% of the phones would fail on day 1 and then fewer than a dozen per day...and often fewer than 5 per day would fail on the other days if this is a matter of calculation complexity. The ODK form was getting a thorough workout on about 300 phones per day, day after day, and didn't cause consistent problems. If the problem were with the calculations, wouldn't we expect to see widespread problems day after day?

Second, is there helpful guidance somewhere on recommended device specs when planning to do this kind of work? Of course the usual advice is to buy the best hardware you can afford, but is there any advice more useful than that? Overall, the Tecno SA6s devices served us well in this field effort. I would love to be able to plan and say "If we purchase XXX device with YYY specs (RAM, Android version, ODK version, etc), it should comfortably be able to collect data using interview form ZZZ from NNN respondents without needing to reboot or upload the forms." Is there a straightforward way to make such a confident statement to project planners and the procurement team?

Third...I take your point about making the calculations simple and we will strive to do that for future projects. Is there a utility that shows how many resources are used with different versions of the logic? It would be satisfying to load one form and see what gets used and then to load the simpler form and see the savings with the same survey responses. How might we do that? Ideally I would like to be able to amend the statement above and say "If we use the form in its most straightforward incarnation with the logic expressed in natural but somewhat complicated form, we can collect data from NNN respondents without rebooting and if we devote resources to simplify the form, we expect to see ___ operational benefit. (Not crashing? More interviews before reboot needed? Other?)

In this project, I seem to have gotten off lucky. The hardware recommendations and purchases were made before I got involved. My team developed a form that instantiated the questionnaire. We didn't see any problems during testing, although we also did not simulate an entire day of data collection. We won't make that mistake again. On field day 1 it looked like we had substantial problems, but then once the phones were re-re-updated, everything went fairly well and we got the data we hoped to collect.

I didn't particularly deserve to be this lucky...and I would rather not rely on good luck next time, so I'll appreciate pointers to resources to help plan and to decide how many resources to devote to simplifying the logic in the interview forms.

Thank you,
-Dale

If you were way outside the RAM needed by Collect to process your form, you're right that it would just fail systematically. And if you were comfortably within needed RAM, you'd never have a problem. But since you're mostly within the need but flirting with the edge, my guess is that you saw failures when something else was happening on the device outside of your control like another app updating or some operating system task running. Alternately it could be that some enumerators just got to more households for some reason and I'd expect them to have more problems as described below.

No but that's an interesting idea. The thing that is resource-intensive is relationships between fields. So if field B is computed using field A's value, that relationship means that field A's value changing has ripple effects. This gets magnified if you have long chains of relationships. Those relationships are represented in memory. The most effective change you can make in form design is capturing expressions that are identical in calculates and reusing them so that fewer relationships need to be represented.

There is some strangeness around how these relationships are represented when repeats are involved and that takes up further memory. @ggalmazor is actually currently exploring this part of the implementation to at least have it better documented but also to see whether there are improvements that could be made.

In your case, I'm guessing that you ran into issues because of the number of repeats that specific enumerators added and that if they had been assigned fewer households or worked half days or something, you would not have seen any issues. Did you notice adding repeats taking progressively longer? Or saving the form taking progressively longer as more repeats were added? Was that disruptive?

I agree that this should be better documented and have filed https://github.com/opendatakit/docs/issues/1149.

I wish! I know it sounds simple but because of the broad range of things that can be done in form design, the different Android versions available and the huge amount of variation in how devices are set up, I don't think we can really provide such specific and confident guidance. We'll know more after you answer some of my questions above but I'm fairly confident in your case that it's the combination of the relationship between fields and the number of repeats added that caused problems.

Thank you, again, @LN. I’m grateful for your thoughtful reply.

We’re cleaning the final dataset now and will know soon how many repeats we encountered. I did not hear reports of super-long wait times with additional repeats. Anecdotally we heard that some phones were failing very early in the day,
but we were scrambling to replace the phones and put the frozen ones back into service and I didn’t have a good process for capturing the timing and conditions of failure. I was also based in the operations center and only hearing the details third-hand.
Next time we’ll have some SOPs for documenting when they fail and what was observed just before.

The repeat theory also doesn’t feel right to me, although it could be right. Although we visited ~50k households in a week, we were in a part of Kenya where we shouldn’t have taxed the repeat portion of our instrument. We use repeats
for multiple families in a dwelling and multiple children under age 2 in a household, but there shouldn’t have been very much of that where we were working. And it should’ve been spread about evenly over our days of data collection. But maybe you’re right
about a conjunction of RAM demands when there was a pending update and a quirky set of logic.

We do have dreams of repeating the work in a more urban setting where we would likely see more multi-family dwellings and would therefore lean more heavily on the repeat portion of the form. So we’ll need to pressure test that very thoroughly
maybe before and after making some changes to the form.

I’m super curious about how to say something useful when deciding what hardware to procure and deciding how much more time to spent atomizing the logic. I don’t have a pressing need now, but it seems like this would be a phase of the project
that everyone faces sooner or later.

For the Kenya project, whomever was advising the acquisition of Tecno SA6s told the procurement people they could get three years of use out of those phones. I don’t have any idea how they justified that statement. And I don’t yet feel
qualified to say confidently what sorts of forms (and repeat / logic structure and workload in terms of stored forms per day) would be
safe to field on those phones and what would not. I don’t like that uncertainty.

New question: Is the pressure on the RAM affected by whether the data collector phones have SIM cards? In our recent trial, they did not. We didn’t think they needed them and we wanted to minimize distractions. In a future trial I might
lean more on Open Map Kit and want a moving map on the interviewer’s phone so they can tag an OSM rooftop rather than rely on geopoint coordinates. I’m afraid that the mapping…either via SIM card or via stored map tiles will chew up even more RAM resources…isn’t
that right? So although it might provide a justification for a SIM card, it might also bring additional RAM-hog processes into the mix. We can try things on individual phones: map from SIM; map from tiles; don’t map at all, but would appreciate guidance
if there’s some that’s ready-made or adaptable.

Another new question: Based on what happened recently, we’ll be sure to
wake up the phones and give them any updates they need before the next field trial. After doing a set of updates, is there a way to tell the phones to ignore updates that become available during the period of field work? I would like them to settle
down and simply run the software onboard, without consuming energy on mid-project updates. Is that possible? Recommended? Or are the phones a bit at the mercy of several asynchronous and unpredictable update calendars…and likely to devote some resources
to detecting the possibility and trying to install those updates when I want them to devote all available resources to the task at hand?

Thank you for your patient responses, and warm holiday regards!

-Dale

@ggalmazor and I are picking back up some analysis of your forms. From the first post, it sounded like child_vaccination_VOL_tool_v12 was the form giving issues but follow-up posts made me less certain. Could you please confirm? What's the relationship between the VOL and SUP variants?

If you have a rough estimated range, that'd be helpful (e.g. 20-40 households, 2-5 children per form).

Some services that couldn't run without connectivity might be able to run so there could be a slight difference but it shouldn't be much.

Please note that changes to Android will mean that the Play Store version of Collect will no longer be compatible with OpenMapKit by August. See Collect will need to stop using /sdcard/odk for files for more. If OMK is critical for you, please make sure you communicate that with the Open Map Kit team. It sounds like the most important functionality for you is being able to select features, is that right? Do you need to natively produce OSM files? We'd like to bring at least some of the OMK functionality into Collect to make it more readily usable.

The RAM needs likely wouldn't be at the same time so it might not make a difference. Like you say, experimenting is going to be your best bet.

Given that you were able to complete your data collection with so few problems and that your form is quite small, I'm not convinced that there's a systematic RAM issue. I think what you're describing about making sure that the phones have received their updates and have sat for a while before fielding them sounds like a great idea. I don't believe there is a way to stop system work at a certain time but if you let the devices sit online for some hours, confirm that they are idle and then take them offline, I don't think there's much they could be doing.

1 Like

Hi @LN...thank you again for your attention on this thread. I apologize for not responding in February.

The number of repeats in our November work was negligible. Nearly all the interviews were in single-family dwellings and only a very small portion of HHs had more than one eligible child.

That said, we anticipate future work in an urban setting where we could legitimately enter triple-nested loops, with households inside multi-family dwellings, and more than one eligible child inside some of those households. We're thinking ahead to that possibility and wonder if it would be worthwhile to do some structured testing to see whether our current phones have sufficient RAM.

In a related train of thought, we're thinking even farther down the road to spec'ing out phones for a possible future purchase in another country, and trying to think about how to confidently set the RAM requirement and/or think about redesigning forms that use RAM inefficiently.

I'd love to hear whether others do the sort of testing I describe below and whether you have insights or suggestions concerning this draft plan.

  1. Make sure the phone has any and all recent updates.
  2. Clear off all forms except the stress-test form (one form per dwelling; possibility of numerous households (HH) nested within dwellings; possibility of numerous children nested within HH)
  3. Reboot the phone
  4. In Excel on another device, note the amount of phone RAM currently used and amount currently free.
  5. In Excel, note which of the several scenarios below you are running (a-h):
  6. Enter data from:
    a) 1 dwelling; 40 HH; 1 child each; or
    b) 4 dwellings; 10 HH each; 1 child each; or
    c) 10 dwellings; 4 HH each; 1 child each; or
    d) 40 dwellings; 1 HH each; 1 child each; or

e or f or g or h corresponds to a or b or c or d with 2 children per HH.

  1. Close out the last form.
  2. Record the RAM available
  3. Upload the data; delete from device; go to #2 above.

This would give us a sense of how much RAM might be used in a "day's work" when visiting 40 HH and collecting data on 40-80 kids, in batches of 1 or 2 per HH. We could see if the RAM usage is an appreciable dent in the resources, and of course note if ODK crashes. We would also see if the different numbers of loops inside loops 40x1, 4x10, 10x4, 1x40 seems to make a difference.

(We could also drop Excel and add questions whereby we enter the scenario and RAM numbers into ODK. I don't suppose there is a way for ODK to query the phone and auto-magically record the amount of used and available RAM??? If no, is that worth filing an issue to request it??)

We could use those numbers to speculate about how many buildings/HHs/kids might be enough to crash the phone and decide whether to try to accomplish that by hand on several project phones.

(Is there general guidance on how low does free RAM need to go before there can be glitches or crashes?)

And if there is plenty of RAM available after each of these tests, we could stop thinking about re-organzing the form and guess that last November's crashes were not RAM-related and likely due to phones needing to be updated after a long time powered off.

Of course I'd like to hear back that there's some sort of easier clever way to characterize a form's likely RAM gobbling profile. If yes, I'm all ears. Or to hear that there's been an upgrade that makes nested loops less costly now than in the past.

I welcome any feedback on these ideas.

Respectfully and with gratitude in advance,
-Dale

We added a smoke test and a benchmark around the child_vaccination_VOL_tool_v12 form to see whether we could identify systematic issues. We were able to find some small performance improvements to make without fundamentally changing how repeats are implemented but we remain convinced that the behavior you experienced was due to activity outside of Collect.

Collect v1.28 has just been released and includes slight performance enhancements and code simplification related to the form. You've probably already tried the betas but as a reminder, we encourage everyone with repeats or complex form logic to carefully try out v1.28.

One thing I want to flag as definitely expensive to do is having calculations that act across instances of repeats. For example, count and sum calculations can really slow down a form. If you can avoid those, then you at least shouldn't have any significant slowing down as new repeat instances are added.

I think that ideally the answer to these nested scenarios would be a dedicated way of representing related entities as the threads about longitudinal data collection describe. This is a ways off but I do hope it's where we'll be going.

Yes, definitely. I think your experiment sounds good to get a ballpark idea of how things go. I'd start with a really improbably big scenario first to see if it's even worth doing more experimentation (like 1000 dwellings). I'd also urge you to add a new dwelling instance as part of your experimental protocol to trigger form re-computation. If you don't have any calculations across repeat instances (sum, count), that should be virtually instantaneous.