Discussion Series: How have you raised the bar on data quality?

All,

I have a hypothesis:

When it comes to collecting high-quality data, ODK users have a tremendous
wealth of experience.

As a community, it would be extremely valuable to share some of that
hard-won experience, some of that wisdom. So while I have some thoughts of
my own, I'd like to open this week's discussion with a question for you:

How is it that you have raised the bar on data quality?

In other words, what have you done to improve the quality of the data you
collect? And what can we all learn from your experience?

And finally:

How might we, in the ODK community, raise the bar even further in the
future?

Please reply to the group and share some of your thoughts. It can be your
public service for the week.

Thanks,

Chris

··· --- Christopher Robert Dobility, Inc. (SurveyCTO) http://www.surveycto.com/ http://blog.surveycto.com/

In my experience, the most overlooked aspect of data quality is training on
content. Focusing on the tablets too early in the training, before (or at
the same time as) the actual survey questions is an error I've observed on
multiple occasions. In my opinion quality control falls into four primary
components.

  1. Survey design: Appropriate questions and responses, flow, skip
    logic, and constraints
  2. *Training: *A comprehensive training for enumerator AND supervisors
    to ensure a common understanding of questions, responses, and concepts.
    This should include accompanying documents such as enumerator and
    supervisor guides, as well as cluster control sheets. Superviros should
    partake in all of enumerator training, and have additional supervisor
    training as well.
  3. Supervisor oversight: This includes supervisor observing portions
    of at least one enumerator's interviews daily, ensuring questions are being
    asked the same across enumerators in their team(s). Spot (or 'Back') checks
    -- with supervisor returning to a completed respondent to ask 2-3 key
    questions and later checking the responses they obtained to those obtained
    by the enumerator. Daily debrief sessions (15-30 minutes) with enumerator
    teams. Weekly debrief sessions with supervisors from various enumerator
    teams. This insures all issues are shared across the teams, and the issues
    are being dealt with, and documented, in a similar method.
  4. Remote data review*: Strong* communication between the survey
    administrator/reviewer and the data collection team during the first 1/3
    and last 1/3 of data collection (the data review can be with the teams in
    the field, or this can be done remotely). This includes checks such as
    "length of interview," and comparing the mean responses of key questions
    (questions which can be used as a "skip" for larger sections) for each
    enumerator compared to all enumerators, to identify potential deviants.
    This also includes basic hypothesis testing on continuous variables
    to ensure they fall within the expected range. I find that in the first 1/3
    of data collection enumerators make honest mistakes and if they
    are not caught and identified, those mistake continue (and become worse).
    In particularly the field test, and the first three days of data
    collection, are crucially important to ensure quality data. Equally, I have
    found that data quality also dips in the final 1/3, specifically final
    week, of data collection as enumerators get tired, and "see the light at
    the end of the tunnel". Carefully reviewing data, and providing reports to
    field teams is a good tool to help mitigate this.

~lb

··· On Sunday, October 18, 2015 at 4:08:54 PM UTC-4, Christopher Robert wrote: > > All, > > I have a hypothesis: > > *When it comes to collecting high-quality data, ODK users have a > tremendous wealth of experience.* > > As a community, it would be extremely valuable to share some of that > hard-won experience, some of that wisdom. So while I have some thoughts of > my own, I'd like to open this week's discussion with *a question for you*: > > *How is it that you have raised the bar on data quality?* > > In other words, what have you done to improve the quality of the data you > collect? And what can we all learn from your experience? > > And finally: > > *How might we, in the ODK community, raise the bar even further in the > future?* > > Please reply to the group and share some of your thoughts. It can be your > public service for the week. > > Thanks, > > Chris > > --- > Christopher Robert > Dobility, Inc. (SurveyCTO) > http://www.surveycto.com/ > http://blog.surveycto.com/ > >

Hi ODK folks , I would like to contribute to the discussion with my views
on enhancing data quality , of course constraints and relevant and
calculations are crucial on all this , here are my experiences regarding*
first question*:

  • Introducing time-stamps at the end of each section , this can be done
    adding a time question with a read only property with an informative
    message like "You are at the end of section 1 please swipe to move forward
    to the section 2 ...". On this way we are able of recording the relative
    time each enumerator spend on going through each section , this is adding
    on more variable to analyze in order to shorten a particular section or
    reinforce the training for a particular enumerator

  • Adding a question (Q2) single select question with the following
    choices

Does not know Refused to answer Skipped in error Does not apply
(that will prompt if the previous question (Q1) was not answered or
skipped by error adding the following relevant condition:

for integer input questions I use the following relevant : not(${Q1}>0 or
${Q1}=0)

for select one input questions I use the following :not(${Q1}='1') and
not(${Q1}='2') and not(${cQ1}='3')

for string input questions I use the following : string-length(Q1) = 0

  • Using regular expressions under constraint column for string input
    questions as much as I can i.e.

      -for email adresses: regex(.,
    

'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,4}')
-for letters and numbers combination
: regex(.,'[1]{3}[0-9]{5}$')

Regarding second question I really think that it would be possible for
the ODK community could put together via google doc or similar a sheet that
could compile the different approaches used for enhancing data quality ,
don't you think so ?

Best regards

Jorge

··· 2015-10-18 22:08 GMT+02:00 Christopher Robert :

All,

I have a hypothesis:

When it comes to collecting high-quality data, ODK users have a
tremendous wealth of experience.

As a community, it would be extremely valuable to share some of that
hard-won experience, some of that wisdom. So while I have some thoughts of
my own, I'd like to open this week's discussion with a question for you:

How is it that you have raised the bar on data quality?

In other words, what have you done to improve the quality of the data you
collect? And what can we all learn from your experience?

And finally:

How might we, in the ODK community, raise the bar even further in the
future?

Please reply to the group and share some of your thoughts. It can be your
public service for the week.

Thanks,

Chris


Christopher Robert
Dobility, Inc. (SurveyCTO)
http://www.surveycto.com/
http://blog.surveycto.com/

--

Post: opendatakit@googlegroups.com
Unsubscribe: opendatakit+unsubscribe@googlegroups.com
Options: http://groups.google.com/group/opendatakit?hl=en


You received this message because you are subscribed to the Google Groups
"ODK Community" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to opendatakit+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Jorge Durand Zurdo

joduzu@gmail.com joduzu@gmail.com*skype user: *joduzu


  1. A-Z ↩︎

Thanks, Jorge and Lloyd, for these contributions. As I suspected, there is
much wisdom here!

Some kind of Google Doc or wiki on data quality would be a great idea. If
this discussion generates a good start in terms of raw material, perhaps we
could pull something together as a community.

Lloyd, you're quite right to emphasize the importance of training. This is
an area that we've tended to neglect in terms of our advice to users, and
that's something that should probably change.

Speaking of advice that we give to users -- and in the spirit of sharing
raw materials that might be more broadly useful to the community -- I'm
attaching two related help topics from our online help
.

These topics attempt to summarize some key considerations for our users who
want to collect high-quality data. A lot of the features discussed are
general to ODK and so apply equally to all ODK users, but some are
SurveyCTO-specific (like "speed limits"). (If there's demand, we could try
to share some of the SurveyCTO-specific features with the broader
community. The tricky thing is that most of our QC improvements required
hacks to JavaRosa code that many might consider quite ugly and even
objectionable from an open-standards point of view. An example is our new
duration() function for better tracking survey time, which required a
number of tricky hacks.)

Lloyd, you'll see that we barely even mention training, and we could more
heavily emphasize the other human/management aspects of the process. We
might focus too much on the form-design and process-workflow aspects.

What else are we missing? We know that there's a lot out there and want to
learn where some of the gaps are -- and where some of the new opportunities
might be.

Thanks again,

Chris

SurveyCTO Online Help Excerpts - Quality Data.pdf (146 KB)

··· On Mon, Oct 19, 2015 at 8:51 AM Lloyd Owen Banwart wrote:

In my experience, the most overlooked aspect of data quality is training
on content. Focusing on the tablets too early in the training, before (or
at the same time as) the actual survey questions is an error I've observed
on multiple occasions. In my opinion quality control falls into four
primary components.

  1. Survey design: Appropriate questions and responses, flow, skip
    logic, and constraints
  2. *Training: *A comprehensive training for enumerator AND supervisors
    to ensure a common understanding of questions, responses, and concepts.
    This should include accompanying documents such as enumerator and
    supervisor guides, as well as cluster control sheets. Superviros should
    partake in all of enumerator training, and have additional supervisor
    training as well.
  3. Supervisor oversight: This includes supervisor observing portions
    of at least one enumerator's interviews daily, ensuring questions are being
    asked the same across enumerators in their team(s). Spot (or 'Back') checks
    -- with supervisor returning to a completed respondent to ask 2-3 key
    questions and later checking the responses they obtained to those obtained
    by the enumerator. Daily debrief sessions (15-30 minutes) with enumerator
    teams. Weekly debrief sessions with supervisors from various enumerator
    teams. This insures all issues are shared across the teams, and the issues
    are being dealt with, and documented, in a similar method.
  4. Remote data review*: Strong* communication between the survey
    administrator/reviewer and the data collection team during the first 1/3
    and last 1/3 of data collection (the data review can be with the teams in
    the field, or this can be done remotely). This includes checks such as
    "length of interview," and comparing the mean responses of key questions
    (questions which can be used as a "skip" for larger sections) for each
    enumerator compared to all enumerators, to identify potential deviants.
    This also includes basic hypothesis testing on continuous variables
    to ensure they fall within the expected range. I find that in the first 1/3
    of data collection enumerators make honest mistakes and if they
    are not caught and identified, those mistake continue (and become worse).
    In particularly the field test, and the first three days of data
    collection, are crucially important to ensure quality data. Equally, I have
    found that data quality also dips in the final 1/3, specifically final
    week, of data collection as enumerators get tired, and "see the light at
    the end of the tunnel". Carefully reviewing data, and providing reports to
    field teams is a good tool to help mitigate this.

~lb

On Sunday, October 18, 2015 at 4:08:54 PM UTC-4, Christopher Robert wrote:

All,

I have a hypothesis:

When it comes to collecting high-quality data, ODK users have a
tremendous wealth of experience.

As a community, it would be extremely valuable to share some of that
hard-won experience, some of that wisdom. So while I have some thoughts of
my own, I'd like to open this week's discussion with a question for you
:

How is it that you have raised the bar on data quality?

In other words, what have you done to improve the quality of the data you
collect? And what can we all learn from your experience?

And finally:

How might we, in the ODK community, raise the bar even further in the
future?

Please reply to the group and share some of your thoughts. It can be your
public service for the week.

Thanks,

Chris


Christopher Robert
Dobility, Inc. (SurveyCTO)
http://www.surveycto.com/
http://blog.surveycto.com/

--
--
Post: opendatakit@googlegroups.com
Unsubscribe: opendatakit+unsubscribe@googlegroups.com
Options: http://groups.google.com/group/opendatakit?hl=en


You received this message because you are subscribed to the Google Groups
"ODK Community" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to opendatakit+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Okay, surely there are more than three of us with lessons and thoughts
we're willing to share on data quality. Another few questions, then, in
case others can be tempted to share:

  • What is it that you do to improve the quality of your data, that
    you're surprised others don't also seem to do?
    (Perhaps even something
    small, that you think is obvious? Do share!)

And:

  • How do you corroborate the quality of your data? (We've seen more
    projects that require photos and audio to corroborate form responses, but
    then how do you actually use that corroborating evidence? Do you check
    every submission, or just a random sub-sample? Do you do it in a field
    office, use Mechanical Turk, or have some other distributed system?)

Please help make this thread a truly valuable resource for the community.

Thanks,

Chris

··· --- Christopher Robert Dobility, Inc. (SurveyCTO) http://www.surveycto.com/ http://blog.surveycto.com/

On Mon, Oct 19, 2015 at 9:35 AM Christopher Robert crobert@surveycto.com wrote:

Thanks, Jorge and Lloyd, for these contributions. As I suspected, there is
much wisdom here!

Some kind of Google Doc or wiki on data quality would be a great idea. If
this discussion generates a good start in terms of raw material, perhaps we
could pull something together as a community.

Lloyd, you're quite right to emphasize the importance of training. This is
an area that we've tended to neglect in terms of our advice to users, and
that's something that should probably change.

Speaking of advice that we give to users -- and in the spirit of sharing
raw materials that might be more broadly useful to the community -- I'm
attaching two related help topics from our online help
.

These topics attempt to summarize some key considerations for our users
who want to collect high-quality data. A lot of the features discussed are
general to ODK and so apply equally to all ODK users, but some are
SurveyCTO-specific (like "speed limits"). (If there's demand, we could try
to share some of the SurveyCTO-specific features with the broader
community. The tricky thing is that most of our QC improvements required
hacks to JavaRosa code that many might consider quite ugly and even
objectionable from an open-standards point of view. An example is our new
duration() function for better tracking survey time, which required a
number of tricky hacks.)

Lloyd, you'll see that we barely even mention training, and we could more
heavily emphasize the other human/management aspects of the process. We
might focus too much on the form-design and process-workflow aspects.

What else are we missing? We know that there's a lot out there and want to
learn where some of the gaps are -- and where some of the new opportunities
might be.

Thanks again,

Chris

On Mon, Oct 19, 2015 at 8:51 AM Lloyd Owen Banwart < lloyd.banwart@gmail.com> wrote:

In my experience, the most overlooked aspect of data quality is training
on content. Focusing on the tablets too early in the training, before (or
at the same time as) the actual survey questions is an error I've observed
on multiple occasions. In my opinion quality control falls into four
primary components.

  1. Survey design: Appropriate questions and responses, flow, skip
    logic, and constraints
  2. *Training: *A comprehensive training for enumerator AND
    supervisors to ensure a common understanding of questions, responses, and
    concepts. This should include accompanying documents such as enumerator and
    supervisor guides, as well as cluster control sheets. Superviros should
    partake in all of enumerator training, and have additional supervisor
    training as well.
  3. Supervisor oversight: This includes supervisor observing
    portions of at least one enumerator's interviews daily, ensuring questions
    are being asked the same across enumerators in their team(s). Spot (or
    'Back') checks -- with supervisor returning to a completed respondent to
    ask 2-3 key questions and later checking the responses they obtained to
    those obtained by the enumerator. Daily debrief sessions (15-30 minutes)
    with enumerator teams. Weekly debrief sessions with
    supervisors from various enumerator teams. This insures all issues are
    shared across the teams, and the issues are being dealt with, and
    documented, in a similar method.
  4. Remote data review*: Strong* communication between the survey
    administrator/reviewer and the data collection team during the first 1/3
    and last 1/3 of data collection (the data review can be with the teams in
    the field, or this can be done remotely). This includes checks such as
    "length of interview," and comparing the mean responses of key questions
    (questions which can be used as a "skip" for larger sections) for each
    enumerator compared to all enumerators, to identify potential deviants.
    This also includes basic hypothesis testing on continuous variables
    to ensure they fall within the expected range. I find that in the first 1/3
    of data collection enumerators make honest mistakes and if they
    are not caught and identified, those mistake continue (and become worse).
    In particularly the field test, and the first three days of data
    collection, are crucially important to ensure quality data. Equally, I have
    found that data quality also dips in the final 1/3, specifically final
    week, of data collection as enumerators get tired, and "see the light at
    the end of the tunnel". Carefully reviewing data, and providing reports to
    field teams is a good tool to help mitigate this.

~lb

On Sunday, October 18, 2015 at 4:08:54 PM UTC-4, Christopher Robert wrote:

All,

I have a hypothesis:

When it comes to collecting high-quality data, ODK users have a
tremendous wealth of experience.

As a community, it would be extremely valuable to share some of that
hard-won experience, some of that wisdom. So while I have some thoughts of
my own, I'd like to open this week's discussion with a question for
you
:

How is it that you have raised the bar on data quality?

In other words, what have you done to improve the quality of the data
you collect? And what can we all learn from your experience?

And finally:

How might we, in the ODK community, raise the bar even further in the
future?

Please reply to the group and share some of your thoughts. It can be
your public service for the week.

Thanks,

Chris


Christopher Robert
Dobility, Inc. (SurveyCTO)
http://www.surveycto.com/
http://blog.surveycto.com/

--
--
Post: opendatakit@googlegroups.com
Unsubscribe: opendatakit+unsubscribe@googlegroups.com
Options: http://groups.google.com/group/opendatakit?hl=en


You received this message because you are subscribed to the Google Groups
"ODK Community" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to opendatakit+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.