Qualitative research part 1: A primer
I was recently asked by a client whether the concept testing we were about to carry out would deliver valid results.
TIP: TrustFind out more about Trust and how trust, credibility and reassurance are essential for your business.Now that’s a perfectly reasonable question for a client to ask and I happily explained how the qualitative user testing, despite the low numbers of participants, not only identifies where the problems are with a design, but additionally identifies why they are problems too. I talked about the the methods we use and the significance of the findings in relation to the cohort size. So everyone was happy, but how about the client’s colleagues who didn’t see the research take place, or their boss who signed off the budget?
That got me thinking about the research methods that we build on and apply so we can trust our qualitative research, and how we communicate this trust to the wider circle of end users who will read and act on our research.
Foundations of trust
The foundations of trust in qualitative research are reliability and validity. These are what your colleagues and your boss might be interested in when you re-present a cxpartners report containing analysis and recommendations to them once all the testing is over.
Reliability
Reliability represents the quality of the measurement used to capture the data during concept testing, IDIs or fieldwork. Essentially are the measurements consistent and repeatable throughout the research?
We want to know that the data is not random, that there is a pattern. In particular, two methods that can be used to evaluate reliability through replication in concept testing, in depth interviews (IDIs) and fieldwork are:
Inter-observer reliability – where two observers are used to measure, mark or record data during the research and the two sets of data are compared for consistency and agreement. This consistency and agreement can be measured and given a score, e.g. if 100 tests were given and 85 of the test resulted in the same data then this would give 85% agreement; crude but effective.
Split-half reliability – where all the questions in a topic guide that measure the same concept are randomly divided into two sets, or two sets of questions that probe the same things are generated. The question sets are asked of the participants and scores kept for each group of questions. The reliability estimate is the correlation (Pearson r, at it’s most basic) between the two sets of scores.
Validity
There are numerous kinds of validity that need to be accounted for in academic social research, so I’ll briefly touch on two that have the most immediate impact on usability and user experience research.
Construct validity
The one that no one asks about.
Construct validity is the confidence we have that the measures we’re using are actually measuring what we think they are.
Thinking about the construct validity is important during the development and writing of user test plans and IDI topic guides. For example, it would be no good if we asked someone how they felt about using a tool or buying a widget, when what we really wanted to know was how usable the tool was or how easy it was to buy the widget.
External validity
The one that we get asked about most of all.
External validity, when for example participants during a usability test do not know how to find or buy a widget, can we confidently say no one else in the wider population will be able to find or buy the widget either.
Visualising reliability and validity
Visualising unreliable measurement and invalid results.A good and frequently used visualisation of reliability and validity is that of a set of scales. If we used a set of scales to weigh a 12 stone (76kg) person five times and each result was different, e.g. 5 stone, 7 stone, 13 stone, 15 stone, and 10 stone then the result is neither reliable or valid.
If we weighed the same person again five times on a different set of scales and each time the result was the same, e.g. 10 stone, 10 stone, 10 stone, 10 stone, and 10 stone, then the result is reliable but not valid.
Visualising reliable measurements and invalid results.Only when we weighed the person five times and each result was the same and accurate would the result be reliable and valid.
Visualising reliable measurement and valid results.In this example we obviously know the weight of the person before we begin. When we test websites we don’t know the answer before we begin. We know we can check for reliability as we’ve outlined above, but how do we know that it’s valid?
Next time: How we can ensure validity and reliability in our research
In part two we’ll look at the methods available to us during research that allow us to maintain reliability and validity. We’ll also look at the threats to reliability and validity and how we deal with them.
About the author
Walt has spent the last 11 years working with the web. He has a background in design and production and enjoys ethnographic research. He’s renovating his house at the moment, so he’s doing a lot of D.I.Y at the weekends! Email Walt, or call +44 (0)117 946 3930
How much construct validity is there when participants are being paid to do things before an audience, while being video-ed and in an unfamiliar environment?
Hi Dave,
Thanks for the comment, in answer to your question, ‘How much construct validity is there…? This is a big subject and this will only be a brief look at some of the major concerns.
Truthfully, we can’t know how much construct validity there is in a typical usability test but we do recognise that the setting is unnatural and can therefore respond to this by presenting a comfortable friendly environment and behaving in ways that make the participants experience as neutral and as comfortable as possible.
We understand that an audience of impartial observers would effect results, therefore we provide separate viewing facilities for observers where they can talk and discuss the research as it happens. Our test suite and observation room are physically separated (by another room), rather than just a one-way mirror as is typical in market research facilities. We also understand that observation with the facilitator would effect the outcome of research, so our facilitators are trained to keep their spoken language and body language neutral during the research so as to eliminate as much influence as possible.
Incentivising participants could be a problem whilst researching usability if we asking participants if they liked something or not. However we ask neutral open questions to encourage participants to talk about how they use a system or perform a task. Therefore payments are less of an issue. Because we understand that payments my effect participants we try to diminish the effect by paying a reasonable amount of money so people feel valued and by paying prior to the research so participants can set aside concerns about being paid.
I agree that participants are aware that they are being video-ed, we tell them that they’ll be recorded and ask them if they are happy with the arrangement prior to a test. Anecdotally I’d argue that participants concerns about being recorded quickly diminish as they become more engaged with the scenarios they are asked to imagine themselves in. Additionally we understand that much of a participants’ reactivity toward the environment and experience will be around modifying personal characteristics. We are not researching a participant’s personal characteristics we’re researching usability, so their reactivity matters less.
Regarding the physical environment the participant finds themselves in, we try to create as neutral and unimposing environment as possible, we like to give participants time to make themselves comfortable, we offer them a drink and we talk to them about the research and what they are likely to expect during the session to eliminate surprises and concerns about being tested.
Well that begins to answer your question, but we still don’t know why we don’t know how much construct validity there is. Because cxpartners are user experience practitioners rather than academic researchers commercial constraints of time and budget rarely allow us to accurately measure construct validity. Accurate measurement of construct validity would require academic research methods where time was less of a constraint and the research could be more fully explored. Practitioners can not and do not claim to find all the problems with a system but can find the problems that if rectified make a measurable difference to the success of a commercial website, system or piece of software.
I’ll point you to this paper by Gitte Lindgaard – Notions of thoroughness, efficiency, and validity: Are they valid in HCI practice? – http://bit.ly/4xqOUr
I’d be very interested to know what you think?
Walt
I am very eager to read part 2, particularly as it relates to answering external validity. Will it be coming soon?
Looking forward to part 2, Walt!