A few days ago, a good friend of mine brought a paper to my attention that claims you need hundreds of participants in order to identify most usability issues of a modern website or application. Now, we all know NN’s stance on this (5, except when it’s not), so I was immediately intrigued.
The paper Comparative Usability Evaluation by R. Molich, M. Ede, K. Kaasgaard and B. Karyukin dates from 2004. Though old, its points aren’t any less valid (NN’s original advice of 5 participants is even older). I strongly recommend reading it as it is only 10 pages, but it can’t hurt to recap anyway:
- Nine independent studios were tasked with investigating the usability of Hotmail. They all investigated the same features and mostly used the same method (structured task-based observations using think-aloud), though their exact tasks and number of respondents differed. In total 310 unique usability issues were found across all teams, with a shocking 75% of these only reported by one team (no overlap), and only one issue reported by 7 of the 9 teams.
- The mean number of respondents was 6.6 in this study, which means that according to NN, each team should have found ~85% of the usability issues during their tests, which would have resulted in a far larger overlap. So what’s going on here? Has NN been wrong for over two decades? Is this study (and by extension some really good research studios at the time) flawed? I think the answer is neither: they both tell a different side of the story.
Control your variables….
As researchers, we try to control as many variables as possible during our tests to produce the most reliable data. This includes using the same tasks, same hard/software, the same researcher(s) observing and taking notes, etc. In such settings, the only variable that changes is the respondent itself, and it is for scenarios like this that NN’s best-bang-for-buck golden rule of 5 participants was tested and created.
What Molich et.al. show us is that when we change more variables than just the respondents, we start to obtain far more varied results. Lab settings are not an accurate representation of actual use, and each researcher has their own special flavour of research bias.
In any academic study, this would be anathema, but I’d argue for UX research this is exactly what we need. Not only are lab settings and artificial tasks not an accurate representation of actual varied use, but our qualitative methods are also far too influenced by the person executing it: every researcher has her/his own style, perspective and special flavour of researcher bias.
Change variables and start to obtain more varied results
What I’m getting at is both studies are right. For any given usability test you’re doing, you should only need 5 respondents if you keep everything else the same. But you shouldn’t. Instead, try changing the tasks and run it past 5 more respondents. Pass the baton to a colleague and run it by 5 more. Go to a different setting for 5 more, and so on. I expect you’ll find far more usability issues this way, and you will learn more about your users in the process.
I hope I inspired some critical thoughts about our methods and way of working, and what we can do as a community to bring the quality of our research to a higher level. I’m excited to try out the above suggestions on my next set of usability tests. I’d love to hear your thoughts and comments on the matter. Has anyone explored a similar style of usability testing? Let me know!