Usability Testing

What is Usability Testing?

Usability testing is watching real users attempt to complete tasks with your product. It reveals where users struggle, get confused, or fail—insights you can't get from analytics alone.

The Power of Watching Users

Jakob Nielsen's Rule: Testing with 5 users reveals 85% of usability problems

Why It Works: Most issues are encountered by multiple users. After 5 tests, you see diminishing returns.

Types of Usability Testing

Moderated vs Unmoderated

Moderated Testing

What: Facilitator guides user through tasks

Pros: Can ask follow-up questions, dig deeper, observe body language

Cons: Time-intensive, requires scheduling, facilitator bias

Example: In-person session where you watch user navigate your app while asking "What are you thinking?"

Unmoderated Testing

What: Users complete tasks independently, recorded

Pros: Fast, scalable, natural behavior, cheaper

Cons: Can't ask follow-ups, technical issues, less context

Example: UserTesting.com—users record themselves completing tasks, you review videos later

Remote vs In-Person

Example: Zoom's Remote Testing Advantage

Pre-COVID: Most usability testing was in-person

2020 Shift: Forced remote testing adoption

Discovery: Remote testing had unexpected benefits:

  • Users in natural environment (home/office)
  • Easier to recruit diverse participants
  • Lower cost (no lab rental)
  • Easier to record and share

Result: Many companies now prefer remote testing

Planning a Usability Test

Test Plan Components

  1. Goals: What do you want to learn?
  2. Participants: Who represents your users?
  3. Tasks: What should users try to do?
  4. Scenarios: Context for each task
  5. Success Metrics: How do you measure usability?
  6. Questions: What to ask after tasks

Writing Good Tasks

  • Realistic: Based on actual user goals
  • Specific: Clear end state
  • No UI Hints: Don't say "click the button"
  • Scenario-Based: Give context and motivation

Bad: "Click on the search icon and search for shoes"

Good: "You need running shoes for a marathon. Find a pair that fits your budget of $100"

Example: Airbnb's Booking Flow Test

Goal: Identify friction in booking process

Participants: 8 users who booked Airbnb in last 6 months

Task: "You're planning a weekend trip to San Francisco for 2 people. Find and book a place to stay."

Metrics: Time to complete, errors made, completion rate

Discovery: Users confused by cleaning fee appearing late in process

Fix: Showed total price upfront, completion rate increased 12%

Recruiting Participants

Recruitment Methods

  • User Research Panels: Pre-recruited pool of users
  • Recruitment Services: UserTesting, Respondent.io
  • Social Media: Post in relevant communities
  • Customer Lists: Email existing users
  • Intercepts: Recruit from your website/app
  • Guerrilla Testing: Coffee shops, public spaces

Screener Questions

Filter for right participants:

  • Demographics (age, location, occupation)
  • Behavior (frequency of use, experience level)
  • Attitudes (preferences, pain points)
  • Technology (devices, platforms)

Example: "How often do you order food delivery?" (Daily/Weekly/Monthly/Rarely/Never) → Filter for Weekly+ users

Recruitment Mistakes

  • Testing Coworkers: They know too much
  • Friends and Family: Too polite, biased
  • Professional Testers: Not representative
  • Wrong Segment: Testing power users when designing for beginners

Facilitating Sessions

Facilitator's Role

  • Set Expectations: Explain you're testing the product, not them
  • Think Aloud: Ask users to verbalize their thoughts
  • Don't Help: Let them struggle (that's the data)
  • Stay Neutral: Don't react to successes or failures
  • Probe Gently: "What are you thinking?" not "Why did you do that?"

Session Structure (60 minutes)

  1. Introduction (5 min): Build rapport, explain process
  2. Background (5 min): Learn about participant
  3. Tasks (40 min): Observe task completion
  4. Debrief (10 min): Overall impressions, questions

Example: Slack's Onboarding Test

Setup: New users, never used Slack

Task: "Your team wants to use Slack. Set up a workspace and invite a teammate."

Observation: 6 out of 8 users confused by "workspace" terminology

Facilitator Note: Didn't explain "workspace," let users struggle to understand

Fix: Changed language to "team" and added explainer, completion rate went from 40% to 85%

What NOT to Say

  • "That's not how you're supposed to do it" (judgmental)
  • "Try clicking there" (leading)
  • "Most people find this easy" (pressure)
  • "Let me show you" (defeats purpose)

Analyzing Results

What to Look For

  • Task Success Rate: Did they complete it?
  • Time on Task: How long did it take?
  • Error Rate: How many mistakes?
  • Paths Taken: Expected vs actual route
  • Verbal Feedback: Confusion, frustration, delight
  • Body Language: Hesitation, confidence

Severity Rating

  • Critical: Prevents task completion (fix immediately)
  • Serious: Causes significant delay or frustration (fix soon)
  • Minor: Small annoyance (fix if time permits)
  • Cosmetic: Doesn't affect usability (backlog)

Example: Instagram's Photo Upload

Test Finding: Users tapped "Next" multiple times thinking it wasn't working

Root Cause: No loading indicator while processing photo

Severity: Serious (caused frustration, duplicate uploads)

Fix: Added progress spinner and "Processing..." text

Result: Support tickets about upload issues dropped 60%

Affinity Mapping Findings

Process:

  1. Write each observation on sticky note
  2. Group similar issues together
  3. Identify patterns across users
  4. Prioritize by frequency and severity

Example: 5 users struggled with search → High priority. 1 user wanted dark mode → Lower priority.

Metrics & Benchmarking

Quantitative Usability Metrics

  • Task Success Rate: % who completed task
  • Time on Task: Average seconds to complete
  • Error Rate: Mistakes per task
  • Clicks to Complete: Number of interactions
  • SUS Score: System Usability Scale (0-100)

Example: Amazon's One-Click Patent

Baseline Test: Standard checkout flow

  • Average time: 90 seconds
  • Completion rate: 60%
  • Average clicks: 8

One-Click Test:

  • Average time: 3 seconds
  • Completion rate: 95%
  • Average clicks: 1

Impact: 30x faster, 58% higher completion—worth patenting

System Usability Scale (SUS)

What: 10-question survey, scored 0-100

Benchmark: 68 is average, 80+ is excellent

Use: Compare versions, track over time, benchmark against competitors

Example: Redesign increased SUS from 62 to 78, validating improvements

A/B Testing vs Usability Testing

When to Use Each

Usability Testing

  • Best For: Understanding WHY users struggle
  • Sample Size: 5-8 users
  • Data Type: Qualitative insights
  • Speed: Days to weeks

A/B Testing

  • Best For: Measuring WHICH version performs better
  • Sample Size: Thousands of users
  • Data Type: Quantitative metrics
  • Speed: Weeks to months

Example: Netflix's Artwork Testing

Usability Testing: Showed users different title artwork, asked which caught their attention

Insight: Users preferred images with faces, emotional expressions

A/B Testing: Tested face-focused vs scene-focused artwork with millions of users

Result: Face-focused artwork increased click-through 30%

Lesson: Usability testing generated hypothesis, A/B testing validated at scale

Guerrilla Testing

What is Guerrilla Testing?

Quick, informal testing with people in public places. Fast and cheap way to get feedback.

Where: Coffee shops, libraries, parks, malls

Time: 5-10 minutes per person

Incentive: Coffee, gift card, or just goodwill

Example: Starbucks Mobile Order Testing

Method: Approached customers in Starbucks, asked to try mobile ordering prototype

Setup: Laptop with clickable prototype

Task: "Order your usual drink"

Time: 5 minutes per person, tested 20 people in 2 hours

Discovery: Customization options were buried, users couldn't find them

Cost: $100 in gift cards vs $5,000 for formal lab testing

Guerrilla Testing Tips

  • Be Respectful: Ask permission, accept rejection gracefully
  • Keep It Short: 5-10 minutes max
  • Have Clear Task: One specific thing to test
  • Take Notes: Record or write down observations
  • Test in Context: Coffee shop for coffee app, gym for fitness app

Accessibility Testing

Testing with Assistive Technologies

  • Screen Readers: VoiceOver (iOS), TalkBack (Android), NVDA (Windows)
  • Keyboard Only: Navigate without mouse
  • Voice Control: Voice commands only
  • Screen Magnification: Zoom to 200%+
  • Color Blindness: Simulate different types

Example: Apple's Accessibility Testing

Practice: Every iOS feature tested with VoiceOver before shipping

Discovery: Many "obvious" interactions don't work for blind users

Example: Swipe gestures need audio feedback, buttons need descriptive labels

Impact: iOS became most accessible mobile OS, expanded market to millions of users with disabilities

Continuous Testing

Building a Testing Cadence

  • Weekly: Quick guerrilla tests on new features
  • Bi-weekly: Moderated sessions on in-progress work
  • Monthly: Comprehensive testing of major features
  • Quarterly: Benchmark testing, SUS scores

Example: GOV.UK's Testing Culture

Mandate: Every team must test with users every 2 weeks

Infrastructure:

  • Dedicated user research lab
  • Panel of 10,000 citizens for recruitment
  • Research ops team handles logistics
  • Designers conduct their own tests

Result: Usability improved dramatically, became model for government digital services worldwide

Usability Testing at Scale (Staff/Director Level)

Building Research Infrastructure

  • Research Ops: Team dedicated to recruiting, scheduling, logistics
  • Testing Labs: Dedicated spaces with recording equipment
  • User Panels: Pre-recruited participants for quick studies
  • Tools & Platforms: UserTesting, Lookback, Maze for scale
  • Democratization: Train all designers to conduct tests

Example: Microsoft's Usability Testing Program

Scale: 1,000+ usability tests per year across all products

Infrastructure:

  • 15 dedicated usability labs worldwide
  • Research ops team of 30 people
  • 100,000-person participant panel
  • Automated recruitment and scheduling
  • Centralized repository of all findings

Impact: Every product team can test weekly, usability issues caught early, customer satisfaction increased 40%

📅 Evolution of Usability Testing

Pre-2000: Lab Testing Only

Example: Nielsen Norman Group usability labs

  • Expensive lab setups ($100K+)
  • Think-aloud protocol established
  • 5-10 participants per study
  • Weeks to schedule and conduct
  • VHS recordings and manual analysis

Pre-2023: Remote & Scalable

Example: UserTesting.com, Maze, Lookback

  • Remote unmoderated testing
  • Recruit globally in hours
  • Automated metrics and heatmaps
  • Continuous testing in agile sprints
  • Screen recordings with analytics

2023+: AI-Powered Analysis

Example: AI identifies usability issues automatically

  • AI watches sessions, flags problems
  • Synthetic users for rapid testing
  • Real-time accessibility scanning
  • Predictive usability scores
  • Automated test script generation

Fun Fact

The "think-aloud" protocol used in usability testing was actually borrowed from psychology research in the 1980s! Researchers studying problem-solving asked people to verbalize their thoughts. Jakob Nielsen adapted it for software testing. Interestingly, studies show that thinking aloud can actually CHANGE how people interact with interfaces—they're more careful and analytical than normal users. This is called the "observer effect"!

⚠️ When Theory Meets Reality: The Contradiction

Theory Says: Always test with real users before launching

Reality: Gmail launched as invite-only beta and stayed in "beta" for 5 years—millions used it without formal usability testing.

Example: Gmail's Perpetual Beta

  • Google launched Gmail in 2004 with minimal testing
  • Kept it in beta until 2009
  • Users found bugs and usability issues in production
  • Used real usage data instead of lab testing
  • Became the world's most popular email service

Lesson: Sometimes launching to real users IS the usability test. Beta programs and gradual rollouts can replace traditional testing. The key is having good analytics and fast iteration cycles.

📚 Resources & Further Reading

Books

  • Krug, Steve. Rocket Surgery Made Easy: The Do-It-Yourself Guide to Finding and Fixing Usability Problems. New Riders, 2009.
  • Rubin, Jeffrey, and Dana Chisnell. Handbook of Usability Testing: How to Plan, Design, and Conduct Effective Tests. 2nd ed., Wiley, 2008.
  • Dumas, Joseph S., and Janice C. Redish. A Practical Guide to Usability Testing. Revised ed., Intellect Books, 1999.

Articles & Papers

Tools

  • UserTesting - Remote usability testing platform
  • Maze - Rapid testing and research
  • Lookback - Live user interviews
  • Hotjar - Session recordings and heatmaps