*************************************************************
*                                                           *
*            TASK-CENTERED USER INTERFACE DESIGN            *
*                 A Practical Introduction                  *
*                                                           *
*             by Clayton Lewis and John Rieman              *
*                                                           *
*           Copyright 1993, 1994: Please see the            *
*       "shareware notice" at the front of the book.        *
*                                                           *
*************************************************************


* * * * * * * * * * * * * * *                            v.1
*
*   Chapter 5
*
*   Testing The Design With Users
*
*


You can't really tell how good or bad your interface is going 
to be without getting people to use it. So as your design 
matures, but before the whole system gets set in concrete, 
you need to do some user testing. This means having real 
people try to do things with the system and observing what 
happens. To do this you need people, some tasks for them to 
perform, and some version of the system for them to work 
with. Let's consider these necessities in order.

---------------------------
5.1  Choosing Users to Test
---------------------------

The point of testing is to anticipate what will happen when 
real users start using your system. So the best test users 
will be people who are representative of the people you 
expect to have as users. If the real users are supposed to be 
doctors, get doctors as test users. If you don't, you can be 
badly misled about crucial things like the right vocabulary 
to use for the actions your system supports. Yes, we know it 
isn't easy to get doctors, as we noted when we talked about 
getting input from users early in design. But that doesn't 
mean it isn't important to do. And, as we asked before, if 
you can't get any doctors to be test users, why do you think 
you'll get them as real users?

If it's hard to find really appropriate test users you may 
want to do some testing with people who represent some 
approximation to what you really want, like medical students 
instead of doctors, say, or maybe even premeds, or college-
educated adults. This may help you get out some of the big 
problems (the ones you overlooked in your cognitive 
walkthrough because you knew too much about your design and 
assumed some things were obvious that aren't).  But you have 
to be careful not to let the reactions and comments of people 
who aren't really the users you are targeting drive your 
design. Do as much testing with the right kind of test users 
as you can.


============================================================
HyperTopic: Ethical Concerns in Working with Test Users
------------------------------------------------------------

Serving as a test user can be very distressing, and you have 
definite responsibilities to protect the people you work with 
from distress. We have heard of test users who left the test 
in tears, and of a person in a psychological study of problem 
solving who was taken away in an ambulance under a sedation 
because of not being able to solve what appeared to be simple 
logic puzzles. There's no joke here.

Another issue, which you also have to take seriously, is 
embarrassment. Someone might well feel bad if a video of them 
fumbling with your system were shown to someone who knew 
them, or even if just numerical measures of a less-than-
stellar performance were linked with their name.

The first line of defense against these kinds of problems is 
voluntary, informed consent. This means you avoid any 
pressure to participate in your test, and you make sure 
people are fully informed about what you are going to do if 
they do participate. You also make clear to test users that 
they are free to stop participating at any time, and you 
avoid putting any pressure on them to continue, even though 
it may be a big pain for you if they quit. You don't ask them 
for a reason: if they want to stop, you stop.

Be very careful about getting friends, co-workers, or 
(especially) subordinates to participate in tests. Will these 
people really feel free to decline, if they want to? If such 
people are genuinely eager to participate, fine, but don't 
press the matter if they hesitate even (or especially) if 
they give no reason.

During the test, monitor the attitude of your test users 
carefully. You will have stressed that it is your system, not 
the users, that is being tested, but they may still get upset 
with themselves if things don't go well. Watch for any sign 
of this, remind them that they aren't the focus of the test, 
and stop the test if they continue to be distressed. We are 
opposed to any deception in test procedures, but we make an 
exception in this case: an "equipment failure" is a good 
excuse to end a test without the test user feeling that it is 
his or her reaction that is to blame. 

Plan carefully how you are going to deal with privacy issues. 
The best approach is to avoid collecting information that 
could be used to identify someone. We make it a practice not 
to include users' faces in videos we make, for example, and 
we don't record users' names with their data (just assign 
user numbers and use those for identification). If you will 
collect material that could be identified, such as an audio 
recording of comments in the test user's voice, explain 
clearly up front if there are any conditions in which anyone 
but you will have access to this material. Let the user tell 
you if he or she has any objection, and abide by what they 
say.

A final note: taking these matters seriously may be more than 
a matter of doing the right thing for you. If you are working 
in an organization that receives federal research funds you 
are obligated to comply with formal rules and regulations 
that govern the conduct of tests, including getting approval 
from a review committee for any study that involves human 
participants. 

[end hypertopic]--------------------------------------------


----------------------------
5.2  Selecting Tasks for Testing
--------------------------------

In your test you'll be giving the test users some things to 
try to do, and you'll be keeping track of whether they can do 
them. Just as good test users should be typical of real 
users, so test tasks should reflect what you think real tasks 
are going to be like. If you've been following our advice you 
already have some suitable tasks: the tasks you developed 
early on to drive your task-centered design.

You may find you have to modify these tasks somewhat for use 
in testing. They may take too long, or they may assume 
particular background knowledge that a random test user won't 
have. So you may want to simplify them. But be careful in 
doing this! Try to avoid any changes that make the tasks 
easier or that bend the tasks in the direction of what your 
design supports best.


============================================================
Example: Test Users and Tasks for the Traffic Modelling 
System.
------------------------------------------------------------

We didn't have to modify our tasks for the traffic modelling 
system, because we used the same people as test users that we 
had worked with to develop our target tasks. So they had all 
the background that was needed. If we had not had access to 
these same folks we would have needed to prepare briefing 
materials for the test users so that they would know where 
Canyon and Arapahoe are and something about how they fit into 
the surrounding street grid.

[end example]-----------------------------------------------


If you base your test tasks on the tasks you developed for 
task-centered design you'll avoid a common problem: choosing 
test tasks that are too fragmented. Traditional requirements 
lists naturally give rise to suites of test tasks that test 
the various requirements separately. Remember the phone-in 
bank system we discussed in Chapter 2? It had been thoroughly 
tested, but only using tests that involved single services, 
like checking a balance or transferring funds, but not 
combinations of services like checking a balance and then 
transferring funds contingent on the results of the check. 
There were big problems in doing these combinations.

---------------------------------------------
5.3  Providing a System for Test Users to Use
---------------------------------------------

The key to testing early in the development process, when it 
is still possible to make changes to the design without 
incurring big costs, is using mockups in the test. These are 
versions of the system that do not implement the whole 
design, either in what the interface looks like or what the 
system does, but do show some of the key features to users. 
Mockups blur into PROTOTYPES, with the distinction that a 
mockup is rougher and cheaper and a prototype is more 
finished and more expensive. 

The simplest mockups are just pictures of screens as they 
would appear at various stages of a user interaction. These 
can be drawn on paper or they can be, with a bit more work, 
created on the computer using a tool like HyperCard for the 
Mac or a similar system for Windows. A test is done by 
showing users the first screen and asking them what they 
would do to accomplish a task you have assigned. They 
describe their action, and you make the next screen appear, 
either by rummaging around in a pile of pictures on paper and 
holding up the right one, or by getting the computer to show 
the appropriate next screen.

This crude procedure can get you a lot of useful feedback 
from users. Can they understand what's on the screens, or are 
they baffled? Is the sequence of screens well-suited to the 
task, as you thought it would be when you did your cognitive 
walkthrough, or did you miss something?

To make a simple mockup like this you have to decide what 
screens you are going to provide. Start by drawing the 
screens users would see if they did everything the best way. 
Then decide whether you also want to "support" some 
alternative paths, and how much you want to investigate error 
paths. Usually it won't be practical for you to provide a 
screen for every possible user action, right or wrong, but
you will have reasonable coverage of the main lines you 
expect users to follow. 

During testing, if users stay on the lines you expected, you 
just show them the screens they would see. What if they 
deviate, and make a move that leads to a screen you don't 
have? First, you record what they wanted to do: that's 
valuable data about a discrepancy between what you expected 
and what they want to do, which is why you are doing the 
test. Then you can tell them what they would see, and let 
them try to continue, or you can tell them to make another 
choice. You won't see as much as you would if you had the 
complete system for them to work with, but you will see 
whether the main lines of your design are sound.


============================================================
HyperTopic: Some Details on Mockups
------------------------------------------------------------

What if a task involves user input that is their free choice, 
like a name to use for a file? You can't anticipate what the 
users will type and put it on your mockup screens. But you 
can let them make their choice and then say something like, 
"You called your file 'eggplant'; in this mockup we used 
'broccoli'. Let's pretend you chose 'broccoli' and carry on."

What if it is a feature of your design that the system should 
help the user recover from errors? If you just provide 
screens for correct paths you won't get any data on this. If 
you can anticipate some common errors, all you have to do is 
to include the error recovery paths for these when you make 
up your screens.  If errors are very diverse and hard to 
predict you may need to make up some screens on the fly, 
during the test, so as to be able to show test users what 
they would see. This blends into the "Wizard of Oz" approach 
we describe later.

[end hypertopic]--------------------------------------------


Some systems have to interact too closely with the user to be 
well approximated by  a simple mockup. For example a drawing 
program has to respond to lots of little user actions, and 
while you might get information from a simple mockup about 
whether users can figure out some aspects of the interface, 
like how to select a drawing tool from a palette of icons, 
you won't be able to test how well drawing itself is going to 
work. You need to make more of the system work to test what 
you want to test.

The thing to do here is to get the drawing functionality up 
early so you can do a more realistic test. You would not wait 
for the system to be completed, because you want test results 
early. So you would aim for a prototype that has the drawing 
functionality in place but does not have other aspects of the 
system finished off. 

In some cases you can avoid implementing stuff early by 
faking the implementation. This is the WIZARD OF OZ method: 
you get a person to emulate unimplemented functions and 
generate the feedback users should see. John Gould at IBM did 
this very effectively to test design alternatives for a 
speech transcription system for which the speech recognition 
component was not yet ready. He built a prototype system in 
which test users' speech was piped to a fast typist, and the 
typist's output was routed back to the test users' screen.  
This idea can be adapted to many situations in which the 
system you are testing needs to respond to unpredictable user 
input, though not to interactions as dynamic as drawing.

If you are led to develop more and more elaborate 
approximations to the real system for testing purposes you 
need to think about controlling costs. Simple mockups are 
cheap, but prototypes that really work, or even Wizard of Oz 
setups, take substantial implementation effort. 

Some of this effort can be saved if the prototype turns out 
to be just part of the real system. As we will discuss 
further when we talk about implementation, this is often 
possible. A system like Visual Basic or Hypercard allows an 
interface to be mocked up with minimal functionality but then 
hooked up to functional modules as they become available. So 
don't plan for throwaway prototypes: try instead to use an 
implementation scheme that allows early versions of the real 
interface to be used in testing.

----------------------------------
5.4  Deciding What Data to Collect
----------------------------------

Now that we have people, tasks, and a system, we have to 
figure out what information to gather. It's useful to 
distinguish PROCESS DATA from BOTTOM-LINE data.  Process data 
are observations of what the test users are doing and 
thinking as they work through the tasks. These observations 
tell us what is happening step-by-step, and, we hope, 
suggests WHY it is happening. Bottom-line data give us a 
summary of WHAT happened: how long did users take, were they 
successful, how many errors did they make.

It may seem that bottom-line data are what you want. If you 
think of testing as telling you how good your interface is, 
it seems that how long users are taking on the tasks, and how 
successful they are, is just what you want to know.

We argue that process data are actually the data to focus on 
first. There's a role for bottom-line data, as we discuss in 
connection with Usability Measurement below. But as a 
designer you will mostly be concerned with process data. To 
see why, consider the following not-so-hypothetical 
comparison.

Suppose you have designed an interface for a situation in 
which you figure users should be able to complete a 
particular test task in about a half-hour. You do a test in 
which you focus on bottom-line data. You find that none of 
your test users was able to get the job done in less than an 
hour. You know you are in trouble, but what are you going to 
do about it? Now suppose instead you got detailed records of 
what the users actually did. You see that their whole 
approach to the task was mistaken, because they didn't use 
the frammis reduction operations presented on the third 
screen. Now you know where your redesign effort needs to go. 

We can extend this example to make a further point about the 
information you need as a designer. You know people weren't 
using frammis reduction, but do you know why? It could be 
that they understood perfectly well the importance of frammis 
reduction, but they didn't understand the screen on which 
these operations were presented. Or it could be that the 
frammis reduction screen was crystal clear but they didn't 
think frammis reduction was relevant.

Depending on what you decide here, you either need to fix up 
the frammis reduction screen, because it isn't clear, or you 
have a problem somewhere else. But you can't decide just from 
knowing that people didn't use frammis reduction.

To get the why information you really want, you need to know 
what users are thinking, not just what they are doing. That's 
the focus of the thinking-aloud method, the first testing 
technique we'll discuss.

------------------------------------
5.5  The Thinking Aloud Method
------------------------------------

The basic idea of thinking aloud is very simple. You ask your 
users to perform a test task, but you also ask them to talk 
to you while they work on it. Ask them to tell you what they 
are thinking: what they are trying to do, questions that 
arise as they work, things they read. You can make a 
recording of their comments or you can just take notes. 
You'll do this in such a way that you can tell what they were 
doing and where their comments fit into the sequence.

You'll find the comments are a rich lode of information. In 
the frammis reduction case, with just a little luck, you 
might get one of two kinds of comments: "I know I want to do 
frammis reduction now, but I don't see anyway to do it from 
here. I'll try another approach," or "Why is it telling me 
about frammis reduction here?  That's not what I'm trying to 
do." So you find out something about WHY frammis reduction 
wasn't getting done, and whether the frammis reduction screen 
is the locus of the problem.


============================================================
Example: Vocabulary Problems
------------------------------------------------------------

One of Clayton's favorite examples of a thinking-aloud datum 
is from a test of an administrative workstation for law 
offices. This was a carefully designed, entirely menu-driven 
system intended for use by people without previous computing 
experience. The word "parameter" was used extensively in the 
documentation and in system messages to refer to a quantity 
that could take on various user-assigned values. Test users 
read this word as "perimeter", a telling sign that the 
designers had stepped outside the vocabulary that was 
meaningful to their users.

Finding this kind of problem is a great feature of thinking-
aloud. It's hard as a designer to tell whether a word that is 
perfectly meaningful and familiar to you will be meaningful 
to someone else. And its hard to detect problems in this area 
just from watching mistakes people make.

[end example]-----------------------------------------------


You can use the thinking-aloud method with a prototype or a 
rough mock-up, for a single task or a suite of tasks.  The 
method is simple, but there are some points about it that 
repay some thought. Here are some suggestions on various 
aspects of the procedure. This material is adapted from 
Lewis, C. "Using the thinking-aloud method in cognitive 
interface design," IBM Research Report RC 9265, Yorktown 
Heights, NY, 1982.

5.5.1  Instructions

The basic instructions can be very simple: "Tell me what you 
are thinking about as you work." People can respond easily to 
this, especially if you suggest a few categories of thoughts 
as examples: things they find confusing, decisions they are 
making, and the like.

There are some other points you should add. Tell the user 
that you are not interested in their secret thoughts but only 
in what they are thinking about their task. Make clear that 
it is the system, not the user, that is being tested, so that 
if they have trouble it's the system's problem, not theirs. 
You will also want to explain what kind of recording you will 
make, and how test users' privacy will be protected (see 
discussion of ethics in testing earlier in this chapter).

5.5.2  The Role of the Observer

Even if you don't need to be available to operate a mockup, 
you should plan to stay with the user during the test. You'll 
do two things: prompt the user to keep up the flow of 
comments, and provide help when necessary. But you'll need to 
work out a policy for prompting and helping that avoids 
distorting the results you get.

It's very easy to shape the comments users will give you, and 
what they do in the test, by asking questions and making 
suggestions. If someone has missed the significance of some 
interface feature a word from you may focus their attention 
right on it. Also, research shows that people will make up an 
answer to any question you ask, whether or not they have any 
basis for the answer. You are better off, therefore, 
collecting the comments people offer spontaneously than 
prodding them to tell you about things you are interested in. 

But saying nothing after the initial instructions usually 
won't work. Most people won't give you a good flow of 
comments without being pushed a bit. So say things that 
encourage them to talk, but that do not direct what they 
should say. Good choices are "Tell me what you are thinking" 
or "Keep talking". Bad choices would be "What do you think 
those prompts about frammis mean?" or "Why did you do that?"

On helping, keep in mind that a very little help can make a 
huge difference in a test, and you can seriously mislead 
yourself about how well your interface works by just dropping 
in a few suggestions here and there. Try to work out in 
advance when you will permit yourself to help. One criterion 
is: help only when you won't get any more useful information 
if you don't, because the test user will quit or cannot 
possibly continue the task. If you do help, be sure to record 
when you helped and what you said.

A consequence of this policy is that you have to explain to 
your test users that you want them to tell you the questions 
that arise as they work, but that you won't answer them. This 
seems odd at first but becomes natural after a bit.



5.5.3  Recording
 
There are plain and fancy approaches here. It is quite 
practical to record observations only by taking notes on a 
pad of paper: you write down in order what the user does and 
says, in summary form. But you'll find that it takes some 
experience to do this fast enough to keep up in real time, 
and that you won't be able to do it for the first few test 
users you see on a given system and task. This is just 
because you need a general idea of where things are going to 
be able to keep up. A step up in technology is to make a 
video record of what is happening on the screen, with a lapel 
mike on the user to pick up the comments. A further step is 
to instrument the system to pick up a trace of user actions, 
and arrange for this record to be synchronized in some way 
with an audio record of the comments. The advantage of this 
approach is that it gives you a machine readable record of 
user actions that can be easier to summarize and access than 
video.

A good approach to start with is to combine a video record 
with written notes. You may find that you are able to 
dispense with the video, or you may find that you really want 
a fancier record. You can adapt your approach accordingly. 
But if you don't have a video setup don't let that keep you 
from trying the method.

5.5.4  Summarizing the Data

The point of the test is to get information that can guide 
the design. To do this you will want to make a list of all 
difficulties users encountered. Include references back to 
the original data so you can look at the specifics if 
questions arise. Also try to judge why each difficulty 
occurred, if the data permit a guess about that.

5.5.5  Using the Results

Now you want to consider what changes you need to make to 
your design based on data from the tests. Look at your data 
from two points of view. First, what do the data tell you 
about how you THOUGHT the interface would work? Are the 
results consistent with your cognitive walkthrough or are 
they telling you that you are missing something? For example, 
did test users take the approaches you expected, or were they 
working a different way? Try to update your analysis of the 
tasks and how the system should support them based on what 
you see in the data. Then use this improved analysis to 
rethink your design to make it better support what users are 
doing.

Second, look at all of the errors and difficulties you saw. 
For each one make a judgement of how important it is and how 
difficult it would be to fix. Factors to consider in judging 
importance are the costs of the problem to users (in time, 
aggravation, and possible wrong results) and what proportion 
of users you can expect to have similar trouble. Difficulty 
of fixes will depend on how sweeping the changes required by 
the fix are: changing the wording of a prompt will be easy, 
changing the organization of options in a menu structure will 
be a bit harder, and so on. Now decide to fix all the 
important problems, and all the easy ones.


=============================================================
HyperTopic:  The Two-Strings Problem and Selecting Panty Hose
-------------------------------------------------------------

Thinking aloud is widely used in the computer industry 
nowadays, and you can be confident you'll get useful results 
if you use it. But it's important to understand that test 
users can't tell you everything you might like to know, and 
that some of what they will tell you is bogus. Psychologists 
have done some interesting studies that make these points.

Maier had people try to solve the problem of tying together 
two strings that hung down from the ceiling too far apart to 
be grabbed at the same time (Maier, N.R.F. "Reasoning in 
humans: II.  The solution of a problem and its appearance in 
consciousness." Journal of Comparative Psychology, 12 (1931), 
pp. 181-194). One solution is to tie some kind of weight to 
one of the strings, set it swinging, grab the other string, 
and then wait for the swinging string to come close enough to 
reach. It's a hard problem, and few people come up with this 
or any other solution. Sometimes, when people were working, 
Maier would "accidentally" brush against one of the strings 
and set it in motion. The data showed that when he did this 
people were much more likely to find the solution. The point 
of interest for us is, what did these people say when Maier 
asked them how they solved the problem? They did NOT say, 
"When you brushed against the string that gave me the idea of 
making the string swing and solving the problem that way," 
even though Maier knows that's what really happened. So they 
could not and did not tell him what feature of the situation 
really helped them solve the problem.

Nisbett and Wilson set up a market survey table outside a big 
shopping center and asked people to say which of three pairs 
of panty hose they preferred, and why (Nisbett, R.E., and 
Wilson, T.D. "Telling more than we can know: Verbal reports 
on mental processes." Psychological Review, 84 (1977), pp. 
231-259). Most people picked the rightmost pair of the three, 
giving the kinds of reasons you'd expect: "I think this pair 
is sheerer" or "I think this pair is better made." The trick 
is that the three pairs of panty hose were IDENTICAL. Nisbett 
and Wilson knew that given a choice among three closely-
matched alternatives there is a bias to pick the last one, 
and that that bias was the real basis for people's choices. 
But (of course) nobody SAID that's why they chose the pair 
they chose. It's not just that people couldn't report their 
real reasons: when asked they made up reasons that seemed 
plausible but are wrong.

What do these studies say about the thinking-aloud data you 
collect? You won't always hear why people did what they did, 
or didn't do what they didn't do. Some portion of what you do 
hear will be wrong. And, you're especially taking a risk if 
you ask people specific questions: they'll give you some kind 
of an answer, but it may have nothing to do with the facts. 
Don't treat the comments you get as some kind of gospel. 
Instead, use them as input to your own judgment processes.

[end hypertopic]--------------------------------------------


------------------------------------
5.6  Measuring Bottom-Line Usability
------------------------------------

We argue that you usually want process data, not bottom-line 
data, but there are some situations in which bottom-line 
numbers are useful. You may have a definite requirement that 
people be able to complete a task in a certain amount of 
time, or you may want to compare two design alternatives on 
the basis of how quickly people can work or how many errors 
they commit. The basic idea in these cases is that you will 
have people perform test tasks, you will measure how long 
they take and you will count their errors.

Your first thought may be to combine this with a thinking-
aloud test: in addition to collecting comments you'd collect 
these other data as well. Unfortunately this doesn't work as 
well as one would wish. The thinking-aloud process can affect 
how quickly and accurately people work. It's pretty easy to 
see how thinking-aloud could slow people down, but it has 
also been shown that sometimes it can speed people up, 
apparently by making them think more carefully about what 
they are doing, and hence helping them choose better ways to 
do things. So if you are serious about finding out how long 
people will take to do things with your design, or how many 
problems they will encounter, you really need to do a 
separate test.

Getting the bottom-line numbers won't be too hard. You can 
use a stopwatch for timings, or you can instrument your 
system to record when people start and stop work. Counting 
errors, and gauging success on tasks, is a bit trickier, 
because you have to decide what is an error and what counts 
as successful completion of a task. But you won't have much 
trouble here either as long as you understand that you can't 
come up with perfect criteria for these things and use your 
common sense.

5.6.1  Analyzing the Bottom-Line Numbers

When you've got your numbers you'll run into some hard 
problems. The trouble is that the numbers you get from 
different test users will be different. How do you combine 
these numbers to get a reliable picture of what's happening?

Suppose users need to be able to perform some task with your 
system in 30 minutes or less. You run six test users and get 
the following times:

         20 min
         15 min
         40 min
         90 min
         10 min
          5 min

Are these results encouraging or not? If you take the average 
of these numbers you get 30 minutes, which looks fine. If you 
take the MEDIAN, that is, the middle score, you get something 
between 15 and 20 minutes, which looks even better. Can you 
be confident that the typical user will meet your 30-minute 
target?

The answer is no. The numbers you have are so variable, that 
is, they differ so much among themselves, that you really 
can't tell much about what will be "typical" times in the 
long run. Statistical analysis, which is the method for 
extracting facts from a background of variation, indicates 
that the "typical" times for this system might very well be 
anywhere from about 5 minutes to about 55 minutes. Note that 
this is a range for the "typical" value, not the range of 
possible times for individual users. That is, it is perfectly 
plausible given the test data that if we measured lots and 
lots of users the average time might be as low as 5 min, 
which would be wonderful, but it could also be as high as 55 
minutes, which is terrible.

There are two things contributing to our uncertainty in 
interpreting these test results. One is the small number of 
test users. It's pretty intuitive that the more test users we 
measure the better an estimate we can make of typical times. 
Second, as already mentioned, these test results are very 
variable: there are some small numbers but also some big 
numbers in the group. If all six measurements had come in 
right at (say) 25 minutes, we could be pretty confident that 
our typical times would be right around there. As things are, 
we have to worry that if we look at more users we might get a 
lot more 90-minute times, or a lot more 5-minute times.

It's the job of statistical analysis to juggle these factors 
-- the number of people we test and how variable or 
consistent the results are -- and give us an estimate of what 
we can conclude from the data. This is a big topic, and we 
won't try to do more than give you some basic methods and a 
little intuition here.

Here's a cookbook procedure for getting an idea of the range 
of typical values that are consistent with your test data.

1. Add up the numbers. Call this result "sum of x". In our 
example this is 180.

2. Divide by the n, the number of numbers. The quotient is 
the average, or mean, of the measurements. In our example 
this is 30.

3. Add up the squares of the numbers. Call this result "sum 
of squares" In our example this is 10450.

4. Square the sum of x and divide by n. Call this "foo". In 
our example this is 5400.

5. Subtract foo from sum of squares and divide by n-1. In our 
example this is 1010.

6. Take the square root. The result is the "standard 
deviation" of the sample. It is a measure of how variable the 
numbers are. In our example this is 31.78, or about 32.

7. Divide the standard deviation by the square root of n. 
This is the "standard error of the mean" and is a measure of 
how much variation you can expect in the typical value. In 
our example this is 12.97, or about 13.

8. It is plausible that the typical value is as small as the 
mean minus two times the standard error of the mean, or as 
large as the mean plus two times the standard error of the 
mean. In our example this range is from 30-(2*13) to 
30+(2*13), or about 5 to 55.  (The "*" stands for 
multiplication.)

What does "plausible" mean here? It means that if the real 
typical value is outside this range, you were very unlucky in 
getting the sample that you did. More specifically, if the 
true typical value were outside this range you would only 
expect to get a sample like the one you got 5 percent of the 
time or less.

Experience shows that usability test data are quite variable, 
which means that you need a lot of it to get good estimates 
of typical values. If you pore over the above procedure 
enough you may see that if you run four times as many test 
users you can narrow your range of estimates by a factor of 
two: the breadth of the range of estimates depends on the 
square root of the number of test users. That means a lot of 
test users to get a narrow range, if your data are as 
variable as they often are.

What this means is that you can anticipate trouble if you are 
trying to manage your project using these test data. Do the 
test results show we are on target, or do we need to pour on 
more resources? It's hard to say. One approach is to get 
people to agree to try to manage on the basis of the numbers 
in the sample themselves, without trying to use statistics to 
figure out how uncertain they are. This is a kind of blind 
compromise: on the average the typical value is equally 
likely to be bigger than the mean of your sample, or smaller. 
But if the stakes are high, and you really need to know where 
you stand, you'll need to do a lot of testing. You'll also 
want to do an analysis that takes into account the cost to 
you of being wrong about the typical value, by how much, so 
you can decide how big a test is really reasonable.


============================================================
HyperTopic:  Measuring User Preference
------------------------------------------------------------

Another bottom-line measure you may want is how much users 
like or dislike your system. You can certainly ask test users 
to give you a rating after they finish work, say by asking 
them to indicate how much they liked the system on a scale of 
1 to 10, or by choosing among the statements, "This is one of 
the best user interfaces I've worked with", "This interface 
is better than average", "This interface is of average 
quality", "This interface is poorer than average", or "This 
is one of the worst interfaces I have worked with." But you 
can't be very sure what your data will mean. It is very hard 
for people to give a detached measure of what they think 
about your interface in the context of a test. The novelty of 
the interface, a desire not to hurt your feelings (or the 
opposite), or the fact that they haven't used your interface 
for their own work can all distort the ratings they give you. 
A further complication is that different people can arrive at 
the same rating for very different reasons: one person really 
focuses on response time, say, while another is concerned 
about the clarity of the prompts. Even with all this 
uncertainty, though, it's fair to say that if lots of test 
users give you a low rating you are in trouble. If lots give 
you a high rating that's fine as far as it goes, but you 
can't rest easy.

[end hypertopic]--------------------------------------------


5.6.2  Comparing Two Design Alternatives

If you are using bottom-line measurements to compare two 
design alternatives, the same considerations apply as for a 
single design, and then some. Your ability to draw a firm 
conclusion will depend on how variable your numbers are, as 
well as how many test users you use. But then you need some 
way to compare the numbers you get for one design with the 
numbers from the others.

The simplest approach to use is called a BETWEEN-GROUPS 
EXPERIMENT. You use two groups of test users, one of which 
uses version A of the system and the other version B. What 
you want to know is whether the typical value for version A 
is likely to differ from the typical value for version B, and 
by how much. Here's a cookbook procedure for this.

1. Using parts of the cookbook method above, compute the 
means for the two groups separately. Also compute their 
standard deviations. Call the results ma, mb, sa, sb. You'll 
also need to have na and nb, the number of test users in each 
group (usually you'll try to make these the same, but they 
don't have to be.)

2. Combine sa and sb to get an estimate of how variable the 
whole scene is, by computing 

    s = sqrt( ( na*(sa**2) + nb*(sb**2) ) / (na + nb - 2) )

("*" represents multiplication; "sa**2" means "sa squared").

3. Compute a combined standard error:

    se =  s * sqrt(1/na + 1/nb)

4. Your range of typical values for the difference between 
version A and version B is now:

    ma - mb  plus-or-minus 2*se

Another approach you might consider is a WITHIN-GROUPS 
EXPERIMENT. Here you use only one group of test users and you 
get each of them to use both versions of the system. This 
brings with it some headaches. You obviously can't use the 
same tasks for the two versions, since doing a task the 
second time would be different from doing it the first time, 
and you have to worry about who uses which system first, 
because there might be some advantage or disadvantage in 
being the system someone tries first. There are ways around 
these problems, but they aren't simple. They work best for 
very simple tasks about which there isn't much to learn. You 
might want to use this approach if you were comparing two 
low-level interaction techniques, for example.  You can learn 
more about the within-groups approach from any standard text 
on experimental psychology (check your local college library 
or bookstore).


============================================================
HyperTopic: Don't Push Your Statistics Too Far.
------------------------------------------------------------

The cookbook procedures we've described are broadly useful, 
but they rest on some technical assumptions for complete 
validity. Both procedures assume that your numbers are drawn 
from what is called a "normal distribution," and the 
comparison procedure assumes that the data from both groups 
come from distributions with the same standard deviation. But 
experience has shown that these methods work reasonably well 
even if these assumptions are violated to some extent. You'll 
do OK if you use them for broad guidance in interpreting your 
data, but don't get drawn into arguments about exact values.

As we said before, statistics is a big topic. We don't 
recommend that you need an MS in statistics as an interface 
designer, but it's a fascinating subject and you won't regret 
knowing more about it than we've discussed here. If you want 
to be an interface researcher, rather than a working stiff, 
then that MS really would come in handy.

[end hypertopic]--------------------------------------------


--------------------------------------------
5.7  Details of Setting Up a Usability Study
--------------------------------------------

The description of user testing we've given up to this point 
should be all the background you need during the early phases 
of a task-centered design project.  When you're actually 
ready to evaluate a version of the design with users, you'll 
have to consider some of the finer details of setting up and 
running the tests.  This section, which you may want to skip 
on the first reading of the chapter, will help with many of 
those details.

5.7.1  Choosing the Order of Test Tasks

Usually you want test users to do more than one task. This 
means they have to do them in some order. Should everyone do 
them in the same order, or should you scramble them, or what? 
Our advice is to choose one sensible order, starting with 
some simpler things and working up to some more complex ones, 
and stay with that. This means that the tasks that come later 
will have the benefit of practice on the earlier one, or some 
penalty from test users getting tired, so you can't compare 
the difficulty of tasks using the results of a test like 
this. But usually that is not what you are trying to do.

5.7.2  Training Test Users

Should test users hit your system cold or should you teach 
them about it first? The answer to this depends on what you 
are trying to find out, and the way your system will be used 
in real life. If real users will hit your system cold, as is 
often the case, you should have them do this in the test. If 
you really believe users will be pre-trained, then train them 
before your test. If possible, use the real training 
materials to do this: you may as well learn something about 
how well or poorly it works as part of your study.

5.7.3  The Pilot Study

You should always do a pilot study as part of any usability 
test. A pilot study is like a dress rehearsal for a play: you 
run through everything you plan to do in the real study and 
see what needs to be fixed. Do this twice, once with 
colleagues, to get out the biggest bugs, and then with some 
real test users. You'll find out whether your policies on 
giving assistance are really workable and whether your 
instructions are adequately clear. A pilot study will help 
you keep down variability in a bottom-line study, but it will 
avoid trouble in a thinking-aloud study too. Don't try to do 
without!

5.7.4  What If Someone Doesn't Complete a Task?

If you are collecting bottom-line numbers, one problem you 
will very probably encounter is that not everybody completes 
their assigned task or tasks within the available time, or 
without help from you. What do you do? There is no complete 
remedy for the problem. A reasonable approach is to assign 
some very large time, and some very large number of errors, 
as the "results" for these people. Then take the results of 
your analysis with an even bigger grain of salt than usual.

5.7.5  Keeping Variability Down

As we've seen, your ability to make good estimates based on 
bottom-line test results depends on the results not being too 
variable. There are things you can do to help, though these 
may also make your test less realistic and hence a less good 
guide to what will happen with your design in real life. 
Differences among test users is one source of variable 
results: if test users differ a lot in how much they know 
about the task or about the system you can expect their time 
and error scores to be quite different. You can try to 
recruit test users with more similar background, and you can 
try to brief test users to bring them close to some common 
level of preparation for their tasks. 

Differences in procedure, how you actually conduct the test, 
will also add to variability. If you help some test users 
more than others, for example, you are asking for trouble. 
This reinforces the need to make careful plans about what 
kind of assistance you will provide. Finally, if people don't 
understand what they are doing your variability will 
increase. Make your instructions to test users and your task 
descriptions as clear as you can.

5.7.6  Debriefing Test Users

We've stressed that it's unwise to ask specific questions 
during a thinking aloud test, and during a bottom-line study 
you wouldn't be asking questions anyway. But what about 
asking questions in a debriefing session after test users 
have finished their tasks? There's no reason not to do this, 
but don't expect too much. People often don't remember very 
much about problems they've faced, even after a short time. 
Clayton remembers vividly watching a test user battle with a 
text processing system for hours, and then asking afterwards 
what problems they had encountered. "That wasn't too bad, I 
don't remember any particular problems," was the answer. He 
interviewed a real user of a system who had come within one 
day of quitting a good job because of failure to master a new 
system; they were unable to remember any specific problem 
they had had. Part of what is happening appears to be that if 
you work through a problem and eventually solve it, even with 
considerable difficulty, you remember the solution but not 
the problem.                  

There's an analogy here to those hidden picture puzzles you 
see on kids' menus at restaurant: there are pictures of three 
rabbits hidden in this picture, can you find them? When you 
first look at the picture you can't see them. After you find 
them, you can't help seeing them. In somewhat the same way, 
once you figure out how something works it can be hard to see 
why it was ever confusing.

Something that might help you get more info out of 
questioning at the end of a test is having the test session 
on video so you can show the test user the particular part of 
the task you want to ask about. But even if you do this, 
don't expect too much: the user may not have any better guess 
than you have about what they were doing.

Another form of debriefing that is less problematic is asking 
for comments on specific features of the interface. People 
may offer suggestions or have reactions, positive or 
negative, that might not otherwise be reflected in your data. 
This will work better if you can take the user back through 
the various screens they've seen during the test.