************************************************************* * * * TASK-CENTERED USER INTERFACE DESIGN * * A Practical Introduction * * * * by Clayton Lewis and John Rieman * * * * Copyright 1993, 1994: Please see the * * "shareware notice" at the front of the book. * * * ************************************************************* * * * * * * * * * * * * * * * v.1 * * Chapter 5 * * Testing The Design With Users * * You can't really tell how good or bad your interface is going to be without getting people to use it. So as your design matures, but before the whole system gets set in concrete, you need to do some user testing. This means having real people try to do things with the system and observing what happens. To do this you need people, some tasks for them to perform, and some version of the system for them to work with. Let's consider these necessities in order. --------------------------- 5.1 Choosing Users to Test --------------------------- The point of testing is to anticipate what will happen when real users start using your system. So the best test users will be people who are representative of the people you expect to have as users. If the real users are supposed to be doctors, get doctors as test users. If you don't, you can be badly misled about crucial things like the right vocabulary to use for the actions your system supports. Yes, we know it isn't easy to get doctors, as we noted when we talked about getting input from users early in design. But that doesn't mean it isn't important to do. And, as we asked before, if you can't get any doctors to be test users, why do you think you'll get them as real users? If it's hard to find really appropriate test users you may want to do some testing with people who represent some approximation to what you really want, like medical students instead of doctors, say, or maybe even premeds, or college- educated adults. This may help you get out some of the big problems (the ones you overlooked in your cognitive walkthrough because you knew too much about your design and assumed some things were obvious that aren't). But you have to be careful not to let the reactions and comments of people who aren't really the users you are targeting drive your design. Do as much testing with the right kind of test users as you can. ============================================================ HyperTopic: Ethical Concerns in Working with Test Users ------------------------------------------------------------ Serving as a test user can be very distressing, and you have definite responsibilities to protect the people you work with from distress. We have heard of test users who left the test in tears, and of a person in a psychological study of problem solving who was taken away in an ambulance under a sedation because of not being able to solve what appeared to be simple logic puzzles. There's no joke here. Another issue, which you also have to take seriously, is embarrassment. Someone might well feel bad if a video of them fumbling with your system were shown to someone who knew them, or even if just numerical measures of a less-than- stellar performance were linked with their name. The first line of defense against these kinds of problems is voluntary, informed consent. This means you avoid any pressure to participate in your test, and you make sure people are fully informed about what you are going to do if they do participate. You also make clear to test users that they are free to stop participating at any time, and you avoid putting any pressure on them to continue, even though it may be a big pain for you if they quit. You don't ask them for a reason: if they want to stop, you stop. Be very careful about getting friends, co-workers, or (especially) subordinates to participate in tests. Will these people really feel free to decline, if they want to? If such people are genuinely eager to participate, fine, but don't press the matter if they hesitate even (or especially) if they give no reason. During the test, monitor the attitude of your test users carefully. You will have stressed that it is your system, not the users, that is being tested, but they may still get upset with themselves if things don't go well. Watch for any sign of this, remind them that they aren't the focus of the test, and stop the test if they continue to be distressed. We are opposed to any deception in test procedures, but we make an exception in this case: an "equipment failure" is a good excuse to end a test without the test user feeling that it is his or her reaction that is to blame. Plan carefully how you are going to deal with privacy issues. The best approach is to avoid collecting information that could be used to identify someone. We make it a practice not to include users' faces in videos we make, for example, and we don't record users' names with their data (just assign user numbers and use those for identification). If you will collect material that could be identified, such as an audio recording of comments in the test user's voice, explain clearly up front if there are any conditions in which anyone but you will have access to this material. Let the user tell you if he or she has any objection, and abide by what they say. A final note: taking these matters seriously may be more than a matter of doing the right thing for you. If you are working in an organization that receives federal research funds you are obligated to comply with formal rules and regulations that govern the conduct of tests, including getting approval from a review committee for any study that involves human participants. [end hypertopic]-------------------------------------------- ---------------------------- 5.2 Selecting Tasks for Testing -------------------------------- In your test you'll be giving the test users some things to try to do, and you'll be keeping track of whether they can do them. Just as good test users should be typical of real users, so test tasks should reflect what you think real tasks are going to be like. If you've been following our advice you already have some suitable tasks: the tasks you developed early on to drive your task-centered design. You may find you have to modify these tasks somewhat for use in testing. They may take too long, or they may assume particular background knowledge that a random test user won't have. So you may want to simplify them. But be careful in doing this! Try to avoid any changes that make the tasks easier or that bend the tasks in the direction of what your design supports best. ============================================================ Example: Test Users and Tasks for the Traffic Modelling System. ------------------------------------------------------------ We didn't have to modify our tasks for the traffic modelling system, because we used the same people as test users that we had worked with to develop our target tasks. So they had all the background that was needed. If we had not had access to these same folks we would have needed to prepare briefing materials for the test users so that they would know where Canyon and Arapahoe are and something about how they fit into the surrounding street grid. [end example]----------------------------------------------- If you base your test tasks on the tasks you developed for task-centered design you'll avoid a common problem: choosing test tasks that are too fragmented. Traditional requirements lists naturally give rise to suites of test tasks that test the various requirements separately. Remember the phone-in bank system we discussed in Chapter 2? It had been thoroughly tested, but only using tests that involved single services, like checking a balance or transferring funds, but not combinations of services like checking a balance and then transferring funds contingent on the results of the check. There were big problems in doing these combinations. --------------------------------------------- 5.3 Providing a System for Test Users to Use --------------------------------------------- The key to testing early in the development process, when it is still possible to make changes to the design without incurring big costs, is using mockups in the test. These are versions of the system that do not implement the whole design, either in what the interface looks like or what the system does, but do show some of the key features to users. Mockups blur into PROTOTYPES, with the distinction that a mockup is rougher and cheaper and a prototype is more finished and more expensive. The simplest mockups are just pictures of screens as they would appear at various stages of a user interaction. These can be drawn on paper or they can be, with a bit more work, created on the computer using a tool like HyperCard for the Mac or a similar system for Windows. A test is done by showing users the first screen and asking them what they would do to accomplish a task you have assigned. They describe their action, and you make the next screen appear, either by rummaging around in a pile of pictures on paper and holding up the right one, or by getting the computer to show the appropriate next screen. This crude procedure can get you a lot of useful feedback from users. Can they understand what's on the screens, or are they baffled? Is the sequence of screens well-suited to the task, as you thought it would be when you did your cognitive walkthrough, or did you miss something? To make a simple mockup like this you have to decide what screens you are going to provide. Start by drawing the screens users would see if they did everything the best way. Then decide whether you also want to "support" some alternative paths, and how much you want to investigate error paths. Usually it won't be practical for you to provide a screen for every possible user action, right or wrong, but you will have reasonable coverage of the main lines you expect users to follow. During testing, if users stay on the lines you expected, you just show them the screens they would see. What if they deviate, and make a move that leads to a screen you don't have? First, you record what they wanted to do: that's valuable data about a discrepancy between what you expected and what they want to do, which is why you are doing the test. Then you can tell them what they would see, and let them try to continue, or you can tell them to make another choice. You won't see as much as you would if you had the complete system for them to work with, but you will see whether the main lines of your design are sound. ============================================================ HyperTopic: Some Details on Mockups ------------------------------------------------------------ What if a task involves user input that is their free choice, like a name to use for a file? You can't anticipate what the users will type and put it on your mockup screens. But you can let them make their choice and then say something like, "You called your file 'eggplant'; in this mockup we used 'broccoli'. Let's pretend you chose 'broccoli' and carry on." What if it is a feature of your design that the system should help the user recover from errors? If you just provide screens for correct paths you won't get any data on this. If you can anticipate some common errors, all you have to do is to include the error recovery paths for these when you make up your screens. If errors are very diverse and hard to predict you may need to make up some screens on the fly, during the test, so as to be able to show test users what they would see. This blends into the "Wizard of Oz" approach we describe later. [end hypertopic]-------------------------------------------- Some systems have to interact too closely with the user to be well approximated by a simple mockup. For example a drawing program has to respond to lots of little user actions, and while you might get information from a simple mockup about whether users can figure out some aspects of the interface, like how to select a drawing tool from a palette of icons, you won't be able to test how well drawing itself is going to work. You need to make more of the system work to test what you want to test. The thing to do here is to get the drawing functionality up early so you can do a more realistic test. You would not wait for the system to be completed, because you want test results early. So you would aim for a prototype that has the drawing functionality in place but does not have other aspects of the system finished off. In some cases you can avoid implementing stuff early by faking the implementation. This is the WIZARD OF OZ method: you get a person to emulate unimplemented functions and generate the feedback users should see. John Gould at IBM did this very effectively to test design alternatives for a speech transcription system for which the speech recognition component was not yet ready. He built a prototype system in which test users' speech was piped to a fast typist, and the typist's output was routed back to the test users' screen. This idea can be adapted to many situations in which the system you are testing needs to respond to unpredictable user input, though not to interactions as dynamic as drawing. If you are led to develop more and more elaborate approximations to the real system for testing purposes you need to think about controlling costs. Simple mockups are cheap, but prototypes that really work, or even Wizard of Oz setups, take substantial implementation effort. Some of this effort can be saved if the prototype turns out to be just part of the real system. As we will discuss further when we talk about implementation, this is often possible. A system like Visual Basic or Hypercard allows an interface to be mocked up with minimal functionality but then hooked up to functional modules as they become available. So don't plan for throwaway prototypes: try instead to use an implementation scheme that allows early versions of the real interface to be used in testing. ---------------------------------- 5.4 Deciding What Data to Collect ---------------------------------- Now that we have people, tasks, and a system, we have to figure out what information to gather. It's useful to distinguish PROCESS DATA from BOTTOM-LINE data. Process data are observations of what the test users are doing and thinking as they work through the tasks. These observations tell us what is happening step-by-step, and, we hope, suggests WHY it is happening. Bottom-line data give us a summary of WHAT happened: how long did users take, were they successful, how many errors did they make. It may seem that bottom-line data are what you want. If you think of testing as telling you how good your interface is, it seems that how long users are taking on the tasks, and how successful they are, is just what you want to know. We argue that process data are actually the data to focus on first. There's a role for bottom-line data, as we discuss in connection with Usability Measurement below. But as a designer you will mostly be concerned with process data. To see why, consider the following not-so-hypothetical comparison. Suppose you have designed an interface for a situation in which you figure users should be able to complete a particular test task in about a half-hour. You do a test in which you focus on bottom-line data. You find that none of your test users was able to get the job done in less than an hour. You know you are in trouble, but what are you going to do about it? Now suppose instead you got detailed records of what the users actually did. You see that their whole approach to the task was mistaken, because they didn't use the frammis reduction operations presented on the third screen. Now you know where your redesign effort needs to go. We can extend this example to make a further point about the information you need as a designer. You know people weren't using frammis reduction, but do you know why? It could be that they understood perfectly well the importance of frammis reduction, but they didn't understand the screen on which these operations were presented. Or it could be that the frammis reduction screen was crystal clear but they didn't think frammis reduction was relevant. Depending on what you decide here, you either need to fix up the frammis reduction screen, because it isn't clear, or you have a problem somewhere else. But you can't decide just from knowing that people didn't use frammis reduction. To get the why information you really want, you need to know what users are thinking, not just what they are doing. That's the focus of the thinking-aloud method, the first testing technique we'll discuss. ------------------------------------ 5.5 The Thinking Aloud Method ------------------------------------ The basic idea of thinking aloud is very simple. You ask your users to perform a test task, but you also ask them to talk to you while they work on it. Ask them to tell you what they are thinking: what they are trying to do, questions that arise as they work, things they read. You can make a recording of their comments or you can just take notes. You'll do this in such a way that you can tell what they were doing and where their comments fit into the sequence. You'll find the comments are a rich lode of information. In the frammis reduction case, with just a little luck, you might get one of two kinds of comments: "I know I want to do frammis reduction now, but I don't see anyway to do it from here. I'll try another approach," or "Why is it telling me about frammis reduction here? That's not what I'm trying to do." So you find out something about WHY frammis reduction wasn't getting done, and whether the frammis reduction screen is the locus of the problem. ============================================================ Example: Vocabulary Problems ------------------------------------------------------------ One of Clayton's favorite examples of a thinking-aloud datum is from a test of an administrative workstation for law offices. This was a carefully designed, entirely menu-driven system intended for use by people without previous computing experience. The word "parameter" was used extensively in the documentation and in system messages to refer to a quantity that could take on various user-assigned values. Test users read this word as "perimeter", a telling sign that the designers had stepped outside the vocabulary that was meaningful to their users. Finding this kind of problem is a great feature of thinking- aloud. It's hard as a designer to tell whether a word that is perfectly meaningful and familiar to you will be meaningful to someone else. And its hard to detect problems in this area just from watching mistakes people make. [end example]----------------------------------------------- You can use the thinking-aloud method with a prototype or a rough mock-up, for a single task or a suite of tasks. The method is simple, but there are some points about it that repay some thought. Here are some suggestions on various aspects of the procedure. This material is adapted from Lewis, C. "Using the thinking-aloud method in cognitive interface design," IBM Research Report RC 9265, Yorktown Heights, NY, 1982. 5.5.1 Instructions The basic instructions can be very simple: "Tell me what you are thinking about as you work." People can respond easily to this, especially if you suggest a few categories of thoughts as examples: things they find confusing, decisions they are making, and the like. There are some other points you should add. Tell the user that you are not interested in their secret thoughts but only in what they are thinking about their task. Make clear that it is the system, not the user, that is being tested, so that if they have trouble it's the system's problem, not theirs. You will also want to explain what kind of recording you will make, and how test users' privacy will be protected (see discussion of ethics in testing earlier in this chapter). 5.5.2 The Role of the Observer Even if you don't need to be available to operate a mockup, you should plan to stay with the user during the test. You'll do two things: prompt the user to keep up the flow of comments, and provide help when necessary. But you'll need to work out a policy for prompting and helping that avoids distorting the results you get. It's very easy to shape the comments users will give you, and what they do in the test, by asking questions and making suggestions. If someone has missed the significance of some interface feature a word from you may focus their attention right on it. Also, research shows that people will make up an answer to any question you ask, whether or not they have any basis for the answer. You are better off, therefore, collecting the comments people offer spontaneously than prodding them to tell you about things you are interested in. But saying nothing after the initial instructions usually won't work. Most people won't give you a good flow of comments without being pushed a bit. So say things that encourage them to talk, but that do not direct what they should say. Good choices are "Tell me what you are thinking" or "Keep talking". Bad choices would be "What do you think those prompts about frammis mean?" or "Why did you do that?" On helping, keep in mind that a very little help can make a huge difference in a test, and you can seriously mislead yourself about how well your interface works by just dropping in a few suggestions here and there. Try to work out in advance when you will permit yourself to help. One criterion is: help only when you won't get any more useful information if you don't, because the test user will quit or cannot possibly continue the task. If you do help, be sure to record when you helped and what you said. A consequence of this policy is that you have to explain to your test users that you want them to tell you the questions that arise as they work, but that you won't answer them. This seems odd at first but becomes natural after a bit. 5.5.3 Recording There are plain and fancy approaches here. It is quite practical to record observations only by taking notes on a pad of paper: you write down in order what the user does and says, in summary form. But you'll find that it takes some experience to do this fast enough to keep up in real time, and that you won't be able to do it for the first few test users you see on a given system and task. This is just because you need a general idea of where things are going to be able to keep up. A step up in technology is to make a video record of what is happening on the screen, with a lapel mike on the user to pick up the comments. A further step is to instrument the system to pick up a trace of user actions, and arrange for this record to be synchronized in some way with an audio record of the comments. The advantage of this approach is that it gives you a machine readable record of user actions that can be easier to summarize and access than video. A good approach to start with is to combine a video record with written notes. You may find that you are able to dispense with the video, or you may find that you really want a fancier record. You can adapt your approach accordingly. But if you don't have a video setup don't let that keep you from trying the method. 5.5.4 Summarizing the Data The point of the test is to get information that can guide the design. To do this you will want to make a list of all difficulties users encountered. Include references back to the original data so you can look at the specifics if questions arise. Also try to judge why each difficulty occurred, if the data permit a guess about that. 5.5.5 Using the Results Now you want to consider what changes you need to make to your design based on data from the tests. Look at your data from two points of view. First, what do the data tell you about how you THOUGHT the interface would work? Are the results consistent with your cognitive walkthrough or are they telling you that you are missing something? For example, did test users take the approaches you expected, or were they working a different way? Try to update your analysis of the tasks and how the system should support them based on what you see in the data. Then use this improved analysis to rethink your design to make it better support what users are doing. Second, look at all of the errors and difficulties you saw. For each one make a judgement of how important it is and how difficult it would be to fix. Factors to consider in judging importance are the costs of the problem to users (in time, aggravation, and possible wrong results) and what proportion of users you can expect to have similar trouble. Difficulty of fixes will depend on how sweeping the changes required by the fix are: changing the wording of a prompt will be easy, changing the organization of options in a menu structure will be a bit harder, and so on. Now decide to fix all the important problems, and all the easy ones. ============================================================= HyperTopic: The Two-Strings Problem and Selecting Panty Hose ------------------------------------------------------------- Thinking aloud is widely used in the computer industry nowadays, and you can be confident you'll get useful results if you use it. But it's important to understand that test users can't tell you everything you might like to know, and that some of what they will tell you is bogus. Psychologists have done some interesting studies that make these points. Maier had people try to solve the problem of tying together two strings that hung down from the ceiling too far apart to be grabbed at the same time (Maier, N.R.F. "Reasoning in humans: II. The solution of a problem and its appearance in consciousness." Journal of Comparative Psychology, 12 (1931), pp. 181-194). One solution is to tie some kind of weight to one of the strings, set it swinging, grab the other string, and then wait for the swinging string to come close enough to reach. It's a hard problem, and few people come up with this or any other solution. Sometimes, when people were working, Maier would "accidentally" brush against one of the strings and set it in motion. The data showed that when he did this people were much more likely to find the solution. The point of interest for us is, what did these people say when Maier asked them how they solved the problem? They did NOT say, "When you brushed against the string that gave me the idea of making the string swing and solving the problem that way," even though Maier knows that's what really happened. So they could not and did not tell him what feature of the situation really helped them solve the problem. Nisbett and Wilson set up a market survey table outside a big shopping center and asked people to say which of three pairs of panty hose they preferred, and why (Nisbett, R.E., and Wilson, T.D. "Telling more than we can know: Verbal reports on mental processes." Psychological Review, 84 (1977), pp. 231-259). Most people picked the rightmost pair of the three, giving the kinds of reasons you'd expect: "I think this pair is sheerer" or "I think this pair is better made." The trick is that the three pairs of panty hose were IDENTICAL. Nisbett and Wilson knew that given a choice among three closely- matched alternatives there is a bias to pick the last one, and that that bias was the real basis for people's choices. But (of course) nobody SAID that's why they chose the pair they chose. It's not just that people couldn't report their real reasons: when asked they made up reasons that seemed plausible but are wrong. What do these studies say about the thinking-aloud data you collect? You won't always hear why people did what they did, or didn't do what they didn't do. Some portion of what you do hear will be wrong. And, you're especially taking a risk if you ask people specific questions: they'll give you some kind of an answer, but it may have nothing to do with the facts. Don't treat the comments you get as some kind of gospel. Instead, use them as input to your own judgment processes. [end hypertopic]-------------------------------------------- ------------------------------------ 5.6 Measuring Bottom-Line Usability ------------------------------------ We argue that you usually want process data, not bottom-line data, but there are some situations in which bottom-line numbers are useful. You may have a definite requirement that people be able to complete a task in a certain amount of time, or you may want to compare two design alternatives on the basis of how quickly people can work or how many errors they commit. The basic idea in these cases is that you will have people perform test tasks, you will measure how long they take and you will count their errors. Your first thought may be to combine this with a thinking- aloud test: in addition to collecting comments you'd collect these other data as well. Unfortunately this doesn't work as well as one would wish. The thinking-aloud process can affect how quickly and accurately people work. It's pretty easy to see how thinking-aloud could slow people down, but it has also been shown that sometimes it can speed people up, apparently by making them think more carefully about what they are doing, and hence helping them choose better ways to do things. So if you are serious about finding out how long people will take to do things with your design, or how many problems they will encounter, you really need to do a separate test. Getting the bottom-line numbers won't be too hard. You can use a stopwatch for timings, or you can instrument your system to record when people start and stop work. Counting errors, and gauging success on tasks, is a bit trickier, because you have to decide what is an error and what counts as successful completion of a task. But you won't have much trouble here either as long as you understand that you can't come up with perfect criteria for these things and use your common sense. 5.6.1 Analyzing the Bottom-Line Numbers When you've got your numbers you'll run into some hard problems. The trouble is that the numbers you get from different test users will be different. How do you combine these numbers to get a reliable picture of what's happening? Suppose users need to be able to perform some task with your system in 30 minutes or less. You run six test users and get the following times: 20 min 15 min 40 min 90 min 10 min 5 min Are these results encouraging or not? If you take the average of these numbers you get 30 minutes, which looks fine. If you take the MEDIAN, that is, the middle score, you get something between 15 and 20 minutes, which looks even better. Can you be confident that the typical user will meet your 30-minute target? The answer is no. The numbers you have are so variable, that is, they differ so much among themselves, that you really can't tell much about what will be "typical" times in the long run. Statistical analysis, which is the method for extracting facts from a background of variation, indicates that the "typical" times for this system might very well be anywhere from about 5 minutes to about 55 minutes. Note that this is a range for the "typical" value, not the range of possible times for individual users. That is, it is perfectly plausible given the test data that if we measured lots and lots of users the average time might be as low as 5 min, which would be wonderful, but it could also be as high as 55 minutes, which is terrible. There are two things contributing to our uncertainty in interpreting these test results. One is the small number of test users. It's pretty intuitive that the more test users we measure the better an estimate we can make of typical times. Second, as already mentioned, these test results are very variable: there are some small numbers but also some big numbers in the group. If all six measurements had come in right at (say) 25 minutes, we could be pretty confident that our typical times would be right around there. As things are, we have to worry that if we look at more users we might get a lot more 90-minute times, or a lot more 5-minute times. It's the job of statistical analysis to juggle these factors -- the number of people we test and how variable or consistent the results are -- and give us an estimate of what we can conclude from the data. This is a big topic, and we won't try to do more than give you some basic methods and a little intuition here. Here's a cookbook procedure for getting an idea of the range of typical values that are consistent with your test data. 1. Add up the numbers. Call this result "sum of x". In our example this is 180. 2. Divide by the n, the number of numbers. The quotient is the average, or mean, of the measurements. In our example this is 30. 3. Add up the squares of the numbers. Call this result "sum of squares" In our example this is 10450. 4. Square the sum of x and divide by n. Call this "foo". In our example this is 5400. 5. Subtract foo from sum of squares and divide by n-1. In our example this is 1010. 6. Take the square root. The result is the "standard deviation" of the sample. It is a measure of how variable the numbers are. In our example this is 31.78, or about 32. 7. Divide the standard deviation by the square root of n. This is the "standard error of the mean" and is a measure of how much variation you can expect in the typical value. In our example this is 12.97, or about 13. 8. It is plausible that the typical value is as small as the mean minus two times the standard error of the mean, or as large as the mean plus two times the standard error of the mean. In our example this range is from 30-(2*13) to 30+(2*13), or about 5 to 55. (The "*" stands for multiplication.) What does "plausible" mean here? It means that if the real typical value is outside this range, you were very unlucky in getting the sample that you did. More specifically, if the true typical value were outside this range you would only expect to get a sample like the one you got 5 percent of the time or less. Experience shows that usability test data are quite variable, which means that you need a lot of it to get good estimates of typical values. If you pore over the above procedure enough you may see that if you run four times as many test users you can narrow your range of estimates by a factor of two: the breadth of the range of estimates depends on the square root of the number of test users. That means a lot of test users to get a narrow range, if your data are as variable as they often are. What this means is that you can anticipate trouble if you are trying to manage your project using these test data. Do the test results show we are on target, or do we need to pour on more resources? It's hard to say. One approach is to get people to agree to try to manage on the basis of the numbers in the sample themselves, without trying to use statistics to figure out how uncertain they are. This is a kind of blind compromise: on the average the typical value is equally likely to be bigger than the mean of your sample, or smaller. But if the stakes are high, and you really need to know where you stand, you'll need to do a lot of testing. You'll also want to do an analysis that takes into account the cost to you of being wrong about the typical value, by how much, so you can decide how big a test is really reasonable. ============================================================ HyperTopic: Measuring User Preference ------------------------------------------------------------ Another bottom-line measure you may want is how much users like or dislike your system. You can certainly ask test users to give you a rating after they finish work, say by asking them to indicate how much they liked the system on a scale of 1 to 10, or by choosing among the statements, "This is one of the best user interfaces I've worked with", "This interface is better than average", "This interface is of average quality", "This interface is poorer than average", or "This is one of the worst interfaces I have worked with." But you can't be very sure what your data will mean. It is very hard for people to give a detached measure of what they think about your interface in the context of a test. The novelty of the interface, a desire not to hurt your feelings (or the opposite), or the fact that they haven't used your interface for their own work can all distort the ratings they give you. A further complication is that different people can arrive at the same rating for very different reasons: one person really focuses on response time, say, while another is concerned about the clarity of the prompts. Even with all this uncertainty, though, it's fair to say that if lots of test users give you a low rating you are in trouble. If lots give you a high rating that's fine as far as it goes, but you can't rest easy. [end hypertopic]-------------------------------------------- 5.6.2 Comparing Two Design Alternatives If you are using bottom-line measurements to compare two design alternatives, the same considerations apply as for a single design, and then some. Your ability to draw a firm conclusion will depend on how variable your numbers are, as well as how many test users you use. But then you need some way to compare the numbers you get for one design with the numbers from the others. The simplest approach to use is called a BETWEEN-GROUPS EXPERIMENT. You use two groups of test users, one of which uses version A of the system and the other version B. What you want to know is whether the typical value for version A is likely to differ from the typical value for version B, and by how much. Here's a cookbook procedure for this. 1. Using parts of the cookbook method above, compute the means for the two groups separately. Also compute their standard deviations. Call the results ma, mb, sa, sb. You'll also need to have na and nb, the number of test users in each group (usually you'll try to make these the same, but they don't have to be.) 2. Combine sa and sb to get an estimate of how variable the whole scene is, by computing s = sqrt( ( na*(sa**2) + nb*(sb**2) ) / (na + nb - 2) ) ("*" represents multiplication; "sa**2" means "sa squared"). 3. Compute a combined standard error: se = s * sqrt(1/na + 1/nb) 4. Your range of typical values for the difference between version A and version B is now: ma - mb plus-or-minus 2*se Another approach you might consider is a WITHIN-GROUPS EXPERIMENT. Here you use only one group of test users and you get each of them to use both versions of the system. This brings with it some headaches. You obviously can't use the same tasks for the two versions, since doing a task the second time would be different from doing it the first time, and you have to worry about who uses which system first, because there might be some advantage or disadvantage in being the system someone tries first. There are ways around these problems, but they aren't simple. They work best for very simple tasks about which there isn't much to learn. You might want to use this approach if you were comparing two low-level interaction techniques, for example. You can learn more about the within-groups approach from any standard text on experimental psychology (check your local college library or bookstore). ============================================================ HyperTopic: Don't Push Your Statistics Too Far. ------------------------------------------------------------ The cookbook procedures we've described are broadly useful, but they rest on some technical assumptions for complete validity. Both procedures assume that your numbers are drawn from what is called a "normal distribution," and the comparison procedure assumes that the data from both groups come from distributions with the same standard deviation. But experience has shown that these methods work reasonably well even if these assumptions are violated to some extent. You'll do OK if you use them for broad guidance in interpreting your data, but don't get drawn into arguments about exact values. As we said before, statistics is a big topic. We don't recommend that you need an MS in statistics as an interface designer, but it's a fascinating subject and you won't regret knowing more about it than we've discussed here. If you want to be an interface researcher, rather than a working stiff, then that MS really would come in handy. [end hypertopic]-------------------------------------------- -------------------------------------------- 5.7 Details of Setting Up a Usability Study -------------------------------------------- The description of user testing we've given up to this point should be all the background you need during the early phases of a task-centered design project. When you're actually ready to evaluate a version of the design with users, you'll have to consider some of the finer details of setting up and running the tests. This section, which you may want to skip on the first reading of the chapter, will help with many of those details. 5.7.1 Choosing the Order of Test Tasks Usually you want test users to do more than one task. This means they have to do them in some order. Should everyone do them in the same order, or should you scramble them, or what? Our advice is to choose one sensible order, starting with some simpler things and working up to some more complex ones, and stay with that. This means that the tasks that come later will have the benefit of practice on the earlier one, or some penalty from test users getting tired, so you can't compare the difficulty of tasks using the results of a test like this. But usually that is not what you are trying to do. 5.7.2 Training Test Users Should test users hit your system cold or should you teach them about it first? The answer to this depends on what you are trying to find out, and the way your system will be used in real life. If real users will hit your system cold, as is often the case, you should have them do this in the test. If you really believe users will be pre-trained, then train them before your test. If possible, use the real training materials to do this: you may as well learn something about how well or poorly it works as part of your study. 5.7.3 The Pilot Study You should always do a pilot study as part of any usability test. A pilot study is like a dress rehearsal for a play: you run through everything you plan to do in the real study and see what needs to be fixed. Do this twice, once with colleagues, to get out the biggest bugs, and then with some real test users. You'll find out whether your policies on giving assistance are really workable and whether your instructions are adequately clear. A pilot study will help you keep down variability in a bottom-line study, but it will avoid trouble in a thinking-aloud study too. Don't try to do without! 5.7.4 What If Someone Doesn't Complete a Task? If you are collecting bottom-line numbers, one problem you will very probably encounter is that not everybody completes their assigned task or tasks within the available time, or without help from you. What do you do? There is no complete remedy for the problem. A reasonable approach is to assign some very large time, and some very large number of errors, as the "results" for these people. Then take the results of your analysis with an even bigger grain of salt than usual. 5.7.5 Keeping Variability Down As we've seen, your ability to make good estimates based on bottom-line test results depends on the results not being too variable. There are things you can do to help, though these may also make your test less realistic and hence a less good guide to what will happen with your design in real life. Differences among test users is one source of variable results: if test users differ a lot in how much they know about the task or about the system you can expect their time and error scores to be quite different. You can try to recruit test users with more similar background, and you can try to brief test users to bring them close to some common level of preparation for their tasks. Differences in procedure, how you actually conduct the test, will also add to variability. If you help some test users more than others, for example, you are asking for trouble. This reinforces the need to make careful plans about what kind of assistance you will provide. Finally, if people don't understand what they are doing your variability will increase. Make your instructions to test users and your task descriptions as clear as you can. 5.7.6 Debriefing Test Users We've stressed that it's unwise to ask specific questions during a thinking aloud test, and during a bottom-line study you wouldn't be asking questions anyway. But what about asking questions in a debriefing session after test users have finished their tasks? There's no reason not to do this, but don't expect too much. People often don't remember very much about problems they've faced, even after a short time. Clayton remembers vividly watching a test user battle with a text processing system for hours, and then asking afterwards what problems they had encountered. "That wasn't too bad, I don't remember any particular problems," was the answer. He interviewed a real user of a system who had come within one day of quitting a good job because of failure to master a new system; they were unable to remember any specific problem they had had. Part of what is happening appears to be that if you work through a problem and eventually solve it, even with considerable difficulty, you remember the solution but not the problem. There's an analogy here to those hidden picture puzzles you see on kids' menus at restaurant: there are pictures of three rabbits hidden in this picture, can you find them? When you first look at the picture you can't see them. After you find them, you can't help seeing them. In somewhat the same way, once you figure out how something works it can be hard to see why it was ever confusing. Something that might help you get more info out of questioning at the end of a test is having the test session on video so you can show the test user the particular part of the task you want to ask about. But even if you do this, don't expect too much: the user may not have any better guess than you have about what they were doing. Another form of debriefing that is less problematic is asking for comments on specific features of the interface. People may offer suggestions or have reactions, positive or negative, that might not otherwise be reflected in your data. This will work better if you can take the user back through the various screens they've seen during the test.