In this lecture, we're going to talk about trying out your interface with people, and doing so, in a way that you can improve your designs based on what you learned. One of the most common things that people ask when running studies is, do you like my interface? And it's a really natural thing to ask, because on some levels, it's what we all want to know. But this is really problematic on a whole lot of levels. For one, it's not very specific, and so sometimes people will try and make these better. And so they'll improve it by doing something like, how much do you like my interface on a one to five scale? Or this is a useful interface, agree or disagree on a one to five scale? And this add some kind of patina of scientificness to it. But really, it's just the same thing, you are asking somebody, do you like my interface? And people are nice, so they're going to say, sure, I like your interface. This is the Please the Experimenter Bias and this can be especially strong when there are social or cultural or a power differences between the experimenter and the people that you're trying out your interface with. For example, Androni Metti and colleagues showed these in effect in India, where this effect was exacerbated when the experiment was white. Now you should not take this to mean that you shouldn't have your developers try out stuff with users. Being the person who's both the developer and the person who's trying stuff out is incredibly valuable and one example I like a lot of this is Brian Krieger one of the Instagram founders is also a former master student and TA of mine. And Mike, when he left Stanford and joined Silicon Valley. Every Friday afternoon he would bring people into the lab, into his office, and have them try out whatever they were working on that week. And so, that way they were able to get this regular feedback each week. And the people who were building those systems got to see real people trying them out. This can be nails on a chalkboard painful, but you'll also learn a ton. So how do we get beyond, do you like my interface? The basic strategy that we're going to talk about today, is being able to use specific measure and concrete questions, to be able to deliver meaningful results. One of the problem of do you like my interface is, compared to what? And I think that one of the reasons that people say, yeah, sure is that there's no comparison point. And so, one thing that's really important is when you're measuring the effectiveness of your interface, even informally, it's really nice to have some kind of comparison. It's also important to think about, well, what's the yardstick? What constitutes good in this arena? What are the measures that you're going to use? So how can we get beyond do you like my interface? One of the ways we can start out is by asking a base rate question, like what fraction of people click on the first link in a search results page or what fraction of students come to class? Once we start to measure correlations, things get even more interesting. Like, is there a relationship between the time of day a class is offered, and how many students attend it? Or, is there a relationship between the order of a search result and the click-through rate. For both students and click through there can be multiple explanations. For example, if there are fewer students that attend early morning classes, is that a function of when students want to show up? Or is that a good function of when good professors want to teach? With the click through example, there are also two kinds of explanations. If lower placed links yield fewer clicks is that because the links are of intrinsically poorer quality? Or is it because people just click on the first link, they don't bother getting to the second one, even if it might be better. To isolate the effect of placement and identifying it as playing a causal roll, you need to isolate that as a variable by, say, randomizing the order of search results. As we start to talk about these experiments, let's introduce a few terms that are going to help us. The multiple different conditions that we try. That's the thing that we're manipulating. For example, the time of a class. Or the location of a particular link on a search results page. These manipulations are independent variables, because they're independent of what the user does. They're in the control of the experimenter. Then, we're going to measure what the user does and those measures are called Dependent Variables, because they depend on what the user does. Common measures in NCI include things like, task completion time. How long does it take somebody to complete a task? For example, find something you want to buy, create a new account, order an item. Accuracy, how many mistakes did people make? And were those fatal errors? Or were those things that they were able to quickly recover from? Recall, how much does a person remember afterward or after periods of non-use. And emotional response, how does the person feel about the task being completed? Were they confident, were they stressed, would the user recommend the system to a friend? So your independent variables are the things you manipulate, your dependent variables are the things that you measure. How reliable is your experiment? If you ran this again, would you see the same results? That's the internal validity of an experiment. To have a precise experiment, you need to be able to remove the confounding factors. Also it's important to study enough people, so that the result is unlikely to open by chance. You may be able to run the same study over and over and get the same results, but it may not matter in some real world sense. And the External Validity is the generalized ability of your results. Does this apply only to 18 year olds in a college classroom or does this apply to everybody in the entire world? Lets bring this back to HCI and talk about one of the problems you're likely to face as a designer. I think one of the things that we commonly want to be able to do, is to be able to ask something like, is my cool new approach better than the industry standard? Because after all that's why you're making the new thing. Now, one of the challenges with this, especially early on in the design process is that, you may have something that is very much in its prototype stages and something that is the industry standard is likely to benefit from years and years of refinement. And at the same time, it may be stuck with years and years of croft, which may or may not be intrinsic to its approach. So if you compare your cool new tool to some industry standard, there's two things varying here. One is the fidelity of the implementation, and the other one of course is the approach. Consequently when you get the results, you can't know whether it will tribute the result to fidelity or approach or some combination of the two. So we're going to talk about ways of teasing apart those different causal factors. Now, one thing I should say right off the bat is there are sometimes, where it may be more or less relevant. Whether you have a good handle on what the causal factors are. So, for example, if you're trying to decide between two different digital cameras, at the end of the day maybe all you care about is image quality or usability or some other factor. And exactly what makes that image quality better or worse or any other element along the way may be less relevant to you. If you don't have control over the variables, then identifying cause may not always be what you want. But when you're a designer, you do have control over the variables and that's when it's really important to ascertain cause. Here's an example of a study that came out, right when the iPhone was done by a research firm, User Centric, and I'm going to read from this news article here. Research firm User Centric has released a study that tries to gauge how effective the iPhone's unusual on-screen keyboard is. The goal is certainly a noble one, but I cannot say the survey's approach results in data that makes much sense. User Centric brought in 20 owners of other phones. Half had QWERTY keyboards, half had ordinary numeric phones with keypads. None were familiar with the iPhone. The research involved having the test subjects enter six sample test messages with the phones that they already had, and six with the iPhone. The end result was that the iPhone newbies took twice as long to enter text with an iPhone as they did with their own phones, and made lots more typos. So let's critic this study and talk about it's benefits and drawbacks. Here is the webpage directly from User Centric. What's our manipulation in this study? Well, the manipulation is going to be the input style. Now, how about the measure in this study? That's going to be the words per minute. And there's absolutely value in being able to measure the initial usability of that iPhone for several reasons. One is, if you're introducing a new technology, it's beneficial if people are able to get up to speed pretty quickly. However, it's important to realize that this comparison is intrinsically unfair, because the users of the previous cell phones were experts at that input modality and that people who were using the iPhone are novices in that modality. And so, it seems quite likely that the iPhone users, once they become actual users, are going to get better over time. And so, if you're not used to something the first time you try it, that may not be a deal killer. And it's certainly not an apples to apples comparison. Another thing that we don't get out of this article is, is this difference significant? So, we read that each person typed six messages in each of two conditions. So they did their own device and the iPhone, or vice versa, six messages each. And the iPhone users were half the speed of the, or rather the people typing with the iPhone were half as fast as when they got to type with a mini qwerty, the device they were accustom to. So, while this may tell us something about the initial usability of the iPhone, in terms of the long term usability, I think we get so much out of this here. If you weren't satisfied by that initial data, you're in good company, neither were the authors of that study. So, they went back a month later, and they ran another study, where they brought in 40 new people to the lab, who were either iPhone users, qwerty users, or nine key users. And now it's more of an apples to apples comparison in that they're going to test people who are relatively experts in these three different modalities. After about a month on the iPhone, you're probably starting to asymptote in terms of your performance. Definitely it gets better over time, even past a month. But you know, a month starts to get more reasonable. And what they found was that iPhone users and QWERTY users were about the same in terms of speed and that the numeric keypad users were much slower. So, once again our manipulation is going to be input style and we're going to measure speed. This time, we're also going to measure error rate. And what we see is that iPhone users and qwerty users are essentially the same speed. However The iPhone users make many more errors. Now, one thing I should point out about this study is that each of the different devices was used by a different group of people. And it was done this way, so that each device was used by somebody who was comfortable and had experience with working with that device, and so we remove the worry that you have newbies working on these devices. However, especially in 2007, there may have been significant differences in who the people were who are using the early adopters of the 2007 iPhone or maybe business users were particularly drawn to the Qwerty devices, or people who had better things to do with their time then send email on their telephone, were using the nine-key devices, and so, while this comparison is better than the previous one, the potential for variation between the user populations is still problematic. If you'd like to be able to claim, is something about the intrinsic properties of the device. And may, at least in part, have to do with the users. So what are some strategies for fairer comparison? To brainstorm a couple of options, one thing that you can do is insert your approach into a production setting. And this may seem like a lot of work, sometimes it is, but in the age of the web this is a lot easier than it used to be. And it's possible even if you don't have access to the server of the service that you're comparing against. You can use things like a proxy server or client side scripting to be able to put your own technique in and have an apples-to-apples comparison. A second strategy for neutralizing the environment difference between a production version and your new approach is to make a version of the production thing in the same style as your new approach, that also makes them equivalent in terms of their implementation fidelity. A third strategy, and one that's used commonly in research is to scale things down, so you're looking at just a piece of the system at a particular point in time. That way, you don't have to worry about implementing a whole big giant thing, you can just focus one on small piece and have that comparison be fair. And the fourth strategy is that, when expertise is relevant, train people up. Give them the practice that they need, so that they can start at least hitting that asymptote in terms of performance and you can get a better read than what they would be as newbies. So now, to close out this lecture, if somebody asks you the question is interface x better than interface y, you know that we're off to a good start because we have a comparison. However, you also know to be worried. What does better mean? And often in a complex system, you're going to have several measures, that's totally cool. There's a lot of value in being explicit though, about what it is that you mean by better? What are you trying to accomplish? What are you trying to improve? And if anybody ever tells you that their interface is always better, don't believe them because nearly all of the time the answer's going to be, it depends. And the interesting question is, what does it depend on? Most interfaces are good for some things and not for others. For example, if you have a tablet computer where all of the screen is devoted to display that's going to be great for reading, for web browsing, for that kind of activity, looking at pictures. Not so good, if you want to type a novel. So, here we've introduced controlled comparison as a way of finding the smoking gun, as a way of inferring cause. And often, for when you have only two conditions, we're going to talk about that as being a minimal pairs design. As a practicing designer, the reason to care about what's causal is that it gives you the material to make a better decision going forward. A lot of studies violate this constraint and that gets dangerous, because it prevents you from being able to make sound decisions. I hope that the tools that we've talked about today and in the next several lectures will help you become a wise skeptic, like our friend in this xkcd comic. I'll see you next time.