WEBVTT

00:00:00.000 --> 00:00:30.860
When we want to say whether something's good or not, it's not so obvious.
And this unit is all about evaluation.
Ah, well, it's a lovely day here in Tiree.
I'm looking out the window again.
But how do we know it's a lovely day?
Well, I won't turn the camera around to show you,
because I'll probably never get it pointing back again.
But I can tell you the Sun's shining;
there's a blue sky.

00:00:30.860 --> 00:01:00.020
I could go and measure the temperature. It's probably not that warm
because it's not early in the year.
But there's a number of metrics or measures I could use.
Or perhaps I should go out and talk to people
and see if there's people sitting out and saying how lovely it is
or they're all huddled inside.
Now, for me, this sunny day seems like a good day.
But last week it was the Tiree Wave Classic, and there were people windsurfing.
The best day for them was not a sunny day.

00:01:00.020 --> 00:01:31.420
It was actually quite a dull day, quite a cold day.
But it was the day with the best wind.
They didn't care about the Sun; they cared about the wind.
So, if I'd asked them, I might have got a very different answer
than if I'd asked a different visitor to the island
or if you'd asked me about it.
Evaluation is absolutely crucial to knowing whether something is right.
But, you know, the methods of it are important – they are important to do.
But they tend to be a bit boring to talk about, to be honest,
because you end up with long lists of things to check.

00:01:31.420 --> 00:02:00.760
When you're looking at an actual system, though, it becomes more interesting again.
But it's not so interesting to talk about.
What I want to do is talk more about the broader issues
about *how* you choose *what kind* of evaluation to do
and some of the issues that surround it.
And it *can* be almost a conflict between people within HCI.
It's between those who are more quantitative.
So, when I was talking about the sunny day, I could go and measure the temperature.
I could measure the wind speed if I was a surfer

00:02:00.760 --> 00:02:33.360
– a whole lot of numbers about it –
as opposed to those who want to take a more qualitative approach.
So, instead of measuring the temperature, those are the people who'd want to talk to people to find out
more about what it *means* to be a good day.
And we could do the same for an interface. I can look at a phone and say,
"How long did it take me to make a phone call?"
Or I could ask somebody whether they're happy with it:
what does the phone make them feel about?
– different kinds of questions to ask.
Also, you might ask those questions, and you can ask this in both a qualitative and quantitative way in a sealed setting.

00:02:33.360 --> 00:03:03.400
You might take somebody into a room, give them perhaps a new interface to play with.
So, you might take the computer, give them a set of tasks to do
and see how long they take to do it.
*Or* you might go out and watch people in their real lives using some piece of
– it might be existing software; it might be new software, or just actually observing how they do things.
There's a bit of overlap here – I should have mentioned at the beginning –
between evaluation techniques and empirical studies.
And you might do empirical studies very, very early on,

00:03:03.400 --> 00:03:31.420
and they share a lot of features with evaluation.
They're much more likely to be wild studies.
And there are advantages to each.
In a laboratory situation when you've brought people in,
you can control what they're doing;
you can guide them in particular ways.
However, that tends to make it both more
– shall we say – robust that you know what's going on,
but less about the real situation.
In the real world, it's what people often call "ecologically valid";
it's about what they *really* are up to.

00:03:31.420 --> 00:04:02.420
But I said it's much less controlled, harder to measure – all sorts of things.
Very often, it's rare or it's rarer to find more quantitative in the wild.
But you can find both. You can both go out and perhaps do a measure of people outside.
You might go out on a sunny day and see how many people are smiling.
Count the number of smiling people each day and use that as your measure
– a very quantitative measure that's in the wild.
More often, you might in the wild just go and ask people – it's a more qualitative thing.

00:04:02.420 --> 00:04:34.080
Similarly, in the lab, you might do a quantitative thing – some sort of measurement –
or you might ask something more qualitative – more open-ended.
Also, you might do away with the users entirely.
So, you might have users there doing it,
or you might actually use what's called an *expert evaluation* method
or an analytic method of evaluation.
By having a structured set of questions, somebody who's got a bit of expertise, a bit of knowledge,

00:04:34.080 --> 00:05:05.160
can often have a very good estimate of whether something is really likely to work or not.
So, you can have that sort of expert-based or analytic-based evaluation method,
or you can have something where you get real users in.
Most people I think would say that in the end you do want to see some real users there;
you can't do it all by expert methods.
But often the expert methods are cheaper and quicker to do early on in the design process.
So, usually both are needed, and in fact that's the general message I think I'd like to give you about this.

00:05:05.160 --> 00:05:30.400
That, in general, it's the *combination* of different kinds of methods which tend to be most powerful.
So, sometimes at different stages: you might do expert evaluation or analytic evaluation early,
more with real users later.
Although probably you'll want to see some users at all stages.
Particularly, quantitative and qualitative methods, which are often seen as very, very different,
and people will tend to focus on one or the other.

00:05:30.400 --> 00:06:03.380
Personally, I find they fit together.
Quantitative methods tend to tell me whether something happens and how common it is to happen
– whether it's something I expect to see in practice commonly.
Qualitative methods – the ones which are more about asking people open-ended questions –
either to both tell me new things I didn't think about before,
but also give me the "Why?" answers, if I'm trying to understand *why* it is I'm seeing a phenomenon.
So, the qualitative things – the measurements – say, "Yeah, there's something happening. People are finding this feature difficult."

00:06:03.380 --> 00:06:30.540
The qualitative thing helps me understand *what it is about it that is difficult* and helps me to solve it.
So, I find they give you *complementary* things – they work together.
The other thing you have to think about when choosing methods is what's appropriate for the particular situation.
And these things don't always work.
Sometimes, you can't do an in-the-wild experiment.
If it's about, for instance, systems for people in outer space,
you're going to have to do it in a laboratory.

00:06:30.540 --> 00:07:05.280
You're not going to go up there and experiment
while people are flying around the planet.
So, sometimes you can't do one thing or the other – it doesn't make sense.
Similarly, with users – if you're designing something for chief executives of Fortune 100 companies,
you're not going to get 20 of them in a room and do a user study with them.
That's not practical.
So, you have to understand what's practical, what's reasonable
and choose your methods accordingly.
Key to all of this is understanding the purpose of your experimentation.

00:07:05.280 --> 00:07:30.110
Why are you doing the evaluation in the first place?
What do you want to get out of it?
And there's usually said to be two main kinds of user evaluation.
The first of them is what's called *formative evaluation*.
And that's about "How can I make something better?".
So, you've designed an interface and you're partway through.  This is in the iterative process.
You're in that iterative process, and you're thinking:
"Okay, how do I make it better? How do I find out what's wrong with it?"

00:07:30.110 --> 00:08:04.160
In fact, people often focus on what's wrong.
The making it better sometimes is a better way to think about it.
But very often people look for usability faults or flaws.
Maybe you should be looking for *usability opportunities*.
But whichever way, your aim is about making this thing better that you have in front of you.
So, that's about improving the design.
The other kind of evaluation you might do is towards the end of that process,
which is "Is it good enough? Does it meet some criteria?".
Perhaps somebody's giving you something that says:
"I've got to put this into the company, and everybody has got to be able to use this

00:08:04.160 --> 00:08:30.960
within ten minutes; otherwise, it's no good."
So, you have some sort of criteria you're trying to reach.
So, that's more about contractual or sales obligations, and it's an endpoint thing.
The two of these will often use very similar methods.
You might measure people's performances, do a whole range of things.
But in the first of them – the formative one – your aim is about improving things.
It's about unpacking what's wrong to make it better.

00:08:30.960 --> 00:09:02.680
In the second, your aim is about finding out whether you've done it well enough.
Sometimes, people use this to try and *prove* that they've done it well enough.
So, there's an interesting tension that goes on there.
However, those two are important.
But there's a third, which is often missed, which is:
In practice people *are* doing things, but often forget and don't realize what they're doing.
There isn't a good name for this one.
I sometimes call it "explorative", "investigative", "exploratory".

00:09:02.680 --> 00:09:35.540
And this is about when you want to understand something.
So, I might be giving somebody a new mobile interface to use
because that's the interface I'm going to deliver and I want to make it better.
But I might give them the interface to use because I want to understand how they would use something like it.
So, say it's a life-logging application – it's about health monitoring.
You know, "How well are you feeling today?" and stuff like that.
I might be more interested in finding out how that would go into their lives,

00:09:35.540 --> 00:10:05.840
how it would fit with their lives, how it would make sense to them
– the kinds of things they would want to log.
Later on, then, I might throw away completely what I've designed.
So, it wasn't an early design; it was more an exploratory thing – a thing to find out.
Now, you'll certainly do that from an academic research point of view
if you're doing a Ph.D. or if you're doing research in HCI.
But it's also true early in a design process.
Your aim is more to understand the situation than it is to make something that's going to get better

00:10:05.840 --> 00:10:33.580
or to say it's good enough.
It's very easy to confuse these goals.
That's why I'm telling you about them – because your goal, what you're really after, might be investigative
but you might address your experiment as if it's summative – a "good enough" answer:
"Yes, it was good enough." That doesn't tell you anything.
So, if I had this health application and I found that people enjoyed using it,

00:10:33.580 --> 00:11:02.260
what does that tell me? What have I learned?
So, if you know *what* you're trying to address, you can then tune your evaluation for that.
And when does this process end?
Evaluation could go on forever, especially if you think about these iterative processes.
And that's not true – the summative one is when you *do* get to the end.
But in these formative evaluations, when do you get to the end of that?
Now, there would have been a time when you'd say, "Well, it's when we deliver the product;

00:11:02.260 --> 00:11:30.400
when it goes into shrink-wrap and we put it on shelves."
Nowadays, you may have heard the term "perpetual beta":
the idea that with web applications you're constantly putting them up there,
tweaking them, making them better, experimenting effectively, often with real users.
So, in some sense, real use is the ultimate evaluation.
Because of that, actually as you design one of the things you might want to think about is
how you are going to get that information from use

00:11:30.400 --> 00:12:03.760
in order to help you design.
In fact, last week at the Wave Classic – the surfer event –
I've been involved in designing a local history application for the island that I'm on.
And we were able, just in time, to get a version of this out for the Wave Classic.
I know, because of the number of downloads and access to feeds from logs,
that some people were using the application.
But I don't know whether they used any of the history things or they just used some of the other facilities on it,

00:12:03.760 --> 00:12:26.920
because I was a bit last-minute; I didn't get a chance to get the logging in.
So, I'm getting real use,
but I wasn't getting information to help improve the future one.
So, certainly for future prototypes, we will actually have this in.
But when you design, you can actually think about how you're going to gather information from real use to help you improve things.