﻿WEBVTT - https://subtitletools.com

00:00:03.060 --> 00:00:06.680
When we want to say whether something's good or not, it's not so obvious.

00:00:06.680 --> 00:00:10.300
And this unit is all about evaluation.

00:00:16.000 --> 00:00:18.320
Ah, well, it's a lovely day here in Tiree.

00:00:18.320 --> 00:00:20.380
I'm looking out the window again.

00:00:20.380 --> 00:00:22.980
But how do we know it's a lovely day?

00:00:22.980 --> 00:00:25.320
Well, I won't turn the camera around to show you,

00:00:25.320 --> 00:00:27.220
because I'll probably never get it pointing back again.

00:00:27.220 --> 00:00:29.140
But I can tell you the Sun's shining;

00:00:29.140 --> 00:00:30.860
there's a blue sky.

00:00:30.860 --> 00:00:33.260
I could go and measure the temperature. It's probably not that warm

00:00:33.260 --> 00:00:35.100
because it's not early in the year.

00:00:35.440 --> 00:00:38.700
But there's a number of metrics or measures I could use.

00:00:38.940 --> 00:00:41.040
Or perhaps I should go out and talk to people

00:00:41.040 --> 00:00:44.520
and see if there's people sitting out and saying how lovely it is

00:00:44.520 --> 00:00:46.600
or they're all huddled inside.

00:00:47.480 --> 00:00:51.000
Now, for me, this sunny day seems like a good day.

00:00:51.000 --> 00:00:56.700
But last week it was the Tiree Wave Classic, and there were people windsurfing.

00:00:57.280 --> 00:01:00.020
The best day for them was not a sunny day.

00:01:00.020 --> 00:01:03.100
It was actually quite a dull day, quite a cold day.

00:01:03.100 --> 00:01:05.000
But it was the day with the best wind.

00:01:05.000 --> 00:01:07.520
They didn't care about the Sun; they cared about the wind.

00:01:07.520 --> 00:01:10.280
So, if I'd asked them, I might have got a very different answer

00:01:10.280 --> 00:01:13.240
than if I'd asked a different visitor to the island

00:01:13.240 --> 00:01:15.980
or if you'd asked me about it.

00:01:15.980 --> 00:01:20.280
Evaluation is absolutely crucial to knowing whether something is right.

00:01:20.560 --> 00:01:25.500
But, you know, the methods of it are important – they are important to do.

00:01:25.500 --> 00:01:28.680
But they tend to be a bit boring to talk about, to be honest,

00:01:28.680 --> 00:01:31.420
because you end up with long lists of things to check.

00:01:31.420 --> 00:01:35.660
When you're looking at an actual system, though, it becomes more interesting again.

00:01:35.660 --> 00:01:37.640
But it's not so interesting to talk about.

00:01:37.640 --> 00:01:41.200
What I want to do is talk more about the broader issues

00:01:41.200 --> 00:01:44.680
about *how* you choose *what kind* of evaluation to do

00:01:44.680 --> 00:01:47.160
and some of the issues that surround it.

00:01:49.000 --> 00:01:53.440
And it *can* be almost a conflict between people within HCI.

00:01:53.440 --> 00:01:55.600
It's between those who are more quantitative.

00:01:55.600 --> 00:01:58.780
So, when I was talking about the sunny day, I could go and measure the temperature.

00:01:58.780 --> 00:02:00.760
I could measure the wind speed if I was a surfer

00:02:00.760 --> 00:02:03.000
– a whole lot of numbers about it –

00:02:03.000 --> 00:02:06.720
as opposed to those who want to take a more qualitative approach.

00:02:06.720 --> 00:02:11.900
So, instead of measuring the temperature, those are the people who'd want to talk to people to find out

00:02:11.900 --> 00:02:15.120
more about what it *means* to be a good day.

00:02:15.120 --> 00:02:17.580
And we could do the same for an interface. I can look at a phone and say,

00:02:17.580 --> 00:02:19.960
"How long did it take me to make a phone call?"

00:02:19.960 --> 00:02:22.240
Or I could ask somebody whether they're happy with it:

00:02:22.240 --> 00:02:24.700
what does the phone make them feel about?

00:02:24.700 --> 00:02:27.500
– different kinds of questions to ask.

00:02:27.800 --> 00:02:33.360
Also, you might ask those questions, and you can ask this in both a qualitative and quantitative way in a sealed setting.

00:02:33.360 --> 00:02:37.420
You might take somebody into a room, give them
perhaps a new interface to play with.

00:02:37.420 --> 00:02:40.820
So, you might take the computer, give them a set of tasks to do

00:02:40.820 --> 00:02:42.900
and see how long they take to do it.

00:02:42.900 --> 00:02:48.400
*Or* you might go out and watch people in their real lives using some piece of

00:02:48.400 --> 00:02:53.620
– it might be existing software; it might be new software, or just actually observing how they do things.

00:02:54.280 --> 00:02:56.280
There's a bit of overlap here – I should have mentioned at the beginning –

00:02:56.280 --> 00:02:59.660
between evaluation techniques and empirical studies.

00:03:00.580 --> 00:03:03.400
And you might do empirical studies very, very early on,

00:03:03.400 --> 00:03:06.700
and they share a lot of features with evaluation.

00:03:06.700 --> 00:03:09.080
They're much more likely to be wild studies.

00:03:09.080 --> 00:03:10.631
And there are advantages to each.

00:03:10.631 --> 00:03:13.180
In a laboratory situation when you've brought people in,

00:03:13.180 --> 00:03:14.950
you can control what they're doing;

00:03:14.950 --> 00:03:17.840
you can guide them in particular ways.

00:03:17.840 --> 00:03:21.560
However, that tends to make it both more

00:03:21.560 --> 00:03:23.880
– shall we say – robust that you know what's going on,

00:03:23.880 --> 00:03:26.180
but less about the real situation.

00:03:26.180 --> 00:03:29.160
In the real world, it's what people often call "ecologically valid";

00:03:29.160 --> 00:03:31.420
it's about what they *really* are up to.

00:03:31.700 --> 00:03:35.700
But I said it's much less controlled, harder to measure – all sorts of things.

00:03:36.360 --> 00:03:42.320
Very often, it's rare or it's rarer to find more quantitative in the wild.

00:03:42.320 --> 00:03:47.720
But you can find both. You can both go out and perhaps do a measure of people outside.

00:03:47.720 --> 00:03:52.180
You might go out on a sunny day and see how many people are smiling.

00:03:52.180 --> 00:03:55.640
Count the number of smiling people each day and use that as your measure

00:03:55.640 --> 00:03:58.020
– a very quantitative measure that's in the wild.

00:03:58.020 --> 00:04:02.420
More often, you might in the wild just go and ask people – it's a more qualitative thing.

00:04:04.060 --> 00:04:08.060
Similarly, in the lab, you might do a quantitative thing – some sort of measurement –

00:04:08.060 --> 00:04:12.880
or you might ask something more qualitative – more open-ended.

00:04:16.100 --> 00:04:21.140
Also, you might do away with the users entirely.

00:04:21.160 --> 00:04:22.940
So, you might have users there doing it,

00:04:22.940 --> 00:04:27.040
or you might actually use what's called an *expert evaluation* method

00:04:27.040 --> 00:04:29.080
or an analytic method of evaluation.

00:04:29.080 --> 00:04:34.080
By having a structured set of questions, somebody who's got a bit of expertise, a bit of knowledge,

00:04:34.080 --> 00:04:39.620
can often have a very good estimate of whether something is really likely to work or not.

00:04:39.620 --> 00:04:43.560
So, you can have that sort of expert-based or
analytic-based evaluation method,

00:04:43.560 --> 00:04:46.840
or you can have something where you get real users in.

00:04:47.360 --> 00:04:51.200
Most people I think would say that in the end you do want to see some real users there;

00:04:51.200 --> 00:04:53.880
you can't do it all by expert methods.

00:04:53.880 --> 00:04:59.180
But often the expert methods are cheaper and quicker to do early on in the design process.

00:05:00.060 --> 00:05:05.160
So, usually both are needed, and in fact that's the general message I think I'd like to give you about this.

00:05:05.160 --> 00:05:11.040
That, in general, it's the *combination* of different kinds of methods which tend to be most powerful.

00:05:11.360 --> 00:05:16.240
So, sometimes at different stages: you might do expert evaluation or analytic evaluation early,

00:05:16.240 --> 00:05:18.240
more with real users later.

00:05:18.240 --> 00:05:21.440
Although probably you'll want to see some users at all stages.

00:05:22.000 --> 00:05:27.400
Particularly, quantitative and qualitative methods, which are often seen as very, very different,

00:05:27.400 --> 00:05:30.400
and people will tend to focus on one or the other.

00:05:30.400 --> 00:05:33.500
Personally, I find they fit together.

00:05:33.500 --> 00:05:39.640
Quantitative methods tend to tell me whether something happens and how common it is to happen

00:05:39.640 --> 00:05:43.660
– whether it's something I expect to see in practice commonly.

00:05:43.660 --> 00:05:48.280
Qualitative methods – the ones which are more about asking people open-ended questions –

00:05:48.280 --> 00:05:52.060
either to both tell me new things I didn't think about before,

00:05:52.060 --> 00:05:57.020
but also give me the "Why?" answers, if I'm trying to understand *why* it is I'm seeing a phenomenon.

00:05:57.020 --> 00:06:03.380
So, the qualitative things – the measurements – say, "Yeah, there's something happening. People are finding this feature difficult."

00:06:03.880 --> 00:06:08.880
The qualitative thing helps me understand *what it is about it that is difficult* and helps me to solve it.

00:06:08.880 --> 00:06:13.320
So, I find they give you *complementary* things – they work together.

00:06:14.140 --> 00:06:20.000
The other thing you have to think about when choosing methods is what's appropriate for the particular situation.

00:06:20.000 --> 00:06:22.080
And these things don't always work.

00:06:22.080 --> 00:06:25.280
Sometimes, you can't do an in-the-wild experiment.

00:06:25.480 --> 00:06:28.420
If it's about, for instance, systems for people in outer space,

00:06:28.420 --> 00:06:30.540
you're going to have to do it in a laboratory.

00:06:30.540 --> 00:06:33.940
You're not going to go up there and experiment

00:06:33.940 --> 00:06:35.880
while people are flying around the planet.

00:06:35.880 --> 00:06:39.560
So, sometimes you can't do one thing or the other – it doesn't make sense.

00:06:39.560 --> 00:06:46.360
Similarly, with users – if you're designing something for chief executives of Fortune 100 companies,

00:06:46.360 --> 00:06:50.500
you're not going to get 20 of them in a room and do a user study with them.

00:06:50.500 --> 00:06:51.960
That's not practical.

00:06:51.960 --> 00:06:54.580
So, you have to understand what's practical, what's reasonable

00:06:54.580 --> 00:06:57.460
and choose your methods accordingly.

00:07:00.120 --> 00:07:05.280
Key to all of this is understanding the purpose of your experimentation.

00:07:05.280 --> 00:07:07.580
Why are you doing the evaluation in the first place?

00:07:07.580 --> 00:07:09.880
What do you want to get out of it?

00:07:09.880 --> 00:07:14.480
And there's usually said to be two main kinds of user evaluation.

00:07:14.480 --> 00:07:17.260
The first of them is what's called *formative evaluation*.

00:07:17.260 --> 00:07:19.840
And that's about "How can I make something better?".

00:07:19.840 --> 00:07:23.840
So, you've designed an interface and you're partway through.  This is in the iterative process.

00:07:23.840 --> 00:07:26.680
You're in that iterative process, and you're thinking:

00:07:26.680 --> 00:07:30.110
"Okay, how do I make it better? How do I find out what's wrong with it?"

00:07:30.110 --> 00:07:32.380
In fact, people often focus on what's wrong.

00:07:32.380 --> 00:07:35.920
The making it better sometimes is a better way to think about it.

00:07:35.920 --> 00:07:39.360
But very often people look for usability faults or flaws.

00:07:39.360 --> 00:07:42.080
Maybe you should be looking for *usability opportunities*.

00:07:42.080 --> 00:07:46.440
But whichever way, your aim is about making this thing better that you have in front of you.

00:07:46.440 --> 00:07:49.200
So, that's about improving the design.

00:07:49.200 --> 00:07:52.560
The other kind of evaluation you might do is towards the end of that process,

00:07:52.560 --> 00:07:55.690
which is "Is it good enough? Does it meet some criteria?".

00:07:55.690 --> 00:07:59.270
Perhaps somebody's giving you something that says:

00:07:59.270 --> 00:08:04.160
"I've got to put this into the company, and everybody has got to be able to use this

00:08:04.160 --> 00:08:07.080
within ten minutes; otherwise, it's no good."

00:08:07.080 --> 00:08:10.080
So, you have some sort of criteria you're trying to reach.

00:08:10.080 --> 00:08:15.440
So, that's more about contractual or sales obligations, and it's an endpoint thing.

00:08:16.960 --> 00:08:20.500
The two of these will often use very similar methods.

00:08:20.500 --> 00:08:23.780
You might measure people's performances, do a whole range of things.

00:08:23.780 --> 00:08:27.620
But in the first of them – the formative one – your aim is about improving things.

00:08:27.620 --> 00:08:30.960
It's about unpacking what's wrong to make it better.

00:08:30.960 --> 00:08:35.600
In the second, your aim is about finding out
whether you've done it well enough.

00:08:35.600 --> 00:08:39.340
Sometimes, people use this to try and *prove* that they've done it well enough.

00:08:39.340 --> 00:08:42.680
So, there's an interesting tension that goes on there.

00:08:43.179 --> 00:08:46.600
However, those two are important.

00:08:46.600 --> 00:08:49.060
But there's a third, which is often missed, which is:

00:08:49.060 --> 00:08:55.440
In practice people *are* doing things, but
often forget and don't realize what they're doing.

00:08:55.440 --> 00:08:58.120
There isn't a good name for this one.

00:08:58.120 --> 00:09:02.680
I sometimes call it "explorative", "investigative", "exploratory".

00:09:02.680 --> 00:09:05.820
And this is about when you want to understand something.

00:09:05.820 --> 00:09:12.160
So, I might be giving somebody a new mobile interface to use

00:09:12.160 --> 00:09:18.340
because that's the interface I'm going to deliver and I want to make it better.

00:09:18.340 --> 00:09:23.440
But I might give them the interface to use because I want to understand how they would use something like it.

00:09:23.440 --> 00:09:26.660
So, say it's a life-logging application – it's about health monitoring.

00:09:26.660 --> 00:09:29.800
You know, "How well are you feeling today?" and stuff like that.

00:09:30.420 --> 00:09:35.540
I might be more interested in finding out how that would go into their lives,

00:09:35.540 --> 00:09:38.240
how it would fit with their lives, how it would make sense to them

00:09:38.240 --> 00:09:41.320
– the kinds of things they would want to log.

00:09:41.320 --> 00:09:45.300
Later on, then, I might throw away
completely what I've designed.

00:09:45.300 --> 00:09:49.860
So, it wasn't an early design; it was more an exploratory thing – a thing to find out.

00:09:50.360 --> 00:09:53.320
Now, you'll certainly do that from an academic research point of view

00:09:53.320 --> 00:09:56.060
if you're doing a Ph.D. or if you're doing research in HCI.

00:09:56.066 --> 00:09:59.800
But it's also true early in a design process.

00:09:59.800 --> 00:10:05.840
Your aim is more to understand the situation than it is to make something that's going to get better

00:10:05.840 --> 00:10:07.960
or to say it's good enough.

00:10:08.380 --> 00:10:12.560
It's very easy to confuse these goals.

00:10:16.680 --> 00:10:22.520
That's why I'm telling you about them – because your goal, what you're really after, might be investigative

00:10:22.520 --> 00:10:26.058
but you might address your experiment as if it's summative – a "good enough" answer:

00:10:26.058 --> 00:10:28.840
"Yes, it was good enough."
That doesn't tell you anything.

00:10:28.840 --> 00:10:33.580
So, if I had this health application and I found that people enjoyed using it,

00:10:33.580 --> 00:10:36.200
what does that tell me? What have I learned?

00:10:36.200 --> 00:10:43.240
So, if you know *what* you're trying to
address, you can then tune your evaluation for that.

00:10:45.900 --> 00:10:48.440
And when does this process end?

00:10:48.440 --> 00:10:53.120
Evaluation could go on forever, especially if you think about these iterative processes.

00:10:53.120 --> 00:10:55.820
And that's not true – the summative one is when you *do* get to the end.

00:10:55.820 --> 00:10:58.720
But in these formative evaluations, when do you get to the end of that?

00:10:58.720 --> 00:11:02.260
Now, there would have been a time when you'd say, "Well, it's when we deliver the product;

00:11:02.260 --> 00:11:05.300
when it goes into shrink-wrap and we put it on shelves."

00:11:05.300 --> 00:11:07.700
Nowadays, you may have heard the term "perpetual beta":

00:11:07.700 --> 00:11:11.580
the idea that with web applications you're constantly putting them up there,

00:11:11.580 --> 00:11:17.220
tweaking them, making them better, experimenting effectively, often with real users.

00:11:17.860 --> 00:11:22.220
So, in some sense, real use is the ultimate evaluation.

00:11:22.220 --> 00:11:26.640
Because of that, actually as you design one of the things you might want to think about is

00:11:26.640 --> 00:11:30.400
how you are going to get that information from use

00:11:30.400 --> 00:11:32.540
in order to help you design.

00:11:32.540 --> 00:11:36.700
In fact, last week at the Wave Classic – the surfer event –

00:11:36.700 --> 00:11:43.060
I've been involved in designing a local history application for the island that I'm on.

00:11:43.060 --> 00:11:48.760
And we were able, just in time, to get a version of this out for the Wave Classic.

00:11:48.760 --> 00:11:55.200
I know, because of the number of downloads and access to feeds from logs,

00:11:55.200 --> 00:11:58.600
that some people were using the application.

00:11:58.600 --> 00:12:03.760
But I don't know whether they used any of the history things or they just used some of the other facilities on it,

00:12:03.760 --> 00:12:07.760
because I was a bit last-minute; I didn't get a chance to get the logging in.

00:12:07.760 --> 00:12:09.720
So, I'm getting real use,

00:12:09.720 --> 00:12:13.880
but I wasn't getting information to help improve the future one.

00:12:13.880 --> 00:12:18.000
So, certainly for future prototypes, we will actually have this in.

00:12:18.000 --> 00:12:26.920
But when you design, you can actually think about how you're going to gather information from real use to help you improve things.
