﻿WEBVTT

00:00:02.720 --> 00:00:09.120
One of the main issues that we do need to focus on is this question of the term *statistical significance*.

00:00:14.320 --> 00:00:20.154
Basically, all that means is that if we have two different groups and we've shown them different designs

00:00:20.154 --> 00:00:25.787
or they're in two different categories – these people converted; these people didn't convert,

00:00:25.787 --> 00:00:29.783
or these people saw design A; these people saw design B –

00:00:29.783 --> 00:00:34.890
and we get *different counts* as a result, are those counts statistically significant?

00:00:34.890 --> 00:00:40.399
If we were to do this all over again with a different set of participants, would we see similar results?

00:00:40.399 --> 00:00:45.001
Now, the example of people who converted and people who didn't convert – that

00:00:45.001 --> 00:00:50.081
doesn't usually need statistical significance or separate testing

00:00:50.081 --> 00:00:54.381
because you'd be lucky to get more than about 5% of visitors converting.

00:00:54.381 --> 00:00:57.049
But take design A versus design B.

00:00:57.049 --> 00:01:01.348
We might see two different figures for conversion from *that* lot.

00:01:01.348 --> 00:01:05.068
And we'd want to know, well, is that meaningful?

00:01:05.068 --> 00:01:10.562
Was that a really successful design or are we just barking up the tree in statistical terms?

00:01:10.562 --> 00:01:15.478
So, the whole question is whether we would get these kinds of results again or

00:01:15.478 --> 00:01:18.542
whether these results are the product of chance.

00:01:18.542 --> 00:01:21.539
So, we have to understand a bit about statistics

00:01:21.539 --> 00:01:24.400
in order to be able to know how to test this

00:01:24.400 --> 00:01:29.261
and to know what the results actually mean in terms of significance.

00:01:29.261 --> 00:01:31.281
Here's an example.

00:01:32.800 --> 00:01:38.438
So, this actually is taken straight from an example on Optimal Workshop's Chalkmark.

00:01:38.438 --> 00:01:43.391
So, Chalkmark is Optimal Workshop's version of first-click testing.

00:01:43.391 --> 00:01:49.714
And that little wireframey thing right in the middle of the screen is the thing being tested.

00:01:49.714 --> 00:01:55.038
You don't need much in terms of visual design in order to be able to do first-click testing,

00:01:55.038 --> 00:01:59.448
which this is an example of. And in this particular case – this is using their own figures –

00:01:59.448 --> 00:02:03.522
they had 60 people who clicked in what we thought was the right place.

00:02:03.522 --> 00:02:10.631
So, in these early-design tests, you're often allowed to state what is considered to be success

00:02:10.631 --> 00:02:14.000
and what isn't – which is great because it means you can actually

00:02:14.000 --> 00:02:18.453
get some overall validation of what it is you're trying to do.

00:02:18.453 --> 00:02:22.692
We had, I think, a little bit more than 100 participants.

00:02:22.692 --> 00:02:26.240
We had 60 people click in the right place, and we had 42

00:02:26.240 --> 00:02:29.312
people click in the wrong place.

00:02:29.312 --> 00:02:32.927
So, those are just shown on the slide as success and failure.

00:02:32.927 --> 00:02:36.291
And we had one person skip, who we're going to ignore.

00:02:36.291 --> 00:02:39.035
Now, 60 is bigger than 42,

00:02:39.035 --> 00:02:43.504
but is it very much bigger? They're both numbers that are quite close to the middle.

00:02:43.504 --> 00:02:48.462
And if you look at the pie chart, it's clearly in favor of the successful side.

00:02:48.462 --> 00:02:53.492
There's more green than red, but it isn't exactly an overwhelming result.

00:02:53.492 --> 00:02:57.181
So, we need to run a test of statistical significance.

00:02:57.181 --> 00:02:59.567
Now, this is what's called *categorical data*.

00:02:59.567 --> 00:03:03.365
We have two counts: 
one of 16 and one of 42.

00:03:03.365 --> 00:03:08.754
And, for that, we have a very well-understood and popular test called *chi square*.

00:03:08.754 --> 00:03:11.920
And we can do this very simple test even

00:03:11.920 --> 00:03:18.455
with just Excel, or lots and lots of online 
websites that will offer to do chi-square tests for you.

00:03:18.455 --> 00:03:22.968
And all we have to do is put in our counts what we were expecting and what we got.

00:03:22.968 --> 00:03:27.409
Now, in terms of what we were expecting if it was *random* selection,

00:03:27.409 --> 00:03:31.831
if we simply flipped a coin for every participant taking part in this,

00:03:31.831 --> 00:03:34.614
we would expect a 50/50 ratio.

00:03:34.614 --> 00:03:37.606
So, that would be 102 total.

00:03:37.606 --> 00:03:40.204
So, 51 get it right

00:03:40.204 --> 00:03:42.433
and 51 get it wrong.

00:03:42.433 --> 00:03:45.897
But we've got the actual figures of 60 and 42.

00:03:45.897 --> 00:03:51.310
And it turns out that there is a 7% chance of this result occurring *randomly*.

00:03:51.310 --> 00:03:55.404
And that really isn't good enough.
We tend to use in user experience,

00:03:55.404 --> 00:03:59.476
as in a lot of social research, a figure of 95%,

00:03:59.476 --> 00:04:03.104
leaving a 5% chance of random occurrence.

00:04:03.104 --> 00:04:05.532
Here, we're only 93% certain

00:04:05.532 --> 00:04:08.033
that this is not random.

00:04:08.033 --> 00:04:11.870
And so, we would actually say this is not statistically significant.

00:04:11.870 --> 00:04:17.202
And that's the kind of thing that we need to do with a very large amount of the results that we're looking at

00:04:17.202 --> 00:04:20.064
in all kinds of tools, if not all of them.

00:04:20.064 --> 00:04:23.933
In some cases – you know – if you've got "90% of users did this",

00:04:23.933 --> 00:04:28.018
then you probably can get away without running a separate significance test.

00:04:28.018 --> 00:04:32.030
But that's the kind of thing that we need to do, and we'll be talking about how to do these things.

00:04:32.030 --> 00:04:36.398
So, what conclusions can we reach with this particular result?

00:04:36.398 --> 00:04:38.422
Is the wireframe a poor design?

00:04:38.422 --> 00:04:42.719
Well, probably – people really aren't doing terribly well;

00:04:42.719 --> 00:04:48.473
we're not doing really significantly better than random; just flipping a coin.

00:04:48.640 --> 00:04:53.651
So, there isn't a large enough difference between the people who got it right and the people who got it wrong.

00:04:53.651 --> 00:04:58.056
And one of the things that we need to 
be concerned about is *how engaged users were*.

00:04:58.056 --> 00:05:00.643
Were they actually following the instructions?

00:05:00.643 --> 00:05:04.816
Or were they just clicking blindly in order to be able to get through to the end of the study

00:05:04.816 --> 00:05:11.048
so they could claim compensation? — Some kind of remuneration for having taken part in the study,

00:05:11.048 --> 00:05:17.146
which I'm afraid is something that happens – not all that frequently, but it's certainly a known hazard.

00:05:17.146 --> 00:05:21.762
And you could expect in any particular study that maybe 5–10% of your participants

00:05:21.762 --> 00:05:25.450
will not actually be paying attention to what they're actually doing.

00:05:25.450 --> 00:05:31.116
So, we need to make sure that we're looking at clean data and that we don't have other sources of noise.

00:05:31.116 --> 00:05:35.521
And, of course, one of the sources of noise is just going to be our choice of terminology.

00:05:35.521 --> 00:05:39.000
And if we're using words that users *aren't* quite certain about,

00:05:39.000 --> 00:05:42.400
then, yes, we might expect half of them to get it wrong.
