﻿WEBVTT

00:00:07.840 --> 00:00:10.874
Correlation is actually quite an interesting topic,

00:00:10.874 --> 00:00:16.954
perhaps not directly relevant to user experience in many cases, but it certainly does have its uses.

00:00:16.954 --> 00:00:23.249
And it's where we're trying to decide whether there is any kind of relationship between two different variables.

00:00:23.249 --> 00:00:29.440
So, for example, in user experience we might be tempted to find 
out whether the longer that somebody spends

00:00:29.440 --> 00:00:34.880
on a particular page the more they spend at 
checkout or the more they donate to our cause  

00:00:34.880 --> 00:00:38.797
or whether there's an inverse relationship, which 
is where the opposite would happen,

00:00:38.797 --> 00:00:44.560
or if there's no relationship at all; so, by way of real-world example, if we look at the connection between

00:00:44.560 --> 00:00:48.896
people's heights and weights, they *are* correlated,

00:00:48.896 --> 00:00:56.856
but there's no real direct relationship. People who are taller also tend to be heavier; it's just the way the world works.

00:00:56.856 --> 00:01:02.129
But that doesn't define the weight, by any chance – you can't predict somebody's weight from their height.

00:01:02.129 --> 00:01:08.023
But the scattergram, it's called – it's literally every 
point plotted on both dimensions.

00:01:08.023 --> 00:01:13.246
On the right-hand side here, we have heights and weights, and every dot represents a person.

00:01:13.246 --> 00:01:17.386
It tends to show that as people get higher the weights go up.

00:01:17.386 --> 00:01:23.434
So, that is by way of a correlation.
It's also called a *scatter plot* as I've named it in the slides.

00:01:23.434 --> 00:01:26.952
But the word "scatter" is the important one.

00:01:26.952 --> 00:01:32.932
And it's actually a very good way of looking at your data and trying to find out if there's anything interesting going on there.

00:01:33.000 --> 00:01:39.240
We're going to talk initially about a very common – or perhaps *the* most common – correlation test.

00:01:39.240 --> 00:01:45.615
And it's called Pearson's Correlation 
Coefficient, and sometimes it's just referred to as "Pearson's r";

00:01:45.615 --> 00:01:50.333
"r" is the name of the variable that's 
always used for the correlation coefficient.

00:01:50.333 --> 00:01:54.991
And the way that it works is that r expresses the strength of the relationship

00:01:54.991 --> 00:01:59.288
from -1, which would be perfectly inversely correlated

00:01:59.288 --> 00:02:05.181
– that *would* mean that as the height went up the weight would go down in a very direct way

00:02:05.181 --> 00:02:14.640
– which, as you can see, is not really at all the case – or up to one, +1, which is where the weight goes up, the height follows directly and vice versa.

00:02:14.640 --> 00:02:18.908
And, of course, in the middle of all that is zero,
where there is no correlation. And if there is no correlation,

00:02:18.908 --> 00:02:24.699
you'd expect to see that scatter plot 
or scattergram having points all over the white space,

00:02:24.699 --> 00:02:27.268
not just towards the middle.

00:02:27.268 --> 00:02:30.291
The nature of the relationships are these.

00:02:30.291 --> 00:02:36.119
Now – certainly with Pearson's r, we're talking about *linear* relationships.

00:02:36.119 --> 00:02:41.011
So, if it's not linear, then Pearson's won't think it's correlated.

00:02:41.011 --> 00:02:44.931
But we can see the sorts of things that we might get as results from Pearson's.

00:02:44.931 --> 00:02:49.089
So, top-left-hand corner is a strong positive correlation.

00:02:49.089 --> 00:02:54.955
A bit to the right is a weak positive correlation where the points aren't quite following the straight line that we've drawn.

00:02:54.955 --> 00:02:59.930
The third on the right is a strong negative correlation,
where the line is going down.

00:02:59.930 --> 00:03:07.577
So, as one variable increased up the 
y-axis, the other decreased across the x-axis.

00:03:07.577 --> 00:03:12.964
Similarly, weak negative correlation – the points aren't quite as tightly packed around the line.

00:03:12.964 --> 00:03:20.448
And then, moderate, and finally no correlation, which is what I mentioned as being r equals or approximately equal to zero.

00:03:20.448 --> 00:03:24.885
Now, there is a kind of rule of thumb with correlation coefficients.

00:03:24.885 --> 00:03:32.416
And it's shown here; this is taken from "Statistics For Dummies". It is actually, in spite of its title, quite a good book on statistics.

00:03:32.416 --> 00:03:35.572
But this particular page is taken from their website.

00:03:35.572 --> 00:03:38.600
And you can see it runs from -1 to +1, as I said.

00:03:38.600 --> 00:03:43.121
And around about 0.7 plus or minus 1 is a strong relationship;

00:03:43.121 --> 00:03:46.240
0.5 moderate, and 0.3 weak.

00:03:46.240 --> 00:03:51.558
So, anything between 0.3 and 0 is more or less not very correlated.

00:03:51.558 --> 00:03:58.206
And anything between 0.7 and 1, whether it's positive or negative, is very strongly correlated.

00:04:00.456 --> 00:04:06.151
Let's have a look at correlation in user experience.
Charts are from the Google Merchandise store.

00:04:06.151 --> 00:04:08.916
The top one is to do with the length of sessions,

00:04:08.916 --> 00:04:15.244
and the bottom one is to do with conversion value – the amount of money that people have spent at checkout.

00:04:15.244 --> 00:04:22.095
And while we don't need these to be normally 
distributed for the correlation tests to work,

00:04:22.095 --> 00:04:27.519
it's a little bit unlikely that if you've got two 
extremely different-looking distributions

00:04:27.519 --> 00:04:32.393
– maybe one's normal and one's not, or maybe they've got glitches in different places –

00:04:32.393 --> 00:04:37.237
the fact that they're different distributions means that they're not nearly as likely to be correlated.

00:04:37.237 --> 00:04:43.278
So, looking at a particular instance, the conversion value – the very first column – is labeled zero.

00:04:43.278 --> 00:04:49.190
That's somebody's checked out with presumably very little; actually, it's not quite zero, is it? It's a little bit above zero:

00:04:49.190 --> 00:04:56.106
a few pounds or dollars. There are not many of those, but if we look at a similar location in the session lengths,

00:04:56.106 --> 00:05:00.922
the zero- to ten-second session 
length is the most common by a long way.

00:05:00.922 --> 00:05:05.600
And, of course, we know the reason for that, and the reason 
for that is that lots of people come to a website  

00:05:05.600 --> 00:05:10.982
and decide it isn't really where they wanted to be;
so, they've left quite quickly.

00:05:10.982 --> 00:05:15.554
And it needn't necessarily be a problem.
We'd actually have to try this.

00:05:15.554 --> 00:05:20.793
But the point is that we're interested in the correlation between session length and conversion value.

00:05:20.793 --> 00:05:24.442
And there is no conversion value for people who have not gone through the checkout.

00:05:24.442 --> 00:05:29.357
So, we almost certainly are going to be throwing out that 0- to 10-second column, anyhow.

00:05:29.357 --> 00:05:32.945
So, we might still find some correlation there.

00:05:32.945 --> 00:05:38.541
Note that there are many more than just one correlation test.

00:05:38.541 --> 00:05:41.763
Pearson's r is just the most common, most used.

00:05:42.000 --> 00:05:45.543
Pearson's r is very much a linear correlation test, as I've mentioned;

00:05:45.543 --> 00:05:52.480
so, if the relationship between your two variables is not linear, then it's not going to find much in the way of a correlation.

00:05:52.480 --> 00:06:01.541
Another very common and popular test is called the Spearman Rank Coefficient, and that is *not* sensitive to linearity.

00:06:01.541 --> 00:06:07.154
It doesn't have to be a linear relationship. So, this scatter plot shows a very strong correlation result:

00:06:07.154 --> 00:06:10.166
0.92, which is getting very close to 1.

00:06:10.166 --> 00:06:15.675
But it's certainly not a linear relationship. You can see it's curved; curvy linear is the proper term; and so,

00:06:15.675 --> 00:06:19.280
that might actually be more relevant 
if you're not sure exactly what kind of  

00:06:19.280 --> 00:06:23.120
relationship you're going to have; or if you look at a scattergram and you can see that it's  

00:06:23.120 --> 00:06:27.895
certainly not a linear relationship although there 
is some kind of correlation going on there.
