WEBVTT
00:00:00.000 --> 00:00:34.880
Correlation is actually quite an interesting topic,
perhaps not directly relevant to user experience in many cases, but it certainly does have its uses.
And it's where we're trying to decide whether there is any kind of relationship between two different variables.
So, for example, in user experience we might be tempted to find out whether the longer that somebody spends
on a particular page the more they spend at checkout or the more they donate to our cause
00:00:34.880 --> 00:01:02.129
or whether there's an inverse relationship, which is where the opposite would happen,
or if there's no relationship at all; so, by way of real-world example, if we look at the connection between
people's heights and weights, they *are* correlated,
but there's no real direct relationship. People who are taller also tend to be heavier; it's just the way the world works.
But that doesn't define the weight, by any chance – you can't predict somebody's weight from their height.
00:01:02.129 --> 00:01:32.932
But the scattergram, it's called – it's literally every point plotted on both dimensions.
On the right-hand side here, we have heights and weights, and every dot represents a person.
It tends to show that as people get higher the weights go up.
So, that is by way of a correlation. It's also called a *scatter plot* as I've named it in the slides.
But the word "scatter" is the important one.
And it's actually a very good way of looking at your data and trying to find out if there's anything interesting going on there.
00:01:32.932 --> 00:02:05.181
We're going to talk initially about a very common – or perhaps *the* most common – correlation test.
And it's called Pearson's Correlation Coefficient, and sometimes it's just referred to as "Pearson's r";
"r" is the name of the variable that's always used for the correlation coefficient.
And the way that it works is that r expresses the strength of the relationship
from -1, which would be perfectly inversely correlated
– that *would* mean that as the height went up the weight would go down in a very direct way
00:02:05.181 --> 00:02:30.291
– which, as you can see, is not really at all the case – or up to one, +1, which is where the weight goes up, the height follows directly and vice versa.
And, of course, in the middle of all that is zero, where there is no correlation. And if there is no correlation,
you'd expect to see that scatter plot or scattergram having points all over the white space,
not just towards the middle.
The nature of the relationships are these.
00:02:30.291 --> 00:03:07.577
Now – certainly with Pearson's r, we're talking about *linear* relationships.
So, if it's not linear, then Pearson's won't think it's correlated.
But we can see the sorts of things that we might get as results from Pearson's.
So, top-left-hand corner is a strong positive correlation.
A bit to the right is a weak positive correlation where the points aren't quite following the straight line that we've drawn.
The third on the right is a strong negative correlation, where the line is going down.
So, as one variable increased up the y-axis, the other decreased across the x-axis.
00:03:07.577 --> 00:03:32.416
Similarly, weak negative correlation – the points aren't quite as tightly packed around the line.
And then, moderate, and finally no correlation, which is what I mentioned as being r equals or approximately equal to zero.
Now, there is a kind of rule of thumb with correlation coefficients.
And it's shown here; this is taken from "Statistics For Dummies". It is actually, in spite of its title, quite a good book on statistics.
00:03:32.416 --> 00:04:06.151
But this particular page is taken from their website.
And you can see it runs from -1 to +1, as I said.
And around about 0.7 plus or minus 1 is a strong relationship;
0.5 moderate, and 0.3 weak.
So, anything between 0.3 and 0 is more or less not very correlated.
And anything between 0.7 and 1, whether it's positive or negative, is very strongly correlated.
Let's have a look at correlation in user experience. Charts are from the Google Merchandise store.
00:04:06.151 --> 00:04:32.393
The top one is to do with the length of sessions,
and the bottom one is to do with conversion value – the amount of money that people have spent at checkout.
And while we don't need these to be normally distributed for the correlation tests to work,
it's a little bit unlikely that if you've got two extremely different-looking distributions
– maybe one's normal and one's not, or maybe they've got glitches in different places –
00:04:32.393 --> 00:05:00.922
the fact that they're different distributions means that they're not nearly as likely to be correlated.
So, looking at a particular instance, the conversion value – the very first column – is labeled zero.
That's somebody's checked out with presumably very little; actually, it's not quite zero, is it? It's a little bit above zero:
a few pounds or dollars. There are not many of those, but if we look at a similar location in the session lengths,
the zero- to ten-second session length is the most common by a long way.
00:05:00.922 --> 00:05:32.945
And, of course, we know the reason for that, and the reason for that is that lots of people come to a website
and decide it isn't really where they wanted to be; so, they've left quite quickly.
And it needn't necessarily be a problem. We'd actually have to try this.
But the point is that we're interested in the correlation between session length and conversion value.
And there is no conversion value for people who have not gone through the checkout.
So, we almost certainly are going to be throwing out that 0- to 10-second column, anyhow.
So, we might still find some correlation there.
00:05:32.945 --> 00:06:01.541
Note that there are many more than just one correlation test.
Pearson's r is just the most common, most used.
Pearson's r is very much a linear correlation test, as I've mentioned;
so, if the relationship between your two variables is not linear, then it's not going to find much in the way of a correlation.
Another very common and popular test is called the Spearman Rank Coefficient, and that is *not* sensitive to linearity.
00:06:01.541 --> 00:06:27.895
It doesn't have to be a linear relationship. So, this scatter plot shows a very strong correlation result:
0.92, which is getting very close to 1.
But it's certainly not a linear relationship. You can see it's curved; curvy linear is the proper term; and so,
that might actually be more relevant if you're not sure exactly what kind of
relationship you're going to have; or if you look at a scattergram and you can see that it's
certainly not a linear relationship although there is some kind of correlation going on there.