Before we start talking about algorithms for clustering, we're going to need some performance metrics. Let me start with the good news, which is that if we have a classification, then that implies that we have a clustering. Clustering is just a division of the samples into subsets. And we can just choose subsets as class membership of the classification is by colors. Then all the blues end up in a cluster, all the reds end up in another cluster and so on. So that does give us some test information or some test sets that we can work with. The bad news is that there are no perfect measurements in this business. They all come with warning labels. In order to make them easier to talk about. I'm going to define some terminology that you won't see in most books. But if I have a pair of samples, I'm going to call them buddies. If they're in the same cluster. And strangers if they're in different clusters. Are first clustering metric is going to be the Rand Index. The Rand Index is used for comparing two clusterings. So if we do have a reference or a ground truth available, we can compare to it. Rand Index. Write out the formula first and then explain it. In this formula, b is the number of pairs that are buddies in both clusterings. S. It's the number of pairs that are strangers in both. This denominator here is what's known as a binomial coefficient. If you haven't seen this notation before. The way to read this sometimes is read as n choose two. It's the number of ways of choosing two objects out of a set of n of them. I should say it's the number of unique ways. In general. If n and k are any integers, then n choose k. Given by that formula that's learned from combinatorics. It's especially easy when k is equal to two. Now the Rand Index has some nice advantages. Though it goes between 01. 0 means we disagree about all the pairs. One means that you agree about all the pairs. It's symmetric and the clusterings. I don't have to designate one of them as the reference. I don't need to find a correspondence between the clusterings. In other words, I don't have to figure out, does this cluster over here represent poodles are Labrador is compared to the reference classification. Doesn't matter, but in fact, I don't even need to have the same number of clusters in the two clusterings. And it's easy to interpret. One way of interpreting it is, it's the accuracy of a binary classifier. Where positive and the classifier means in the same cluster. And negative in the classifier means different clusters. In this formula we had, this would be the number of true positives. This would be the number of true negatives. For example, let's say I have these five samples and I'm given some ossification indicated by the colors. And now let's say we come along with a clustering algorithm. Clusters him like this into three different clusters. We want to evaluate the rand index comparing these two clusterings. To do that, I'm going to make a table. The table allows me to look at every pair of points. And we need to work on the upper triangle of the table. And at each spot, I'm going to put a capital B if their buddies in both clusterings. A capital S If they are strangers and both clusterings. And an N if they are neither. In other words, there's a disagreement. Okay? So for X1 and X2, they have the same color, so they're buddies in that clustering. And they are in the same cluster as drawn here. So that's a, B, X1 and X3. They have different colors and they are in different clusters. So they are strangers in both. Same with X1 and X4, different colors, different clusters. Finally, for X1 and X5, X1 and X5 of the same color, but different clusters. So there's no agreement as to whether there's, but their bodies are strangers. So we put an n there. Moving on, X2 and X3, different colors, different clusters, x2 and x4, different colors, different clusters. X2 and X5, same color, different clusters. Next row, X3 and X4, same color, same clusters. X3 and x5, different color, different clusters. Both strangers. I only X4 and X5, different colors, different clusters, strangers in both. So the rand index, okay, I'm going to add up all of the 0s and Ss. So there are two b's and six Ss. And there are 10 pairs total. So at 0.8 or all of its virtues, adjusted rand index does have one big problem. That is, even if you have a nice classification that you, that you want as a reference, then you choose a clustering at random. The score might actually come out to be pretty close to one anyway. In other words, it makes things look similar that we wouldn't necessarily think of similar because sort of even if you just do things randomly on average, you're going to do a pretty good job. So the adjusted rand Index tries to take that into account. Here's how it's computed. Okay? So our eye is the rand index. This means expected value. So this is the average of the rand index over all possible clusterings. That sounds like a nightmare, but you can actually compute that exactly using combinatorics. This number here is the maximum possible Rand Index overall clustering compared to the reference. And that too can be worked out exactly. So the formulas are long and there's not much you're going to learn from looking at them. But what this does is it now recalibrate if you like, the Rand Index. That now 0 means no better than random. And one means complete agreement. It is possible to get negatives. You can do worse than random. Now adjusted rand index is pretty good if you do have a reference to work from, there are alternatives to it. They all have their caveats and their advantages. But adjusted rand index is pretty much as good as any of them. Most of the time. Problem comes up when you don't have a real reference, you don't really have a ground truth. And this is what happens a lot if I'm trying to do a clustering. I probably don't already have classifications for these things, or I do a classification. We need some sort of intrinsic measurements as well, and these are harder to combine. I'll just talk about one of them which are silhouette values were silhouette scores. These words value, score coefficient. They aren't used consistently in different areas. So you have to sometimes take underneath to see exactly what they're referring to. The idea is that we now have our sample points. We're going to define B sub I bar as the mean distance between x i and all of its buddies, right? All the other samples that are in the same cluster. We also have our eye bar, which is the mean distance between x i and all of the strangers in the nearest cluster. So I can compute the mean between x i and each cluster or strangers and then just take the smallest one, that's RI bar. And finally we have the silhouette value, S sub i or I sub bar minus b i bar. By the maximum of those two numbers. Just by definition with that denominator follows that the score is between negative 11. Negative one means she doesn't look much like a member of the cluster it's supposed to be in. It would rather be close. It's actually closer to some strangers cluster. As I have one is great. It means x. I would rather hang out with his buddies than any group of strangers. And then of course it could be anything in between. Now that's assigned, this number is assigned to each and every sample. And then there are different things that are done. You can take an average over each cluster. You can take an average over all the samples. As usual. You have the usual caveats with taking averages, right? If the distributions aren't great or aren't symmetric, averages might not be the best idea. You can take medians or, or do whatever you think is appropriate. For example, let's say these are two clusters of one-dimensional sample points, right? So these would be my X1, X2, X3, and X4. So again, I'm going to make a table. Okay? What's the average distance between this value and all of its buddies? Well, there's only one buddy. So I'm just taking the average of the number 2. There's only one other cluster to consider, so it must be the nearest one. What's the average distance between negative two and those two numbers? Well, the distance from negative two to three is five. And the distance from negative two to seven is nine. So you add those and get 14, divide by 2. The average value, this seven, S sub I is going to be the R minus the B divided by the maximum of the two. So here it's five-sevenths. For 0. Average distance to its buddies is again to average distance to the members of this cluster. It's the average of 37. Which is 5. So the silhouette value is 5 minus 2 over 5. Have about three. It's average distance to its buddies is four. And then the average distance to the other cluster is the average of 53. So that's also for, so the silhouette value is 0. And finally for seven, average of its buddies is for average over the strangers, is the average of 97. So that's eight. So I get eight minus four over eight. Which is, you see the scores are a little bit lower for the enter points because they feel a little bit more attraction to the other cluster. Whereas seven is much happier being with three then with these other to try out these clustering metrics, I'm going to start with a toy dataset, which just has two features and a classification given. And they're chosen to correspond to blobs in these, in two-dimensional space. The classes of all these points are pretty natural except maybe for this one and this one computes the silhouette values. I'm just going to import silhouette samples. And then I'm going to call that with the feature matrix because I need to know all the positions of the points to compute these values. And the classification because I need clustering to use in order to compute silhouettes. And I'm assigning that to another column in my. And if we plot again. But now I'm going to be using the silhouette score as the size of the dot ID. See that most of these dots are a nice, healthy IQ score. But I have some small ones. And I add some negative silhouette scores. But with these small points. Or to take another look, I'm forgetting, forgetting about how they're distributed in space. If I just do a plot by class showing the distribution of the silhouette values for each class. I is 0, That's the one at the upper left. All the scores are good. Most of them are very good. Class one. Most of the scores are at least 0.25 or so. But then there are some outliers that stretch down into even negative values. And again, for class 2, maybe there's a little bit more I out, more extreme outlier, but still not a lot of them. And we can go ahead and compute. The silhouette. Average is grouping by class if we like. But again, you know, to what extent do these averages represent these skewed distributions? Mean they're definitely affected by the outliers. The median values are closer to 0.6. Now let's do a different assignment, two clusters. Let's suppose that we just assign each point according to the quadrant of the plane that it sits in a. So the axes split the plane into four quadrants. So now when I compute the samples, I'm going to use the quadrants as the labels for the clustering. Right? And now we see we have a, a fourth cluster because they are in all four quadrants. And now there are some more points that have small. So that value, if I group them and look at the averages, they're really not looking better than they were before. In fact, I split that one cluster into two. And instead of Point 53, these got worse. I don't license now that I have two different clusterings, I can compute the adjusted rand index between them. Festive Brand score in the SK, learn naming. So what I need to do is give it the two clusterings. It doesn't matter which order. One coming from the classes and the other coming from the quadrants. And they agree, they're pretty high level. Next we're going to look at a famous dataset on handwritten digits. What you've actually got here is there are 64 feature columns and they represent an average grayscale intensity value. And then there's a classification as to what digit was written by the person who originally created the set. What I'm going to do is separate this into features and labels. I want to keep only digits 456 with all ten digits, the class, the clustering is an early good to keep only those. And then I'm going to see how many I have of each label, about an equal number of fours, fives, and sixes. Essence, these actually represent pictures. We can plot them. I just have to reshape the columns to an eight by eight. Picture. Become intensities. These are some of my sexes. Alright? And we're going to be testing different clustering methods on these. But we want to know first of all, what should be expected mean to the classifications have anything to do with clusters. Two clusters reproduce those pretty well. So if we look at the silhouette samples, silhouette values on all the samples using the classification. Going to reinterpret the classification as a category variable. So that when I do the mean, you can see that 4.4, I'm sorry for handwritten four is not so great. Lives a little better, a little better. And again, if we want to look beyond just the averages, here are the distributions. They don't look quite so bad. There are some long outlier tails that OK down low. But there are also a healthy number that are positive and, and away from 0.
Clustering performance
From Tobin Driscoll April 15, 2022
4 plays
4
0 comments
0
You unliked the media.