Our last clustering algorithm is called DB scan, stands for density-based spatial clustering and applications with noise. And it has some interesting features. We need some definitions to understand how it works. First of all, there are two parameters. One of them is epsilon, which is just a positive number, and N Min, which is an integer. These are hyper parameters of the method. We're going to define a neighborhood of a point to all points, including itself, distance epsilon or less. So that's a neighborhood. Sometimes it's called an Epsilon neighborhood of a point, just everything within a little ball. Now we have the idea of a core point. And that is a point whose Epsilon neighborhood contains at least N Min points. The intent of this requirement is that if you're sort of at the edge of a bunch of points, you may not be considered part of the core of the cluster. For example, let's say these are our points and we have N Min equal to 3. And at this circle here has radius epsilon. If I center of the circle at a point in the middle of a bunch of other points. It's definitely going to have a neighborhood with at least three entries in it. But as I get out to the edges, right, these points are going to qualify as core points. Really not these loners. Let's say I've colored in all the core points. I haven't checked these all rigorously, so there may be a few mistakes in there, but just for the sake of argument, let's say the green ones are all core points. Now we start building a cluster. So I take a core point at random, let's say this one. And I give it a cluster assignment. And then I look at its neighborhood. And I add to this cluster all the points that are in the neighborhood. And this case, all those additional points or four points. And so for each of those core points, I'm going to play the same game. I'm going to take a look at their neighborhood and add all of the members. Looking at this point, for example, I'd want to include this into the cluster. And I'm going to include this one in the cluster as well. But that's not a core point. So I'm not going to take that guy's neighbors into the cluster. I'm only going to add the neighbors of four points that I've already added. Now I just changed the neighborhood. Oops. So if I add this core point here, right, I'm going to get this core point, the bottom. And then when I add that neighborhood, we're going to get this one. And so on. I keep doing that until I run out of new points to add to this cluster. Now that clusters finished, and I start a new one by picking another point that I haven't already assigned somewhere else. Let's say it's this one. And I play the same game by add their neighbors. I add neighbors of those points that are also core. That will pull in these lin, this and this, let's say at a pull in this and then pull it in this. I keep going until I run out of new points to add. Okay? Now the points that haven't been selected for any cluster are called noise points. That's an interesting feature of DB scan, that it identifies points that it does not assigned to any cluster. The points that are circled that do belong to clusters, but are not themselves core points. These are called border points. The way the algorithm is setup. Or points will always belong to the same cluster no matter what order you do the assignments in. But border points might shift from one cluster to another depending on the ordering. So for example, if I had started my cluster assignments over here, this one, then I would have pulled in this one. And then I would have pulled in this board appoint. So the border points might go either way in some cases, depending on how you order the core points for select first, creating clusters. Looks like I missed one. Put a point here. Let's go through this smaller example more deliberately. In this case, I'm going to say N Min is four. And epsilon is going to be a length which is almost two units on the grid. So what that means is that from this point, is four are close enough to be part of the neighborhood, as are the four corners. But nothing else that would be that points neighborhood, potential neighborhood. So for instance, returning to this point, I have a neighborhood of five. So that is a core point. This is a neighborhood of only three. This is a neighborhood of three. This is a neighborhood of 1, 2, 3, 4, 5. So its core. This has a neighborhood of 1, 2, 3, 4. Its core. This was a neighborhood with just three. It says a neighborhood of size 5, 3, 2. This one has five. So it has four. So on. Here's all of our core points. Now we start making clusters. So I will pick this one to star. And now we add its entire neighborhood. That's the three additional points. Every one of these, that's a core point. I'm going to add neighborhood. Coming down here. These two points are in the neighborhood. And that's it. I've included all the neighborhoods of all the core points so far. I don't include this guy's neighborhood because he's not cool. Right? So now I move on to a new cluster. Let's say I start with this one. Add in everybody in its neighborhood, and then I add the neighborhood to those core points. There are my clusters. I have no noise points. And five border points. A lot depends on the algorithm on these two hyperparameters, epsilon and N Min. As epsilon increases, we're making the neighborhood scan or larger, that there are more and more points that are going to have enough neighbors to qualify as a core point. You're going to get more and more core points are if epsilon's really big, then everybody's neighborhood includes everyone. And so you only just get one big cluster. And so you sort of lose all meaningful information. Everybody responds to a cluster, there's no noise and you haven't learned anything. On the other hand, as you make epsilon really, really small, you are going to get fewer points. Potentially you're going to get more clusters and more noise. You increase N Min. Picture sort of flip-flops. Increasing n men makes harder to be a core point. So you're likely to get more clusters and more noise. And if you take and men all the way down to one, then whatever epsilon is, everybody has their own neighbor. And so everybody has a complete core point again. And you're back to something that's not very informative. Sometimes you have some considerations from the application that gives you some clues as to how to choose these numbers. But a lot of times there's trial and error involved in finding these. It's not quite the same as classification. Or we can use cross-validation because clustering isn't quite as predictive and algorithm as clustering, classification is meant to be. So if a point wasn't part of the samples to begin with, you wouldn't really consider it a core point, but what if it's close to a whole bunch of other points, too directive, actively make it core. It doesn't make a whole lot of sense. So for the most part, you just have to look at silhouettes or other kinds of scores that you have for clusters and decide which epsilon and n are giving you the best results. Let's take a look at DB scan, an artificial example. All this is just to set up some points here. Again, I've got a ring around blob. This time. I've made sure that there are some points that are kind of a no man's land as well. It's pretty challenging. Let's first take a look at all the pairwise distances between these points. And I'll define a little function here, which will simply find where are the members of a row where the distances are less than apps are epsilon value for each row in the distance matrix. For example, if I choose a small value of epsilon 0.02, then most of the time, each point is the only thing in its own neighborhood. Looking at the first five results, 0, it's only neighbor is itself, one and only neighbors itself and psi1. What I'll do now is I'll take a look at the length of each of these. So that's the number of neighbors for each point. And you can see I have 488 of these little singleton neighborhoods and only 12 where there's any overlap at all. It's tiny epsilon. But when I do DB scan with this value of epsilon, even for a small and Min. I fit that labels will be the cluster assignments. And I look at the value counts now for the output of DB scan, this labels property as cluster assignments, but the cluster number negative one means noise. So it would be a mistake to think of this as a cluster. These are the points that have anything in common with each other. These are points that are just outliers to all the clusters. And then we found six tiny little clusters of pairs, essentially the points that just happened to be very close to it. This is not a useful clustering. Let's go to the other extreme and choose a large value of that. So now epsilon will be two. And in that case, I'm looking at the distribution here of the number of neighbors. So most points, in fact, all points seem to have at least 40 neighbors. Because epsilon is so large. So that means even if MN is as large as 40, everything is a core point. And everything just gets assigned to one big mega cluster. As you know, again, not very useful. So we've got our Goldilocks moment here, right? We want something between 0.022. Well, I'll try 0.05 and Min of or. Now we have something we can work with. There are three clusters. Two of them are pretty big, and one of them is itty-bitty. And then there are eight noise points as well. But if I now plot the original points colored according to cluster assignment, you can see we've done pretty well already. We do have these little this extra little spurious cluster out here, as well as the blue noise points. I can get rid of this spurious cluster simply by requiring that there be more points in it. If I take N Min up to five at the same value of Epsilon. Now these are considered noise points rather than part of a cluster. Now let's try DB scan on one of our familiar test sets, the handwritten digits. And we're trying to identify just three of the different numerals, 456. Now this is a bit of a complicated experiments. So what I'm going to do here is I'm going to let Epsilon range from 20 to 26. And I got those values just by a little advance experimenting. Is the admin equal to four and then use DB scan to fit. Now see here contains the cluster assignments that I get a result. Noise is a Boolean index which tells me which points where noise point. I want to count how many clusters there are. But what I do is I take all of the cluster labels that are noise, or reduce them to the unique set and within, within these labels. And then I take the length of that unique set. That's the number of non noise clusters. And I want to compute a silhouette score here, things get a little tricky. First of all, if I only have one cluster, the silhouette score will throw me an error. So if I only have one cluster, I'm going to give it and not a number, right? I do have more than one cluster. I want to compute the silhouette score. But it wouldn't make sense to treat the noise points like a cluster that would have very low silhouette scores and will be sort of unfair since DB scan said that these points were not part. Buster. What I'm doing here is I'm only going to classify. We're only going to compute silhouettes values for those points which are not noise points. And their label. And what we can see is as we're increasing epsilon, we're going from small neighborhoods to larger neighborhoods. The number of clusters starts to go down. And the number of noise points starts to go down because points have more neighbors. Generally speaking, you see that 654 clusters come out only very briefly. And then there's this long period of three. And then it jumps back to four and then it jumps back to three again, and so on. Eventually as epsilon keeps increasing. Get just two clusters and then one cluster. But looking at all that data in a graphical form, the y coordinate is the mean silhouette value. The color is the number of clusters other than the noise points. And the size of the dot is the number of noise points. Do you see that the dots are getting smaller as you go from left to right because you're getting fewer noise points as you go. The colors change from four, from three to four, back to three again, then finally to two down here. You see that when we make our transition from three to four, the silhouette values drop a bit. And then when we go for four back to three, so that values go back up again. So that plus the fact that there's such a long stretch of three, of resulting in three clusters. Is that maybe three is the right thing to be using Care, Of course, we know from external knowledge, we know that three is the right number. But there is some evidence here intrinsically to suggest that it should be. So for and to lead to much worse syllabi that Wherever you can see that as we continue increasing epsilon, the silhouette score is going down a bit even within three clusters. What's going on here is that the number of noise points is shrinking. So instead of points that are being noisy and now including them in clusters, and that brings down the scores a little bit because they weren't good fits. If they had bed good fits that would have been included in the clusters. So it's a little bit easy as to how important it is that the silhouette value is decreasing here. Should we put more value on having few noise points or we should, should we put more value on having large silhouette score? And CR here, at least, turns out to be that we want the fewest possible noise. Since noise points, in a sense represent a failure, we would like to minimize the number of cases that we can classify cluster. So if I look at just those results with three clusters, and I find the index that had the minimum noise. And my epsilon will be, that location, will be the epsilon at that location of minimum. And now if I do a DB scan with that value of epsilon, you see I have three clusters of nearly equal size, which is very good. The aisle started at around 180 and then just 23 noise point. Now I do have a ground truth here. So if I compare these labels to the true labels, I get an adjusted rand score of 0.93, which is really good. Again, this is considered a pretty easy classification problem. Here you can see some of the digits that ended up as noise points. So as a human looking at this, you'd say, well, yeah, this is pretty ambiguous. And maybe this one's pretty ambiguous to, but others like this here looks a lot like a five. This looks like a five, this looks like a six. So the reasons that they didn't get clustered or not obvious, you'd have to do a little bit of digging to figure it out.
DBSCAN
From Tobin Driscoll April 21, 2022
4 plays
4
0 comments
0
You unliked the media.