Other than some value, the only alternative is to say, is it equal or not equal to someone? So every node makes them really trivial choice, split. And the goal is to make a split such that the purity of the outgoing sample is maximum, which means or based on the target variable, which in this case you survive or not survive. You wander out the outgoing branches to be maximally homogeneous. Everybody survived the one side. Everybody that did not survive on the other side. Okay. There are some details that we're going to cover now. The first I wanted to show you. So I wrote a bunch of cells where I tested some of these decisions for different variables and different cuts just to get some numbers. And those are the numbers that I showed you in the slides. Let's get this in presentation mode that I showed you in the slides when I had the sort of quick high level demo of how a tree for the climate challenge the cabbage hands-on. On the Titanic with brain, you got sunspots, you decide your target variable, which in this case I made it quite trivia. Show you the mathematical formula, the target variables that we actually use in fees most commonly, I mean similar things, simple. I mean just the ratio of the largest class in your outgoing node to the total. That essentially tells you the period, right? So to do that by hand, of course it's a lot of cells or code, but I don't have to do that because we got so far. We actually run together up to this point, I think, where we create a tree with the official. So that's what we're going to pick it up. Phone rang, I give you two minutes to brand your code up to the cell. And if you have problems, let me now, ideally you'll be here because you're going to do, is we went a little bit further. We actually did a training test samples to make sure that we get up to here when we start. Does anybody have problems getting the air in your notebooks running? They working? Let me check, consume and see if everybody's Zoom as well. I don't see anything in the chat. Nobody screaming. So I'm going to assume that everything is fine. So first of all, let me stop here and tell you so to make the dendrogram. Dendrogram is, as I said, it's the visualization that we normally use to represent trees. It really represents the treats the flowchart compensation breakthrough you've had. The vehicle they were using in the top row of each node of the dendrogram we presented in the top, the original neuron tells you what it's not, It's been used. In that case, it's gender at the top. It tells you what the cut is. Got here is smaller or equal than 0.5. Note that that's a trivial because gender was encoded as a barrier is the Wu-Tang is 0 or 1, right? So the fibula, and to say it is 0 or is equivalent to say, is it one that goes and he had aids and take a class. So the other thing that you'll see is the GI number, that's the actual target function that has been used. So I use the ratio of the largest path to the photo. They use something slightly more sophisticated as the Gini index. We shall, we shall see you in a minute. See what the size of the sample in the node is and what the value of the sample in denoted what is, what was the breakdown of survivors not survive in your note. And you see that for all the nodes. So the easiest way to do this by far is to download this package. That does it for you. So notice that the package takes the model. So this works in conjunction with ascii. Learn what these package free will this method, what this function trader expert graph does, does it impact the output of the killer mother rice, a uranium Ireland? And this is just a graphical representation. I wanted to show you where I imported. Maybe I install it, maybe I don't. I think you somewhere in here there's a cell where I pip install tree, yeah, from ascii learning for tree, tree decision classifier. And I import a graph this as a support package to the tree object of as Hitler, the parser. Good. So keep in mind that this task B, this is a very simple model with arbitrary variable. If your tree is very long, this can become very complex visualization, which might have very small cells and become utterly useless. I, O, and you can save it as some PDF files. You don't have to take a screenshot. I recommend you learn how to save your bags instead of taking screenshots because people can tell she don't think that they can. They really, really can. Okay. So far, so good. So I trust that you're here and that you have your training and test sample will talk about training a splitting into training and test sample. But I want you to try and run one of the ensemble methods. So we're going to talk about ensemble methods today. So write your code right now as you wish, you can copy that cell. Actually let me put two of these sounds. Now I only have the random forest here. We'll see the gradient boosted tree in another, No, worry about it. We'll just copy that cell and run it. I just want to show is trivial. Same syntax except actually sensing comfort on, I'm not using any of the in the first cell. I'm using exactly the same hyperparameters, but this is a different method and it's more complex method, which I will tell you about in a minute. And so in the second cell of that code, I also have some different hyperparameters, have these kind of parameters. You are a number of estimators. This is a hyperparameter that is proper to random forests tree model hyperparameter. It's a random hyperparameter. Necessarily what it does to stuff that you coded it up. I'm also setting the maximum depth. I'm a big fan and second that second side in 3 for failure to none, which means unlearning the tree go as far down as a very dangerous thing. In this case. What I think it. So that's interesting in this case what I think it does. Hi, I'm positive because I'm realizing that in practice, if this really went out as far as I can, I will get an accuracy of, get a score of 100%. Because I had a single category, a single numerical variable, which has a potential for an infinite numbers. But so in principle, the street he was really none. So if you really went down to as far as it can, it would just by design have to tell you 100%. So there's some criteria whereby even if you set the max that to one, it actually won't continue up until it has ethical every object file. Okay? If you're working, if you have problems when you tell me. Now we can tell them. Good luck. Could you around the tree? The single other thing is when you have all this core, bye, bye, bye, bye, bye. If your variable names, they really have survived in spite of it. The way I run the notebook last time, Fats had separate the target variable from the features. You don't want to pass my panic. If that is the variable in which you have everything including survived. I think that might be a problem. Right? So in this particular example, I had Titanic short. It's a really bad name, but it means it's a short version of the Titanic dataset that only contains HP class and gender. And Titanic's the Titanic was the mother or father or parent database. And in that I had the variables about Is it working now? Anybody else with troubles? Okay, So I want to emphasize one more time the ST learned super great. Sklearn is super great because I had to write 50 cells or codes to the very embarrassingly poor tree model above here. When I was doing it by hand. Sklearn does the single command. This is brilliant. Excel SQL is like a packet that gives you enough rope to yourself because it's so easy to run models that you may really run them without knowing what you're doing. So it's very important that you take good care of learning what are the hyperparameters and how to use them. Every one of the SKLearn models. Yes. Coco. So here the gradient boosted tree, I thought I had it in this notebook. If you want to run the gradient boosted tree, just, you run the same set of code. Don't use that, doesn't have the number of estimator parameters. So remove that and replace William Booth, the random forest classifier with brain, with gradient boosting classifier. Okay. You can just run it takes maximum depth RAM policy, the criterion, but I think he takes the criterion. I pass it a random state, same syntax, same everything, completely different model. I'll tell you why it's different in a second. So far, so good. Let me check the online. Everybody's fine. Everybody happy. And X2. Sorry. Yes. So next thing we're going to do, we're going to look at actually, let me just check one more time. Everybody can switch from random forest or gradient boosted. Yes. We're same problem that they have. Make sure that the input variables that you passed only contain age, gender, and passenger class. Thank you. Kevin had a current lamina propria ones prions screenings account and they work now. If it doesn't work, let me know again, we'll take care of it. So the next thing that I want to look at is the feature importance. Cilia. Let's do this. And then we look at, we look at what we've actually done and how we've done it. So now I have a tree, I have a series, a flowchart that has been optimized to make decisions such that based on the sample that I know, I get the best result in terms of purity of the outgoing leads. There was that sentence with a little bit loaded, but it's correct, right. To have a flowchart that has been optimized such that the outgoing roots are pure SQL. Based on the constraints that I gave. And based on the data that I've seen, right, based on my training set. No guarantee whatsoever that if I give you a new data point and I follow the flow chart, the decisions, I get the right result. But if the insect is representative of all of the data that is relevant for the structure, then yes, I have a model that can predict well, if my test data is in any ways different than my training data, all bets are off. That's clear. So that'll be on trees. That's like giant machine learning comes a bit. So now because I had this flowchart and social service self-explanatory model, right? I can look at it. I can see where the decisions. Now, because I had this flowchart, these models actually very easy to interpret. In principle. I guess either the first factor, gender was really important. The fact that it was used as the first variable, because that was the variables that it was giving me. The initially best split tells me that that's an important variable. And then I can see, for example, the age. Use the bunch of times In the, now this is on it. Maybe imagine that I can run this model longer and I use age three more times. So from the feather I use the three times. That also tells me another way by which that variable is important. Very important being defined as I need to use it. I need to use it to make this decision. So this gives us visibility into the random forest. So that was a very inaccurate high level description of feature importance. But in reality, you can do pretty much exactly what I said except with some math in the background that accounts for what is actually the Gini index at every split. And how good at we split woven, how many splits for every variable, et cetera. And you can calculate through that, the importance of each one of the input variables. So again, the high level picture is your user a bunch of ties. But you need to use it early on for making the very big splits. That makes it can vary. You use it maybe once that one of the later nodes. Well, nothing. Okay? So really this is the best that we can do when we talk about interpretability of the model machine that I'm really, what we really generally aspire to is to be able to say which variables that you gave me, where most important and how I use them. All three days models give you a theoretically trivial interpretation of this because you have the chuck and you can see them off. Okay? But it gets more complicated. So for that, we have to get back to the slides. But if you want to a couple of minutes to just run it. So our hearing, this was my random forest model. Very well. Use the random forest, another tree I could have done it with the tree as well this way. But this will get uncertainties on that important metric. So every petri polygons feature importance is an attribute of my model. After I run the model, that tells me how important each, each feature is, it gets built-in into the model directly when I create a random order and a favorite my data. So I already have it. These movies doing is plotting feature importance. But if you just extract the feature and percent honest with you on this course in the right places. That tells you, you know, 0.190.35061 would be the most important feature and the only one that is used if you actually get 10 means we just never gets, gets way. Mouse. I will have a whole slide on a promise. Any other questions or thoughts? Teaching both done so far makes sense, handled very, very high level. So now let's go back to the slides. So, so we did this, we did all this. We looked at the tree. We looked at the structure, the tree nodes, roots, leaves above I have the slide that's just taxonomy. This is the dendrogram. We looked at the hyperparameters briefly and now we're going to look at them in some detail. So there's a lot of hyperparameters. The ones that I really care like I said, multiple times for the single decision tree is the maximum, that's fine. The maximum depth. Here this gets to really the maximum depth and the criteria, the criteria is what I half-ass with my larger class over total. These are the actual criteria that are most commonly used in tree models. The Gini impurity index and the information gain or entropy. I think in the model It's called entropy. The default is going to be jenny, So does matter. So it's just nice to get something intuitive about in i, p is the probability that your classification is right at a given node. You don't know what the probability that the classification is greater a given node. But if you notice 100% pure than the probability is one, it just can't get it wrong. It's like you know exactly that every person in the known is the same class. If you're probably 50 percent of the people in the North are one class and 50% that The other than it's a random gas. So p is the probability that you have the right classification in there. You pick a person out of the null, and you had the right classification. Which is exactly corresponding to this. A frequent is probability. So it's like the fraction of people of one class offense. And these are some just sort of like ways to optimize over that time. So why do I control the data? So the maximum that controls how many splits I do, the depth of three, I only three times. I allow the maximum depth to be ran wild to get to none. I get to the point that I was talking about earlier. In principle, it will go to infinity because I have a single continuous variable here that would just continue splitting until I get a 100 percent pure people in one class. Doesn't matter the fact that I really slick people between like 38.5 years and 38.6 here. That is not something that helps me predict people in the future because it's not a generalizable characteristic. So we control the decks to avoid overfitting it. Since ask the question, cocoa cool. And you will be surprise, how's more, how sure the maximum that there is appropriately using MOS model, it is typically a very small number, like 34, maybe 10. If your dataset is enormous, particulary be a really large there. Okay? There are other ways to avoid overfitting. So you could in principle let your tree ran all the way down and then pruning. So cat branches from the bottom up and look at what point you actually start overfitting by, for example, looking at what is the outcome on the training versus the test set. But controlling the maximum depth at the beginning is most effective. So far so good. Right? So all we talked about with classification, with treats, survived and not survived. So binary class. So for one thing, like us in random forest does not have to be binary. I could have survived. Did not survive. Don't know, There's no record. I could have red, green, and blue graph by whole bunch of different classes. Sort of in the same way. I can also do regression with twins. So this is, I don't usually use the standard ascii learn figures for the lectures. With that, I figured I. Those are available to you on the I skipped learn what sites or perhaps you want something more from me, but in this case really helpful. So this is a demonstration I skipped learn or how your classification can go wrong. So that this data that is being simulated, the orange dots called Theta or a piece of the sinusoid occur out of which some points have been moved. So that represents some noise. Okay? And so when you do your, if you want to use regression, if you want to use trees as regressors, you essentially had to turn your, your, your, your target variable into bins. So you're still kind of classification, right? You're saying, okay, the value of my target variable is going to be between 0.50 and I will predict 0.25 as a representative number of B. That is still classification operation, but your bins are small enough. Then it's indistinguishable for our intents and purposes from a regression that make sense to you, the credit based on some values on the x-axis. All these values right here. In this case, for the blue curve, you will predict a y value of 0.05 or something. For all the x values between 0.53.5, you will predict the y value of 0.75. The sophistications year handling, deciding how Y easier Ben. And if you're embarrassed y and of course your method is too general and you're not accurate in the prediction, right? I'm predicting things, they go up as high as one and think they go as low as 0.5. They're all together in a 0.75 bunch. Is this useful? I don't know, it depends on your problem. Certainly reason useful to reconstruct whether or not this is the sinusoidal wave, for example. But then if my bins are too small, then I end up making beans of one I encountered. And that's also a problem. That's the same problem as before. I'm overfitting. So this new parameter, which is the size of the class, becomes the parameter then exposed to the risk of overfitting. So that's the conceptual, conceptual description. And then algorithmically, rather methodologically, same thing, same model, same family of models. Sklearn dot tree is the same that we used before. Instead of using decision classify, you use the decision tree regressor. So it has a different criteria and I'll show you in a second why. Some other different parameters is still as your maximum, that the minimum sample split or loss of the hyperparameters of the thing. And you control the same way. Your target function has to be random, right? Because you can no longer talk about purity. Because you can't say, is this glossaries that class, you had to say how close this one to my predicted 0.5. And so we use the typical criteria that we've used before. L2 is the mean squared error. That's, that's the default one. Maybe you can use another one. I mean the absolute mean, absolute error or something. Okay? But all I had for regression with treats, same thing, small subtleties, some tiny differences in the hyperparameters. You going with this. We're going to talk about random forest and gradient boosted trees. Before we talk about that, we're going to talk about why we would talk about that. So we're gonna talk about what are the issues that we have with these tree models that we just studied? The issue is that, like I said, fine. You can do an exhaustive search through your flowchart. You can say for every age, what would be my best split at every node? Because that computation, it will be untrackable. So you're going to take some guesses. We're going to say like, well, let me try and guess my 10 is 10, good, maybe it's me. Try and guess. You know, if Chinese descent can I go to the left and to the right, which one, et cetera. But you have, you had to make some random guesses, and you had to make some random guesses. That means that if you made a different random guess, you have no guarantee that your mother would result the same in the same thing. So we call that variance of a model, which means I run my model on the same exact data with the same exact hyperparameter 10 times. And I get 10 different results. Unless I set my random, see them and make sure that I'm only getting one version, right, but I don't know if that version is right just because I've set the random seed and I continue getting this result, is no guarantee that it will be the right result. Okay? That's what I said. Okay, for the solution is to run a whole bunch of genes. Kinda like I was telling you last time, you know that when you have, when you're sensitive to one parameter, really the only reasonable solution is to try different values of that parameter and see how sensitive you are. And if you can backtrack what's the best choice? So this is the same idea. You are sensory to this parameter which is the Reynolds, which is a random gas. You're going to run a whole bunch of treats and see if you can make sense of a CPU can get a stable result out of the ensemble of trees as opposed to out of a single one. The two main methods or random forest ingredient, main method, main tree ensemble methods are random forest ingredient boosted trees. So that's what I said. So the difference, the way i makes it easy for me to think about it is I think about a random forests as a bunch of trees that run in parallel. Each one is independent of the other ones and gets one result. So then I had to merge these results into a single, goes up by a single mode. So typically the decision out of a random forest is going to be a majority decision for this variable, for this observation that has these values of each one of my variables. What would the majority of the trees predict the class or regression value is? And that's what you take. For the regression value, take the average for the class. Okay? So you take a bunch of trees, random on, you get different results, and then you figure out what was the most common outcome for one object. There is a little more to it. One is added. Instead of just changing this random seed, you're also going to randomize over either the observations or the variables or everything. So each one of the trees will not use all of the variables. Use HP, class and gender belong to use a random subset out of the variable. For example, p, class, and age. Typically you have many more variables than sodium and more permutations. And then the other thing that you can do is only take some of the oscillations and throw out other ones. Why would you do that? Anybody think, anybody knows why you would want to do that? What does drop in some of the variables for some of the observations. So when I ran random forests, I'm going to choose a random subset of the input variable or a random subset of the observations. For a random subset of both. For every tree, every tree runs independently with its own subset. Why would I do that? I think the answer is seen by think you're overthinking it. Does. The answer is simple. I mean, I don't know. There are simple and difficult answers, but they have there is not a particularly complex write. Something with very low barrier. It's related to buy. So it's a way to ensure that you're not dominated by ham observations versus others. If you have outliers in your sample, then they may skew the results and they may skew the result in a systematic way for all the models, even if you choose a different valency. But if you throw them away every few times, then the matters that are heavy will get a different result there are. And the idea is that on the average you will get a result there is collectively better. So these kind of techniques are called bootstrapping. And particularly stop selecting a subset of your either observations or features is called bagging. Here, picking a bank, robbing a bag of interest, right? So that's random forests, gradient boosted trees. So if random forest is a bunch of trees that run in parallel, independently gradient boosted trees is, is a series of trees that runs in series, right? So there are one after the other. But you do run the trees with, not with all the features and all the observations, but you assign random weights, the features and observations. And then the weights that you're finding the next tree have to do with the result of the previous tree. Plus some randomness that ensures that you're not falling into a local minimum, so to speak. So each tree uses a different weight for the feature that feature learning. And the way. Each, each tree uses a set of weights for each feature type, each tree uses a weight for each feature. And you essentially optimize over these weights as you go down the tree. So the weight that you use in the next three had to do with the width in the previous tree and how well, how well the model. These particular method, it's extremely successful. Most of the high, high-energy particle physics The relies on machine learning. Maybe now it's a little bit different, but until a few years ago like Higgs boson, all that rely, the results of Graham was a tree-based. Now deep-learning. Random forest was comparative random forest and deep learning we're comparing gradient boosted trees was surpassing all the other months. There was a time until just a few years back that basically all the cargo challenge, challenges were won by gradient boosted trees. Like you could just like zoning Gridium with the prison. They would almost certainly when the charge. So they're extremely simple conceptually, but they're extremely powerful methods. So before we look at yes, William was wondering, am I, so I was looking at the, the gradient boosted trees. That kind of sounds a lot like neural networks and the stochastic gradient descent and my right, and kind of seeing a comparison there. Okay, for that, let me think about it. Since the weights are updated each time, it's just me. Yeah. So it is similar in the sense that it's a series of models that are, It's a set of models that are in series. But they, in a, in a deep learning, let's assume they were talking about a fully-connected deep neuron neural network. If it's causally connected, it might be a little bit different. But in a fully connected neural network, essentially, each node will make a collective decision that is based on all of the data that he gets, that it fits the data that it gets in a different way. Whereas here, each node we're still in a tree method, right? So each node is only using one variable. That is a code defense and the other defenses that with, with neural networks you can basically make for every neuron you make linear decision. You make a linear regression based decision on the input variable. And then you can modify that with your activation function to be a step function or as loose changing feature, what comes out of the linear regression? Essentially the way you should think about it that needs if your activation function for all staff functions, that will be closer to a gradient boosted tree. But still you're only one variable in each node. Okay, though. Yeah, that makes sense. And we will cover neural networks next week. Think well, after Thanksgiving. Any other questions? So what I said now probably will make more sense to those who have a new neural network before it. But more questions. Okay, so let's do, let's do. I'm trying to think if I want to do a magic trick now, if I want to look at, now let's look at more slides for a second. So something that we haven't talked about in detail is transcends trees also general machine learning training, right? So once we have our neurons, are, once we have our Machine Learning model can be a neural network, tree, can be a linear regression. But let's say that it's a classifier. So let's rule out linear regression. Let's say that we instead as some other form of regression that we have not covered by the regression. They give you a classification of common law, not a regressive outcome. So, how do I evaluate whether my model is good or not? So trivially, what triggers metric would be? What, what is the fraction of reclassification to wrong classifications. But there are some slightly more complicated and thorough metrics that we typically use in machine learning, which we're going to go through right now. So for one thing, a lot of them rely on our concept that there are positive classifications and negative past editions. And then you can make two kinds of mistake. One is that you can predict positive when it's negative and the other one is that you can pretty much, right? Otherwise you're predicting it right? So if we're in the case of the type having a positive outcome, let's say, I think we will be tempted to say like, well survived was a pretty clumsy positive outcome compared to that event pipeline. But you just want to think about positive in a very different way. You want to think about positive here. Not as like a moral connotation of the outcome, but sort of what is the, if, if you have a machine learning method that is trying to tell you that there is something. For example, tell you that there is a car in the world, or that there is a pedestrian in the road, or that your email is a good email and not spam, and that is the positive. So predict. So try to think about in your classification, what would you consider as a positive as opposed to a negative result? So the example that I like is the inner example, which is also one of the driving reasons for which classifiers have been developed. This is a very big problem solving the industry, right? Building a model that allows you to, that allows the user to now get spam, but allows the user to get all the important, you know, so, you know, Gmail, perilla into classifiers. Google for learning the classifier is precisely from this problem. So in that case, hazard, it is a good, honest message and negative is spam. It's not a good positive, not a good relevant message. So you can make mistakes like I was saying, you can predict positive. So was thinking about it statistically, I will introduce the H-naught as the null hypothesis. Positive, that is a lot hypothesis is true. The null hypothesis is that your messages panel and the classification is your messages, not pack. It's so if I hear this around, usually you do the matrix the other way. So usually on the diagonal, you have the true probability eventual vagaries of the off-diagonal, false positive and false negatives. I will need to be do this visualization. Just as it isn't important methods got spam. So you get buyer because he didn't bite, you bounce. False negatives is that you have some spam inbox in it. So in general, what we try to automatically, in a lot of cases as this, we don't want to spend any good messages in the example of the e-mail address should be fairly intuitive why, right? You don't want to miss a message that is important. You can tolerate some span in your inbox. But this is really not as universal as you might think. So think for example about think for example, about the medical field. So correct and incorrect diagnosis. So the relevance of a false positive versus the true negative now depends on what is the treatment for that diagnosis. Is there a diagnosis for which you can treat or is it not? Because if it is the north pole which you can tree and you have a false negative, then the person does not get treated and possibly die. That is a pretty poor outcome. But then again, what was the treatment and what were the side effects of the treatment? Do they increase the probability of the person to be ill compared to just no treatment. So it becomes very complicated. So I'm trying to say. But generally, we use this metric of one of the metrics that we use is the likelihood ratio. This is more like in statistics in general. And then a machine that, which is the number of false negative over the number of true negatives. And again, this is only based on negatives. And you see already it's biased towards one type of error versus the other. For completeness, Do I have a slide where I tell you which type of error it is? Yes. So false positive is called type one error is called type two error. I hate them because I never remember which is which It's important because he read it, then you know what they're talking about. A little bit deeper. We can define different metrics. The kind of parents did more emphasis on one of the 21 of the two are the two possible kinds of errors versus the other. So the precision and recall and yellow makes these thoughts of precision recall for one of his papers and every single time that he makes the plot of that and has a result that he's trying to tell me about that. I didn't wait a minute. Can you remind me what is precision? What is recall? I just cannot remember precision. So this is a religious organization and really packed constantly, remember it, but I can't memorize this. I had to look it up every single time. True positive divided by true positives versus false positive. So true positives divided by the entire circle. And then recall is true positives divided by true positive plus false negatives. So it's this size of the circle divided by this area. So you can see that they put emphasis on a different thing, right? And then we put them all together and just report it as was true negative divided by a. Okay, Lucy was used in K-means. We were optimizing on the stats. So this might look similar mathematically, but it's in a completely different conceptual method because we were trying to optimize over a continuous variable. Whereas here it's strictly category, right? So here we are. Oh, but it looks you're talking about the Became on Yes. It does look singers a table. It's the same concept. You're absolutely right that so when we do the categorical distances for K-means, this is exactly what we build. And in fact, I tell you that those intersection over union, that we look for the categorical distances, that was the Joker, the Jaccard distance I think is intersection over union are the same that we used to say it will have a true positive or a false positive in a neural network that identifies image. It's okay, fine. So last thing. And then we'll do some coding, but this is conceptual. This is the last concept that I share. So imagine that you run a random forest or gradient boosted tree, or even just a single tree model. And you have a result out of the node. So you put the, again, the way you do that, you build this flowchart, you pre-compute it. It becomes a lookup table. And then you have your new variable that has its own characteristic. You follow the flow of the lookup table. You get which Benny having that being at the end, the link, the leaf, the leaf of the tree has some purity itself, right? So you had to make a decision at that point. You can say, well, I'm going to classify based on the majority's output in that the majority of the people in my outgoing node has survived. I'm going to call that x sub. I know that I'm going to predict that the person has to bite. For example, A1. And we can do wonder is not crossed. Otherwise, it's more confusing. Yes. For example, the second from the left for those on Zoom but on the bottom leave the sacrum on the right, the line 86. What do I personally folding that then? Have they survived or not survived? Eight people in that bin had surmised six-sided die. What are my options asking you? The one I gave you, right? I can say like, well, the majority of the people survived. Passer-by that as inspiring them to have any other options. Again, you can, but you can do a little bit better than just say like, it's undetermined. You can say, well, if I were to pick one of the 641 of the 14 people in that leave, I would have a probability of 80% for that to be, say, I have the probability of a divided by 14 credentials TBI, and a probability of 6 divided by equity for them too dark. So I can treat each outcome has a probability this time. Let's think about the random forest. Here. Let's say that I made a decision that I'm going to say the majority are, each node is what the classification of that node tells me. So let's say that at my leaf I had 868 is greater than 60. That's a survivalist. For now have a whole bunch of pins of leaves in which my variable costs based on which tree. And I told you we're going to make a majority decision. We're going to take what the majority of the trees have decided. But they're two. I can do a little better. I can say well, 70 percent of the trees decided that this person to buy, 30 percent that the tree is decided and this person will not survive. So that's a probabilistic classifier. Hi. However, when I haven't probabilistic classifier had a problem. This is a true positive, false positive, It's not a probabilistically Bayes classification method. So how do I put that in the stable if I decide that the best way to use my trees, the hyperboles classifications, okay, Bear with me, I'm almost there. So let's say that I have a probabilistic classifier and that I want to turn that probabilistic classifier into a true positive, into a yes or no outcome. What am I going to have to do? I'm going to have to choose a threshold. And based on that threshold, I will say this is yes or no, right? So if my threshold is 75, so this is important. The threshold will depend on your willingness to risk a false positive versus your willingness to risk a false negative. If you want to be absolutely sure that the passenger has survived, a 26 doesn't give you the absolute certainty. So you can say not so bad. If you want to be absolutely sure that a person does not get the treatment that is very dangerous than the probability of that person having the disease has to be really high. If the treatment is not dangerous, maybe the probability of that person having the disease can be not. The right answer is a domain question. What is that threshold at which you choose your, your true, your classification value? But what you can do is make all the choices and plot them and see what is the number of true positives versus the number of true negatives you get. Based on this probability session. We do this, we called an ROC curve. And I'll thank her twice. Receiver Operator. Receiver. So long. Receiver operator was the C4. Thank you. Catherine of Africa. Live and I forgot I put it on the slides. So Receiver Operator Characteristic slides. So for all my possible choices, I can plot a point in this curve. So the different lines here are different models that run on the Titanic tree in the notebooks and you can run them for yourself. But logistic regression is the regression that gives you a binary outcome. Radius also gotten to random forest, getting boosted trees with the inference somehow encoded categorical variables, which I forgot that we had to get to in a second. So I plot them all and I get occurs. How do away with this curve, where would you like to be in this two-dimensional plane? If you could choose your model to be in only one spot. Where would you like your model to be? Very good? So is right, you or your true positive rates. So the fraction of positive predictions that were Quran to be one and your false positive rate to be 0. So you want to be on the left of the x-axis and at the top of the y-axis. So the closer your model gets up there, the better for a single choice. But generally we evaluate the model more holistically by saying what is the area under that curve? Which tells me in a sense about the potential of the model. And not like the exact result for the one choice that I made about that model. Okay. Do I have anything else here? How do you get one? How do you get your model to go in that direction? You tune your hyperparameters. So when you run a model, which I don't think we're going to build a scene. I will leave it in the notebook for the homework. I will just leave the cell so that you can do that. You can play with the hyperparameter, this by hand. And when you do that, you always wanted to see what is the result that I get on the training set. And what is the result that I get on a test set like we did last time. And you want to get to the hyperparameters, sad them maximizes the training set while having the test set getting a similar result. If you push too much in one direction, you get just a worse model. They will both the grade is you push too much in the other direction. Tomorrow that is too complex. You will get a better performance on the training, but you will lose performance on the test them. And we call that what's the word that I used for that? Yes. Overfitting. I wanted to bring a hub for people to answer questions today I have another University of Delaware coffee cup that I was gifted. Next time. Let's get ready to answer questions next time. Okay? So you add a parameter to get the best model you can maximize your area under the curve. You can do it by hand. It's really painful, three time-consuming. And there is a very handy SKLearn function that does a grid search over all of your parameter. Run, zoom out a whole bunch of time with different parameter and chooses the best one. And I can show you that next. Okay. Yes. Let me get to the notebook so I have a different notebook that I put in the lab. But see you don't have the code right now, you can just watch. So this notebook is an old book of real estate crisis in New York City. So much fun. It's been a few years now, but I had so much fun extract in the real estate prices from Zillow because Zillow is a proprietary algorithm that you see your database. So somebody who's coordinating this slide somewhere, chris miller had created a essentially like an interactive with it that was web scraping the Zillow prices, and then was popping up the window when you had to certify that you're not a robot was doing some scraping, then you're getting a window. You have to click, I'm not a robot. Some are scraping window click took me two days, but it was really fun and satisfying. On Zillow, at the end of the day, I was able to get a comprehensive data set of the low prices in New York. Is there are two things that you know about, or is that housing is really expensive and pizza is really good. I very much the housing where people live in very small apartments. And there's another thing that we'll see in a second. So this is the price, this is predicting real estate market prices is a huge business in the industry. A lot of people that work in machine learning do this for a job. So turns out it's not that hard, we can do it maybe. So there's something else in this notebook that I don't have time to explain. I'm using a shapefile to geocode the zip code so that I can make a map. Just follow along if you're interested. This is, we probably, in this class, we don't need it so much. But geo pandas is a GIS library based on pandas. He has very similar syntax with pandas that allows you to do geospatial inference. Whole different thing in and of itself. And supercool find some pre-processing my data and I'm extracting from my dataset the ZIP code, the city, the state, sorry, I'm dropping from my, from my dataset. The ZIP code, the city, the state, the URL in the address. So I don't care about those things. I'm also dropping the price on Zillow and dropping the zip code because actually it appears twice. I'm keeping the zip code. Myometrium is 2. So I did a especial merge. So as a result, I got the zipper to appear twice. I'm dropping one, but I'm maintaining the zip code is one of the variables of interest. I know that because I will tell you that that's important and I'll tell you how. So at the end of the day, I'm ending with a zip current price and square footage, number of bedrooms, number of bathrooms, and sell Tide DataFrame. These are the features that I want to use for my random points. That seems reasonable. Which one would you think it's important? Say rather? Yes. Okay. Why the square feet? Well, because a larger house, Cosmo, does it good. Yes. I call this notebook location, location, location. So that's definitely my x depiction. If I want to buy a condo in Manhattan, I'm going to have be ready to spend a bit more than if I want to buy wine in the pings or in some other or in the Bronx and such. I was thinking of like bats do a very thorough. He realized that that might not be as well normally when the price is in bacteria. That's my explanation. Let's see how it pans out. So I made the problem a little bit simpler. This is something that I recommend to do. In general, if you have a regression problem, if you have a regression problem they have to address with machine. They're generally my strategy is first to turning into a classification problem. And that two granite as a regression problem. If I know that I can solve, is the price greater than this or lower than this? Let me put it the other way around if I can also, is the price greater than this or lower than that, that I really am going to bang my head on the wall if I really wanted to solve what is the actual price? So as a strategic choice, I strongly recommend this in general. Then the only thing that I've noticed is that I have a lot of categorical variables. First of all, the Zipf curve is encoded as a number, right? My zipper, this 10001. But it is not a number. It is not something that represents exactly allocation. It's a category. Right? That the second type, whether it's a condo or a single-family home, that's a category. But as we said in our categories, need to be in one way or another, turned into numbers from a machine learning method to work on that. So I turn them into categories. I turn sail types into a numerical category and keeping zip code as it is for right now. I'm going to widen my data we've talked about is how I'm going to skip over this, but I'm going to pre-process my data so that all the variables are comparable. I don't really need to do that because of what I'm actually going to want to run. Our, what I'm going to want to run, or only tree models. Remember, I'm only choosing based on one variable, so I don't have the scale, but I also want to run a logistic regression. For that, I need to do this. So then I have two ways. And this was somebody asked me to get to address this time I have two ways to encode my categorical bakes. I can say, well, I have my z, which is 1 000 001. It is a category that I'm going to pretend that I don't know about that. I'm going to use it as an actual number. What is the problem with that? The problem is that I ascribe meaning to being 1 000 000 to compare 2100012 is greater. And so if I want, I will use a random points. I'm going to have nodes that decide based on that what's grader and what smaller. And that doesn't make sense if that number is a category for making a somewhat arbitrary decision. Except I'm going to argue the zip codes because of the way they're encoded, they actually tend to be including groups of numbers. All of my hat times 1 000 000 something on a Brookhaven is one to seven. And then the other option that I have is to create N variables. So that's what you see in these nodes. One for each one of my categories, one for each one of the zip codes. And then for every observation, if it isn't the zipper, that variable has a one. If it's not in that ZIP code, the variable has a 0. Let me just fast forward and tell you what the result is. And then let me show you. So I run a bunch of models. Notebook is for you to look around decision trees. I know the graph this I run a linear regression, ridge regression this morning. Now I still at the results. I can do that. I can do my classification. I get certain results for the classification. I can run my ROC curves. Just the range single tree. Okay, let's look at one of them. For example, the score is excellent. In the single fee for the training set, you, the in-sample, that's the training set. I have a 100 percent result. I get PV cells. That's great. I should work on my hyperparameters to get it to stop Obesity. Random forest, I get 0.91 when I one-hot encode my zipcode. Gradient boosted trees, I guess 0. And again, Random Forest gradient boosted trees, very powerful methods. 90 percent classification power is common on a lot of data sets. And I run these models in two ways. One with the one-hot encoding. So with making n categories, one for each zip code, That's an enormous dataset because there's like a 100 ZIP codes in. My dataset is now suddenly 110, 11, and 12. I use just zip code as a number. So at the end of the day, what do I want to show you? This is the feature importance from I'm over it with you earlier. We can run this for, we can look at this feature importance, uncertainties. We had a random forest and I ran the future emboldens every tree has its own feature importance. So I can put together all these feature importances and look at what is the variance between groups. And so the teacher terabytes, more complicated with it. Random for future inclusions we've inserted. So the models, we'll really, the punchline is a really good in both ways of inquiry. My categorical variables, I got 90 percent in both cases. Fine. But when I look at the Fisher and both hands, hold the number, the zip code as an actual wrong choice. Music that category. So there's some general meaning, sort of the magnitude of that number. Now, these are completely unreasonable. And I had a vision for the square footage of the house is the most important characteristics. Tetra, etc, where it is. Anybody, we're still location, location, location. When I run the one-hot encode, I get the same result. And my teaching, my zipper has disappeared. When I do my fishing pole enough analysis, I no longer know that the location of the apartment or house was important in determining its price. Why? We have two minutes? Anybody has an idea of why? Why have I lost track of the fact that that was a feature that was important. So the feature in the model performance in both great. When I use the zip code encoded as a number. So my tree makes choices based on greater than a certain number and smaller than another number for the zip code. I know that defeat the zipper, this import, the location of where the houses, which I know is true. When I run the model with a one-hot encoded. So I have a 100 variables for a 100 different zip codes. Each one of them has only one value of one for every observation, and all the others are 0. I no longer know whether that's an important feature. Kind of in the name, it's kind of phrasing the question. I no longer know that that's an important feature because it's not sorry, please go ahead. I was just crazy. Just because it's now a 100 features instead of just one that may no longer one feature. Now it's a 100 features of the importance of that feature from the mathematical point of view from which I calculated the feature importance. Remember what I was telling you, how many times you use the feature matters in calculating the feature importance that we want only once because everyone only had one choice. Essentially, what's happening is that my mother doesn't know that I get it a 100 feature that are strongly covariant. It thinks that they're independent. And it calculates the future Important as if they were independent. So why am I telling you all this and why am I learning you into tricking you into thinking that you can encode the zip code as a categorical variables, as a numerical variable which you will, they can not going to be done in 20 seconds. So examples of how you actually run the one-hot encoder, turning the species of animal into three variables, catbird and dog, etc, etc. So anywhere you read, it will tell anybody who you ask will tell you. You have to one-hot encode categorical variables. We can now use them as numbers. And that is true. The model accuracy is going to be maximize answer mathematically, that is the correct thing to do. Two exceptions. One is that mathematically your models think they are variable in general, are not covariant. And now you're fitting a Kobe and aids. So just make sure that you know that. And secondly, that covariance, while you might get away with that covariance when you create your model, and particularly trees are robust the covariance to a large extent. When you look at the feature importance. Now you're no longer insensitive. If two features predict the same thing, your tree will only use one. The other one will become obsolete. But you don't know which one cause the other one, you don't know why they are the same. It might be that the one that got ignored is actually the important feature in the sense that that's the one that causes the other ones have that belt, that value. Right? Distance and luminosity. For example, in astrophysics. If you had to start their skin redness and you put them at different distances, one would be fainter. You should give me both variables are given the same information content twice. Your random forest may choose, your tree method may choose one or the other variable will always be new with the SEC. Ignore the sacrum author has chosen one. Yeah, importance will be split in half. Okay. There's no solution to that by the way. So it's just, you know, you shouldn't one-hot encode. And once you do you're done. I shift feature importance fourth one. So that's all I had done this. So I will put some more pieces according to the homework so that with the from these notebooks, but also make the notebooks available to you. I will remove the arrows from the one that we've just seen that apparently I destroyed this morning. I wanted to read something about the interpretability of the methods. It's just the two sigma blog posts, so it's very quick. And the homework has been posted for some days. And once we add something on Wednesday, yes. We switched by few days now by a we can do that. Otherwise, we can make it you'd like during the Thanksgiving week on Sunday, but like when you all the rest of the week from Wednesday, so next week is Thanksgiving, so we have no classes after thanksgiving. It's back to the due on Sunday. I don't mind giving you a few extra days this week. Let's just do that. So this homework is due on Sunday this week. That okay. Alright. You can take your questions. Otherwise, I'm done. I'm sorry that I ran a couple of minutes. Yes. Members here or here or here or over there. And if you yes, That is exactly right. Yeah. That's sort of the first year. First of attack for the, for the tree-based by kind of think about it like they were in the degrees of the polynomial, polynomial regression. There are other things that essentially kinda do the same thing like that, that works on all the branches the same way. You could, for example, say, I'm going to put him in a minimum number on the minimum size of the split. So that if the split, if the branch has less than 10 objects and I'm no longer going to slurry whatever the result is. Right? And that does the same, but it's personalized for each branch. So what does this one? So this score number, it depends on the classification. Typically it's the actors. How many did you get right? Versus the court? What fraction of predictions were right? If you're doing a regression, then your score, it's no longer the score, but maybe you use the mean squared error. What is the overall distance between my predictions? What is a way to measure the overall distance between my prediction and the troops, right? So what you use as a score depends on what you want. And I showed you a few other things where maybe your score is to recall instead of being the story about. But whatever score you had a target, we want to get the training and the test set to be similar. And especially if the training, if the test set is lower than your overfit, the training set is low, is lower than that says Look, it's not nothing. It's a it's a statistical oddity. Does nothing to it. Yes. Yes. Because 11 no, false positive rate is 0. Positive rate goes from the left to the right. Okay, so is this. So the false positive is minimized here and the true positive rate is minimize your own thoughts. So there's always years ago is one and the false positive rate because we're not used to it. Sorry, I'm confused. On how can I get a model behind the chair? I must have done something wrong. And that's going to get back to you. Any other questions or is everybody everybody's mind blown by the spring? But it's yes, I'm sorry. I have the symposium that I'm taking care of, so so there'll be rough for me. I will definitely have them. I have most of them. Definitely finish it up. Have any other questions?
DSPS 2021 Lecture 19 | CART 2
From Federica Bianco November 15, 2021
0 plays
0
0 comments
0
You unliked the media.
Zoom Recording ID: 6342477150
UUID: kCM/uBOdRueeHPJVaya4kw==
Meeting Time: 2021-11-15 07:35:30pm
…Read more
Less…
- Tags
- Appears In
Link to Media Page
Loading
Add a comment