So let me load the cars dataset. Again. I'm seaborne. And now I'm going to ask the DataFrame to describe itself. And you see a bunch of numbers here. You certainly know what some of the mean, maybe, you know, but all of the mean, we're going to talk about them. Before we do though. Just know that when you take a large set of numbers and you reduce it to a very small set of numbers or one number, sometimes they're throwing away a ton of information and that's inevitable. But you want to try to do it in a way that's as careful as possible without distorting what's really there. Three of the most frequently cited statistics, or the mean, variance and standard deviation. Let's suppose I have n observations labeled X1 to Xn. The mean is just the average. We'll use the symbol mu, the Greek letter Mu. So that's one over n times the sum from I equals one to n of the x, i's, add them up and then divide by n. Now if I were to take the deviations from the mean and then add those together. So I can break that up into two separate sums. The first one, by the definition of the mean, is just n times mu. Second one mu is constant, so I'm adding up mu n times n times a you again, and the whole thing is 0. And that means if I divide by n, that the average deviation is also 0. But something more interesting happens if I add up the square deviations, then divide by n. We get a quantity which is usually written as sigma squared, that's a lowercase sigma. And that's the stand, that's the variance. Now, variance has nice properties. Statistically speaking, it's kind of easy to manipulate and a lot of ways, but it's hard to interpret because its units, for example, if X has physical units, the units of variance or units squared, the units of x squared. So, you know, it's not always clear exactly what that means as a number. But what we often do is then take the square root of that is to get sigma. And that's called the standard deviation. And that does have the same units as x itself. So we can more easily compare the values, the values of x. The standard deviation is a measure of the spread in the values are what's called the dispersion of the values. One of the things we often do with the mean and the standard deviation is a process called standardization or producing what are called z-scores. That the same thing. The setup is the same as I had a minute ago. We have these n observations, X1 to Xn, and they have a mean mu and a standard deviation of Sigma. Then what I do is I define Z as x I minus mu over sigma. And that's valid for all the possible values of I. So out of the original observations, I get these z-scores instead. The two things happen with these. The mean of the z i's is 0. That's the same derivation I did a minute ago. With the average value of the deviations. Also, when you divide a statistic by a scalar number, then you divide its standard deviation by that number as well. So the standard deviation of the Xi's. But these are now dimensionless. They have no physical units, which is always a nice thing to look for. And they're easy to compare across very different kinds of datasets. Doesn't matter whether I'm comparing nanoseconds and years in a sense, because I'm always looking at the relative deviations. Pretty easy to standardize scores. In pandas. Here I'm defining a one line function. The input here is going to be a series like a single column of the DataFrame. And then what I do is I calculate the mean. And the standard deviation of that series. And I use that to compute the z-score. And then all I have to do is at least if I only want to do this on one column at a time, is I just feed that column into the standardization function. Here I'm making it a new column in the DataFrame. Then I'm going to ask those two columns, the old one and the new one, to describe themselves. Here's the original values. Here are the z-scores and the second column. And as you can see, the mean is effectively 0. Remember numbers are rounded off to about 15 digits. So it's rare that you can make it exactly 0. But the standard deviation, too many digits is also equal to. Now here we run into a little bit of subtlety. I've got these n observations. Let's suppose that I were doing a study of the heights of NBA players. There's not so many of them, right? There's only a few 100 of them. I could reasonably work with every possible observation. In that case, if I'm working with every observation, we call that the population. If I were doing a study of the heights of human beings, not going to sample all seven or 8 billion human beings currently alive. Instead, I have to settle for a subset of them. And in that case, we would say that these observations form a sample. This is usually what we have. Now when we talk about the statistics of samples, they're related to the statistics of the population, but they're not exactly identical. We actually have to give them different symbols. So the sample mean, we're going to use x bar. But otherwise that formula is the same. And the sample variance. Unfortunately, these, these symbols aren't completely standardized and sometimes people are a little bit careless about them. Let's all straightforward. Now, let's imagine a thought experiment where I sampled n humans over and over again and measure their heights and did experiments are calculated to sticks. So if I take samples over and over again, and I look at the average of the sample mean, the mean of means that will, in a mathematical sense, converge to the limit of mu, the population mean. For that reason, we say in statistics this isn't that the sample mean. An unbiased estimator. Unbiased in statistics generally means no error and some limit or so after some average. But that's not true of the sample variance. So that's a biased estimator. But if I were to sample over and over again, taking all the variances, if the average value I would not get population variance. Instead. Unbiased version has a different denominator. And this sounds pretty weird, but actually it's pretty routine math, although it's long to write out. Part of the reason for the problem is that we don't know the population mean exactly. So we're calculating the statistic based on something that itself isn't quite right. But by dividing by n minus 1 instead of n, this now becomes unbiased estimator. A sample of population variance. Unfortunately, it's not the case that the square root of that will converge on average to the samples or the population standard deviation. It's basically impossible to find an at symbol general unbiased estimator of sigma. So usually people do use the square root of the sample variance or the unbiased estimator. Unfortunately, even that isn't completely standardized. So NumPy. If you ask for the standard deviation of a sample, it will actually give you S sub n is though you'd given at a population, Pandas makes the other choice. If you ask for the standard deviation of a list or a data of a series, you'll get as n minus one. They both estimate standard deviation of the population, but they do it a little differently. Next up is the median. The median is defined as that value of x such that 50 percent of the observations half or less than or equal to x. Now of course, this could be for a population or for a sample. There's a simple way to compute the, an unbiased estimate of the population median if we only have a sample. If you take my observations and sort them. Now I'll just call them why? If n is odd? And our estimator or the median is the value that's right in the middle. And that works out to be at position n plus one over two. N is odd, then n plus one will be even. If n is even. Well, let's say I have six values. There's no value exactly in the middle. They look at the two that are in the middle and I split the difference. I'll take the average of y position n over two. Why a position over 2? Let's take another look at the empirical cumulative distribution function. One of our columns here, I'm taking the standardized z-scores looking at that. So remember that this converts a value here or the Z-score into a proportion or percentage. So write the portion of scores that are less than or equal to negative one is about 18 percent. But we can also do this in the inverse direction. So if I gave you the percentage of 40 percent or proportion of 0 for what value of Z, what I need, right at about minus 0.3, give you that. So the inverse converts from a proportion or percentage to a value. That's called the percentile. But it'll find a percentile. We're given some value of P. And the percentile, the 100 p percentile is the value of x such that 100% of the values are less than or equal to x. The median is the 50th percentile. Now, calculating that in an unbiased way for our sample, actually not totally trivial. I'm not going to go into it. Fortunately, it's all done automatically for us. If you have a bunch of percentiles that divide up proportions equally. That's called set of quantiles. Before example. The four quantiles are called quartiles, right? And that's the 25th percentile, 50th percentile. 75th percentile. Sometimes people throw in the Min and the max of the whole thing as though those are 0 and a 100 as well. Unfortunately, pandas uses the word quantile as a function name, but it means percentile. I don't know what to tell you. I don't know how that happened, but that's the way it is. Now if we go back up and look at the output of the describe method, this is just a count of how many rows there are. And then for each column that's quantitative, we get a mean and standard deviation, Min and the max and then the quartiles. If we wanted some other quantile, we can compute that is by getting a list of the percentiles that we want to take. Another place that we see quartiles in particular come up in, is in what's known as a box plot. There are some different definitions of boxplot. This is technically what's called a two key box plot. Now, in this box plot, what I'm doing is first of all, I'm using a different command. It's cat plot. So cat for categories, categorical. That means I want to display how something depends on a categorical variable. In this case, it's going to be the origin of the car manufacturer. Again, have to tell it what the DataFrame is. And then I'm going to tell it what's on the y-axis as well, which is the miles per gallon statistic. And then I ask it for a box plot and I get this result. So you can see that there are three columns corresponding to the three different origins. In each one. I've got this box. The top and the bottom of the box show the 75th, 25th percentiles and the horizontal line. And the box is the median at the 50th percentile. And then we'll talk about the whiskers and these dots in the future. But the colored boxes showing you the quartiles of the data. An alternative to the box plot is a violin plot. Here I've put the three different regions on the y-axis this time. And then you have these inner lines. They show the same thing as the whiskers really discussed. Yet. The thick part though, is the span from the 25th percentile up to the 75th percentile, what we call the inter-quartile range. And then that little dot in the middle would be the median. So that line has the same information as box and the whiskers here. But then the violin sides, It's just symmetric on both sides, is a kernel density estimate of the distribution. Remember that's the method for smoothing out the histogram. So with this plot, you see the spread of the data as well as the detail distribution of it.

DS1: Summary statistics

From Tobin Driscoll February 02, 2022

6 plays 0 comments You unliked the media.

mean, variance, standard deviation
z-scores
populations and samples
median, quantiles
box and violin plots

Tags: data sciencestatistics

Appears In: Data Science 1

Comments

Add a comment