Now it has been I'm going to pause it because I'm going to put you in the form. So somebody reminds me when we start back with the slides. All right, so one thing that I wanted to do was talking briefly about Bayesian statistics. It's definitely a core subject of the underlies, a lot of scientific discovery, a lot of machine learning really all Monte Carlo methods that are something that is extremely heavily used in the physical sciences. Sciences. I swapped around so the numbers are no longer in sync, but this is what we're going to start talking about right now before we go back to the null hypothesis rejection testing, this is just a very high level discussion. The application, the most direct application of Bayesian statistics, as just, as I've just said, Ising Monte-Carlo, Monte-Carlo methods, particularly MCMC, Markov Chain Monte Carlo. So when we do simulations and we try to assess the S value and the uncertainty of the best values For a model and it fit to the data. We'll talk about a, very generally when we do linear regression. But very heavily this leverages Bayesian statistics. I'm going to start with one thing that I said last time. I think I might have put this slide here, right? Yes. So we're talking about Bayesian statistics as opposed to frequently statistics. And I've defined frequently statistics as the statistics. Just give me 1 second. The statistic interpretation of probability. That says, the probability, the probability of something corresponds to the frequency at which something happens, at which that something happens. And less probable events happened at or less at a lower frequency. So when we speak about scientific discovery Prague, scientific discovery process, this means you have a low probability of a datum having a specific value or a specific range of values, rather, based on the knowledge that you had about the probability distribution from which it comes. So Bayes statistics includes something here that is missing from the frequentist statistics interpretation of probability. And I want to highlight that it's not like Bayesian statistics is new at all. Bayesian statistics is as old as frequency statistics. And I think perhaps it's formalism might have preceded by a few, a couple of decades. The base, the urine was comes from the work of Thomas Bayes in the 17 hundreds. The problem is that Bayesian statistics is more complicated and my appear more subjective and it did not gain ground as rapidly as frequently statistics. So Bayesian statistics relates the probability of something to the belief that something is true given new evidence. That is equal to. Let me just read it and then unpack it. It's easier that way. The probability that a belief is true given new evidence equals to the probability that they believe is true regardless of that new evidence, times the probability that the evidence, given that the that the leaf is true in divided by the probability that the evidence is true regardless of your belief. The key point here is that we introduce the concept of believing. We introduce, in a sense, the concept of the build ability of the model, the plausibility of a model that is entirely missing from frequentist statistics. So by doing that, we can stay things like, in principle, we can say things like I have two models. They explained data the same way from a frequencies point of view, but I'm going to choose model that is most plausible. One way in which we leverage that a lot is by associating these concepts to AAC hands raised there is and the concept of parsimony, which means we're going to believe that the simpler model that we created is true over the other model. Even if, even if from a frequencies point of view, they gave me the same result. So in a frequentist sense, I can say I can only tell you this model predicts a value. I measure a p-value that is inconsistent at the 5% level. But it doesn't tell me anything about whether that model is insane or whether the model really the core point. Whether the model is too complex to be believable, whether I've introduced so many variables that I could produce a good result no matter what, because I had a model has a tremendous amount of flexibility. We'll go back over all these concepts and the concept of model selection over and over again. So mathematically what I, what is written in that sentence means that we relate the probability of the model given the data to the probability of the model itself. The believability of the model, the model reasonable at all. And to the probability of the data given the model. This is the frequent this piece. This is how good did my p-value come out to be? I had to divide. And this is a very sore point for Bayesian statistics. By the evidence. The evidence is the probability, the intrinsic probability of the data. The probability of the data regardless of any model. And if that sounds very confusing, it should sound very confusing. Basically, we never had the evidence for Bayesian statistics. We just never know what is the probability of the data regardless of any assumptions. The reason why this will not matter, and I'm jumping several steps ahead. But the reason why this will not matter is because if I compare two models, so if I have model 1 and model 2, I can simplify the denominator because the probability of the data is the same between the two models. So I can still say this model is better than the other one based on the prior and the N and the probability of the data given the model, the likelihood. Modulo the evidence, though, that's the wrong regardless of the evidence, because the evidence is the same, That's the probability of the data, regardless of what models I'm considering. We do Bayesian statistics. What we say is the probability of the pivotal quantity given the null hypothesis, is the probability of the null hypothesis given the data. The pivotal quantity is actually the data here. So when we say this, we're saying that the prior and the evidence don't matter. That's the difference between frequentist and Bayesian statistics. Let's see, just to make sure that we're on the same page with the math of this theorem. We're going back to combinatorial statistics. The probability of a if B, if a and B are not independent, the probability of a given B times the probability of B is equal to the probability of B given a times the probability of a. This was one of the rules that we looked at in combinatorial statistics. Fine. I just massage it and move it around. I get the probability of B at the bottom. This is Bayes theorem, so long as a is my model and B is my data. So here remember yesterday we were talking about the probability of physics given the data. Really what we do is probability of models and models or Citrix given the data. So that's base theorem. Just to discussion of taxonomy. We talk about data. We really never had the model here. We. So what we want to do is the way in which we most often use this method is to optimize the model parameters. So really a lot of times here we don't have the model itself, but values for the parameters that the model take. So this means if my model is a line, the equation of a line is a x plus b, my parameter a, the slope of the line and b the intercept of the line. So normally the way I will leverage these is to say, Okay, let's assume that my model is alive. Then what are the best values for the parameters, slope and intercept of my line? There are the values that maximize this piece. The posterior distribution, we call this the posterior distribution. The probability of the parameters themselves is the prior probability of the data given the parameter is called the likelihood. And you may know about the maximum likelihood estimation. Maximum likelihood estimation is basically saying, ignore the prior, ignore the evidence and relate the probability of the posterior to the data, make it equal to the probability of the data given, sorry, the probability of the prior to getting the data is equal to the probability of data given the parameters. So it's back to frequent t-statistic. So D is the data. We discuss what the evidence is. The usefulness of the prior a lot of times is that we can have constraints, so we can constrain, even if we use it in a very naive way, we can constrain models that are unphysical. So that's where we inject our domain knowledge into the model. We can say there's non-negative number of photons. Photons are unit objects and I will count a positive number of photons. So I'm going to remove from the parameter space anything, any parameter that would lead to negative photons. Or we can, if we know that there is. If we know from the main knowledge that there is a correlation between reactions, we can remove parameters that would lead to measuring anticorrelation. Things like that. A lot of times we talk about priors that are informative or not informative. So a lot of times people will say, Oh, I could non-informative priors because I did not have an idea about whether the model or the range of around was was reasonable or not. And that's a those and if you say that you're fooling yourself, normally non-informative means a flat prior, something like this that removes some regions of space in the parameter space. Something like this is only flat in this coordinate system. And if you change coordinate system, we will not be flat. Therefore it will become informative prior. So don't be Folder exists no non-informative prior. It's things that you can do are things like, well, I know that some numbers are completely reasonable, right? So if I want to constrain models of people's way based on, if I want to become, want to model the distribution of people's way based on data and some things. Constraint on the way. Yes. Then some medical medical information, I can remove parameters that will lead to weights of over 1000 pounds because I know that people don't weigh over a thousand pounds. In reality, this is uninformative only in the sense that my model should not predict these kinds of ways. So if my analysis is done well, at the end of the day, this is not what the theta will blend on. So the reason why we actually put priors in statistical analysis that leverage, leverage is the Bayes theorem is just to actually make the analysis faster. We're not going to waste time in the regions of the parameter space that we know are now going to be the ones where the solution is going to be, is going to end up. So think about it that way. But be extremely conservative about removing, removing chunks of your parameter stakes. Because you might make your result converge easily towards what you want it to be, rather than worry, the data says it is, right. Uninformative prior. Find. The evidenced. I already talked about it. We never know and it doesn't matter because we compare models. And when we compare the models, The evidence is the same for all the models. So any set of data will have the same denominator. So the place where the posterior distribution of Theta given data is maximum is going to be the same regardless of what I put here. So essentially what I'm saying is that we're not looking for a specific value of this. We're looking for the maximum value of this. So this constant, this multiplication constant, does not matter. And I think that wraps up the preview, right? So if I'm considering right, exactly when I said these are the definitions for your reference. So you can look at that later. And that's, and that's it. That's all we're going to talk about for Bayesian statistics. For now. We'll get back to it when we talk about Markov-chain Monte-Carlo, okay? So back to ignore in the prior, back to frequencies, statistics and two pivotal quantity. So I'm going to, I'm going to start back from individual tests that we had looked at yesterday and they will look again at today. So will we started by saying there's a **** of a lot of tests and it's very not trivial to decide which test is right. And then we looked at, we kind of looked at for test or we're going to look now at four tests. So what defines the test is what is the thing that I'm going to measure about my data? Based on the question that I have. The nature of my data is a numerical data is categorical data. And the thing that I'm going to measure has to be a pivotal quantity. That is, that somebody has defined and studied and figured out that what makes IT people tell is the fact that under the assumptions that are specific to a test for the null hypothesis, that p with a quantity has a specific statistical distribution. So that I had an expectation value for what I should measure under the null hypothesis, which means if the null hypothesis is true. So the test already actually tells you. So the test, the test relates to the question directly. The Z-test can only be applied if what I want to know is whether a sample has the same mean of the population. Doesn't tell me whether the sample has the same standard deviation of the population. It doesn't tell me about correlation between things. It just tells me based on the property of the sample, I cannot tell you that the central tendency of the sample is different than the central tendency of the population that you have proposed for the sample to come from. And it turns out that if you extract a sample from a population, I stick with the z-test because it's, even if it's very, it has very limited applicability because I need to have a population and I need to only be interested in the mean. It's very transparent so we can discuss it and unpack it. And then based on that, kind of get how to unpack all the other tests. So I had a population that has some properties. If I were to do like actually operationally pull out individuals from that population where individuals doesn't mean people in whatever data. Then I would find that the, for every sample, that for every sample that I take, I measure, I mean, and if I put together all those means, I end up with a Gaussian distribution that has mean 0 and standard deviation one. Okay? Cyan going to continue on, I'm going to need to unshare per second once with it. Okay. So I go, I measure my thing. In this case, it's easy. I take the mean of the population which is given to me because its population, if it's a real experiment, we did it differently. We calculated our own population mean. That wasn't really a population. In fact, as somebody of you pointed out, quarterly, I measure the mean of my sample. I know the standard deviation of the population. I measure the size of my sample. I put them together, I get a number. And I compare that number to my expectation for what I should get. If the null hypothesis was true, that that sample comes from that population. My expectation is, strictly speaking, the expectation value of this distribution is 0. The expectation value is unbiased need I'm not going to get 0 because that says my expectation. I'm going to get something that is slightly different from 0. And I'm going to evaluate whether it's slightly, is consistent with my leave. This case, that the sample comes from the population and my belief is stated before the experiment as the chosen threshold. If I get something that had a probability of less than 5% to come out. As a generalization p-value of 0.05 is very common. Then I'm going to say that is too improbable for me to believe that the model assumptions were correct, that the null hypothesis holds. Right. And so the Z-test is particularly helpful because E tells us directly it a number. It gives us a number that is directly relatable to the p-value. The number that comes out of this is the is, is distributed according to a standard normal to a Gaussian distribution with mean 0 and standard deviation one. So if I get three e, that means that in units of standard deviation and three units away from 0, that means that I am a three sigma, three standard deviation. If I get 22 standard deviation and if I had chosen a p-value of 0.05 and exactly at that threshold, when you're via the fractured, you generally accept the null hypothesis, but that's a different story. If I get 0.13, which I think is what we got yesterday, I checked the things. I get 0.13. I'm well within two standard deviations, I cannot reject the null hypothesis. It is a null result. So far, so good. Now, all the other tests are not Sony or not so simple because I don't know of any other tests that produces a number that is distributed following a standard normal distribution. They're all distributed following something more bizarre and we are. So let's start with the t-test. Very similar to the Z-test, measures the same thing kinda between two samples and not sample and a population. It measures whether based on the central tendency, it is reasonable to assume that the two samples do come from the same population, that there are two samples extracted from a population, the same population, the same generative process. It's similar the distribution that this quantity between the two sample ends up being a senior to a Gaussian distribution in that it's a symmetric, bell-shaped distribution, but the details are different. The shape of the distribution is actually different from the shape of the standard normal. And it's determined by these single, by single parameter, the number of degrees of freedom. Remember we're looking at Wikipedia, stable, and the table tells us what are the parameter of the distribution that we're considering. The only parameter of a t distribution is the number of degrees of freedom. I'm going to go back to number of degrees of freedom when we talk about chi-squared, I think it's more familiar. But let's say that we know what the number of degrees of freedom is. In fact, we know that because for this particular test, the result of this is distributed following a student's t with number of degrees of freedom. This mass right here, which is a mess. But it's a linear algebra mass. So it's just a mess because it looks overwhelming, but it's trivially computable. Fine. So I get something. And now what? This is not a distribution that trivially tells me what is the standard deviation, how many standard deviations away I landed? It tells me a number and that number, like two, and that number has a probability. So I'm going to have to go look up what is the p-value for something to or greater. Keep in mind that we're always thinking about this in cumulative space. So we're thinking about something that exceeds two. Or if we're doing a single dependent, whether it's a one-tail or two-tail tests that exceeds two or exceeds minus 2. That's a detail that is import them, but I don't have time to really focus on. So I'm actually looking at the cumulative distribution and I want to see where am I in cumulative space, right? So where does two here on the x-axis? That's not really a 2.5332 indefinitely. Where does three on the x axis? What is the cumulative value of the probability distribution for three? So you can compute that in Python. You can look at the psych package that an invoke the probability module for this particular distribution and ask you what is the value of the cumulative distribution for input x equal to 3? Traditionally, people look at tables. And, and so the way you read a table like this is you look at the number of degrees of freedom. So this table is a table of values that the t-distribution takes for specific thresholds that you choose. Like 9.95 being your 0.05 P-value, right? This is the area under the curve that you don't want to remove. So this is for a one-tail test and it tells me what is the number that x value of the plot before would take if I had one degree of freedom. So if this mass right here was equal to one, and I wanted a confidence level of 95 or a p-value of 0.05, then if I get less than 6.3, I cannot reject the null hypothesis. And if I get more, I can reject the null hypothesis. That makes sense. I'm assuming the silencing, yes, and I'm moving steadily on. Then the next thing we talked about is the KS test. So I want to do our first breakout room experience. And I'm going to ask you To Do Something along those lines. I'm going to leave it very open and you work on your own. I have no guiding notebook, nothing to like, look and adapt, but it's very open. It's very open. Exercised. Listen to a couple of things. You want to import the psi pi package and the stats modules from Cp. Again, because depending on your setup otherwise that might not become available just by imports IP from Pipeline report. And then you want to choose two distribution. One is going to be the one is going to be the standard normal distribution. And one is going to be another distribution that you choose, like the chi-squared log-normal like you did last time with validating the central limit theorem. You're going to generate data from this distribution. So for example, if I wanted to generate data from the standard normally will be SP dot psi dot, sorry, speed us. That's the norm. Doc, I'm going to generate a random value from this distribution. So I'm going to generate our VS. This is one value. Let's say that I want a 100. This is that, say that I want a 100. How was that? I don't remember that. That's not usually what I use for my distributions. Log scale size. I'm going to leave the loc and scale alone that determines the mean and the standard deviation. For now I'm going to leave them alone. I'm going to say size equal 100. That gives me a 100 numbers. This is my fake distribution. And then we're going to use the sipeoyo KS test. Now there's two tests for the sidebar. I'm going to encourage you to start by using the one that uses a model. Let me go back to the slides. What the KS test measures is the distance in the cumulative probability distribution space. If I don't have the probability distribution, that is another than that cumulative probability distribution is the, is the cumulative distribution of frequencies of data point. So what this represents is this was the fractional values that were below minus five in the data that are generated. He sees the fractional value that was below five. Okay? It can compare a, with another distribution, so another frequency of values, or you can compare with the theoretical cumulative probability distribution. So I can actually put a smooth curve here that represents a Gaussian distribution. You measure the distance between the two values. And there's some magic that some clever statistician figured out that distance is distributed in a specific way. We'll talk about how it's distributed when we come back. But so the test itself, if you look at, if you just start typing KS on Python, it will give you a lot of options and some of them are related to the cares. That's the one that uses the theoretical comparison. So that compares the KS test, compares a distribution with a theoretical distribution, for example, it tells you very important in science, if a distribution is normal, if the sample is normally distributed, is the KS test. The one that compares two sample is the KS underscore two samp. So start with the KS test. Look at the documentation for the KS test. See what it how it how it is, how it's used. If you scroll all the way to the bottom, it gives you an example. This piece right here is where you put your technically in Python that's called a callable. That's the name of a function which you can define yourself, or it can be a predefined function. The normal distribution is predefined, It's the Gaussian distribution. So it compares this experimental dataset with this theoretical model. And it tells you two things. And you, I encourage you to look at the, what the two things mean by yourself first. By looking at the return value of the function, which is also in the documentation and see if you can wrap your head around it. Otherwise, when you come back, we talk about what you got. So do it for the case, will do it for the normal distribution compared with the normal distribution. And then pick one of the other distributions that we used and choose some parameters and play around and see when that looks like it's consistent with the normal distribution that looks like It's not. For example, you should trivially see that you could choose a binomial distribution with a very large mean. It should look more or less like a normal distribution. And your case that should not be able to tell you this is not normal. But if you choose it with a very small mean, then your KS test should be able to tell you this has nothing to do with normal. It's asymmetric. The cumulative distribution, it's different. Questions. Goes clear, clear. I'm going to put them in the chat as instructions. Select a distribution. Actually generate n values from a distribution normal fors than another one. Use the, I'm going to give you the reference here. Scipy.stats stop KS test function to test if it is normal or not, if the sample is normal or not. And I'm gonna put you in breakout rooms. I'm gonna give you eight minutes, six minutes for this, and then send a message to the breakout rooms that there is a break. And I'm going to give you five minutes of rest and then we come back at 235. Alright, so we are looking at a chi-square test. And we missed the beginning of this discussion because I forgot to record so that people that watch the recording or not too confused about what something. So chi-square test model minus Theta square. So the distance between the model and the data divided by the uncertainty and the Pearson's chi-square test specifically, there is the most commonly use, the uncertainties, the uncertainty in the data. We call it Chi-square because this quantity, as it turns out, is distributed like a chi-square distribution. These traditions of parameters. So which one of the infinite chi-square distributions that depend on the value of the parameter that the chi-square distribution takes, which is the number of degrees of freedom. Well, this is where it gets intriguing. This value is chi-squared distributed following a chi-square distribution that has the number of degrees of freedom equal. Remember that these parameter, it determines what the mean of the distribution is, what the standard deviation of distribution is, et cetera, et cetera. The number of degrees of freedom is the number of observations that I had. Minus 1. We react with ethanol minus 1. That's minus the number of parameters in my model. But we'll talk about that. We'll just assume for now that my model is simple and it's a 11 parameter model. There is another great thing about the chi-square test. And so since there are some physicists here and somebody has used it before, what is the value, the UX that when you run a Chi-Square test, normally like you would publish it with a chi-square test, got a value close to blow, demonstrated that my model is actually a good model. Please, please, please, Somebody answer should be equal to the number of degrees of freedom. That's right. In other words, because the number of degrees of freedom is the mean of the chi-square distribution. The number of degrees of the chi-square distribution has mean equal to the number of degrees of freedom. Very commonly, we refer to chi-square distribution with number. With the chi-square distribution. We say a model is good because my chi-square was actually equal to one. But we don't use the chi-square in that case. We use the chi-square per degree of freedom, which means we divide that number by the number of observations minus one. And very often we divide it by just by n. Because whatever, there's going to be a lot more observations than parameters in my model. So we're very wishy-washy with this quantity fine. So that's because a chi-square distribution, I can look it up on a table with one degree of freedom, has a mean of one. And then I can look up on the, has a mean equal to the number of degrees of freedom, which is one. And then I can look up on the table. And this is the favorite number of a lot of physicists and scientists in general. 3.84 is your rejection of null hypotheses threshold when you, when you have a threshold of 0.05. So a p-value of 0.05. If your chi-square test gives you a value of more than 3.84, then you reject the null hypothesis that the model was a good description of the data. Which if you worked within the falsifiability framework, means that's very exciting because you discover something new about data. And if you were just trying to model the data is not such good news. Okay. So lets see it's three. So I really am going to put you in breakout rooms for quite some time to work on this exercise of reproducing this paper. I will do that in chunks. So we'll do a piece of paper, then we'll get back. We'll look at how I did that piece. And then you'll get back in breakout rooms and then we'll come back and look at how I did that piece, et cetera. I really strongly encourage you to work together. Not just because, well, because you will learn more and you will be exposed to things that are not the usual things that you do, even if you are, if these things are easy for you and you are able to do them by yourself. You might hear about somebody doing it a different way and learn something. Or you might have to explain how you do it, which invariably will teach you a law. Because when we verbalize concepts, we really understand them along better. Okay? So, but I'm not going to do that yet because we have to talk about physics concept as opposed to a statistics or data science concept. In order to be able to discuss the paper that I'm asking you to reproduce. So the physics concept is scaling laws. Scaling laws are really important for the interpretability of phenomena. If there are scaling loans, then we had a hunch about, about a relationship and can go investigate why the relationship exists. And for many scaling laws, We already had ready-made explanations of why things change the way they do and what is the relationship between variables. As an overview, I'll explain what this means in a second. So scalar just means that I measured quantities. And those quantities are related by a power law with some power. So a very trivial scaling law is the law that relates sizes of geometric shapes. A cube has a scaling law between the side of the cube, the length of the side of the cube, the area of the face of the cube, and the volume of the cube. The area of the face of the cube is related by a scaling law with power two to the size of the side. And the volume is related to a scaling of power three. Okay? So the, in, in this particular case, this means that the ratio of areas between the two queues, two cubes, relates to the ratio of their, of their side length. How it relates with the power to find. The volume relates with the power of 3. These are different scaling loss, but they're both skidding not there is a scaling law relationship between my variables. And that's regardless of the shape. That is also true. That means that I is also true for other geometric shape, the volume, the area, and the psi and the size and the linear size always relate by scaling laws of the same power. There is a constant, right? When I calculate the value of a q of a sphere, I calculated differently than the volume of a cube because I had a constant. But that doesn't override the dependency. With the side of the cube is a power law, power three. So that's the scaling law. Fine. And the existence of scaling relationship between physical quantities reveals that there are some underlying mechanism, the dry, this relationship. So I'm going to give you two examples. One is from astrophysics. One of the most famous scaling laws relates the intrinsic luminosity of spiral galaxies, galaxies of a certain type that look like our own galaxy, they look like big spiral. It relates that to the rotational velocity, which is spelled wrong here. The rotational velocity of the stars in the galaxy. I had a video or the galaxy rotating, you're not missing much. It was just the galaxy rotating. So that means I measure the, the intrinsic lumen, rather I measure the rotational velocity of the gas, the stars in the galaxy. And that I measure the absolute brightness of the galaxy. And I get that when I plot them in a log-log space, they look like a line. We're not going to talk about logarithms and kinda going to assume that you've seen logarithms before and you're comfortable with that level of math. If not, just take my word. If you put things in a log-log plot and the related by a scaling law, they look like a line. The slope of that line tells you what is the power in the scaling law. Okay? And you can do the math or not. The fact that these, these scaling law is crucial for us to understand that there is a relationship between the variable, that the variables are good, that the changing the variable is governed in a sense by something in common. And what's in common is gravity. And the specific slope of that line led us to believe that the relationship is determined by gravity, that stars for gravitational potential that determines the rotational velocity. So the soever official relationship is, was paramount in our understanding of galaxy dynamics. Here is another very important scaling law in bio-physics and biology. The scaling law that relates the metabolism of my mouth to their size. And it look, I think this is really stunning because look how fine. It's a log-log plot. So these distances are squished a bit. But look how many orders of magnitudes this holds true for it from the mouth to the elephant. Their southern, their metabolism, their body weight, and their metabolic rate are related by a scaling law. That scaling law says that the metabolism, the crane, is proportional to the mass to the three quarters. Three quarters is less than one, right? That means that an elephant means proportionally less food than amounts compared to its size, compared to its weight. Something that weighs twice as much as something else only needs two to the three quarters Our as much food to survive, makes sense. That the scaling law, and it turns out that these three quarters is a very significant number. The three quarters in this case has been explained as relating to the distribution of resources in the body. And the real like really specifically the fact that distribution happens true through muscles that are essentially pipes. And the efficiency with which these pipes can move, can move, can move things around. So this holds true for plants, for mammals. And it turns out jeff West show that it holds true for a variety of environments that are describable as networks. Particularly he showed that that holds true. And I think this should blow your hand because it's kind of MIMO. For urban environments. There are scaling laws that govern the functioning of CPS to an incredible accuracy. The number of primes, the GDP or the CP, the average income of a CT, the intellectual creativity of a CT as measured by the number of patents that are generated by a CT. All scale. The size of the population. And this was discovered was measure by Jeff was in 2007. He has a pretty great TED talk which is in the references about this. I recommend it. And he explains it the same way. Specifically the fact that the slope for this is consistent with the slope that we've seen for metabolic rate. Indicates that really what these quantities depend on is how these resources can be distributed in networks. In this case, actually networks of streets. In the case of mammals, the blood vessels. So that's where, that's our died in scaling along. That justifies the importance of measuring whether scaling laws exist. So farcical. The slides are over. I just have a bunch of references. And then we're going to look at these paper. So this paper was written by a varchar as corral where don't know. It's fairly recent in 2018. And it's a statistical test for scaling in the inter event times or earthquakes in California. So what this means is that Carl West measure whether the time between earthquakes scales with the magnitude of the earthquakes. And to do that, I can use the sample of earthquakes in California, which you probably know it's a very seismically active regions are tons of earthquakes of all scales. And he did that using a KS test. So let's look at the figure that I'm asking you to reproduce. So he measure a dimensionless quantity which is described in the paper here. And this is frozen. Okay? He measured the dimensionless quantity demanded of two dimensionless quantities that relate to the time that passes between earthquakes and the magnitude of the earthquake. He created a cumulative distribution for these, for these quantities. And he did that cutting out of the sample. So remember, we talked about how u sub select a set of an array by broadcasting the array by using indices, by using a condition as an index of an array. So I do want you to use that because it's the most efficient method for doing something like this. So he selects earthquakes only larger than something and then larger than a different quantity and the larger different than a different quantity. And shows that the cumulative distributions of the probability of earthquakes and this dimensionless quantity that relates to the frequency of the earthquake. The same, thus demonstrating that Dewey's a scaling law between, between these quantities. He does that with the KS test. So he's showing you the cumulative distribution because really what he's measuring with the KS test is the distance between the cumulative distributions. They all look very close, but I'll leave you with suspense about what are the conclusions. So you're sure your TA is as follows. The GitHub window here. So I made a folder inside of my repos S3 call. Oh, I've seen some questions in the chat that I have. One question in the chat that I had met. So let me answer that. What function would I use to standardize distribution with sci-fi? I don't remember if Syfy has a standardized function by Hitler and has a standardized function which is called, which is indices inside of the scaling back of the pre-processing package. And I think it's called scale. To be honest. I never use it because sometimes there is such thing as writing too many utilities for doing very simple things. Standardizing just means divided by the standard deviation. Normalizing means subtracting the mean. Generally, when we talk about standardizing, we do both. You can literally do that by dividing and multiplying. And that has the advantage that you retain your mean and your standard deviation that are often useful later for, for doing for read, for unstandardized in your sample after the analysis. Which turns out it's actually very important in general. The British shrinkage of it with it. So many fans desk to the K-S test and vein going on the last, this subpopulation, is this a bad decision about that particular mission? Mozi test, DC-DC, break it up quicker. Piano can notice unknown scale. Thomas kilo friendly, do this. This is not a trivial problem. We can a parallel model would pick a value of a dedicated tick. What about those equates to Christo article la format, the electron TikTok being already chaotic and Dr. hotel, it's ethanol test the query word I think we stake What's unique one, k, dk, deafness. Canola left with the quantity talk with a dimensionless quantity of Archie. A LRE spoke in encaustic wax unit. The most liquid. Moving on, elasticity distribute soon a real-time demo streak, it probably got adequate substitute, good seal and Kayla scaling. All right. That's okay. With YAML. Okay, So nRT and null hypothesis rejection testing folder. I'll load the data from this article. I have instructions. Okay. The data from this article. I pulled it out myself in principle. And I would like you to think about it as you do with this article, is reproducible in that there is a data source that tells you you can get data here. And then it provides the code for doing the analysis. Or at least it describes the coding submission details that I could reproduce it exactly. The problem is that in the link that is provided for credit for getting the data actually is no longer active. And this happens all the time. You create links that go with their papers, but they're not embedded in the publication. So the journals don't enforce the maintenance of those links. And so the majority of on paper, reproducible research ends up not being reproducible from my experience when I actually go and try to get the data and reproducing. So there's instructions on how I extracted the data. So which means at the end of the day, you won't get exactly the same result. Even if you follow the analysis literally exactly what has been done. I collected the data and I am explaining in this document right here how I collected the data. I'll give you a link to extract it. But that require me actually dealing with the user interface that selected a region of of California following the description. So this was not a trivial endeavor. So I've collected, so I had to go on this map and select this distinct region on this map and then extract the data that is consistent with being in that region. I write what the region was in the paper, blah, blah, blah. At the end of the day I gave you the data. So use the data that I gave you. But if you want to try reproducing, that might be fun. And you might see if you come up with something different than I did. So this is a guided notebook. So the tasks are described sort of one by one. The data, the data extraction and collection tasks, the data reading and exploratory data analysis tasks, and then the selection of the data to make the different samples that will do the K-S test. And this is my reproduction of the figure in the paper. And then I have some more things they had to do with the threshold for more exercises that had to do with the fractions. Now you can say I would like to put you in breakout rooms and see you work on this. And then I think I want to meet you like every 10 or 15 minutes or so to talk about the tasks specifically. So a good goal would be within 15 minutes to be somewhere between to be done with the descriptive analysis I had been able to reproduce. These these tables that show the reading the data and reproduce those tables that show the values in the data. So all the things that you need as far as instructions go should be here because there is a link to the paper and there is a link in there is instructions on where the data is. And I'm going to recreate the breakout room and send you there. Whenever we create them, yes, I recreate them now, but not later. Okay. I will come and dizzy you. So let me see the time just 1 second. So let's get back together at 335. That will be presumably 10 minutes of work. And actually we want to give you a slightly longer break because this is the three hours lecture. So let's meet again at 340. That my, in my mind, that's the breakdown of ten minutes to read the data, 10 to 15 minutes to read the data and that a five minutes break. And I'll come by your breakout rooms and or combine your breakout rooms and see you. Come on. Content loader. To watch the video. You can get this done going through my computer could be some custom templates from the product because the sable could look at it. The fact to create a notebook, let a premium to the to the ledger at Udacity and find KB was about a year or two. So GitHub, grindy, say file sooner the screen EBD. And I don't mean to say Pilsen of the suny New Zealand Kiva. So we'll expect you to go where Li Si person allegedly do come into cone pandas. Yankee going to semantically non pi since other barely spleen and scary got a polyline between the C contributes four legs ready to come into down. Dog hide. It trivially, America probably about in file. Delete it. Did it with the rounding. Can do they give you the actual length of the repository? Simulated one month. I know me a little bit. We needed that to the data file. Is that the integral? That file? If I had an IPO, unless it's a folder which suddenly stood still an earthquake. The CSV, that impasto metric we're studying an amino book, kilometer in African or the linkage between the import pandas as pd and then pd read underscore csv. And problem methods in link get an integer if I equate, still, don't conceal. And I saw committed to a particular teacher. And I expected 15 in 1972, I sought to be stood with five, the value of inductor in link, do you do in a webpage in the Saguaro? Say go out to dock on in come the developer sources, qualita budget okay. To tell them on D2L and aka Yan Mann? No, no, no TF-CBT. Pinch-off duck. We bought us who? Raw Viera, APA style or a raw content? Whistling, reveal approximate tourists who your notebook. I posted the clay still need each a sure bet. Oh, no, no decremented value like to yield about your idea. They took way too, and I certainly didn't instill any day. And they litigate the liquid on the Verifit. Gotta sit done. The less that the Tinto no one thought though, he gave up the Coulombic you having to them. So they all wonder the e phi and in finance what our format to be the sun has happened on asymptotically that we're good at. Csv stands for comma separated values. When the non-verbally Camila general ledger look when separator equal quote, space. Doesn't see Anika Paulson are in the same concerto, altitude, etcetera. Quizzed about stepping stones. To us at 10th amino acid. Codon the command, yes, but toggled on day t to the casino. Particular main tip mouthful, mouth. Yeah, that's Japan. Greatly. Greatly trouncing on documenting. Am not be to the normal to want to be known in just a few dots because I don't it's just society. Okay. Tell me that you get on an end to the Civic similar. They've gotten into the tissue under the big data. La Memoria QA QC needs YAML that written, edited out the Suricata better term. Altria, put in the vaccine group of 500 possible. But as a eulogy written TDAP each epoch z, that's called the EQ data. If I remove all very dark economic zone. Put a puzzle that electrons cannot eat out. Different scenario. The pandas credit quality column a quantity gets shaded area. Now the fats in and interacts to the temple in the interest of time, in the interest of time. But I'm going to show you that those functionalities exist. For example. So much. Skip rows, you can skip the initial rows are n rows. You can only read certain rows and use columns. You can tell which columns by number 2, et cetera, et cetera. For simplicity, let's just really all, and Let's look at the data. Let's remove the data that we don't need. One quick thing. If I do EQ data that describe at this point, this also gives me for free some information which is what data was encoded as a numerical value and what wasn't. And it will tell me that he has submitted a number of numerical columns. In reality, I don't need really any of those. So I gave you instructions to actually only retain the random variables that are, look back at my instructions. Unnamed five, which is the magnitude, the date, and the time. Those are the only three variables that I need. So what I'm gonna do, I'm going to rename those columns and I'm going to extract only the rename columns. So I'm going to say EQ data rename. Rename takes, you can run it with different syntax. My most commonly I would run it by passing in a dictionary. The dictionary will pair the new name to the old name. So the old name, I need to pull out the hydro defined here. The old name e that I need, the date this for the day and then unnamed five, I'm going to copy and paste all of it and then process it. So this is my dates. So I'm going to call a date. This is a dictionary, so it's separately by column. This is the column that once the time. So I'm going to save it as nine. These I don't need. So I'm going to go back here and global this and say this. And this is saved as the magnitude of the earthquake. So this will produce, a DataFrame. Would've done. This is the column. This will produce a DataFrame with all the columns retained and the columns renamed. Didn't rename them when intervening them. Oh, sorry. I want to rename columns and rows. So AX is equal one. But it retains all the other columns. I don't care about the other columns. So now I'm going to pass it a list of my three columns that are date, comma, time, comma, mag. Note, there are two brackets here. One bracket is to say, I'm going to give you an index of something that I want. Then the other bracket is a different nature. Nature of bracket. The bracket it tells me the index is a list. So don't get confused. There are two different brackets that mean different things. So this will give me the dataset that I want. I want to over him. I have no fear about overwriting the original DataFrame with this EQ data is equal to EQ data and I'll rename blah, blah, blah. Notice every name can be used in place, but I also wanted to only extract object, so I did not use any place that mean. So my data is still the only numerical variable, is now. The numerical variable is now my magnitudes. And I have the same number of columns. That's looking good. So the next thing that I need to do is use PD to datetime and convert, convert the values of the date and the time. So I'm going to encounter a couple of problems. First of all, they are split between date and time, and I really just want the date, time, a single value that tells me what is the date and the time of the object. Because what I want to do is create delta time. Adult the time variable, how much time has passed between two earthquakes? Fine. So the other problem that I going to encounter is that some of those times are encoded in a way that will not actually work out. So let's do, let's try some things. Pd to date, time of my EQ data. Date. That works fine. What if I do it with EQ data kind? That doesn't work. It doesn't work for the following reason. I'm passing it times and some of those tiny strings have 60 seconds, 0-zero, 60 seconds point 0, 0 is not acceptable within the pandas date time format, that should be between 05960 seconds to trigger the next minute the next minute count. Now, I have a couple of options. I can go. I had to go inside of each of these entries. There's one here. I don't know at this stage how many there are. I know because I've done it before that like there's like six or so. So I had to do for all of them and then I have to decide what to do. So options are, for example, I could drop all those observations. There's only a few observations that would really not hurt my scientific program, project by enlarge. But I don't really know that because I don't know what the magnitude of those observations are. And I kind of had an idea that large magnitude earthquakes are more rare. What if I remove a rare earthquake That's not sound good? So I can modify it. And I can modify by going to the next minute, but I can also do something easier. This is down to the 100th of a second precision. And I had a hunch that I don't need that much precision. So at least I didn't quite turn it into 59, but attorney 59.99 and I felt quite comfortable doing that. So. The way I did that, I did that in a for loop, I'm sure that there are ways to do that. That would be better than the for loop. I couldn't think of them. So the for loop. So I did for, actually, I can think of it now, but let me just do it in the for loop and waste. So for I in doing before you do it better. So let's see, I'm not going to do PD that datetime yet. I'm gonna, I'm gonna do equation and a EQ time the data. So I can say the rate. Yeah, I know I do it in a for loop. Let me keep it simple because it's not easy to replace a piece of a string. And being sure that I don't mess up the rest of the string in this case. So I'm going to say, well the range in four, I'm creating a full loop that ranges on all of the indices of my DataFrame. And for every one of them, if EQ data, the allocation of that index. So this and pulling out the row. And I'm going to ask, does the end with the seas and method of strings? If a string ends with that number than a complies and this passes the condition if, and it goes to the next command. And what I want to say is if it ends with 60 000, 000. If so, the reason why I'm doing It's a very explicitly and including, including the fact that I'm choosing for loop is because I haven't inspected all the data. So I don't know if there are things that might capture a string replacement here. So just for my curiosity, I'm going to print out what shows up. This might be problematic if I didn't really know already that there are not many, there are many entries. I know that. So why pretend. What I want to find. So what I want to look at is what, what would I replace it with? So let me make a proposal that I replace it with the root of that of the element, which is 123456 characters. So the first six characters, and then I append to that instead of 60, I append 59.99. For now I'm just going around the loop like this to see what would happen. And I'm now going to replace them because then if I do, I had to go reread the data. The data is large, then that might cause problems and ends with two ends with to see this replaces the 60 with 5999 and the code is chugging along looking for all the entries of bank. So I feel quite comfortable with that. And so AQ data of time, I'm still inside of the condition. So this will only happen if the condition is satisfied equal to e cubed theta dot I lock. A time of all the way that string up to the six character included, and then added 59.99 to the string. So this replacings my eyes entry where I encountered a 60 000 with 5999. And unlike happen, and now I'm going to see e, e cubed data. But don't worry about that I've been a little bit shabby about are not using dialogue in this piece, but it still works. A queue data date, time. I can, I can try AQ data day and time. Look at a I can try and turn it into a pandas to datetime. And now, crossing my fingers, take home time, it doesn't give me any problems. Really. What I want is the date and the time. If I just convert the time, it assigns it today's date and gives me the timestamp up today's date. I don't Wanda and I don't want to have to deal with modifying the date afterward. So what I'm going to do is say, before passing PD to datetime, I'm going to put together the two strings. The string that makes the date and the string that makes the time. I'm going to do it as follow. Let me remove this first, I'm going to create a new variable, essentially a new series that is, bear with me as I go through the syntax. That is, let me do it for the, for the 0th one. Let me not do the list comprehension. Yeah, for the 0th element, for example, Example, 0 dot date plus E q data dot i log of 0. That time. Got one extra quo. This creates a string that puts together the date and the time. I need a space between them. So this is a single string that I can convert to. Pt can lead to datetime. That I can use PD to take time to convert and gives me a timestamp, which is with the right ear and everything. Now, I want to do it for everything at once. I'm going to do just as Dai upgrade over a regular for loop, I'm going to use a syntax that is called list comprehension. It's a different way to write a for loop in Python. And it turns out for reasons that I frankly don't fully understand that it's more efficient than a regular for loop in Python. I don't actually know how it's implemented. But so if I wanted to do it in a for loop, I would say for I in range len EQ data. And the model learns and incubator, and I would say replace the EQ data log of I with this piece. I'm going to do that slightly differently. Remember this piece here for I in range. I'm going to use it in a list comprehension syntax. So ignore this for 1 second. I'm going to say EQ data. Say the variable that I wanted, EQ data plus date plus EQ data time. This has an I index in there. Now, then I'm going to put the piece of the for loop for I in. This is called the list comprehension. It looks like a for loop, but it's better than a full loop. It returns a list, hence the name list comprehension. That is a list of this for all the elements in the loop. So in other words, it returns nothing because I have one extra parenthesis. But otherwise it will return a very long list of strings that our PD, that are convertible to a PD day time. If I was Gregor, I would have tested on a subset of my data. In fact, let me do that because this takes some time. So len EQ data, I don't want the length of the queue data. Let me test it on maybe 100, the first 100 data points. This gives me this list. Let me test that I can convert it. Pd. I never remember to put today. Now I converted. So now I can do it for all my data. And I can set that to perhaps a new variable inside of my DataFrame that I can call, for example, EQ date, datetime. And I'm going to let it go. Dissonant EQ data slash to be correct, I'm going to let it go. And I expect that this is about as far as you could possibly have gone if not earlier. So I'm going to let you go back into your breakout rooms and continue with the rest of the time catching up on this and then continuing on with the rest of the tasks. And it is for now, I'll see you at 415. The following tasks. While I open the breakout rooms include actually selecting only the range of data that is u there is for this, for this analysis. And splitting it into groups that have different values for the, the smallest magnitude earthquakes. Remember the distribution that we're comparing, the distribution of time gaps for magnitudes starting at a specific, like for earthquake starting at a specific magnitude. 0, 0, 0, 0, 0, 0, 0, 0. Oh. Hey, hey, good. 0, 0. Hey, okay. Okay. Okay. Hey. Alotta step. The next step was to select only in the earthquake time in the entire, The, sorry, in the entire period for which the earthquakes have been information about earthquakes have been gather by me through that website onto the CSV file that I gave you. The next step of the analysis was to only collect the data about earthquakes that happened during non-active Thevenin's. So this was very manual. I've done that by looking at the paper and looking at the dates in the paper in which the author identify that the ranges of dates are for quiescent times. And this was deep in the text of the paper. And I turned right here. The stationary time periods under consideration in the paved in this paper are refined as bla, bla, bla, these ranges of years. So I provided those ranges of years for you in the chat and in Slack, for let me get them from Slack. And let me share my screen. And let me start recording. Oh, I recruited veteran. Good. Let me share my screen. I think this and let me get those data from Slack. So this was quite laborious and quiet manure. So what I did was first ignore this line of code, was first taking those data as ranges. And then I've converted them. And to ear and fraction of the year, I converted them to date, time. So these are in IR and fraction. So I actually had to give my code some information about what I wanted. What I wanted, what would the format originally was? So I actually linked. I'm looking for the link on the, on the original instructions. There's a link to the entire function as I wrote it. But generally speaking, what I did was writing a loop. So four times in this array year ranges, I put them in a container that this is not a longer ray. So I'm not going to really be worried about using memory efficiently. I'm going to use the method that is most convenient to write. So I'm actually for these, gonna use a list instead of an array and now allocating memory, memory and just letting it be allocated as it happens in the loop. So I'm going to append into this list. So putting things into the list, one after the other. The datetime object corresponding to the state. And I had to be a little bit careful with the string format. And so the way I ended up writing it is TD to date, time of the integer of that number, which is for example, the integer of 19. It is 19, 8649615. And this is in the same format year. So I'm going to tell it, I'm going to tell Python that I want this inform a year just to be sure that I'm doing the right thing, I'm going to pull this out and test it for something at 19 a five. And in fact, I have a parenthesis problem. Disciplines that this is too much right to that gives me that timestamp. So then I've had to deal with the 0.5. So I wish I would have thought that there were better ways. I still think there are better ways to do that in Pandas, but I couldn't figure it out. So what I end up doing is to that I'm adding a PD time delta. This is an interval of time. I'm going to add a floating point days of the part of that string that is the fraction, so T of 0. Remember this is looping through all of the keys are, so t is the first year in the range of two minus the integer part. And this will give me a floating point. Value in fractions of years, not quite there yet. So this, I'm going to multiply it by 365.25. And that is the number of days in one year to a significant precision. So I'm going to pull this piece out and see what it does is what spine. That's actually, let me comment that out and use it in a second. Pd to date, delta t. And this t 0 is going to be, for example, 19 5.5. I'm taking that minus the integer of that mold Buying II by 365. I think I have a problem with the parenthesis. You can have one parentheses, too many. So this opens, this. I need one more here. And so that multiplies that. And then I want to close the time delta and this gives me 183 days. So that's a timedelta that I can add to my year. So I'm going here, I'm going to see if it works. And this gives me a new date with a 183 days we'd just so happens to be exactly on. Right. I didn't add any hours, so that's going to always have the 0 there. So let me look at the parenthesis that causes that. I need another one here that causes that and then that. So this is, if I were to upend this, I would only append the beginning of the index. I actually want to append two numbers for every entry I want to append the beginning and the end of my range. So I'm going to say this and where this is for t 0, for T1, T1 and T1. The closest that, that causes that. And I'm cool with that. I need one more point twisters kind of parenthesis and I cannot keep track. And I did something that I don't usually do which is append directly without looking at it. Looks like it worked out. So now I have my array of datetime. So now I can use this to say, I only want things that are in-between these two sets of dates. So I'm going to create an index that will index through all of my dates for all the earthquake dates and see if they are in any of these intervals. And I'm going to do it as follows. I'm going to call it, for example, good date is equal to if I only wanted the first interval, interval, I could say that this will be equal to EQ data daytime. Is greater or equal than an SS, unfortunate name 000, or which I'm going to write. We then mathematical operator because I like it better, or sorry, nd, smaller than the other side of the range, which is S, S 0 of one. So this is as S 0001. I want dates that are within this and that, fine. So this will generate a single a single set of the dates that I want, the ones that are in-between those two particular times, but I actually won then in-between any of these times. So what I can do is I could do that with a list comprehension, but in the interest of time, let me just write it out because I think it's going to be faster and we're almost at the end of class. I'm going to add to that to this. Remember this is, or the dark, sorry, this is end, the cross product, the multiplication between two booleans is an end operator because false is equal to 0. So if I multiply something to 0 is gone. So if either one is false, the statement is false. Now I want to use the and the or operator. I want to say it's either in this range or in that range or in that range, they're all fine. So now I can use the plus sign because the class is the same as or because 0 plus 1 is 1. So true or false is true. And I can do that for all of those as SCs that I know in my head that they are kind. So I'm going to write it down like this. Again, there was a list comprehension way to do this. There would have been computationally faster, but it probably would have taken me more to debug. And because we're at the end of class, I need to go up to 10 91 more. And on the other side, 0123456789. I got ahead cookie pan. I forgot to hit record once. So hopefully a little bit challenging course. My window, my screen has a lot more stuff than yours. But hopefully, first of all, hopefully my syntax is right and this does return a list comprehension. And so it does. And I can look, I can actually look at how large it is. We won't have time to use it, but I can say dot sum. Remember these are Boolean values, but booleans are also zeros and ones. So if I do list and P dot array of visibility into an array. And AI Summit. That tells me how many values I have. It's the same as saying count and truth. And I think I missed something because in my example, the poor, it was 18,852. So I think I missed some numbers here. 1, 2, 3 provides 3910123 by 610, and I don't have time to the budget, but we'll continue on next week. I encourage you to do it by yourself. I won't put you in breakout rooms to do this next week, but we'll go over how I did it and all the way through the KS test on Tuesdays, rapid onset it is.
NHRT in practice: reproducing Correll 2018
From Federica Bianco March 25, 2021
25 plays
25
0 comments
0
You unliked the media.