So we're going to talk about transformers now, Optimus Prime transformers, but the machine learning models that have really shaken up the field of natural language processing and time series analysis and even image processing, image recognition. Since 2017. Before that, I wanted to go to a visualization that we didn't get to last time. And so I won't spend a lot of time talking about visualizations of neural networks are for neural networks. But just a few words at this stage about what are the kind of things that we can do to make our, to make a neural networks a little bit more transparent and understandable. So we talk about the interpretability of a machine learning model, right? And so neural networks are often frowned upon in some applications where you require transparency. Also, also, we'll talk a lot about these kind of things today when we talk about transformer, there are some very significant ethical implications in the choice of model that you make and on many aspects of it. One of them is in how transparent is your model. I'm going to give you a very trivial example. If your model is predicting cruise should be incarcerated or should not be incarcerated, or she, who should be given bail and who should not be given bail. Based on some data ribbon analysis of prior cases that evaluate the probability of reoffending or of skipping bail. Those people have a legal right to explanation. Because you can't just tell a person you are not being given bail because that's my decision. And so if the decision is made by an expert, legal scholar, a judge, and the judge can say, I will deny you may only because in my experience, people that have done this kind of primes and I have these other things in family with you or people that skip bail because I think this is what happens. But if you have a computer model that makes its decision, then it's going to be very hard to give an explanation about what the decision is based on. If the model is obscure inherently. And neural networks, if we unpack them and we said each component of it is very obvious and transparent. Each component of a is a variation on a linear regression. But when we put a 100 million them, those are, those together becomes very hard to understand why input data and output data there. That doesn't mean that you can do that because you still hardbound to morally and legally in some cases to the right of explanation. So there's a lot of work in trying to increase the interpretability of machine learning models in general and neural networks in particular. One way. The chief way in which we do then is to look at what neurons get activated in the decision process. So what I'm showing you here is something that we'll talk about a lot today as well because it relates to the concept of attention, which is a chief tool in the transformer models that we discussed. So these are saliency maps. We talk about. It says maps are maps of important pixels. It's quite trivial really. This is the input image and you have a neural network with many, many layers. In this case, they're going to be convolutional neural net. Haven't talked about, we talked about it next class I'm skipping a few steps. But the convolutional neural network figure out relationships between pixels in certain positions and build from there with multiple layers and manipulate those relationships are more complex and complex forms and that they get to a solution. I was like big question which might be, what kind of animal is this? So when we try to understand how the solution proud of the network, right, that solution, one thing we can do is look at what was inside of the image that was actually used the most by the model. So the saliency maps are maps of important pixels. It's a heat map that tells you what pixels where in making the decision that the zipper. And so they should make intuitive sense, right? If you're trying to classify the objects and something, you're less concerned about the background. If this is just object classification passed down versus bird, you're generally interested in the overall shape of the object. And so you have the schema that focuses on the core region but also extends from the buyer surplus interested by that on it will be different. If, for example, when you are trying to do is justify the action that the subject in the, in the, in the image is taken. We're making, right? Because then the context becomes important. And a typical example here is when you try to classify things like a Frisbee throw, you do have the context of the fist in the air. It's not part of the person the person that did that. But they said it's kind of things across saliency maps. The way you actually do that is by modifying the pixels in the input image and seeing how much your performance degrades. And that gives you a direct relationship on how important that accelerates. Fine. But this is image processing, we're doing time series analysis. So how does that turn out to be for time series or not? So these are essentially the same thing, little less intuitive for neural networks. This is for a particular interpretation of the LSTM, the long, short term memory. I think this was on, it was a recurrent neural network, but I think if this eventually was not short-term memory in this particular example, where somebody has trying to train an LSTM to reproduce Wikipedia entries, like I did with the archive. But Wikipedia entries. So the input data or Wikipedia entries and so on. You see here is the input string is in, in blue. The ebook either is in blue and green and is car based on the firing of the random neurons that are chosen. So input is blue or green. If it's green, it's an important neuron, neuron that has Empire. And what's interesting is they should look at this in sequences. You can see how different things are being learned, the different layers of a parent or by the French passive in different training sets. So here the, here the neurons that are activated contain the entire URL for essentially what the neural network I imagine it's actually learning is that it should stop paying attention when you see an exit factor. And so it's paying attention to the content as solid. So in this particular policy was learning URL syntax. In a following pass. It learns the double bracket as an important feature of the Markdown syntax, which is the syntax of productive documents are being written. And and then the rest in red, you have the probability of the character that is being protected. So here is where the input string begins and where the outputs through ends and the output stream again. And you can see that, that you've had a higher probability, like it's fairly certain about what's going to come next. And then obviously loses certainty at 0 power. So if the predictions are long-term forecast that we don't know really what we're just spin out. Hearkens back who knows? And in fact, six more or less nonsense. It's making sense. But the way in which we tried to make our neural networks a bit more transparent is by understanding the activation of the neurons. And the problem with that is associated in output. One, I added this, so this is the whole thesis. There's a link in the slides to please. It's a whole thesis about visualization, enable interpretation, works, and particularly climb, climb the current neural network. So RNN, LSTM, etc. And something that was kind of cool here is, or you can select the different neurons are actually different types, right? So there are some neurons that are devoted to learn full words. Like these neurons seem to me in particular that was really interested in, maybe was in this one, was really interested in H and specific letters. Like it really picks up on all of the H, H, H, H, H, H, H. And this is for a vanilla RNN architecture. Remember the vanilla RNN architecture is the one that suffers from short-term memory, from short-term memory issues, from the vanishing gradient problem. And so you can see here when you pass on to the LSTM, how much longer the strings are. So the neural network and now they were given a word, tokenize, not recognized for it. We'll talk about that in a second. But it's learning empire words and not single letters, and that's the noun. And that's because the long, short-term memory problem has been partially solved. Okay. And that's all I had for the visualizations. And we're going to move on to transformers unless there are questions, other questions, losing 321. All right, So transformers. So let's remind ourselves what are the problems they were trying to start? One, I use neural networks for time series analysis. The fact that neural networks in principle should be a really promising model to solve time series analysis problems, meaning prediction and forecasting classification. Because time series analysis has some coherence between elements of the time series. Which are often not regular. And we don't know where the coherence is going to be like what is going to be a house or points in time related to other point in time. But there are structures and we can leverage those structures to understand when we're talking about predicting or to classify. Okay? So in principle, a structure that is very flexible, that is fine, that can reproduce highly non-linear relationships is great. So we want to use neural networks. The problem that we're encountering is that we're fitting a really long strings of dance. And the solution that we had to make it relate future paths is to predict recursively. But that recursive prediction mathematically causes some problems. Because the way in which we train the neural networks is by taking the gradient. And if you take the gradient of the gradient of the grade, for the gradient, the gradient you're risking. And in fact, you're exposed to the absolute certainty more than the risk to either explode. Gradients get really large, really fast. You have mathematical instabilities. Or even more commonly because of your sigmoid activation function. But push to one or 0, the gradients tends to vanished right away. What that means is that we lose memory of steps back, even as few as five or six steps back, one. So one solution that we have found is that maybe we can use only some of the steps in time. We can figure out a way to forget the intermediate steps so that we can actually tap on earlier things that might be more important than the intermediate things that I'm trying to figure out to remember right now. So we have implemented a forget gate right there. I've implemented the mathematical way to say some of these types of data about them. So they will release your memory for, for other things that are further in the past been more important. The problem is that it doesn't go far back in the past. It works, but it doesn't work on very long sequences. I still retain on the a few words or a few 100 or characters. Like there is something else that we haven't talked about, which is that I can use convolutional neural network that are designed to work in the spatial domain to learn relationship between characters. There are some spatial separation, and that too works in some regimes. But it has some problems. It has a problem of encoding things in the right way. Subject re-encoding in, in, in the two encoding time series into a physical space. And also it tends to be expensive on training. So there's a model that solves all these problems and it's called transformers. It's perhaps most well-known implementation is called GPT-3. Which are navigate away stands for. But I remember perhaps later. And it became super famous because it can generate text. That is similarly, I'm going to use the word coherent given very small inputs. So in principle, you can say write a paper about my project on epilepsy and time series analysis. It's going to write something that is going to sound coherent. So I'm going to start talking about this, not by telling you what the model is, by telling you about the controversy that was stirred up from the existence of this bottle, which we think is super-duper inches. So this is one of the pitches for GPT-3. These implementation of a transformer model, transformer model is from 2017, GBD three became public in 2019 maybe. And the pitches, It's a better language model. Well, the pitches, we've trained a large scale unsupervised language model. So these are, these are time series analysis models, are models that work on series. That about the time. The chief obligation for this is going to be NLP, Natural Language Processing. So they're generally going to be referred to as language models. Like I said, you can use them for whatever time series you want. It's quite impressed. It's harder working on which are large-scale unsupervised language model, which generates coherent paragraph of text, achieve state-of-the-art performance are many language modeling benchmarks and performance rudimentary reading comprehension, machine translation, question answering and summarization. Or without past and basic training. We're going to demonstrate that. I want you all to go to this website and type something and see how he completes your sentence. And just keep feeding a piece of the sentence and it will make a story. So the left side of this transformer now have in place that feel me double check that that's the right link. You will be prompted. I don't remember if I had to make an account. I don't think so. I think I just click on Start writing. And notice that the second thing below is an archive NLP. So that's exactly what I did for you as an example last time creates archive postings from string1 archive. And so we'll create an archive abstract. But you can write whatever you want. So what you want to do is write something and then trigger autocomplete, which is on the top-left. When you get something interesting and if you want to share it, you can also tweak the entropy, the temperature, you tweak some of the parameters. Those will be the hyperparameter model, of course, at the bottom left. How's it going on zoom. Anyone wants to share their sentence? Story, that observation? Yes. If you get a yes, no question, my body will not get better. Things. Do you want us? They want to bring back a screenshot. I will read my story to get you started. And then if anybody else wants to readers. So I started with, there were bound tires out that night and the model sad and they would have been more evil had he been able to kill anyone before he actually eaten them? In other words, vampires were good people and bad people. Notice that something is going on here. I'll get back to this later because it will be relevant. Anybody else has a paragraph that they're proud enough that they want to read it. Don't be shy. All right. I would want all of you to put it in the chat then if you didn't, if you're not brave enough to read it out, just put in the top. And if anybody has noticed anything, I mean, under general C, G, T. And that's where you got stuck, I'm guessing. Interesting. And this is mine. So okay, Thank you. So we have a model that can make up sentences. What could possibly go wrong? So for example, something could go wrong. So Mr. Doctor, I'm not sure I think is a grad student. I mean, I probably was the first one that I know of that has exposed some issues with GPT-3, automatic text generation on Twitter, of course, because that's how society works. And so he posted this. He asked GPT-3 to tell to answer the consume walk hills it what LCP OPIA. And the answer was. The main problem with the POB is that it yoga itself is the problem. It seems to me like a country whose exist and cannot be justified. And even if you could be in theory, there are other arteries which will be better than it, since the goal of any country should also be inaccurate. This, okay. But that's, that's one. So now there is another thing that has been noted, which is that the model will make some decision. For example, it fetch is asked to translate from it from a language that isn't agenda rise. To a language that is tenderize, it's going to have to make a decision about gender. And that decision will be based on its training site. And lo and behold, their society is not unbiased and therefore the model makes a very biased decision and decides that she is beautiful body is cover the hearing aids where she works if publishes. All of these are known gender or gender neutral articles. So it assigned gender based on the activity, based on is lieu of examples that we retrieved on the Internet that had their own inherent bias. So who has heard of GitHub? Check on Slack if anybody has heard of Tim Hebrew. So e-tivity Gabriel was an ethicist at Google because most companies like Google, they do, AI, have had began hiring ethicists. May 1 argue mostly to like make sure that they have some legal procedure because I have run the bioethicist. So deeply Hebrew is, was an in the ethics team at Google. And they have this team included in 2020, maybe 21, maybe didn't have published a paper called on the dangers of stochastic para, language model, can language models V2 that they in fact had been published, that had submitted and it was accepted for new apes, which is the largest conference and neural networks artificial intelligence. Okay, and I got hired by Google and they got fired in a very confusing way. So the news, like, like I said, really when the world that we live in. So the news was shared on Twitter. And apparently, according to talent, as we eat, recession was viable Google, after objecting to a manager's requests that we track very low earning from the paper. Google's Hathaway, I said the word given me about my publications. Since then more than 2100 Google employees and found the letter demanding more transparency into the company, blah, blah, blah, blah, blah. Okay. So she's been fired because they asked her to which we need to remove your name from this paper. She refused. It's a little bit more complicated. Story is a little bit more complicated of the firing is a little bit more complicated than that, but it doesn't matter. So let's see what the paper says. So the paper says the past three years of work on natural language processing or NLP, had been characterized by the development and deployment of ever larger language models, especially for English. Third, it's very end GPT-2 and 3 and others and most recently switched. See, I've pushed the boundaries of the parts of the possible, both through architectural innovation and through sheer size. Using the pre-trained models and methodology of fine tuning them for specific tasks. Researchers have extended the state of the art on a wide array of tasks as measured by Led, by leaderboards on specific benchmarks for English. In this paper, we take a step back and ask, how big is too big? What is the possible risks associated with this technology and water, and what paths are available for mitigating those risks. We provide recommendations including Wayne, the environment or financial costs. First, investing resources into curating and carefully document data sets, rather than just think everything on the web. Carrying out redevelop exercises, evaluating how the planned approach fits into research and development goals and support stakeholders values, encouraging research directions beyond the larger language models. Okay, I'm going to break down what the claims in their paper where we have identified a wide variety of costs and risks associated with the rush. Ever larger. Language models including environmental costs, aren't typically by those not benefiting from the result in technology. Let me give you a bit of detail on that. While the average human is responsible for an estimated £5 of CO2 per year is the author's train. A transformer big model with neutron are not neutral architecture search an estimated that the training procedure emitted to a £140 of CO2. When we perform risk-benefit analysis of language technologies, we must keep in mind how the risks and benefits are distributed because they do not accrued on the same people. On the one hand, it is well documented in the literature on environmental racism that the negative effects of climate change are reaching an impact in the world's most marginalized for me, the first is it just, is it just for us last, for example, that the resonance of the ML, these likely to be under water by the 2000, by the year 2100 or the 1800 thousand people on Sudan affected by drastic thoughts, paleoenvironmental price or training and deploying ever larger English language models. When similar large-scale model, iron beam produce for the vacuoles learnings for rabbit. With the environment touching the environmental cost, financial costs, which in turns erects barriers to entry limit who can contribute to the research area in which languages can benefit from the most advanced techniques. Opportunity costs, as research, efforts away from directions are buying less resources. And the risk of substantial harm including stereotyping, denigration, increasing extremist ideology, and wrongful arrest. Should human encounter seemingly coherent ML language models output? And take it for the word of some person or organization who has accountability for what they say. The word accountability is a really critical word in the ethics of AI. Who is accountable for what a model puts out? Is it the researcher that built the model? The people that commissioned the people that deployed it, the people that made decision based on this. And I want to read one more paragraph. Size doesn't guarantee diversity. The Internet is a large and diverse virtual stage, and accordingly, it is easy to imagine that very large datasets, such as the common crown, which is petabytes of data collected over eight years of web crawling or filter version of what is included in the burden of what is included in GPT-3 training data must therefore be broadly representative of the ways in which different people do the work. However, on closer examination, we found that there are several factors which narrow the infinite participation. The struggling with who is contributing to the internet tax collection with Internet access itself is not evenly distributed without any Internet data ogre representing younger users and those from developing, developed countries. However, it is not just the internet as a whole. There is a question about rather specific samples. For instance, Gvd choose training data is source by scraping outbound links from red pitch. I don't know if you've been on Reddit, but if not, the most well-balanced placed. And Pew Internet Research 2016 survey revealed 67% are ready users in the United States are men. 64 are between the age of 18 and 29. Similarly, regional service and Wikipedia and find that on the 8.8 to 15 percent are women workers. So the punchline is that these language models are designed to ingest some form of reality, which is IF and includes all of the biases that reality has. An exacerbates them along axes that are related to privilege and participation into the internet and the tech and the Tax Court. Because it uses that as an example to build on. It encodes the bias and it can exacerbate it because now GPT-3 is producing literature which then can be ingested by GPT-3 itself or other, or other language models to than doubled down on the same biases. And this was what the paper is about. And I propose in the other mandatory homework will do. I will, we will look at transformer models and I will propose that you want to play with them, or you could do. But I will ask you to actually carefully read either these paper or the transformer paper, the paper that describes the transformer introduction by next week. And I will ask you to talk about it in your quiz next week. Okay? So having talked about the dangers of language models, let's go and look at how transformers work and why they work. So there's a few things that we haven't talked about and a few things that we had talked about, what we'll need to talk about, again, to put the transformer architecture together. The model is not particularly complex. It's here on the left. It's a two element coordinates. Model includes an encoder and decoder. You're familiar with this because we talked about the encoder and decoder in what model that we looked at. Encoders in the name on it. Autoencoder has to be fixed, right? That was from the input to the bottleneck is the encoder encodes the input data into a representation that is lower-dimensional, and then decodes that representation into the data. So it's similar. There will be an encoding piece and a decoding. It leverages the attention mechanism. We're going to look at what the attention mechanism is. And specifically it leverages what's called the multi-headed attention system. But let's start with attention. Alright, so essentially the attention is just matrix multiplication. More or less like all the rest of machine learning. So the idea is that let's start with like a very, very trivial and simple and not entirely correct representation of what's happening. Let's say that I have some input and U1 and some potential output, some words that I could in Japan, the record expresses output. I want to see how strongly pieces of this puzzle relate. So I can build a matrix that tells me V1 and k1 don't relate very strongly, whereas be chewing K2 relate very strongly. Let's put some something that is a little bit more familiar. It, again, this is a very Discovery Channel information is not exactly how it works. We're going to get into a little bit closer to Hollywood's. Second. Does the wrap our head around it? This is the same sentence that we worked on last on the cat. That aid was full and happy. There is a generic word. It doesn't really relate very strongly to anything. You don't have a lot of predictive power or whatever, the GSP, so percent That's this. Maybe it was an hour strong, really related because the fact that it has single cat tells me that it was, was and not where maybe full. And, and they relate because there are concepts that are somewhat similar. So let's talk about it in slightly more rigorous way. Not too rigorous, but a little bit. So first thing, the first thing that we do when we're trying to work with natural language, really if we tokenize words, we don't use the words as they are. We tokenize them to organize the word means. A sentence really is going to break down the sentence. And I can break it down just simply by putting a token per word. But some words are composite words that have prefixes. And so I can link those words into a token for the prefix and a token for the rest of the world. Some of the syntax maybe it will, can like here there is a vast numbers of touch. So I'm going to break down the sentence into units, into its constitution constitute constituting units. By the way, these graphics I've taken from this video, which is actually quite productive about the whole attention processes it's implemented by the bert model. Start to, I'm going to embed those tokens because we think in words by computers things and I think in numbers. And so I'm going to have to turn words into tokens and respond into numbers. Each token is going to be associated with specific factor. Typically these vector can be hundreds of digits, hundreds of elements long, because there's a lot of talk about there and each one has to have a unique representation. The advantage of this is that now I can do mathematical operations between vectors. For example, if there is a vector that is a prefix, I can add it to the vector that is the rest of the work. That is a mathematical operation that can result in a complete word that might be from a different token. Or I can do semantic match gender, two worlds and things like that through mathematical operations. So parsable. Fine. That each word or each element of the sentence is a vector that is going to be some number of digits long indicates a bird. This originally in 80 me 768 digits. So now I can stick in a mathematical operation. I can do the dot-product. Remember why I like the dot-product before? Like the dot product because the dot product tells me things or similar things are parallel, the dot product tells me that they are the same. Top priorities. One, if things are perpendicular, then the dot product is 0. So if now I create my tokens wisely, I can learn whether words are in, can be in a high probability of having a relationship with each other based on the angle between them. So the example that I have in this video is walk by the river bank. And this is a common sentence that is used to understand these things because bank is doubled, right? So the word bank owns the place where you get your mind yet. But in this case it doesn't, it relates to rework. So the dot product between something like by, which is the generic element of a sentence should probably be small with any, with any elements, any government for the sample, right? So by should be in a very isolated region of the space. Because it doesn't really tell me a lot about the structure of the context of what I'm reading. Whereas the words river and bank should have a high dot product because they are related. So we talk about similarity with the dot products by really what we mean here is no other word similarities, other words related, but they have a large probability of being associated and determining each other's. Okay. All right, so that's what we do. When we build an attention matrix. We take all these still can. And we'll look at what is the relationship between the tokens. And now, if I want, if I have a series. Focus that I want to, for example, predict in the future. So it's a series of options, words that I could choose. Now I can tell that based on another word in the center, if I have a high or low probability of that would be relevant at the position at which I want to place it. Making sense. So that's the attention mechanism. So essentially it's a bigger matrix that tells me how tokens relate to each other on a probabilistic sense. So far so good. When I got on a stock. When you look like That's fine. So the attention is the series of weights, right? I can interpret them as the, as the weights at the set of weights. So what I want to extract if pump, when I think about the attention of the function, what I'm doing is a weighted sum, weighted by the elements of this dot product of this matrix. So there's one more step, which is that instead of using the actual token them make the sentence models like GPT-3 integral as a youth In a representation of those tokens, will use an encoding got those tokens. So there is a prior step that will convert the sentence to a space. Think about it as a projection like PCA will project into a space of tokens that are all the same length. That's just a mathematical operation to simplify, to really literally simplify the math and the number of operations that are required. And so the language that they use in the attention enemy a pension work. I think starting with these paper in 2017 that maybe the paper that I quoted earlier might also refer to this recorded earlier in the slide is neural, neural machine translation by Joe and learn and get aligned to align and translate. But it's also a paper about pension. They use the, the, you will find, if you read his paper, you will find a reference to key values and queries. Want to unpack that for a second. So this comes from the retrieval information retrieval context. So if you're doing a Google search, you have a query. That query will be in the back-end associated with keys. And it will, the association between the query and the keys will give you the bad news, the result of the query. So think about it the same way. The embeddings of my tokens or call queries and keys and values. It gets a little bit more complicated because there are three elements now and not to. But that's just because I can, I change my coordinate space and it's now three-dimensional as such. So now I have a way to relate. I have a little spiel here about how to think about queries, keys, and values. But I don't want to spend a lot of times. If we coded up, we'll see more of it tomorrow. Next, next class. Okay? So at the end of the day, when I say attention, what I mean is an operation, a mathematical operator that acts on query keys and values. And it is the softmax of matrix multiplication queries, keys translated. So these are vectors. When it comes to all the vectors together. The matrix phase times keys transpose is my matrix multiplication. This is the normalization factor that I'll make sure I don't get crazy numbers. And it gets multiplied by vectors. And this is the softmax which we know pushes things that are hard to hire and things that are low to lawyers who builds a little bit more range so that I really pay attention to the things that are high volumes and I really are the ones that are above. Okay? So that's the attention mechanism. And when in transformer we talk about multi-headed attention. What we mean is that instead of having one attention matrix that relates a set of keys, values, and queries. We have multiple ones so that I can relate the same token to multiple tokens in the query or the same future token. My option world I want to predict to multiple previous words that are having the text as input in different ways. So that I can really begin understanding is contextual structure for a sentence. Because, you know, when we talk about the cat, the eight and was full and happy, the fact that the cat ate tells me about the timing. He tells me about the past. It relates to the fact that it's full eight and fuller. Obviously, things that are close to each other at the verb that I use is specific to a cat and not know because it's in the past. But you know what I mean? So one word can be influenced by multiple other words in the sentence that making sense. So the multi-head attention provides a variety of ways, a variety of ways in which words relate to each other. And I can combine them to get the marginal probability of the next word. And I'm going to predict this making sense. Basically done thing, that same Oppenheimer pension. Okay, so next, what is the actual architecture? So that's only a pension and it's implemented in a number of edges. The Vedic part of our pension, self attention because I'll tell you why it's occupation in a second. There's other slightly different ways in which you can implement the attention. You can implement it with a plus instead of a dot product. There are other mathematical formula, but the goal is always the same and the general description is not too different. So what is the architecture of a transformer model? Like I said, it's a two element model. It has an encoder and decoder. The encoder is the piece too, we should give your input. It encodes the input. So it takes the input, uses a multi-hop. So each one of these element, each one of these graphical elements is one layer of the neural network. And the original vanilla vanilla transformer from 2017 has six identical of these elements. Notice that I'm skipping over something. There is a positional encoding. Another thing that this particular model introduced is that he wants to know, not only because now we've tokenized and transposed our tokens into very values and q. So we may lose information about the location of words, occasional tokens. There is a positional encoding is done through sines and cosines that tells you in which element of, in which position of the sentence be the particular token appear by. And that happens before. An intermediate step before the neural network, before the encoder end before the decoder. So the encoding, just the input. The input is the past of the sentence. What did you use in your fan vesicular mind when you played with the PR. So that's your input. You found that to an encoder. The encoder use the multi-headed attention to relate elements of that sentence. So if I had use the a was phone, it will be a multi-head self-attention that would encode pieces of that, that would relate pieces of that sentence to each other. So what I'm a bunch of these matrices minus organization, manage and that aid in the cubic space. I would have a bunch of these metrics that relate the elements of that, set that to each other. It does is six times. Every time there's a multi-headed attention and feed-forward neural network and all the time. So feedforward neural network because as we know from deep learning, rinse and repeat actually helps. So then there is the decoder. The decoder does the same thing except reversed. It feeds on to itself. It feeds on to the output embedding. So it takes volume predicted so far, re-encode it that enjoys it with the bass. And then it has a feedforward neural network. So again, it's six times the same architecture, but the architecture is to multi-headed attention sites. One that acts on the output of the decoder and one rack from the output of the encoder. There's one more caveat, which is that with the decoder, you have to be careful that you don't use the future that would eat the plants. So in fact, the multi-headed attention in the decoder is masks so that you only can refer to the path element of the something important that's the path that I'm done and preserved for the future. Elements of examples. And know what you have plenty to how you had. The influence is the input that was given to you. That makes sense. And that's trivially implemented biomass than what divides by 0. Or it could be an advocate mask. But I think the way they implemented with a multiplicative mask. And that's about all that I have. That's it. That's the transformer. So what I think I want to do next time, but I'm going to try to make it a little bit less painful than it might be in its native state. There is this really thorough and comprehensive tutorial and I still screen-sharing? No. There is this really thorough and comprehensive tutorial on transformer and multi-headed attention that essentially write it down not quite from scratch, is still uses tensor product. It puts together the architecture from scratch. I want to go through some elements of it. For example, we'll look at how the actual rebuild, the attention, sales, et cetera. Right? So these are the multi-headed attention to us so that we get a little bit more transparency into how we actually pull up the neural networks. Okay? And like I said, what I would like you to do is to either read, Attention is all you need. This is the original transformer paper from 2017. I wanted to point out one really cool thing that I want you to be inspired by when you write your report and tell me what each one of you contributed to in the report. They had a very, very detailed and also very well organized statement about the contribution of the authors in the paper. When you write an article, as you probably all know, It's a bit contentious. Who gets what order in the paper subfields have order by sort of like main contributing authors. Some fields do alphabetical order. In some fields, the PI or the probability that the n sub-fields it's not. And then you go for a job. You don't have enough first author papers because near-field you were under. And so journals that accept these kind of statements, how to solve that problem and that, and I thought that was particularly enormous equal contribution. This is random. Gekko propose to replace me with attention, blah, blah. It goes down to saying, like, I spent countless hours designing various parts of your code base attractions. So anyhow, most of the paper is not about them. Also in the paper is about the architecture. So this is the figure that I extracted for this lines. It's the same exact figure and what I told you, I favorite poem rain again. Alright. What to pay attention to this? Just try to pay attention to they describe the innovation and what is the innovation on, right? And if you have questions about things that they bring up that we haven't covered, put them on Slack. Alternatively, if this is a bit dry for you, I want you to read, Where is my dangers of parents? I went with this, which is pending Hebrew and companies paper from which I extracted sentences before. And both of them are not really long. This is a little bit longer, but it's a little bit less than no mathematical formula. The other one is a bit shorter by somebody. Pick your poison. And actually we're not early for once. Unless you have questions. No questions. No questions on my array. I will see you on Thursday. On Thursday I'll do again the roundabout. See how your products are going. Okay. Thanks. Excuse me. Yeah. Yeah, I do have a question for the previous slide. This slide, but for papers one. Perfect. Yes. I got the SLI and I'm still bucking at a point that you say Arusha is better than RNA because station serves the vanishing gradient problem. Yet. I wanted the explanation for that clear. Yeah, so the, remember that the LSTM, Long, Short-Term Memory is the neural network that implements the forget gate, right? You have to tweak it, right? So the forget gate is a gate that says you can ignore elements of the sentence that are intermediate between the, the one that you want to predict and several steps back. And so that means that effectively you're multiplying by c are not, the gradient becomes flat there. So you're not suffering from the gradient flattening because you're essentially not taking the gradient on those steps. You're going to bury into the region of those, of those of those steps doesn't make sense. Not really much. Like what I understand from the vanishing gradient is when your weight is less than one, then the information go back. But they encountered many that less than one because the total gradient is going to become 0. So you're essentially think about it as the forget gate allows you to tap over some of the elements or something like this. Some people skip connection. Essentially it's like you're skipping the connection. Yes. Okay. And another another question. To be honest. It is a little bit fuzzy to me as well because it's never equal to 0 exactly. So it does push it to very small, but I'm looking at the slides right now. Does push it to very small values, but the result is that you essentially skipping over those. Okay? So. And so, but Lay why we only care about the forget gate by how about the Autograder output and input? We do care about that because the forget gate is the one that is the one that is connected to the c, d minus one to the previous state. So we only care about forgetting connections to previous hidden states. Two previous states to previous treatments. Should we retain their connection with the input, right? But we want to forget the connection between what I predict a noun and what the state that I had predicted earlier was I'm sorry, but you say the forget gate is the only one that connects to previous cell state. They really are all connected. They all connected so remotely as well. So the CTE minus one connects to the forget gate with adapt with the product. So if they forget, they sets it to 0, then it's going to be 0 by that, at that point. And then through there it goes to CT. It does get combined again with the regular gain, the input gate, which combines a with a representation of CP, like with the input state and the process encoded input state. Let me share this slide. Actually. Are watching that. Yeah, yeah, perfect. So the only way the city gets to cd minus 1 gets to CP through the forget gate. So the only path from c t minus 1 so to CD is to go through this dot product. Whereas the input has other paths to go to CT. So the input does not get forgotten writer. But the previous states gap can be forgotten. Yes. Yeah. Any does I wouldn't have said that in the slides. Very soft. They did say that it has the word submission here. It is not a solution to the vanishing gradient problem. It just alleviates it extends the memo past 45. Yeah, yeah, I understand that. Also has a Arianna vanishing gradient, but it's just less. If it's a less pressing problem, we can extend the memory to tens of inputs. Whereas with the vanilla RNN, you're stuck with like four or five cells. Yeah. Yeah. So it actually comes from a snps that the long, short-term memory here. So you look at the visualizations and beyond. One's right or the one with the neurons, like the vanilla RNN is basically only remembering one cell. And the LSTM remembers several pieces of the sentence. But it's still cannot connect the entire story. So that's why I can generate large Wikipedia like GPT-3. Good. Yeah, yeah, isolate the effect is caused chaos to me. I don't know what blue, what rightly the blue. So this is exactly what's mapped to the tan h minus 1 to plus 1 o data the lives of the function, which is an H, an LSTM. But it doesn't actually matter. I think what you should observe from the figure is the fact that there is a large coherence in what is right and what is blue. Whereas in the vanilla RNN, essentially 123 characters and then you can swap the color. That is a visualization and also that's not a tan h. So the range due to read is different, but that's less important. The, the symptom of the very short-term memory is that you are, your neurons are alternating policy only remember. You're only remember like 12 steps where you break a full, a full word, sometimes large piece of a sentence and still understand it. Fill the being connected in the same way. Yeah, I see. Yeah. Let me yeah. Then the question. Yes. Thank you. Of course. Thank you. Yeah. Okay. Yes.
MLTSA 2022
From Federica Bianco April 26, 2022
0 plays
0
0 comments
0
You unliked the media.
Zoom Recording ID: 92136770169
UUID: dgABIlFHQSCDSAceOsxAKA==
Meeting Time: 2022-04-26 07:29:30pm
- Tags
- Appears In
Link to Media Page
Loading
Add a comment