Natural language processing, robotics, and all that entails. So for those that are interested as academic history started at University of Toronto, worked with Geoffrey Hinton, somebody who may know him as one of the pioneers of neural networks for a very long time. A postdoc position at MIT. He's worked back at University of Toronto and then joined Carnegie Mellon University, CMU, where he's now jay UPMC, chair of computer science in the Department of machine learning, which I think is itself noteworthy that the Haredi have a depart rich and created some machine learning. So I'm really excited to hear more. And for those that are interests that there'll be a separate Zoom event afterwards that we can carry on the discussion. So without further ado, I'll let Professor Love began his talk. Thank you very much and thank you very much for the invitation. And again, I think this is people are joining in case something happens to the online session, I disappear. I'll try to reconnect. Just been having issues with that. We'd connection today. Okay. So let me sort of started by saying that over the last, I guess over the last decade now, we've been seeing a lot of impact in machine learning and especially in the subgenual machine learning called deep learning. We've seen impact in their areas of speech recognition, vision, recommendation engines, language understanding, even in places like drug discovery and medical image analysis. And if I kind of like look at what is it that we're doing in, in ML and AI is we want to be able to build algorithms. Algorithms that can see and recognize objects around us, that can understand human speech, that can do some basic form of reasoning, and understand natural language that can move around autonomously, that can explore gentle human thought of a high level goal for us. Now, if I look at some of the challenges that we're facing today, I mean, there's obviously a lot of work done in ML and AI. But if I look at some of the big challenges, I typically view this as four different complimentary areas. One is in the area of language understanding and reasoning. And I'll talk a little bit about this in the first part of the talk. Also looking at the species, what's called embedded AI. This is building agents that basically look at something that's called deep reinforcement learning and control. And again, the second half of the book is going to be mostly focusing on, on that topic. It has also a lot of work and how do we incorporate domain knowledge? And then both of sort of throughout the doc, I'll show you how we can put some prior knowledge of domain knowledge into these complex systems. And also there's a lot of work in the space of multi-modal learning. Where you look at different modalities like audio, like like acoustic, Yeah. Or look at the species of what's called semi-supervised learning, self-supervised learning, or unsupervised. Okay? So the first byte of the dog is going to be focusing on language understanding and reasoning and how do we incorporate structured domain knowledge? And a lot of this work was done by my former student, who is now a faculty at Duke. So let's look at what would the existing systems are doing. So when we think about deep neural networks, we typically think about inputs, mapping to the outputs through some neural networks. So for example, an input could be an image output. Do classified. What's the classifier? Or in neural machine translation, I give you a French sentence and I pass it through the deep neural network and the output is French sentence, Right in Sound Machine Translation, neural machine translation has been very popular in the last few years. It's actually now pretty much the state of the art. But then if I asked you the question of the form which coronaviruses are not instruct people? Read them. We'll see there's different different wires is how do you answer that question? And obviously you need to have some domain knowledge to answer that question, right? So they did data for example, but using for training these models can come in the form of unstructured data. For example, Wikipedia articles and news articles. Or it can come in more structured form, white, Let's see, knowledge basis or tables or any other structured data. And some of the key challenges is, how do you incorporate domain knowledge into your, into your model is that, you know, did is heterogeneous. It can come in many different forms. You need to do some basic form of reasoning. So for example, you need to understand how certain parts of the decks connects two anthologies or knowledge base is, how do you sort of do inferences about being able to answer the question, for example. And also you want to do it what's called weak supervision. In the sense that when you train these models, I give you the inputs. Let's say the question and I'll give you the answer with the answer should be. But they not. It's very hard to sort of tell you exactly what the model should be doing. It's very hard for us to say, well, you should look at this Wikipedia article, you should look at this sentence. Then you should look at this knowledge base and you should pick up that particular entity and then you put them together and then you'll get the answer, right. Generally, we typically that the training data comes in the form of the input. And this is what the right answer is, but it's very expensive to label, let's say this sweden part. And so one of the big areas of research is building knowledge bases. So for example, let's see you have a notion of entities. Think of them as important concepts. And let's say you're building a knowledge base about medical domain AB. You can extract certain, you know, certain concepts, or in this case, this will be nouns or phrases. And then you can also extract facts about those. So for example, you can say diabetes is an instance of a disease, right? Or diabetes can be treated by this specific drug. Alright? And knowledge base has been around for quite some time. If you look at Google search engine, a lot of it is based on these large-scale knowledge bases. If you ask questions from Google or Siri or Alexa, the state of the art right now is again, manually constructing those knowledge basis. So if I asked a question about, you know, what kind of drug is treating diabetes, then if you can map, is that this question into this knowledge base, then you can figure out what the answer should write in a lot of existing set of large-scale systems are really based on these large-scale knowledge bases. And typically these knowledge bases are constructed my hips. So, you know, Google and Apple and Amazon, they employ thousands of labor or it could go in sort of constructed. So one of the things that you can do is you can say, well, can we use knowledge basis of knowledge source? So for example, given the question, can we actually will get, let's say, 50 million passages and 50 million entities in your knowledge base. And many come up with what's called question subgraph. Maybe narrow it down to 50 passages and 500 entities. And then we can do what's called graph comprehension using graph convolutional neural network to analyze this smaller subgraph and to provide the answer. So one of the key aspects of the recent, recent trend is to sort of look at both structured and unstructured data. Unstructured data is really just preform data. Stuff you get from Wikipedia articles, from news. And structured data is more or less predefined knowledge bits, right? So how do we combine both? So let's say, if I asked you the question, who voices the dog and the Family Guy, you can sort of say, well, in your knowledge base the other concept, Family Guy. But then from Wikipedia article you're extracting the sentence panel gang woods, dog called Brian. And then that gives you the link to another entity in your knowledge base. And then you can find out, find out the answer, right? And the reason why you want to construct, you want to merge the two together is because sometimes it has been shown that these knowledge bases are very expensive to construct by hand. And they also lack the typically incomplete, right? So before 2019, we didn't know the concept of coronavirus, right? And so if you sort of get that concept, you want to be able to quickly adapt and be able to answer questions about it floats up. All right, and so the way that we can construct these, these, construct the answers here is that let's say I have a question and I get the representation of the question using something that's called recurrent neural network. So that's the representation of the question that we get. And to get the representation of kind of hybrid representation of unstructured data and structured knowledge base. We're going to be using graph neural network. So the information, the graph neural network propagates among the entities. And then once they get the representations, we can look in these representations as a real vector spaces and then we can look at the dot-product between them and get the answer. Right. So let me spend a little bit of time showing you what these graph neural networks time doing. So let's see that you have a graph and you have a natural language question queue. You want, I want a function f, such that for every node in your knowledge graph, you basically get a 0, 1 vector and B, the output is one if and only if that particular node is the answer to the query. And so, you know, we can construct, for example, probalistic model. This is called softmax representation. Where if we get, let's see some representation from recurrent neural networks such as LSTM neural network for example. Or we transform in that work. Then the HVA, the representation that we get from the graph convolution. Then we can compute the probability of correctly answering the question. Now, when we think about graph convolutional neural networks, I mean, this is sort of the area that has been exploding in the last few years. As this notion of we can operate on graphs, we can do neural networks and graphs. And the idea behind these convolutional neural network is actually pretty simple. The idea here is that if you have a note here, let's say the red node, then you look at the structure of your graph. So you look at all the notes that have incoming links to this, to the node. Then you basically take the linear combination of those. And then you take the representation of yourself at time t minus 1 and you pass it through a non-linear function like a sigmoid function or any other non-linear function. And you repeat it. So it's, it's, it's a fairly simple algorithm. You just collect information, you pass it through a non-linear representation, and you sort of proceed to propagating this information to the next node in the graph. Right? One of the things is, you can do is you can say, well, we can have, if we have graphs with different edge types, let's say they represent relations, then we can have specific parameters, like these parameters or not. It has w, r, which again would represent different relations. And typically the knowledge basis, edges would represent different police. For example, you can say, you know, Obama was born in Hawaii. So Obama and how I would be the two entities in your knowledge, graph. And born m would be the relation, right? And so, and you can sort of, you know, again, these are going to be tunable parameters. What are the weeks you putting on, on, on these edges? And what's important in sort of at a high-level, what you can do is you can do two things. One thing you can do is you can look at unstructured data, get the representation from unstructured data using these recurrent neural networks and propagate this information into the Knowledge Graph. So you effectively by saying, I'm looking at all of these Wikipedia articles, I can enhance my representational knowledge graph. And on the other hand, you get some representation from a knowledge graph. You can propagate that representation to get a better representation of from unstructured data. So for example, if I'm trying to embed this sentence using recurrent neural networks, I can do that, but they can also augment that representation with information that's coming from the knowledge graph, right? So there are two things sort of work together. And now you have this hybrid system that basically says, well, if I can find the answer from the knowledge graph, I will do that. But if I don't have the answer in the knowledge graph, I'm going to be using this unstructured data lte Wikipedia articles to fuse the information together to provide billions rates. We now have a much richer representation. And obviously that allows you to answer a much more sort of complex set of questions. And this is what we see in when we look at the performance. So for example, let's say there's something called web questions database and there's wiki Movie Database. You have questions about different movies, stars and those movies. And web questions is also sort of, he knows a lot of sort of interesting questions that people were asking on the web. And what you're seeing here is that kidney completeness basically means how complete your database, your knowledge bases, rate, and the performance. You can think of these numbers representing the accuracy, how good your model is. So you can sort of see that just using text data, that's your performance. Just using knowledge base, that's your, that's your performance. But if you're using graph neural networks when you're fusing the two together, you can actually get much better performance because you're combining two sources of data and you're using this sort of domain knowledge, it's coming to you from the knowledge bits. So that's, that's nice. But when we look at this sort of system, we can also think about what problem. And this is the problem known as multi-hop questions. And the idea here is, is the following. Let's say I ask you the following question. I ask a question of where's the company which manufactured vulgar boss headquarter? And this is known as a multi-hop question because to answer this question, you sort of have to figure out two things. You have to figure out what is the company that manufactures this specific drug. Once you know the name of the company, then you can ask the question, where is this company headquarter? And humans are actually pretty good at doing. And a lot of existing systems actually not. So if you were to ask this question from Siri or, or, or Google, most of the time. The current systems wouldn't be able to answer that question because, you know. It's a question that requires a little bit off, a little bit of reasoning. And so for example, here, to answer the first question, you have to find the correct answer you to deflect, have to look at, let's say Wikipedia and figure out that it's a pharmaceutical company. Once you know that it's a pharmaceutical company, you can find another piece of information and basic recessive was located, been in, in two-color suck. Right? So the answer is awesome. And obviously, the challenge is that it's not like when you ask the question, you basically say, you know, instead of me asking you this question, I should really be asking two questions. Right? So being done, all representation here is unknown. And this is what sometimes called latent or unknown variable rate that you're trying to infer based from from the data. And what's also interesting is that you need to answer the first question before you can even start, which you will up the second passage rate or the second piece of evidence. So how do we do that? Well, there's been prior work, but can we do it in an end-to-end fashion? Meaning can we build a system that we can train in and add grand fashion? So that means that we can optimize all the pieces of the system. Can we do it efficiently? Because obviously I want to be able to answer these questions quickly. And can we do compositionally, meaning that, you know, can we go for, go hop questions, that pretty hard questions and sort of, you know, as we start building these more and more complex questions, can we, can we do composition? Now before I show you kind of like the overall system and what we've been doing. Let me show you an example. So this is a system that was built jointly with Google. And just to show you, let's say ask you the question. Which company founded by Steve Jobs was bathe in Redwood City? Right? And so what, What's interesting about this thing is that this particular query is analyzing 5 million Wikipedia articles. And it's run on a single desktop with a single GPU and skateboard to sort of find what the answers are, right? So for example, it retrieves the first passage. And the thing that highlighted in yellow basically means that that's what the model thinks is the most important piece of information. So Jobs was the chairman and CEO of Apple, and blah, blah, blah. And then he was also a founder and chairman of next, right, which is another company. And then the second piece of information evidence that the model reduces that the next is the company that was based in Redwood City, California. So if you put the two pieces together, you can find me at, and this is sort of also, which is in sort of ranked according to how important that, you know, what are the most important pieces of information rate. So in this particular case, it was able to answer the question correctly. So how do we do this? Well, we need to use some prime naught, we need to use some domain knowledge. And one of the things that we've done here is we sort of build something that we call relational following. So the idea here is that given a set of antigens, think of them as setup. Now, right? You follow relation R to arrive to the next entities and then you'll follow relation doing stuff, right? And so, so now this overall system is that let's say x is the name of the drug, R1 is the manufacturer seeking sort of, you can mentally think about it. There's like, if I can look at the question, I can figure out what the axes and I can figure out what the first relationship is. So then I can say, how do I build the system that says, given X follow a relation, R, arrive at a set of entities or are they set of answers and say, are possible answers? Who manufactures these drug? Maybe there are few of them. And then the second one is, where's the headquarter location? Where it's, it's sort of builds sort of in this hierarchical way. So now let me show you how we can do that's so imagine that we have tax corporates. And let's say we have what we call mentions. Think of them as nouns or a certain phrases. And we have a boldest entities, we have nouns or phrases, but mentions in our case would represent sentences. So just to give you the scale here, we've been looking at about, I think at about 100 million sentences or even more. And we've been looking at about 50 million entities. Thickening DIA, Wikipedia, you look at all the nouns would represent your entities. You look at all the sentences that would represent dimensions in you sort of have this byte, byte, that connection because this entity can appear in these mentions. And instead of you can sort of make, make the connections, right? And so now the idea here is that this fall a relationship piece is that x is, let's say a family guy, r is the dog in the show. We can do it in multiple steps. The first step is to expand x, the co-occurring mentioned. So we have Family Guy and we'll look at all the sentences that, that mention this particular noun or this particular, you know. Do nouns, I guess in this case. So imagine that here you're going to have on the order of 100000 or whatever, 10 thousand or 50 thousand. Then we're going to filter mentioned based on relation and this is where the learning comes in. So we're going to figure out what are the important sentences and which sentences are not important. And this is done through the form of inner products. And then we basically are going to be combining the scores for the same entity in that. And that will basically give us a set of possible answers. Okay? One important thing about key idea, what we're doing here is that everything can be done efficiently using inner products. Everything is going to be done in a vector space. So we're going to be converting all the sentences and such into the vector space representation. And we're going to be doing operation and learning in that, in that vector space. So let me show you what these three steps are. So let's look at the first step. Let's say I have one of the key encoding schemes and online it can go anywhere, have a long vector, which basically says Load Family Guy occurs here, nowhere else. Then I'm going to construct, pre-compute a binary matrix. And it's just going to be think of it, you know, a 100 million by 50 million matrix, sparse. But it's just tells us this particular, where a particular mention occurs, in what entity red. So this is just the mapping matrix that can be pre-computed offline. And then that gives me their presentation of all the mentions that mentioned this specific concept. Again. So this is there's no learning, it's more of a mapping. Now, let's say we're going to do filtering mentioned based on, on the relation. So let's say we have r, which is the relationship. When a great, we're going to get the representation of it using what's a transformer architectures. We're also going to get, let's say the representation of the question here, or representation of dimension in this case, using transformer network, transformer networks, I'd kind of like a new class of deep neural networks. I'm not going to get into details, but it's more What's be commonly use in language processing. So given this particular sentence, you represented as a vector. In our case, we're representing it as 512 dimensional vector. And then you take the dot product between the two. So you're basically saying this particular relationship. How well does it go with this particular sentence? Right? And these are pre-trained language models. Of course, there are a few pieces in this pipeline where some parameters that, that represents the question. There are certain parameters that also go into representing this f function. And this is something that we can train using backpropagation. So this is something that we can train to sort of get the correct filter things. And and so now in this case what we do is we basically get the representations here. This is offline index that we can build offline. And this is interesting because it basically says, take every sentence in Wikipedia, convert it to do 512 numbers. And now you have this big offline index where every sentence is represented as, you know, a vector. And then essentially what we do is once we have the representation of the course, the relation or the query, we can essentially do the dot product and pick the top debugging units Naples, which can be done fairly Fishkin rate. And that gives us some representation here that basically says what specific sentences or what specific Mencius I important for this specific relation, right? So that's the building stair step. And then we sort of take representations would pass it through the filtering step. And now we basically have mentioned that satisfy this relationship forward. And once we have that, then we can bring it back. So B matrix is the transpose of the Amy checks. It's basically just goes from the entities. So if I'm dimensions back to the entities and we can get the answer. Right. So that's that high level sort of representation of the system that we have. So it has this nice compositional structure and we're expanding, we're filtering. And this is where most of the learning happens. And then we combining the scores for the same entity. And so what's interesting about this thing is that it's efficient. We can, you know, it's closed under composition so we can apply multiple relations is differentiable. So we can backpropagate the gradients through this system, which basically allows us to learn parameters of this model based on historical data rate. But one important piece here is that we do something called saf entity linking, which basically allows us to figure out which mentions we should be looking at. But what's important is that we have this sort of compositional structure. You have these little deep neural networks with transformer architectures that we apply at every single stage rates. It's not the black box model in the sense that given the question with passing through the neural network and the neural net and gives us the answer. It's sort of structured hierarchically, right? Because we identify. The, the noun for example. And then we identify the first relation, second deletion and, and, and such, right? And we can train the entire system and mentally and fashion. So one thing I would want you to take from this talk is that these systems are now become composition rates. They have little modules and we'll see that in the second part of a dog, you have little modules that you can train that sort of follow this, follow this structure because we know that there's going to be relationship important step for us to be able to answer the question. And so if you'll look at the results, What's interesting is that, you know, the performance goes up, particularly performance goes up in sort of 3.5 questions. He asked something that's a little bit more complex. But what's also interesting about these models is that they fast because we're operating on these offline indices and we embedding everything in dire sort of Wikipedia into this large index of real valued numbers. There, the retrieval in the search piece becomes fairly efficient, right? So this is, now if I look up questions, we can, we can answer 900 questions in 1 second. For 300 questions, we can ask the 13 questions in 1 second and such. And this is another example just to show you what's the shape of the family of viruses containing coronavirus. Again, you analyzing 5 million could be articles and it takes a fraction of a second on a single desktop machine, you can get the answer right. And again, it's sort of figures out that coronaviruses belong to specific family of viruses called corona. We write it and be specific family of viruses, their shape is known to be spherical. Bomb, right? And so the answer that the model produces is this spherical. And this sort of brings us to a very active area of research in our communities. How do we represent knowledge? So for example, how to represent common sense knowledge? How do we do multi-hop reasoning? Human knowledge is typically abstract. It's built on high-level concepts. So for example, when you look at this image of a dog here, you know that dogs have four legs, right? And one open area of research is how do you, how do you build systems that can, that can do that? So for example, all of existing systems today, if you look at convolutional neural networks, if I pick this dog and I put two more lax, There's going to be dog with six legs. The model will have no problem saying it for dog, right? Because it does the pattern matching and sort of says I'm seeing enough patterns of, that. This is a dog and so the answer's going to be dark, right there is this kind of like interesting example where you train your system on images of counts, right? And the model is really good at detecting counts. We're saying this is a cow. And then you take a cow and you put the cow on the beach to see a cow on the beach. And the model works at that and basically says, well, that's a boat ride. And the reason why it's doing this is it's basically saying that at the training time, Martha bit down, we see cow is grass. So it's sort of like you see this, you know, then there's grass actually becomes almost like a cow. Right? And so when we see cows and a bitch, that's just so unusual that we typically see boats on the beach or other objects. And so it's easy to confuse those models. And so this is sort of how do we represent our knowledge? How do we efficiently integrate into deep learning models remain sort of a big active area of research. So let me switch gears a little bit and also show you how can we will show the combination of language and how do we sort of combine it with embedded AI and in a sense of looking at reinforcement. So, so this is a little, it's also based on the work of my student given a job blog who's now at Facebook AI usage. So when we think about behaviors, we think about mapping observations through actions. And we try and to do that, The learn, achieve a particular goal. So now we basically want to build agents that actually act in the, in, in, in, in our environment. And we think about physical intelligence. Let's ED agent, observe the environment and then takes the action. For example, edges has go forward, turn left, turn right. Think about a Robert in your house, for example. And the agents need to move around the world physically. Whatever action you take right now can impact your future observation. So if you decide to go to a living room that you'll see subjects. If you decide to go to the kitchen and you will see some analytics, right? So whatever action you take will impact what you're going to see in the future. And it requires some form of special and semantic understand. So let's look at one specific example is the navigation, right? So let's say you have an agent, you don't need to get here. If the agent gets there, you get the reward. If the agent fails, you get a negative. So typically the way that these systems are built is that you have an observation, you pass it through the deep neural network. And the tibial that would produce actions. And if you take a correct action, you get positive reward. If you take incorrect, actually you get the negative reward them. These kind of errors backpropagate through the model. This is something that's not right. It's called reinforcement learning because you get these rewards signals, basically seeing if you've taking a sequence of steps, a sequence of actions, and you've succeeded in achieving your goal to get positive. Otherwise you get negative. And so when we think about goal condition navigation, we can say, well, let's look at the problem of point normal English. I give you the coordinates you need to get there. Or is more recently people were looking at image gone allegation, I show you the image, a TV, let's say, and I tell you, you know, get me a, build me a system that goes, there. Are fines, the TV. There's an object goal or language goal. So language goal is becoming more popular because language has its convenient for humans, interact with your agents, with your robots. It also has a notion of compositionality and as we've seen in the previous part of the dog, you can converse and maybe through dialogues or through instructions. You can tell the agent what it needs to do. Right? Now. When we think about building these systems. One of the big areas of research is exploration of the inline, define the goal, Right? And that's actually pretty challenging, challenging problem. With me. Show you one example of the system that we've built at CMU. So this is something that's called habit that environment. So these are reconstructions. We see blurriness a little bit. These are reconstructions from sort of simulations. And this is the wheel, the wheel Robert, we have a little lock about that moves around. And as it moves around, you sort of see that it's going to be building some form of semantic map of figuring out where the objects are, how they structured, what's the relationship with them between these objects in order to be able to find things, let's say in your house. So, and this is the actual physical robot running the actual sort of deep learning models. So here for example, your goal is the potted plant. So your goal is to find potted plant in your house. And you know, after certain steps, the model basically figures out how to get help to get there, right? And then there is a sort of Pfam. And I'm going to tell you what, what these systems are and how we built them. But there is also a very sort of high level open area of research known as semantic prize of common sense. So let's say I have an agent here, right? And I ask the agent to find the stove, right? And the agent has sort of It can choose between these three paths. The question is, which path would you take? Rate? Would you take one back, number two or number three? Now, many of you would probably say, well, bat Number 1, you can probably look at what's happening over there, BATNA, but it seems like a plausible path because the kitchen and storms are typically in the kitchen, so I'll probably go there. But you would now take path number three. Right. And the reason why you wouldn't do it? Because, you know, kitchens are located on the first floor typically. And that's where you have most chances of fine. Right? It's like if you have a friend in your house and your friend says, Can I go grab a cup, glass of water? And you say, yes, you're in and your friend goes to the second floor and starts kinda like looking for the glass of water and then you'll be like What are you doing? And so we, as humans, we make use of these men did prize and comments. When we explore and navigate. And most of these nano gigs, an algorithm struggled to do so, right? Because for these agents, all three choices are equally probable because you don't know where things are, right? And so again, that big of an area of research, how do we make use of the semantic price? And so for example, we've been looking at this notion of building topological maps. Where you're building notion of, well, it is a master bedroom, there's a kitchen hallway and his offices. And sort of trying to sort of trying to navigate and figure out associational objects in these rooms and, you know, relative to where they are, they are, they will be occurring. Now when we think about sort of internet versus what's called Embodied data, we will look at the internet data and this is where we're building state-of-the-art, let's say object detection systems is, you see the images on the top. So if you computer vision researcher will get any dataset, these are the examples that you will hash, right? Because people take pictures, they annotate and take it. For. Now, if you look at the embodied data or the data that the Robert collects. It kind of looks like this red, so you can kinda see persistency through time. But sometimes you also see images like this, where there's a big doors here and then there's like little couch in over here. And generally, you would never, with your camera, you would never take an image like it's maybe you'll take an image like this. But they're rarely they can image like this. So there are these kind of like interesting views and such that the Robert sees we don't have the data for, right? Because it's just silly for us to collect a data lake like this, an nFET rate. And so in, this leads to sort of if we building models based on Internet data, which is what a lot of us are doing. It leased or what a false positive or false negative. So for example here your goal is to find a chair and that is a false positive things with a chair. Right? What example of you're trying to find a toilet and then use that though, but you don't see it. Because it's such a weird view is that you're just your model just doesn't recognize it from that specific news. All right, so again, when you think about invited data, you can say, well, I can dig state of the art image detection models out there and maybe I can detect these sofas. And as the, as the agent moves around, There is the notion of consistency to die. Right? And this is sort of comes from the fact that objects don't appear and disappear randomly when you, as you move around, there's a persistency that, that, that you see. And so the question is, can we use this information to improve our, our, uh, models? And there's this notion of action perceptional loop. In the sense that, um, you know, we can use something called self-supervised active exploration. Saw the agent moves around the world. It explores in and probably like figures out where things are. And then there is visual learning, self-supervised learning, which basically means that based on what you've seen and what you're observing, Can you will improve your perception model? Can you improve your object detection model? And so what's important about this sort of notion is that it's not that you're learning from static data. The data that you're learning is now dynamic. And you, yourself collect this data in self-supervised, meaning that there's no human involvement. It's not like you would come in and start saying, well, you need to explore this part and this is called so far. And then this part, this is called chair. You just let the agent Moreover out. And by moving around in exploring, you know, you improve your perception. And I think that this is sort of, there's a lot of work now done in that space because I think that in the future, It's almost like you bring your robot to your office. You let it run overnight on its own. It just keeps exploring on its own. And by the morning it figures everything out. I mean, that would be ideal. Red figures out where the objects are, how they're structured, what's called word, then it will improve its own perception model and such. Right? So there's this theme of, of, of being able to actively explore in, actively learn by interacting with the world. Now in. But let's look at what, what we do here. This is a framework that we call self-supervised invited active learning was seal. We have two feasts. So in the first phase would say we have observations. You use the state of the art perception model trained on internet data. You try recognizing what you are observing when you move around. So it's almost like your initial model. And you build what we call the 3D semantic net. We're building the map of the environment. And then we're using something that's called reinforcement learning, using a specific reward structure that allows us to explore objects efficient, right? And then given our policy, we take the action. So think of it as kind of like specific exploration stage where the agent moves around and tries to explore the environment as much as possible. Once you explored the environment, you have these trajectories, the agents sort of moved and recorded. And with the help of 3D semantic map, we do something that's called Label propagation, which basically says something along the lines that if you keep seeing the couch and you move around the couch, it's going to be the same couch weight. And so that basically gives us additional labels that consistency levels through time. And then we can update and fine tune our perception model by using the SVD label propagation. And once we have a better model, again, we repeat Phase 1, we're going to explore and solve for it. So let me kind of like show you what these two pieces are doing in both phases do not require any additional data, labeled data. So there's no human involved in this, in this process. So how do we build semantic maps? So here's an example of the agent that moves around. It uses state of the art. Perception Model got something called Mask R-CNN that basically tries to detect the models and is detecting the models. It's built and misleading semantically. And it's a semantic map is essentially telling us which objects where they occur and yellow. And we have. And so this is what the 3D mapping would look. Right then. Think of it as like box-like representation. And we sort of try and identify that this is a couch, this is a chair, this is a bad and so forth. Right? Now. How do we sort of do, how do we learn a policy that explores this in mind? So if we have the semantic map, we can sum across length, breadth, and the height. And let's say we count the workshops with confidence of 90% or above, right? So let's say these are workflows that we're confident about. And this becomes what we call the reward function for our policy tree name that we call gainful curiosity. And what it's doing is it's basically saying, learn a policy such that you get as much information about objects as possible. Write as much confident information about objects. So for example, if you look at a particular object and you don't really see what's going on there I'm interviewing, you should move around this object so you can confidently predict certain pieces. You know that this is a couch. And so for example, the model is discouraged from just moving around in this area and not exploring these other, right? So you want to explore as much as confident, but at the same time, you also want to find objects that you confidently think should exist, right, with your perception model. And then once you do that, you go to the second phase, which is the perception video label propagation phase. And the idea here is the following. Let's say you agent is located here and that's the view of the agent, right? Because you have a semantic map, you can project the objects that you're seeing in the pixel space, back into the pixel space. And so now you can say, well, because these objects have to be there, it's not like for some use, they're there for some other views. They're not there. You know, that shouldn't be there. You can basically labeled them, right? But if you use state of the art perception model stead of their perception, mold doesn't recognize this couch and it doesn't recognize a, I guess, this bookshelf rate or it recognizes these plants. But it doesn't recognize this as a chair, doesn't recognize this chair. And so now you're basically, by looking at the scene from different viewpoints. You essentially creating automatically labels for your perception rate. And as the agent moves around. By using this consistency objective, you can basically, and again, this is a false positive, which basically means that some your, your best Perception Model, internet-based perception model says this is this is a chair, but it's not a chair, it's a table. It's a false positive. And so now what you can do is you can say, well, I'm going to be training my perception model. I'm going to be updating my perception mob rate by using this consistency. I'm going to keep improving my perception more. And so these are the results just to sort of highlight quickly is that if you take state-of-the-art model and there's just different kind of like generalization, but it's fixed for standardization. That is, object detection. Instance level segmentation is a different, different ways of, you know, exactly how your back these objects. But what I want to show you here is that if you just take state-of-the-art model and use the applied directly to the data. You get about 34% accuracy, right? And there's been a lot of work trying to improve that number. And there's some improvement, but not much. Whereas if you use this consistency or 3D label propagation and building of the semantic map, you can actually improve the numbers quite substantial. I mean, we still far away from actually solving the task. But these numbers go up quite substantially without requiring any additional labeled, labeled supervision. So this is sort of, this is very encouraging because it basically says that we can adapt to specific environment. And so it's sort of allows us to build more precise, more precise months, right? And then there's sort of, again is action and perception. And then, for example, in this step, once they're doing, you can loop. And then again, you can also do something. What I've shown you at the very beginning object go mitigation because you have a better perception model. Again, this is what this would look like where as you're moving around in the environment. This is the semantic map that you building where different objects sign is just more object. And then in this case we have 15 different objects. And you can find relationship between those objects. And again, this is what the system is actually. It looks like weight and so it moves around and builds the semantic map. And it finds, in this case, potted plant. And this is the little locker bought that we've been using. And again, there is sort of substantial improvement and success rate. So think of this success rate. How do you succeed in finding the correct objects or not? And this is sort of what's called semantic exploration plus this idea of self-supervised improving of your, of your own perception more. And so you can get improvements, fairly substantial improvement here. And then finally, there is a lot of work happening in the space of simulation, the wheel. How do you train models and bisimulation and then transfer them to the real-world. And again, this is an active area of research because in simulation, you can actually train across different environments. You can parallelize. You can do a lot of different things. That's hard to do in the real world. But then taking the system and transferring into the real-world is another kind of active area of research. I've shown you an example. We chain on these habitats simulator, which is not perfect. But then we successfully take the models and just adapt them onto the real world where you have a local bought that actually moves around and you apart. And then there is sort of something that's called physical domain gap. We have association noisy sensor noise. There's a visual domain gap that images in the simulation will look different from the images that you see in the real world. And there's a lot of work and trying to do, trying to bridge the gap between simulation and the wheel. And ultimately just on the last note, let me just mention that, you know, is this notion of can we build agents? And I believe that, you know, hopefully they're not 25 distance, distant future. We're going to have agents that more around, that explore, that can do this self-supervised learning of the environment. That can also understand us, understand human speech and start looking at reasoning and natural language understanding. So I'm not showing you this, this here, but there is also a tasks where people actually trying to execute on instructions. Let's say I give my agent and instruction or when it needs to do and then it goes and tries to execute instruction, right? So for example, I can say, well, go to the kitchen, get me the coffin, and bringing back to the question, how do you how do you do that? And so on that note, pink. Let me finish and let me also thank all of my students. There's a lot of different moving pieces that I only covered that very small part of what the lab is actually doing. But that wouldn't be possible without the work. Autograded students in my lab. So thank you. Everybody. Gets them. Digital claps. Yeah, we're happy to take questions. I was a great tour in a state of the art and getting us looking for invoking thoughts to think some students and faculty are here have a lot of overlap with some of this. I hope some of them will ask a question, please introduce. Yeah. Yeah. If you have any questions, feel free to ask. Says Yeah, so there's curly valley of question. Now, this is not really my area, but I was just curious and perhaps you went over this. And I apologize if I missed it. What kind of computer resources are you using to do this? How much computation does it take? That's a very good question, a sort of brushed over it. So for example, in this setting, like in the last setting that I've shown, you. Just sum up these examples here, for example. So here we have a data set for, I think about 3 thousand houses. And I believe that train the system. It takes us on the order of one week of about five to 10 GPUs gotten up. We want to be able to scale it right now in the vendor who is now at Facebook, is now scaling these results massively to many more scenes and many more. Objects. And that, again, that goes into the scale of like 100 GPUs to 100 chickens riding a week of training. And it's one of those things that's when we'll look at these numbers here. Obviously, we want this number to be in the high 80s, which basically means that you almost perfectly recognize everything in the 900s. Who perfectly recognizing and a lot of it has to do with how fast or how much compute you have. In CMU. We have a cluster of Omega, we have a cluster of GPUs, and we have on the order of 20 machines, wheat for GPS. So we have like a 100 GPUs. And that's, by our standards, is small enough to scale these to scale these systems, yes. Thank you. I have to take somebody just knocked at my door, so I have to go that I'm Kate. And stay your question. Okay. Here you are. It shows you're unmuted, but we can't hear you. I'd be at IDEO and PR problem. There was a question. Oh, there's a question in the chat. Okay. How many of them had good perception model? You want to take correctly? How many attempts needed perception model? Pick the correct and incorrect label? Yes. Very good question. I can answer quickly. So one of the things the question is is, how many attempts, you know, if, if you have a perception model makes an incorrect prediction, how many times it takes to actually corrected. So there is one caveat that I glossed over, which is the following. We rely on initial perception model to get an estimate of way the objects are. And so the idea here is that We move around in the environment. And for every single frame, we make a prediction using the state-of-the-art perceptual model, let's say Mask R-CNN. And these labels can be different as we move around. So let's say we're looking at this business view of the model, says it's a couch, we'll look at it from this view, it as a chair. We'll look at in this view it says is a couch again. So what we do is we aggregate this information across multiple steps. And then we threshold it, the confidence, and then we take the most confident prediction. Now once we have the most gotten in prediction that goes into the semantic map, we call it a chair or we call it a couch. And then that label propagates for all the attributes. If the model makes incorrect prediction, like if the model looks at the chair and said there's a couch, Milligan from different viewpoints, still says it's a couch and we'll look at them from different viewpoint. It says a couch. Confidently said it's a couch. Then what we're gonna do is we're gonna place that object and the semantic map, and we're going to call it the couch, even though it's a check. And we're going to propagate that information for. So there is obviously bias in the sense that if our initial model is wrong, we're going to probably get this wrong label even further, right? So we are relying on the fact that our initial model can make some confidently correct predictions, at least for some, some use. Right? It can be, it can be if it's not confident we discarded. So it's kind of like inspiration here is that when you kick the one about the environment, if you point to something and say that's a couch, we keep looks at it from a different viewpoint. You say it's a couch, then that's a couch, right? Even if it's a chair. So we sort of suffer from the same thing because we need the initial model to tell us what the thing is. And then we propagate that for all the different views. And also when we look at the object from occlusions and so forth, There's one more and that are as ready to ask as far as I guess it's kinda upset. Ask, is that an explicit memory map or is there like a neural representation I'm not replace and that's a good question. There's been a few approaches to building neural maps, which is an explicit representation, Let's see, typically sits with recurrent neural networks and then has been approach of building explicit maps. So I think that in my lab I have some students working on both. In this stock. We're using explicit. So we actually, the reason why we do this is we're actually building the explicit 3D map. Because we can do planning. So there's a component of a planning that basically says that if I'm at a particular point, if I need to go somewhere and I have this partial map, I can actually make a plan to get there. And it seems like from the last couple of years, building these explicit maps, you have an advantage in the sense that you can explicitly plan. And so, but the research is happening on both fronts. So that is something that's called neural map. Where you kind of like building map as part of the state and your recurrent neural network. And these models have advantages as well. But for the tasks that we're looking at right now with Bill and Mills work better for us. Thank you. Oh, thank Rudy's question and then probably better with Anki. Thank cell. So the same carry out. That's direly, I'm not in this area. So thanks very much for an extremely interesting part was just perfect for, for, for my level of understanding. So here's the question. I was intrigued by your term. Common sense knowledge. Intuitively, that makes a lot of sense. However, I fail to see how common sense knowledge would be different from other types of damage to me in order to build on that knowledge, knowledge base or derive any type of knowledge you need a lot. Maybe you could call it the contextual knowledge. And how would you know, or maybe distinguish what you call common sense knowledge from other types of facts and knowledge that you would need to consider. And it's a fair point. I think that, you know, it's hard when, when, when we talk about common sense, it's more above is to kind of like schools of thought. One of them is basically says, can we get this knowledge and somehow write it down? Right, in some form? In then the question is like, what's a logical form or, you know, you know, or something. And then we can welcome them as a set of constraints that we can then enforce our model to follow. Common sense knowledge. I agreed, sort of like it's not very well-defined, but it's more along the lines. And that's why it's kind of like open area of research is how do we capture it? In what we've been trying to do is try to capture it based on historical data. So for example, you can figure out things like stoves are typically located on the first floor. And by looking at 3 thousand apart and sort of noticing that certain objects occur, you know, in certain configurations of other objects, right? And the question is, how do you represent that? You can have a good neural network that represents it in parameters of the model that we can maybe write down explicitly some form of constraints. It's not clear at this point what's the best way, what's the best way to proceed? Basically create other types of knowledge I didn't book about. I mean, there's some work on looking at the logical rules and how do you incorporate those into your mom? One quick example that's been very successful is that sentiment analysis. So if I look at the sentence in Android, sentiment of the sentence, then, you know, if you have a sentence of the form, something, something but something else, then typically you ignore the first part. You take the sentiment of the part and you make a prediction. And that's one of kind of like one of the rules, one of the logical rule that basically says every sentence fours the form a but B, just look at the B and predict the sentiment of the B. Right? So for example, I mean, so for example, there has been somewhat successfully deployed within deep learning in these ideas that if you have a sentence of, you know, this is a terrific graduate student. But, you know, the graduate student could have done better than, you know, even though you have words to refer can fantastic. You know, the sentiment is actually negative, right? And so there's different types of forms that people are looking at. And again, it's an active area of research. And for common sense, I don't, I don't know. It's just that when I see the common sense, it's just that kind of like these in again, like, you know, things like dogs have four legs. And we don't know exactly how to build systems that can extract that information, basically. Right. Thank you. All right. I'm Kate. Is your IR work? Yeah. Can you hear me now? Yes, now. I can. All right. That's a relief. Well, thank you for the excellent dark. I really enjoyed it. So I just have two questions on the part where we were discussing about knowledge bases and extracting the information from like, like a huge corpus of Bach's than like answering questions. So the first question is actually. But are there any restrictions on the type of questions because that the problem statement that we considered actually a solid maps to factual questions more natural, like it out you want to extract answers that have finite factual answers. So in the Steve Jobs example, if I change the, the, the, the text of the question to be like, was there a company that was founded by Steve Jobs and redwood or is there a company so well, that does that change the system approach a little bit or not? Yes. I think that would change the system because right now we are primarily focusing on factual questions, which basically means that given the question, there is one answer and there's one pregnancy. So there is work that we're doing right now that sort of expands, goes beyond factual sort of questions, but looks more at the context. So for example, if I asked you the question, the the the the answer could be yes. Provided these conditions hold. We could be no, provided these conditions hold. So there's kind of like the next, sort of the next generation goes beyond question, beyond the actual question, question answers. And these do require slightly different, slightly different systems. I think that a lot of pieces are still there. But yes. You do have something like that. For example, you know, questions is, you know, I'm a graduate student and I am an international student. Am I allowed to get health insurance, for example? And the answer is that, yes, you are allowed to go to get health insurance, but only if your TA for the class. But if you're not seeing for the class, you're not allowed to get health insurance, for example, right? And so these types of questions become much more trickier because you have to look, you have to do a little bit more reasoning to build the answer, to answer those questions. Kind of like the next thing that, that community and we also doing and seeing you as well. Right? And the second question, I'll just get this over with. So the second question is in the multi-hop reasoning and that was actually like really interesting because typically systems don't work that well and that's like a conditional, like it. It's like a Markov chain of questions that you, that you will form. So, so the, the question is, so what I was curious about that like, you know, the X follow R. So how was that are like, you know, this set is being generated per question on the fly? Or is it like an input that is provided to the system? So the way that we building these systems right now is that we have the transformer architecture. Let them take guide question and speaks out R1. Now we want to train it. We don't know what R1 is given in training, so it's spit some representation, right? Given this representation that we call R1, it goes and produces the set of answers. And then it wasn't today another transformer architecture and produces the answer. And so by doing backpropagation, we hoping that in many cases it will figure out what the correct R1 is and what's the correct cartoonists and their thinking like 80% of the time it actually does it. And what's even more interesting is that even if the questions are one-hop question, what the model learns to do is it wants to answer the first question in, just copy it basically to the second one. So because at the end of the day and so this is and of course we can maybe if you have some additional labeled examples that tell me what R1 and R2 are, then that can definitely help the optimization part, right? So this is sort of Yeah. Okay, Well, thank you. Those were my only regressions. Play about. Next question which I think will be our final while, as same as essentially evolving these models and how rapidly so if they're trained on historical data. But you want, as I said, plate and more up to yeah, that's that's a good question. I think that I haven't seen that yet in terms of catching up. Because you write a natural language to moles and that's one of the drawbacks of existing just sort of like knowledge-based systems way. Because like I mentioned, COVID didn't exist before 2019, right? And then this concept came up. And if you just relying solely on knowledge basis, then if you Google, what you would do is he would hire a bunch of people and then they would go and label it and modify your knowledge base. But as it evolves, it's very difficult to do. And so this idea of, let's see, if you have unstructured text where you don't the news come in and you sort of extracting information from renews. Now of course there is sort of, one of the problems with the news is that if you've kind of like constructing from unreliable sources, then you're going to start extracting like if the news keep telling you that Obama was born in Kenya. And a lot of it, a lot of the data that you see says that, then that's the fact that you're going to extract from the data in essentially abandoned we can knowledge bits. So just two caveats right now, we're assuming that the data that we had is factually correct. And the way that we basically, as it evolves, we can we can basically, as you keep seeing more and more sentences effectively and you keep building a bigger, bigger index, the offline index rates so that you can start answering questions immediately about. In that respect, I think that with deep learning and with unstructured text, it's much easier to keep up to date with with the facts that you see. Gray. Well, I want to thank the speaker there, sir. Sorry. And will join us and data science community hour if you have any other questions. Well, I haven't returned to Western Pennsylvania life. Thank you so much. Thank you.
ECE Distinguished Lecture Series: Russ Salakhutdinov, Ph.D.
From Austin Brockmeier October 12, 2021
4 plays
4
0 comments
0
You unliked the media.