Welcome everyone. This is the data science community. Our we've made it to April, April Fool's Day, and luckily nothing nothing has happened so far. That's not within the normal range. We have our two guests tonight. We have we made by the Sharia, whose alumna of University of Delaware. She got her PhD in Computer and Information Sciences. Her thesis advisor was Dr. Hickey Schottkey. As many who knows involved with the Data Science Institute since she left. The meter has been at Etsy where she was doing information retrieval for the two-way marketplace. Many of you may know Etsy for various art to stick and crafts and other things and find connecting customers with clients essentially is interesting from there. And now she's at Netflix, so yeah, just give us a brief little blurb, if you would, about how you're doing, and then we can start asking some questions about data science and industry. Awesome. Thank you, Justin. Very nice to virtually attend an event and UD definitely looking forward to going back to the East Coast. So since since I graduated initially I was NER can then now I'm in California. My trajectory has been in terms of the area for so much towards the end of my grad school, I kind of got interested in recommendation systems. I'm focusing, focusing mainly around during my thesis I was focusing on disease prediction, patient stratification leg working with longitudinal EHR data. And they're one of the project that I was working on was I undefined co-morbidity given kidney patients. So what are some of the additional conditions that we can identify for these get inhibitions and can we automate that process? I was using topic modelling and that led to some literature review in the space. And then I did Heracles, I was considering bidding postdoc and industry. But then once I decided I want to explore the Savior frequently systems it even more sense to go the industry route and look for teams that are working mainly on recommended recommendation systems. And that's where I landed an Etsy, where it was a small team at that point it was just three scientist and I was there for scientists running where we want it to really kind of build out the recommended systems for the companies in 2019 as he was still quite small compared to what? Are you are able to hear me? Oh, yeah, we're fine. Yeah. Okay. So I came my internet is slow. Yeah. So there I worked on and let's see, I walked on various problems including building out like core personalized recommendation module for Etsy users for which is primarily call as our picks for you, which essentially look at all the user history and usage on the platform and build out a personalized set of recommended recommended items. Similarly like other kinds of recommendations would be item to item recommendation that if you're looking at a product and Amazon are, let's see, what are some of the other items that we could recommend. Whether to add in personalization layer there or not. Whether to include sequence model like temporality of the user uses vs. Look at MOOC or sexual leader. And that with this and let's see, I think around after eight months, I got to be the tech lead of the recognition team. And there I was able to grow the team from at that point, maybe two people in the specific part of recommendation to by the time I left it was 1212 people and I have a stack leading the team for six of us. And there, in addition to working, as I see, I got the exposure of how to kind of think about things like adviser in some hay. Like think about what are the future projects too. Pick up and like holistically think about recognition systems. And then Netflix happen there because I was in the space of recommendation and industry. Netflix is pretty well known for the recommender systems stays. And I was very excited about this role and a Netflix currently I am working as a senior scientist in the core recommendations and search team. And more recently I'm trying to dabble a little bit and search algorithms as well. In addition to Recognition Algorithm and happy to chat more and take questions, wow, why I have a couple of cute up. I tend to dominate these interviews if anyone else wants, ask a question. So you mentioned a couple things that I think we're quite relevant to the community here. The first one was the fact that you began leading a team, right? And so the MoMA to start growing a team, right? You're going to have to look back and say, what do we need on this team? And for the graduate students here, what were you looking for? What sort of skills and experiences really stood out for you as something? Yeah, great question. And I would love to think out loud indices now. So definitely far, depending on the team for both the team that I have worked so far since graduation has have had I would say it's at least 50, 50, so 50, 50% of the employees were looking for scientists and 50 percent implies in that team was Emily. Engineers have masters and some experience in scale, scalable machine learning. So depending on the kind that we are interviewing, there would be different skill set that we will assess the candid on. For a PhD. Folks who had, had their PhD, we definitely looked at little more, a broader understanding of breadth of machine learning, depth of machine learning, but at the same time kind of problem-solving skills of given a case, I would often prepare legs and case study that has relevance to the particular product. And for example, I already mentioned our picks for you. I would give this question that if you had to design this our picks for you, how would you go about what kind of data you will look, what kind of metrics you would evaluate the model that you build on and build a baseline. Then what would be a good next step like if you have two months, How would you improve the model? And how would a on the, on the, in the process evaluate how good your model is. So that's if you have more research background, there will be some, some of these kind of open-ended questions, applied machine learning questions, which we would expect the kinda to at least be able to get to a point where there's reasonable solution. And of course, there's the generic parts of coding. You have to be having a PhD or not. You have to be really good at coding. So there will be coding rooms. But I personally steered away from companies that only assessed me encoding I personally when I was looking for opportunities, I had this condition in my mind at some company has to ask me to present my thesis board. Presentation. Otherwise, doesn't make sense. Then is to generate. They don't really want to cater to your skill set. And both in at sea and in Netflix. You how a PhD candidate has to present their research and if they have industry experience and know what they had been working in industry. And communication is another thing that I realized pretty early on. And joining industry that it's super important, especially in applied machine learning where you are, you get to, I've worked as an engineer purely as engineer before my PhD. The core difference I would define is as an engineer when I did not have the applied machine learning application side, I was kinda working in a silo where I have a, it's not non challenging. Work is challenging as well, but you kind of solve a certain part of the system in a way like sitting in their own silo versus applied scientist. You're looking at a problem from end-to-end, from the business side to the application, to the metrics, to the AB test where you are talking to product and if the experiment as well, there are some product decisions to be made. So I think you're kind of really looking at the lifecycle of the entire pipeline, which, which is super exciting by the same, then there's some additional skill set that needs that helps you to be successful. And communication is part of it. So technical skills, of course, and depending on whether a PhD then ML bread, and at least in a few areas, I'm an adept and communication, communication and coding skills that you're scoring scale this across the board as applicable. Yeah, I'm very happy to have you touch on those communication. We do have a unique perspective since you were a software engineer before you get your PhD. So you kinda talked about how the roles differ. But what do you fell you gain from doing your graduate work? I mean, did your perspective changed not only the roles you're suitable for? Yeah. For sure. I think it's an understatement arrives. I know for sure. But great question because I've had so many discussions with people who have gotten their PhD, people who have not gotten the PhD and opinions. My perspective would be a PhD. What p is you brings to the table for you really depends on. Where you're coming from, what your previous experience were and who you are. For me, one competent that really was affected and enhanced with PhD was confidence like, I think the five that you are spending two years, three years, how many years you are doing this independent research in a place, in a state where you are kind of responsible for the body of work and you see your drive like you have to be proactive to publish the papers. You have to figure out how to work with collaborators. And at the same time you are the only one who's the IC on the task. You're the only one who was mostly doing the coding and thinking about the problem. Of course, you have your adviser, you have a lab members, but more or less it cr kind of baby. I think that process gave me a lot of insight into miles like stretch goals and my own ability and weaknesses. And that truly helped me to, in a very philosophical way that truly helped me to understand my strengths and weaknesses and what I, where I want to take and how I want to take this knowledge and apply and what kind of roles I want in future. I think up until then any form of education that I was involved and didn't give me that space to just be doing some work purely for the purpose of the book. And that leads to publication on talks and stuff. But I think confidence boosting and ability to understand my own strength and weaknesses was my biggest takeaway from PhD. Having said that very, very pragmatically, I know it's grad students. We all want to know about career trajectory as well. Financially, if you think about it, I think given where machine learning and data science can do, is if you're looking into industry, I would. And without giving up numbers, I would say there is no comparison in beginning masters role in machine learning AI versus a beginning post PhD roles. If you're getting in the right kind of positions, like by position, I mean, if you're getting into a team that expects you to do some research, then I mean, of course there are PhDs who also, I know people who have joined Google and they're still working as a software engineer after PhD, then the difference gets blurred. But if after Maastricht you're able to join as ML engineer where after PhD, able to journalists and scientists. Then in terms of ports, in terms of career growth. Also, there's a huge difference That's more of the pragmatic or more of the true outcome part. But I would not, I would not say that that should be the goal going into a PhD. Thank you. Any questions from the gallery? I'm interested folks here. Otherwise, I'll ask a couple more and then various year became his Quick question asked and high moving down. But to see you again, yes. I'm not feasible or they're the RDA can call it the theater. Yeah, likewise. So there's a lot of students here. My question to you was, what is your take on students doing internships during the course of their PhD program. Again, internships as great, I think. Great fun to point to. Aspects that come to my mind. One is, it definitely gives you some additional bullet points beyond your thesis or grad school to add to your resume. And if it's relevant internship, then it gives you maybe able to take our models or take different machine learning models or whatever your spaces and applied it as scalable as the escape at a scalable system and a much larger dataset typically. And the second, which actually helped me was it stop, stopping point for you to kind of internalize, I think. Where do you want to be after your PhD? So there is, of course, most PhDs go through this process of academia versus industry. And I think during PhD or getting enough exposure into that world of academia, of course, not all PhDs or writing grants or something with their advisor, but in terms of publication, in terms of conferences, and how much ever you are involved in academic life. It gives you an exposure to academic life versus industry. If you get some exposure with internship in industry, I think it gives you a better perspective to make that decision post PhD and that something I wanted to say, it really helped me, but till the very end, I was still considering postdoc. Yeah. So I think the confusion will still remain, but you at least have some concrete examples to compare with that. I did an internship. I like the team. I like the fact that it's you work on really large-scale data. You are working in production systems, but are you might not like it and you might decide, no academia is what I want to do better. I think that's good perspective ratio. Do you have a follow-up question? I do. So. So just for my PhD, I'm working on a on a health care project. Now, you mentioned that you work with Professor Haidt eat, and then do you also working in the healthcare field? I was wondering, what were the big difference that you experience from moving from a health care project to say working for Etsy or Netflix that do something different, different domain, different field. And how many, how easy was it to transfer or how hard was it to transfer the skills that you learn in the healthcare field to machine learning the healthcare field to your Etsy and Netflix Projects. Awesome, Great question. Yeah, so so I think the main thing was a depending on your research in my research primarily was focused on building models and the models and somebody was agnostic of what is application. But the data was, was, is what I was consuming in my model. So the data had certain characteristics because it was patient data, it was healthcare data. Whereas if the data is user usage data on Netflix, the characteristics of the data would be different, but the space of the, the parameters you're trying to, to, you know, the space of models can still remain the same. So for example, if you are looking at Usage, historical usage of a user. You can apply RNN, LSTM, our attention based model, and let the patient sequential data. You could also applies those model even though patient data there is a limitation, it's usually not that large as industrial scale data. You might not even have to enter into the deep learning world you could like traditional dancer is not a good work really well. But in some way, because my research was applied research, it helped me to kind of learn this skill. Said that given a problem in academia is the collaboration and industry is the product. How do we take that problem and propose a machine learning solution and then implement the solution? I think that three blocks remain consistent, whether its academia or ethics, my applied research versus industry, of course, if your research is very theoretical, I think there's a little more work than you need to put towards transitioning from more theoretical, just pen and paper research to an industrial sense. So in some sense, I guess because of applied research, I'm a grateful that I got that opportunity to work with collaborators and actual physicians that helped me also build on this task of, it's not just about building the models are seeing the results. It's also about communicating that result back to the collaborators in a consumable way that they can get value out of it. And that process actually, even though the area, the domain was completely different from where I am at, those building blocks. The process of getting experience in those building blocks really helped me in ramping up in industry. That you're in, that answer your question, pick You are right. We have one more question, otherwise, the last law school last one. So I have a curiosity. I am familiar with the Netflix challenge and recommender systems of that era. I don't know what they're doing, modern BI image and scalability. And there seems to be this fork in the road between cloud computing where you don't need to worry about it. It'll run over a week and get back. And real-time still have very hard constraints on making this run as fast as possible, right? And I'm sure there's different cases at each in the realms that you work now. But just looking forward, what are the direction that you see as providing scalability? Is it going to be better at computing still? Is that Moore's law still helping because you're, you're really putting Moore's Law, the Cloud into play. I mean, how is that a bottleneck and industry and academia, we're always saying, Oh, we want more time on cabinets or this sort of thing. Writer. So needed run this on your HPC is, but, you know, an industry, you probably can, I get a lot of bang for the buck, but still you don't want to waste that, right. Yeah. In fact, at least in this application of recommendation system and search Cloud computing so far as really has not been actually let me rephrase. One of the key important parameter is latency. So we could use a lot of training time. Actually we cannot because most mortals want to be fresh. So if it's fear, like few hours fresh or at the max I've seen NFC or Netflix is 11 day is like that. That's it. We cannot lose user information and they're updated information more than 1D. Model cannot be more than one state. What it means is the training of the model. Oftentime cannot take more than one the max. But inference time, how quickly we are able to make predictions on the incoming user, whether it's Netflix or at sea. And in the literature you will see this latency competent bought up even an Amazon and Facebook everywhere is I'd convey key. So there is the component of freshness of the model which directly correlates to how much time the train model takes for training our parameter tunings. And of course, there are workarounds that there are checkpoints where you can start from the parameters of the previous day run and then fine tune with the new data and stuff. And the other part is latency. If your model is very heavy, then even inference could take really long time and advancing latency search algorithm latencies literally laying nanoseconds. If it goes to millisecond is true, too slow. We cannot launch actually, we cannot use that as a model set. I think in the space of recognition and search, the worry is more around latency and freshness versus stainless, rather than the model is sauropods that we have to train the whole thing on for a few days. At least I haven't seen application and I'm sure we are more heavy computer vision kind of work happens where the data is not serving a life Production System, the model is not serving life production system. The use cases and the problems are probably different. But so far My exposure has been around. I work closely with production system and my models are all launched on production systems and then AB test. And if it goes, if the AB test wins and it's a part of the platform. And once it's a part of the platform again, the latency and performance and freshness on becomes important. Thank you. I love to get into the technical details. So if I need to explain to Hassan what you mean by fresh this, I mean, I know because I've used at C and I make one search and then it's like, Oh, you really want this? I'll show you everything possibly relevant, right? So your past browsing history has to be really eat it up today, right? That's what you mean? Yeah, that's one. So that's like really, really fresh. But at the same time, if our training time of a model takes more than 24 hours, let's say 36 hours, then essentially a model parameters are not reflecting the last 24 hours of action that happened on the platform. So you're still kind of Catching up with the with the action that has happened on the platform. So because before the training start you need to update all the data, right? And if the training itself takes more than 24 hours, then you're not been able to show unless it's a continuous, continuous training. And I haven't seen in real production system continuous training yet. So you're feeding in data from 24 hours before and if your model more than ready for us and you're losing on that, the models are not yet updated for whatever usage happen on the 25. For example, in Netflix every week, I think more than 10 to 12 original. Don't quote me on that, but a large number of originals are dropped. And if your model training takes more than 24 hours, anyon new, newly launched title will not be recommended at all. And then it's a huge loss of the business model freshness. That's what I mean by modern fashion, is that new, new, new activity on the platform, new content in the platform. Everything needs to be consumed at a timely basis for the recommendations and search to truly reflect correctly, reflect more business ones. And suddenly, we have a guest from Twitch last night and he taught, not last week. Last week we talked about cold start problem, right? Starting, have a new user just gets on, a new movie just gets on, right? How do you get in people's feeds and how do you, it's kind of a zero-shot learning case, right? As some people might know, right? Yes. We do have very correctly pointed our recommendation system and also search, but more so recommendation I think suffers a lot from cold start. Not just at the personalized individual users level, but also all these product level with respect it, which I actually don't use strange, but it's video game break. Yeah, they're recommending channels of video game viewing. Okay. So and in context of Netflix's the shows or movies and contexts of xy, the products, the items that are shown. And there are, there is this, this is pretty big, intense, I'm sure, like many of you know in literature also, like thinking about Goldstar. Oftentime some of the things that we have tried is let's say we have a neural factorization model which generates embedding for all the items. But for a new item, that new movie that just got launched, be that training of the neural factorization model might take long time. We could do is have an additional layer, kind of generate a pseudo embedding by taking the embeddings been loan from the New York factorization model, but then feeding and only the new title information. So there are a lot of workarounds to kind of have sex separate addition to whatever modally use for the school subproblems. And yes, I could I could pull up some papers if you're interested. I can share some of the ads. Yeah. Well, I think that's a great stopping point for the interview part. If you can, you can stick around wall. Ratio is presenting and then we can follow up questions afterwards. Does that sound good? Yeah, sounds good. Alright, great. So we'll switch the spotlight over. Let's see if that works. Awesome. Thanks for me to check it. So my name is mode easier for that though. I'm a third year PhD student here at the University of Delaware and the computer science department. I work with. The computer says the CRP, our lab, which is under Professor Anita. And yet the stock. Basically what I wanted to do is I wanted to create a collection of of what I wanted to create a collection of best practices or best tips that can be useful for someone who is learning about machine learning or deep learning or any data, data science. And it's a wants to start their first project. I assume that there's people from different backgrounds here. I'll definitely appreciate any feedback or a city sort of things that need to be clarified or that could be done better, especially for people that already have experience AND, and OR not have tips to share to our, our greatly appreciate that too. So I'll just go ahead and begin. And this is basically how the presentation is set up. First, I'll talk about the best steps or the things that I've found useful. When you first start with the project, defining the project, finding the best dataset for that, that dot-product that you're, you're working on. Then I'll talk about the machine learning workflow and a couple of different steps that it takes to, to, to, to work through this, that this type of projects. And then also talk about some popular and state-of-the-art tools and techniques. While I go to the product, I'm sorry, I got to the presentation. So first thing, first, of course, if you want to start a project, you need to first find a problem and you wanted to solve. So I will say that you want one of the things that really resonated with me, especially when I, when I first started working on these type of projects, was something that Professor Liao and some of you have taken classes, we have said is that all models are wrong, but some are useful. And I thought this was interesting because I feel like this kind of sets the way that you want to approach the project in the sense that you cannot really create a model, that can, a 100 percent model are really complex scenario. But you can break that problem into smaller pieces. And, and, and you can solve some of these smaller pieces. And, and, and, and that's a good way to kinda start working step-by-step into helping fix that complex problem. So by saying that, I wanted to say that the one thing that you wanna do, you wanna do is you want to make sure that whatever problem you want to tackle, you have a good, measurable objective that you can answer. Because this is, this is important in, in, in, in that it will determine the type of dataset that you will end up using. And it will determine the type of techniques that you end up using. I've seen in the past. Say for example, if you are trying to predict the price of a house based on say, the location of the house, how many bedrooms it has, how many bathrooms? He hasn't. Another information. This this will be consider a regression problem, right? I've seen in the past where people don't really think about it and then end up applying a really complex techniques or classification techniques that they found online. And don't, don't, don't think of the problem, say, okay, this is a regression problem actually just a, by a linear regression problem first and then takeaway from there. So, so that's why it's important. I think that due to define a problem that do you know you can answer and that you can also apply to the real world. Okay? So after you have your problem that you want to answer, the next thing to do is to look for a dataset. So in my opinion that I've seen in my experience, the query, the quality of the data set will determine if the project is successful. And finding a good dataset can be really hard to do sometimes, especially depending on the type of problems are you trying to solve? Things to look out for? Say, at the size of the dataset. You don't want a dataset that's too small because then you won't be able to have enough samples that you can train your model with and get a, get good predictions. But you also, depending on the type of slip systems that you use or the type of resources that you have available. You also don't want to use a data set that's too big because it will take a long time to train and then there will be a website. You will be able to do research in a reasonable time. You want to look into to see if the dataset have too many missing values. If you're using images, see maybe the images are too obscure to Brian and how that can affect the model. Our data set is Vassar balance or imbalance. This is actually something that It's close to heart to me because I work with healthcare data. And they did two biggest problems with the type of datasets that I use. Because they're ready thesis is that the dancers are too small and they're also on balance where there is more people that do not have the disease than there are people that have the disease. And one of the first projects that I, that I tackle, that I tried to tackle had, had this issue where the dataset was too small and we were really confident. In the resource that we were getting. So it, it, it, it kinda suck that I had to spend a couple months working on a dataset that I eventually did get the results that I wanted to. But that was due to my experience and not really knowing it was my first project. I didn't really understand how this type of data science projects work or what are the type of dataset that I should be looking for. And, and, and, and I feel like that's something really important to know. And also to follow up on that. Do you want to have multiple data set options so you don't want to just have one data set because if that dataset doesn't work, then we're going to have to go out and find other ones. Or also, depending on the type of problem that you're trying to solve, you could even use a benchmark dataset or, or, or create a synthetic dataset if you have, if you, if you want to. Okay. Here's a couple of good dataset. Sources. Cargo is very popular. One, It's really good. Especially for people who are, who are starting out if they wanted to look into. Free data says that they can use and also projects do. That is Google Dataset Search, which works just like Google Scholar. If you guys have used Google Scholar. There is data.gov where there's government provided data that you can find there as well as other other different data banks. It's all domain is specific to. If you want to look for healthcare data, then you most likely end up looking for data banks that are related to, say, the National Institute of Health or some, or something, something like that that will have those datasets available. Okay? And then after you have your dataset and your and your problem, you have a problem to solve and at the end, a dataset, you want to build your workflow. So from my experience, what I find most useful is due for the iterative process. So to build a workflow and, and go from end-to-end through all the steps, get some results. And then if I don't get the results that I want, I will go back on each step and then modifier or just the desktop so that I will get newer results and just do it over and over and over. Yeah, so it's good to break down the work into steps. And, and, and to keep it simple. Don't have to take moist if if if you're just starting out and you just want to test out how a machine learning model works. There might be some steps that you, you, you, you can skip through that. Do not have to do that. Maybe they could give you better performance. But if you just want to check out how the end-to-end process work, you don't have to do that. That's something you can do. You can, you can add later on as you continue through the project. Another thing that I, I say that it's useful to fs your first time is to consider using an automated machine learning to automate a workflow for you. There's a couple listed there. But it, it, it will be good to do it just so that you could get the, the, the, the workflow that the tool would provide you. And you can kind of reverse engineer, break it down and see how, how is it that the distorts our, our, our targeting these problems are, are, are, are breaking problems in two steps. Okay, so before I begin talking about the machine learning workflow, I I typically use. Does anybody have any questions or feedback? Is one of those auto loan then reported in research but hadn't heard of one's personal experience with using. So I've use the one called teapot because I was interested in and daddy use kind of a genetic algorithm. But I didn't have a good experience with it, mostly because that was, I'll abuse day when I was working with that really small data set that I have problems that I mentioned at the beginning. I haven't, I haven't touched any sort of AutoML to since then. But it's something that I, that I do can see the trie and at some point because it, it it, it should be like a simple you just input the work and the tool supposed to do automated for you, right? So it doesn't really require much effort or time to sit down and try it out. So yeah. So yeah, so damage machine learning workflow that I found most useful and this is a very generalized and simple. One is breaking it down into four steps. Data pre-processing, feature engineering, model training, and validation. And then I'll just go on and talk about each of the steps. The first we'll talk about data pre-processing. And before we do that, we have to set up or, or environment that we're going to be working on. And, and here I say, you can choose whatever programming language you feel more comfortable with. From my experience, I've, I've, I've, I've used mostly Python. I've, I've tried to use R and I tried to use Julia in the past, but I, I end up always going back to Python. I prefer Python. But really it at a dinner date, it's whatever you're more comfortable with. Some, some, some languages do have, you could say more maturity in the sense of there's more tools and software available that you could use than others, but all of them work. But the only thing that I think it's very important is that you want to be able to understand that what Period system you're using first so that you can choose the best approach in setting up your environment. So if, if you need to do to a project and you, maybe you have a laptop that doesn't have a lot of power in it. You could consider using something like a Jupyter Notebook or Google Colab where it's, it's an interactive notebook that you can call your descrip and run it on the cloud services of Google so you don't actually need to have the resources yourself. Google already provides you a GPU or TPU, any other better resources and enter UT, you can do it for free. Some people prefer to use IDE. So there's a couple ideas that are listed there that are good in terms of visualizing the workflow elimination step. So the data visualizing data visualizing the DA models and so on. And if, say, you're at a point where you want to test a large datasets in clusters or big computer or something like a containerized environment might be more useful. But you said something Anaconda or one of the SignalR, your Docker containers. So it, it, I think it's good to dabble into all of these approaches. Just because I think that it really depends on the Isle of Wight level. I do UI in terms of production, an angle of your project, right? If you're just starting out an interactive notebook, it will be much easier than, than setting up something else. But if you're trying to get results, you will have to eventually have to do some sort of container or, or, or environment like that. And at the time, API processing, I will say it best to use visualization tools so that you can manage the data. There's pandas, which is the most popular one where you can build DataFrames. And the really cool thing about it is that they're shaped into table a state is the digraph. Did your data, you shape it into tables where you have rows and columns that you can manipulate your rows and columns and not have to worry about any sort of like nested arrays or anything like that. Okay. All right. Is there any questions on data pre-processing? So we're going to feature engineering. And so what feature engineering ECE is to use domain knowledge or statistical analysis to find features that you think are going to be more relevant to the answer that you're trying to solve. So this will definitely improve the performance of your model. And in my opinion, this is the most important part of the project. Being able to provide the model with the, the, the, the relevant features that you know are going to give you the best answer will definitely give you big blow bit. Definitely make it easy for the model to, to, to, to give you better performance. Based on the problem, there's multiple things that you can do. You can combine multiple features into a single relevant feature. An example that would be if you have say, length and width, maybe areas more useful for you for, for what you're trying to solve. And it's better to have area to have those two features. You can create new features, maybe. Calculating mean for, for, for one of your features is it, it will be useful. So do we just go ahead and calculate the mean and then add as an extra feature. But I think one of the things that most people end up doing is terrible that where you reduce a first out and not relevant features. And that's called feature selection. And so they did, they think of a feature selection is that it's very important because you had this thing called the curse of dimensionality. And what that means is that when do you look at a, at a day, at our dataset? Day situations, especially in healthcare. In my case, I work with genomic sequence whole WES datasets. So when we're dealing with really small number of samples and a lot of features to look at. So what happens is that if you train a model with a lot of features, especially more features in the number of samples. The model ends up overfeeding or it ends up learning the meiosis that dataset has and, and fit too well to that dataset. So at the time were given a new data that it's never seen before, It's going to struggle. So yeah, so we do feature selection because by using these features do not only improve your performance because you're doing less work, but you also eliminate outliers and prevent overfitting. And here's a couple of different feature selection techniques. Yeah, association rules, feature importance. It really depends on the type of dataset that you have. And so say for example, in my case, we're trying to do use out in quarters to produce the, the, the dimension, the dimensionality of this descent, this genome, genome sequences. And I ended up having a lot of trouble using those out on quarters because they're only when we did a genomic sequence that you're talking about. Hundreds of thousands of sometimes even million number of features. So so so I had trouble getting it. Memory issues with the auto-encoders and another issues where maybe using it at simplest feature section technique would have made it easier for me to do. And I have listed a couple. Tooth socket secular and I think the most, the best tool that you could use to start out just because he has implementations of basically every single step of the workflow. But I recommend feature tools. Since it kind of automates the process of trying out different feature selection techniques for you. I use using scikit-learn as a back-end. So, so instead of going ahead and turn on that feature selection techniques one by one, you're going to choose feature tools and try them all for you. Okay? Now I'm just going to talk about model training. But before that, the one thing I'll talk about that we want to make sure that the data we use to validate the model is data that the model has never seen before. And this is important because if the model has already seen or do stat test data's in for training, then the model Ray knows the answer. And at that point we're not really testing anything. So, so there's two ways to do this. You could, in a random sample of y, at least two ways that I, that I know that you could do this demo. There's probably more. But you could random sample of it from the dataset to Bill three different groups, training set, validation set and that test or hold outset. But this, this really varies on the dataset size. And sometimes the Judaizers, not that big baby separate into two groups is a better option. And then you could also use a cross-validation technique where Daddy, I said you just break the dataset into chunks and then you would train on the first chunk. I'm sorry, do we train on, on, on the rest of the dataset and then test on the first chunk. And then you will do this for, for every other chunky in the dataset. So those are two ways that you could accommodate for this. I had done was using a model. I will say that using simple solutions fors s is the way to go. Because if you really think about it, that there really is no need to spend time building really complex models when a simple solution can give you the same result. And this is basically the problem better how it autoencoders. And the problem that I've had with when I initially started as that neural networks are the big trend that everybody talks about. But sometimes for Telco data, using a more classical or simpler, they'd say like a random forest is an ensemble model. It's a, it's a better now, it's a better option. And I'll also say that it's in. Good to understand the assumptions that each model makes when they make their prediction. What do I mean by that is that you shouldn't look at machine learning models as like a black box where you use this magical black box, where you input some data and then it outputs some magical results. It's important to know how they're making those, how they're making those decisions. And I'm finding those results. Because that could that could mean that maybe the model that you're using is not the correct one for the type of dataset that you use. This. An example of that would be if you're training with images, the most popular algorithms out there, I just convolution neural networks or CNNs. If, if, if you notice, then you may end up using algorithm, but it doesn't really work well with images. And there yet. And the same thing with the dataset. You want to use more than one technique. And I just focus on one and, and try different things out. You know, maybe the first technique doesn't work, but if you look and try the other ones, you may get better results. Here's a list of different machine learning techniques and different tools that you could use, different permutations out there. I will say the big takeaway is that secular and usually has most of these implementations. So if you're just starting out, scikit learn will be the way to go. So you can test all these implementations. And then as you move towards, say, a much bigger dataset that you can start looking into. Something like Nvidia rabbits that basically does the machine learning work on GPUs and therefore artificial neural networks. There's also multiple tools available. Tensorflow and PyTorch are the most popular. I think that, that the, that the big thing to, to know, the big way to differentiate, know which one to use it, static graphs versus dynamic graphs. And what this is is that when you build a neural network with TensorFlow, the way there is representation, it's 11 graph that you build completely and that doesn't change throughout your training process. While with PyTorch, you have dynamic graphs where they can change as you, as you train the model. And this is important. If, say, you want to use a neural network that has memory like a long-short or memory that has to call back to the previous decisions. Okay. So at the time of training the model, like I said before, make sure you only train data split you don't want to test on. And then be mindful of the time that the model takes to train. Sometimes the default parameters for the model don't give you the best performance you can. I want to use some sort of hyperparameter tuning techniques, some optimization technique that can automate that process for you, trying out the different parameter settings for the model that would give you the best predictions. Okay? Okay, and the last step, validation. When it comes to buy any new project, I did a couple of tips that I would like to say is that visualization is key. Sometimes it's better to provide graphs or histogram spoken with box and whisker plots or any sort of sort of visual representation that is pretty nice look at and, and it catches people's attention and also can provide you more information than a single number, a single law you could give you. And also, it's important to learn how to properly read graphs. And I give an example of that later. And daily, daily tip that I would like to say that sometimes accuracy thing enough. And this goes back to basically the beginning of like establishing your problem. The metrics that you would use to, to validate a project where you're trying to find the price of a house is not the same as the metrics that you would use if you're working with a health care project where you're trying to predict if someone has a disease or doesn't have a disease. This is important. It's good to me, you know, do you want to use for source and use them to go back and improve your model. And then you also want to be able to map your source back to that real world scenario and make sure that you answer the question that you're trying to answer. A couple of tools that are good for, for plotting graphs. Matplotlib, seaborne, et cetera. So we can actually skip this example. But yet is the sample that hour that I wanted to talk about was they're also learning how to read graphs. This is actually from one of the projects that we work done. When we first generate this ROC AUC curves. And what an ROC AUC curve is, it basically just gives you a good representation of how your model is performing at different, different thresholds. So say for example, in this case we're doing binary classification. So we're just classifying something as disease or not disease. Here. What we have is that we have. For different datasets that we're looking at. Where each of them have different numbers of four we're calling spikes and what a spike case, it's just a feature that's important feature that should, that tells you that has the disease. And now the one thing that we notice is that for the dataset that had no spikes, we were seeing very high results. We were seeing an area under the curve of 80 percent, which it, it, it's it's high. So what if by being able to identify that, we were able to realize, okay, so by running the feature section technique, we are selecting certain, where's that? We're still selecting certain features that have a small signal that does given the model a way to make accurate predictions. But that's something that's on a boy. Oh boy, just because we have to brought in to do that feature selection technique, whoever it is that we want it to compare. Compare this to a three way to do the whole project again, but not run any feature selection on Arduino feature engineering, you just run on the full dataset. And we didn't think it would be a fair comparison. So we have to normalize the thicker the curve or big it down to, well, what we can see that the baseline, because that does see a spike. Datasets supposed to be a random dataset. And by doing that we were able to normalize the other curves two. And then we have what or a comparison between the two setups. I say I don't want to take too much so easily. Try to stop within an hour. Yeah. I'm about to be done in two more slides. Yeah. So yeah. So ultimately when you want to you want to do is that you want to have fun. When you're working on this slide decks and you want to be patient, Don't, don't let bad resource demotivate you. There is no reasonable does bad. Projects take time? And there's a little back and forth FOR so beginning my results. I mean, the the model doesn't work. It might just need adjustments and then you're going to start to see some better predictions. And also to be patient. And it took me two years to feel comfortable to talk about machine learning. I feel like there's a lot of different aspects of it. A lot of different things that you can learn about and do never. I mean, at least from my experience, when I first started, I felt scared talking about machine learning because, you know, I, I saw this really complex algorithm as I didn't really understand the fundamental dose. But that's something that just takes time. And, and it's different for everybody. I will say try small projects. First. Tried to join small competitions like using Kaggle. And just never be afraid to ask any questions. And does it, does the presentation exploration? Let's let our guess for me to ask the first follow-up question. This histone line here, videos. Is there a question for me? Yeah. Now I want to know if you wanted to asset ratio guest expert today. So first of all, great presentation. I do have to run, but I do have one just in addition to everything you shared, I think one competent that helped me helped me for my PhD was, instead of focusing too much on the available tools, I think whatever algorithm you are using go back to the first principles of it. And from the state, thinking about what is the objective function, what is a match that is driving this algorithm? Because machine learning is just like growing. Like it's a field that is probably changing the fastest. So trying to learning tools and programming languages, definitely a part of any graduate school experience. But what really helps is in fact like if you think about it a basic one unit of neural network is nothing but a logit function. So essentially, if you understand logistic regression really well, our linear regression really well, you can understand deep neural network easily as well. So I think really trying to understand the first principles and we model that you're thinking about is going to have at least help, helps me in a long way. Because when I finished my PhD that deep learning was kind of coming up and I was also working on healthcare. So didn't really have that much of a large-scale data for applying deep learning. But what really helped is these realizations of where and how all these mathematics come converges. And that helps to learn new things quicker. But they're okay for me. That's great. Perspective. Awesome. I do have to drop out. No much Austin and everybody. Yeah. I still live on and posterior day, so, so future data science community will get to see the interview and I hope that helps. Thank you so much for joining us today. Yes. I think I
Data Science Community Hour (April 1, 2021): Dr. Bhattacharya @Netflix and Ferrato's "How to approach your first data analytics project"
From Austin Brockmeier April 01, 2021
12 plays
12
0 comments
0
You unliked the media.
Interview with Dr. Bhattacharya (UD alumna now at Netflix)
Followed by Mauricio Ferrato's "How to approach your first data analytics project"
Followed by Mauricio Ferrato's "How to approach your first data analytics project"
- Tags
- Appears In
Link to Media Page
Loading
Add a comment