Suppose I say the loading Okay so first I want to show the link. So down at the bottom GitHub system VDB HPC lecture. If you want you have got the example from the private sector and from this sector plus other resources links and command line to use what will speak about today. So today we'll speak basically about profiling and debugging tool. And some of the common parallelism bags. Soil pregnancy sectarian interests or whatever again. Welcome that research encoder will present between. I do it. Worked on the slides together button. Ok so so yeah so we'll present profiling in debugging will pick y. We want to do profiling what it is. Some tools that are able to do that and speaking about that. And the size of the data sets it's important in providing. We will speak about race condition and deadlock using open MP example. And we will finish with communication interlock with MPI. We have at the end if we have time we have some example of MPI and NC codes using Open MPI MPI and opening Picard's. Whichever go through together. Try to see what we can find where the Euro. So why we proved we profile we profile because as shown in red a program can be executed faster than the slowest path. So even if you are in parallel if you have very long sequential sequence of preparation to do sequential and it'll take a long time even if you use Open NP it's teed second shovel and taking a long time. So you need to know where to look when you want to add to optimize and parallelize And the problem is that you want to program real-world applications. So it's like a large application and they may have multiple flow paths. And you need to find which one which are the performance bottleneck. If you want to do that by just reading your code and trial and error and ears by modifying your code it will take you a long time and you will probably not optimize image. So you need to find to use profiling tool to find which part of the code you need to read and you need to analyze to understand our two really accelerator programs. So today we'll speak about GPF. So it's the GNU profiling tool. Valgrind is another common Linux tool that we will use. It have two sub tool that our memcache and combine. And finally we speak about VT0 which is an Intel tool for for tuning. Then we'll take a break. Then we take a break. Yes. So GPF As I said it's genu profiling tool. So you need to compare your binaries your program with a special dash PG option enable profiling. Another thing is that you need to be sure that your program is working before profiling. Because profiling a buggy program will not help you much it's not a debugger you need to have debug everything thirsty so you need to have tested your application. So to get a profile of the application you just run the application as usual. It's regenerate gmane.out file. This file is not readable by humans. It's just data that the application of print about monitoring. And so that's where you use g prof the application on the name of the binary and it will generate this file. So it's a lot of data. So I don't know if you can you read clearly on your screen. So that's the beginning of the file. So this first table here where my pointer is give you other aid by the longest time the different function in your application. The first the first column is the percentage of time of the complete application that you spend in dysfunction. But it's accumulator so that when you call another function. So if you for example you have a function a calling the function B. The time spending B will be accumulating a. So here it would apply to an application which is SVM. Computation is yummy super vector machine. So it's a lot of linear algebra. And we see that the function we spend the most time in is. Oops It's fm time so which is a fruit matrix multiply. And it's not surprising that we spend most of the time in this application doing matrix multiply because it's where the computation. And then we have author information next. For each function we have which other function are called by that function by that function and how much time we spend in. So as you can see GPF even after processing the information and not really easy to read. Personally I think it's even really difficult to read and to make good sense of that. So we'll just pick letter short letter or something where we really see what's happened. Thread safety g prof that's you know weight or the theta. Again. Okay you might want to mention it now. So you wouldn't want to profile a multi-threaded program with G Prof Because it's not threads is not guaranteed to be thread safe. So yes some implementation of G profound not thread safe. And anyway we'll present value. Wow. Okay so let me just say that It'll work but the statistics that are captured will not necessarily be correct. So so that's what we mean by thread safe. So you might you might get some statistics but they may be useless. So you should only profile sequential applications with g prof. So that's why we're going to now talk about more powerful profiling tools such as Valgrind. So Valgrind and in fact it's not a new profiling tool. It's a dynamic analysis tool. So it's a very complex and powerful too. Powerful Application and it provide multiple small tools. So they small tool like memory checking with ma'am check that the default tool up for valgrind. Maybe people here have used Valgrind to find a Tech fault or something like that. And that's memory check that we use when we speak about that later. Okay cached grind which is an idea for how the cache I used will not speak will not speak about that and coal grind which is an application analyzing the graph of the application dynamically. And that's the one we used to do profiling So and the big advantage is Valgrind eat thread-safe. Meaning that evening a threat environment Valgrind which give you coherent and current coffee coffee tend to read that out. We'll just one note some of these tools for example Valgrind or not necessarily installed on the mills cluster. So what we're showing you are tools that are used by developers of sequential and parallel programs. And if you see a tool that you think Looks interesting looks like you might want to use it you should perhaps install it on your own system. And then if you find that there's enough use there for you that it'll be helpful for you. Then you might want to make a request to the mills cluster administration folks. And I'm sure we'll be happy to install the system or the application for you because I had been doing on my tests look ODA Evan connect and needs to save bandwidth. But I think we looked it wasn't there. And okay I know GDB nativity button to compute node. So so that's a something similar is the thing we need to check. And since since the males clusters is sort of relatively new there's a lot of also and because mills clusters not necessarily used by you know what I would say is a potential for application developers it's more. The domain of scientists are using them though those. So the scientists are bringing their applications to males cluster like MATLAB like are like those kind of tools. However the development tools that are serious tools. No. High-performance computing developers use are still not being you are being installed and mills cluster. So we're introducing you to some of these tools. And if you find an e there useful you should make a recommendation to how those installed. I think there was some question about the performance certain tools that we'd have to be installed on the kernel and the kernel to be able to take advantage of that and whether or not that would change the overall performance or down the kernel for performance. Now. I don't know exactly all the all the details but we should definitely talk about that and they can requests for the types of things that people might be needing to use. Definitely there's no harm in asking. Some of these tools are are userspace tools but they do require hooks into the kernel to access special features of the processor. Now those hooks shouldn't necessarily reduce the the the performance of the kernel. It's just that they haven't been enabled because no one has asked for them. I did sounds if you have high-performance scanner they may have competitors who pressed it to make it even faster. Yeah but I don't know. Maybe here's something I said about that they would have to be patched. Yeah you have to detach the kernel with some specific functionality that accesses the. So basically all these tools in effect or many of these tools profiling tools access low-level hardware features called performance counters. And you need Colonel some special kernel privileges to access those kernel functionality to access those performance counters. And so that although there are some tools for example tau is installed on the mills cluster. And there's also the AMD Code analysis that are installed and o profile. However the full functionality of those tools can be accessed because the kernel hasn't been patched yet. So it seems like they've done sort of you know half of the job and they're still needs to be some more patching of the kernels to get the full functionality of these tools. There's also a sense that Mills is a production cluster and it's not like a PC is not rebooted every day. In fact it hasn't been rebooted in months. So you can't take advantage of some of the patches is bless. You reboot the computers. We talked about scheduling a reboot some of those patches are dynamically loaded. I'm could be I guess some of those patches can be dynamically loaded. They don't need. You don't need to reboot the kernel. On the other hand if you're not considering the males cluster as a development system than than some of those tools don't necessarily have to be on the mills cluster What would they develop their applications on new cluster another developer or a part of the males cluster could be made. Up development. Could be the development part. Can have a cue that basically you have some of the node waste but she'd candle and some are not but I don't know if it's worth it. So for those of you that came late we'll be taking a break between profiling and debugging. Alright. So the first tool that we have in valgrind is memory check. And so if you look at this we have a document Danforth where we call valgrind we specify the tool as I say memory check is the default so we don't even have to specify it. And you run your application and on Valgrind. So what can we do with valgrind we do memory error detection. And we can do if phi that you are not accessing memory you should not access because you're out of the heap because your out of the stack or because you have already free the memory. So this Valgrind Memcache will tell you when you do that. If you're using undefined value you have in C plus plus for example an attribute that you have not initializing an object Valgrind catch it. Or even the value that had been assigned from an initially an undefined value from it. If you initialize a variable we said then initialize variable. You will also be will also tell you Valgrind will also tell you So if you are freeing double freeing memory or you have a problem with the malloc new and and did it for four arrays and things like that. And also when you have some programming men copy you are copying overlapping array an array from from one away to another when they overlap or memories. So basically Memcache will see all the problem all the memory problem. So when you have a take fault and you cannot find where it came from with GDB. Valgrind will tell you we'll give you all the information in most of the information you need to find it with digital twin. So he has an added Valgrind. It can also be run on a thread application for example. And you will know which thread access which memory at which moment so you can find some of the other program Nyquist condition we'll speak about later. So just a quick note on man check memory bargain in general memory bugs or some of the hardest bugs to find. Also Can cause problems in your production codes because they often go unnoticed in development and some very many of the memory bugs that Justin was talking about here. Don't necessarily cause your program to crash. So you'll think that your program is fully debugged. While you may be reading and writing garbage values that are very hard to find because you're dealing with huge amounts of data and it and it's only when you get to a production rule sorry a production run. And then your application just happens to crash. Because of one of these garbage values. Then you find out you had a memory error or memory bug in your system and then you have to go through the whole process of doing what you should have done in the beginning which is to run one of these tools to make sure that you get rid of most if not all of the memory bugs that you may have in your system as one of the credit insurance that people gave sometime we say theft where is that they have no hedging Valgrind No no an indistinct error message conveyed by these. But sometimes yeah I mean these are going to happen quite often in C and C plus plus programs in applications like Java for example are written in Java for example don't have as many of these memory bugs because they do garbage collection. So you don't actually have to free the memory yourself the memory is automatically freed for you And also Java does a lot of testing to see that you're not overrunning your arrays overrunning the stack and so forth but in a language like C or C plus plus all those. The reason they're high-performance languages and that you use them is because they they don't do the checks. These they don't do garbage collection. They don't look at overrunning of your heap or over running of your arrays and therefore their high-performance languages. However that high performance can cause issues because you can easily or much easier add memory bugs to your programs Okay so the other tool in valgrind it's called grind. Colvin will enable you to find to have a call graph. Your profile graph of your application. So you need to come to compile with the same Dash PG option than you used for g prof. And then you run the application with Valgrind just saying that's the tool we use is called wind. The output of this is not human-readable but we use a GUI application called cash grain. So it's Linux application for the visualization of the graph. So probably something like that. So that's the same application than the one for which I show you the g prof output. So g prof was some text. And here in fact we have exactly the same information. But everything is in a graph where we see which function call which was how many time and how much of the computation and spend of discipline of the time spent on this function. So here we look at the top. We see the. So that's the system. Linux starting the application. Until we arrive to the main the main. We spend 99 by 99% of the time domain which is not surprising. So the main due for four different things can relieve the coded the polynomial function. That get alpha from test set function. Trains train set cookie wrench deep copy range and we'd fight on. So what we see that 60% of the time you spend in the get alpha from train set but we also see that we have multiple way to arrive to the polynomial function. We can go directly from Maine or through the guitar far from training set. Examine examine example take step Cal learning function. So we have the wall cold call graph annotated. And we see finally that spend 44% of the complete time of the application in polynomial which is basically doing multipole matrix multiply and alpha of the time. In fact we spend is spent doing matrix multiply. But we also saw that we spent 24% of the time just reading the file That's with quite large data set. And but with a very small data set we will have almost 90% or 95% spend reading the file which is bad for providing because we don't really we cannot really optimize file reading. You can but we're not here to speak about that. So that's what we have with a second application. Here is one very specific thing that happened with opening P. So I profile the matt matrix multiply in open MP from last time from the lecture two in fact. And what you saw what you see here in fact is that you call the main and the main cool these main open MPI function. That's a function generated by the compiler from the open MP computation. And but this function is only 12 in spend only 12% of the time in the main function and its colony one time the main open MPI function. But it's called Steven and other time because it's a it was eight thread. With this application by the function's task thread. Because open MP stats threads and they didn't those threads that it's that never touch. The main. They are not they are not the master thread there. Another one touching the main. At any point. They only go into the open MP occurred. I just wanted to show that because. So in fact it's demo time as you can see. And so I will do a small demo. Because we're going to sync it with your laptop. Yes but I just. Okay. So here you can see the command line to compile with Open NP. So dash f open MP. And so if I run this one with open-end P I want to show you this One thing we were wondering Tristan and i was so that trust and is running this on his laptop locally. And he's got this nice tool called system monitor which is a standard Linux tool that allows you to monitor what's going on on your local system. And so you can see that there's this than the interesting one is the CPU at the top which shows eight hardware threads or eighths or a course. The threads or threads are going to be running on those cores. And we're going to see the activity of the threads running on each of those cores. Now Christine and I were wondering if there is a tool like this that users of the mills cluster can access that would allow them to see the activity ongoing of the of the system they're running on autos. There's top. Right. Top is like And we'll get right to the actual though you have to run it on the actual yeah right you will have to run this kind of a you will have to run this kind of a graphical user interface on the actual no down each. They have guns. You love what can we hang out. And you can look at your lab is a log of hourly. So currently I'm gonna update this through the sampling it's minutes and it's dated VT the sampling like chicken. But she doesn't do by threads right and it's not looking at threads. It's only looking at the system as a whole. Yes. Alright. This is a technical reason why he couldn't install this. Just use x. Now that you probably you probably have a common vision of these to have all of these information fresh and you can just do val FSH open the tunnel. And as an application like that that exist but need to be fine. What's the name of the application you can run it over x2. You can run it yeah we can run it over x and have it display back but what's the application called it says system monitor about German. It's Jim on Gemini teased him money TO eq-num. Stem Dutchman ITO So I think this is a useful tool for development because you have memory and swap memory which is being reported you also have that the number of threads accessing each core running on each core and you can see you can basically see where dynamically where some of the bottlenecks are happening. You can look in time okay so anyway so should occur. Here I'm just running the open MPI applications are testing company and PFT kitchen and you see the water A thread concurrency at 1. So those spikes at the top of that. Yes. So what happen if we use dash p g now to say that we want to profile. And we run it in valgrind with cold wind. In this case the application will take a lot more time to run. You can see that the forces are not all the threads are not used to the term. I will say about 40% and even less computation is used because because Valgrind is thread safe because Valgrind is taking care of not writing into memory without asking to the other thread. If you can. If not writing twist to shared memory without checking that user thread doing that. Not doing that at the same time. They need to synchronize b to the third. So that's why it's a lot slower. And that's also probably why. That's also why probably Valgrind will not scale. So you see how long it takes to IBM that finish and So just to mention Valgrind and all of its associated tools cause significant slowdowns to your application. So for example men check might make your program run. A 100 times slower. However for the amount of work that it's doing and the errors that it's catching that a 100 times may be worth it to you. But what that means though that you're not going to be able to run a full production run with the full data set. The large dataset you might run because you know if you if you do something like that which would normally take maybe days to run on your application. Multiply that by a 100 that's how long it would take Valgrind to to run. So you wouldn't you wouldn't do this on a production line and what they wanted two-thirds. So you want to try. If you have a big enough an MPI application and you want to provide onto something. Debug with finer-grained. You forever you want to do start with any two Swain and see if you see the opinion Because most swayed you have I think throughout it that's okay now it's finished. So it produce a file called grind out reach a 100% ever because of the synchronization I think because the each time one of the thread is doing something cool guy and say oh I need to see what's happened here. I need to log that something happened. And so And Valgrind catch it and say oh I need to log it and you need to write it somewhere And when you write it you need to think oh I need to be the only one writing now. When you have eight threads that are concurrent and the fact to write data. In summary there's a lot of IO happening. There's a lot of writing to disk and when it's there synchronous rights. So you have to you the CPU's fill up some log in memory. And then that log is flushed to disk. And so all that writing to disk. When that writing happens the CPUs are are now running to their full capacity. Okay and so we have these fide cash drain current drain out. And if we look at it. But that's an interesting point if your application if you run your application with this tool and you see you're now getting. 40% or sorry you're seeing are getting a 100%. You're getting 40% or less than a 100%. Then you want to investigate things like your application doing too much aisle. So as you can see this is not readable by humans. Can you move their screen a little bit to the right If you want to lead me to the know don't that's good. So it's not human readable. So what we do is using this application can cache grind. And so And yes and it's opened directly nice. Thank you. And so you may have to open by n but this application basically it's ugly. It's enable you to see what's happened. You have multipole. Seeing that you can see colormap here. So like this we see that most of where most of the time he spent the area is proportional to the time spent in each function. There is a lot of functionality shortcut which will okay so I haven't think about that. It's OpenNLP an open MP automatically generate function and generate source code. In fact when you compile so you don't have corresponding source code the for this. So yeah but in the sequential case you will have the corresponding thought occurred and the time spent on each line of code which is interesting to us. So it's surfer package grain which is basically racing for the free application in profiling. What you can find best after you will have to pay for v2 and nothing like that that we'll speak about later Yes so so to sum up pervert Valgrind empire affiliation Valgrind is thread-safe. So there is no problem to use it with opening. For MPI MPI folk process and each process control. So we don't have the problem of thinking it's it's it's a control application. So we can run it by adding bold here. You can see MPI run some MPI option then we say Valgrind Valgrind option in your application so that what you will but in your q sub script for the queue system. And the but the problem is the mem Check tool will generate will tell you that a lot about a lot of errors in your code because it did not understand what's happened in the MPI functions. The fact that you have message passing that you have data coming from another computer is not understood directly by Valgrind. But there is a whopper an MPI workers. You have the link at the end to the documentation. And then the github also. And this will help valgrind to maintain some. Some information about the state of the different buffer in the different memory. And it reduce the number of false positive number of non-existing Saturday. So you still have to get through and filter what Berger and not really bug. So still hard but it's PIDA pilot application to debug and to analyze extremely complicated anyway. And if you do a search on Google Scholar you will see that people publish paper regularly and out to profile analyze optimize and debug An application. And now we'll speak about vision. So vid unit an Intel it's an Intel tool. It's used to sew analyze and detect and solve the problem the performance program performance problem. So you have first part which is an ally. Not monitoring Dynamic analysis at runtime so many providing basically of the application. And then you have multiple tool-like between performance analyzer. The thread provider that that are used to to analyze these profiles. And all of these come in the video amplifier which is installed on the cluster. It is. Okay. 34 And how Intel ICC compiler. Thanks. Yeah. Well I don't know becomes an evaluation version. Maybe that comes with but I think we didn't get the cluster version of a hint her wine but I thought I came a v2 when we got yeah we can double-check on knocking. Okay so what can you do basically with between between will detect opec but you have to run the Intel compiler exam had no security or license. We had no. Okay I would actually compile. Anyway I think most of the thing is it will generate a file profile file that you will then get back on your computer and analyzed locally. That's the same by the way for cash grind and grind. Instead of launching kick-ass grind through through a tunnel tunnel through SSH you just copy bags defied the output file and launch it locally. It's working better especially with the big image you have in this. So it will basically find the bottleneck between its will use. So what you call hardware address based sampling basically using other container and performance counter and hardware triggered events. So basically when you have a cache miss its trigger an event when you have a page miss trigger and we'll be talking more about that next lecture. So it's also do analysis and the locks and weight loss and weight. A time where a thread if the door is not working. So we want to know what we want to know how long we spent on each of the weight in each of the rocks. Because if you spend too long on one knock it means that the stock is probably nice optimize. Something needs to be balanced. Then you have thread profiling. So you can see the interaction between threads. And you have the balance of workload which is enable we'd see that the screenshots or drive and sampling we already source because you that which also for the analytics. I provide a lot of the functionality we spoke about before but it's professional software proprietary. And so it's very clean and easy to use. So here we have one screenshot. So it's rendered on windows. And so for each function each line here is the function like we had before in kcals right but we have the rent times the spend the time spent it all. The time spent doing computation And with a different color so I will say this color there based on how much of the homage of each process or each core issues if you use the car with a lot of cache miss it will spend most of the time just doing nothing. And in some cases we do a lot of you will do for each clocks fluxing naughty will do something. So that would be the ideal over. And so for each function you know of where it perform. And under under that we have this section where we have the different threads and so the threads are in green. In green is when the thread exist have been created in brown is when this thread is doing something. Its mean that is computing something. So it's not because a straight exist that is computing. So here what we see that we have a section where the CPU they start to be good but most of the time we're not doing much. So. Okay so what we can do is we can use it to analyze opening P application. So we can. So as we say we can look at the synchronization the weight and the logs. So we can know how long we spend on synchronization or an IO operation. And we can also understand how those weight or penalizing improving Sometimes the application sometimes just having a thread waiting a bit longer make your application working better. So some more detailed information that you can have. You can have some time information. And yes this one is interesting is the weighting object. So it basically tell you on which object you spend the most time waiting. And if you look at the first here it's open MP joint barrier. So it's basically after parallel section for example when you have a four for an open-end p4 and so it will use multiple thread and at the end of that each thread will wait for all the other to finish its adjoint. And that's where we spend most of the time. It means that the different thread or not equilibrate perfectly because some threads have to wait for the other to finish their work. So probably have a way to equilibrate that. Thread computation is a bounced yeah. And so how do you do to see what's happened with your anger and thread you use this gain of u. So in this kind of view you see for each thread Yes you see for each thread when it's working. So it's green the full green. And when ingested or the parish bright shiny green light green the mainScreen meant. Okay because that's the way my colleague. Yes that's when we were either. We're not doing we are not doing anything we are waiting for communication. And those communication. Here are in yellow. And you see here we have a communication. These idle times it's more idle time here. It because we were waiting on this threat to finish. So if we can for example split this first part of the work in to piece. It will be it will take less time. And the synchronization will be we will not have these small downtime because we will reach the synchronization point on these on these trade deficits with fleet and to other thread which before. And so basically all those downtime that we see here or the small gap here can be improved by simply by simply balancing the load the amount of computation you do and each thread. And yes and this last one here the thread concurrency but give you basically the sum of all the computation done or the strayed occupied at one time. So when we have here it means that a lot of computation are done on master. The thread are used. Here it means that when we have only one thread working at this time has been that we have probably a program of balancing the competition between threads and giving a nice walk for everyone to walk to do something. To scroll bar at the bottom where you can basically fastforward rewind to the time that you're interested in. You can see the model you can scroll burden. Your application you will it's. So once again it's one are run only one. So it's not is giving you the information for the data you have use to do this. So maybe with some different data we'd have a completely different pattern completely. So we can also use vision with MPI application and I need to professor. So the anthrax that shell command. Yes So MPI as we know lunch one process pair lunch multiple process. If you look at the command we use here in example. Basically when you will when your application that have been compiled with between informations each process will create an output file so we need a way to collect ordered output. And we use M plex TLS to do that. The in-place TA will come between MPA Exec where we say that we want for presets to before impacts Cl. We collect. So we'll correct the hotspots in my result from my app. And this will create the result in further for protest. And after you have the Qu You viewer that enable to so that the like we showed before. And these enable you to analyze the application. And if you see here they have been even more than China. They change at even more. Because they put an open mp plus MPI application. And they monitor all the threads from each process in the MPI application that you can imagine using the 240 thread that you can launch and the minister and have 240 thread to analyze and understand how they are connected how they communicate to each other are good at. But you can. So you can you can do that. You have already information with this kind of tool Yes good luck. Okay so it will but it will be part of it seem not to be part yet and type and will be supported and cursor. So it's not supported and cluster yet. I don't know what. I don't know. So it seemed to be an answer need to be supported in a cluster I guess there's additional functionality or clusters that we're going to add. But currently right now supports MPI and Open M P. I don't I don't know what other additional functionality diagonal. So okay so something a speak about over this. Button profiling is the data set. It's extremely important to have good data set. You need to have good data set for testing and debugging. They need to be small. We need to have good data set for profiling They need to be larger but not too large because I've always seen using Valgrind or even cheaper off take a long time a lot younger. So basically what do you need who did something that not too small because you will only see the initiators neutralization. Basically when you read. When you read your data set not to be because you don't want to spend three week running your application to have a profile. And you want a good coverage because if you have occurred which have data dependent control and thing like that. You want to be able you want. You want to cover all the code. You want to take all the possible paths with it to understand in each small part of your code what's happened and what is slow. And genericity. Don't use the matrix full of the will to do a test. It's not good. People are not attentive to the previous lesson. Okay. So yes our genericity Really be careful. Sue just walk financing so much nobody yet. So that's about two. So we will take a more back I think you know just go in terms of the Intel cluster studio that's going to be available. Next month probably coincides with supercomputing the big high-performing. Yeah November after supercomputing conference the supercomputing conference. And basically what they're saying is that it's going to have additional support for large clusters. So it'll scale up to a 120 thousand MPI processes. So you don't have to worry about that for males because I don't think we're anywhere near a 120 thousand woody extinguishing have threads. I mean a 120 thousand MPI MPI. Okay so you can have 5 thousand source is 120. I think it's been than bell. If you have 5 thousand cores but we only license nobody nobody owns them all. So you can't do MI But you wouldn't you wouldn't use You wouldn't stand by MPI. Most you could get was a hunter to wander more to you're saying because you understand by yes but we also just added another standby awaiting jobs are running less than four hours the current standby. If you get to 40 if you're running eight If you need at least eight hours if you only need four hours we've added the new resource so you can get up to eight hundred and sixteen hundred and sixteen processors. Okay so you would that weighs a 100 MPI process So but I think that problem is the John finishes within four hours. So this might be an interesting tool to look into. If people want to look at massive MPI parallelism right motive info up an MP and no code it's probably useful. Well that and that can already be done with the Intel amplifier tool. This is for a massive yet this cluster studio is for massive parallelism. I think anyway they don't need to go for this one because it's not the thing. It's not just seems kinda. Well yeah I don't know. I don't know if the amplifier does up to 800 MPI processes. So if there was someone or a group of people regularly using the entire males cluster to do hundreds of MPI processes then you might want to look into the cluster studio. But if you're doing smaller scale on the order of probably tends to maybe up to a 100 MPI processes with open MP underneath in in each processor doing the shared memory thread level on the course. So the MPI would be across the nodes open-end P within the node. If you were doing that kind of thing where you didn't have a massive amount of what MPI parallelism then you could probably just use the Intel amplifier. If they did it just to avoiding that need to be part of that's getting into the private optimize foe of teammates. Communication that tight connection. Yeah I mean they're they're looking at lean with low latency MPI libraries in the cluster Studio. Okay so let's take a five-minute break and then we'll transition no common parallelism bugs. Ok. Rebecca name. Okay so it's pick about C come in pay them back. And so I always Ted Bay seem fee a simple program. If you remember probably the fifth week to beginning of seeking too we spoke about The decomposition of an application how we can decompose an application by data task. And here we do a task decomposition of an application. This application to data structure called a and B. And we do for differencing we in it initialize initialize B. So we basically wipe them. Then we write a and read B in one. And on the other task we write B and a. We suppose that there is no interference between the two task They write the one which will be read part of b which is not right by Joseph. So basically the only dependency that both writing and reading task need to wait for both initializing task. As we saw with the arrow in the graph. So you can see the two color to color I95 because what we want to do is saying okay we'll say take two processor. An associate and the coloring is basically an allocation of the processor. Okay and so we will try to implement that with open MP. So to implement that will use section. So do you remember section from the second lecture is basically that you have the big sections directive which basically declare an area where you have multiple section. What section are basically they are independent computation. So here we have two sections. Each section will be executed on one end. Open MP consider them like independent from the other. So what do we do here here we say we have one section where we init initialize a and the other one where we white B and a and a that's basically the green allocation. And after we have the pink purple one which is initially B and white. So what happened when we do that did someone over 90 of what the result will be in this kind of case what could happen It's what we call. We call a race condition. What will happen because what could happen is that one of the thread will enter the first section. You will execute initialization of a write of B and read an a of a before the second thread which wasn't the other section ever finished to initialize B. And so you will write inside B without without being fully initialized. So the end basically the order of execution of the thread the timing of the execution in each thread will make will make the result different. Your result will depends on the timing. So and worst case in fact it's not really both skated base case. Best case. One of the initialization was doing an allocation and didn't have time to do the allocation. When the other one tried to white. What happen is that this memory is not allocated. And so you have a foot. So your CDO is great. What's happened if you had for example something like A Prezi call three. It means that you want to take the previous value of a and add three to this value. What's happened is when you read a it was two. And why do you adding three to obtain five another thread come and say No I want to initialize a to the value five. When you really should have been seven you will have five fingers number I said but you would not have the good result. The race it's a race condition. One thread will raise the other two up to y the data. So the Craig behaviour is on the left of the screen where Task1 and Task2 try will in fact in both case task one and task to compete on some shared data. And the first case SWACH Well you did each each task will edit the data and task one read the data modify it white it. After task to read the data modify it and write it we're correct. It's working. But if you don't take care of what what's happen even if it's if if it seems that your program will do what is the correct give you if you have not been careful what will what will happen doesn't happen all the time if you don't take care of thinking about it is that you will read with both task. They will read the same data. They will both modify the data in a certain way and both try to white. And so you will lose one of the right. One of the white will be erased by the other one. And you don't even know which of the task with right here. What is called the update from task two. And they say yes update from task to get overwritten by task one. But it could also be the other side. You could have the task to the task one writing first dice to watching after. So with this arrow you're not even sure what result you will have can be correct can be incorrect and it can be incorrect in multiple ways. That's the race condition. To know that we've seen that we have a race condition in our program. We will try to synchronize that taxes. Will use luck. Luck are very classical object for synchronization. It's can be set by only one thread. You have multiple thread and they ask to get the loc and only one of them will obtain the rock. And only these thread which obtains rock Be able to say I release the lock. So all the other thread will need to wait for the first thread to release the lock before one of them can obtain the rock than relief. And another one obtains the rock. And that's what we want. It's important to think about initializing and destroying open-end P threads. So there isn't a PI can come back to see that. But basically what it important to function set and set set basically will try to obtain the lock. If the lock is already possessed by another thread you will weight will be idle. It's one of the waking time we were speaking before. As one of those times where you're waiting for some forth for competition to be available. And you cannot do anything. That's one of the time we want to reduce Also when you optimize and then you can inset drug. This is when you are done with accessing your valuable buy for yourself this protected area it will instead look to leave. Other people come in. You can also test it to be to do some active weight is nothing but. So here it's the same program. We just added some luck. So we have the two section. We have log a and log b which by the way if you look at the parallel directive on top of the slide they are shared. The luck share data among the thread. It's important because if you put them private you will have a knee each thread will have its own copy of the rock. And so they will always obtain the log because it's their own luck. So yeah so basically what we do with a NOT B here is log a to post to protect access to a and not B protecting access to b. So here is have been done with k I. So people tell me that I need to look random variable when I exited the shared variable. So what we do is that we put log a an onset block a around the two access to lucky. And we do the same for B around the access for a B. And on the other side we invent. We put rugby around all the because we have both accessing B. And finally we put the same for a. So what will happen here someone so both of these sections can run at the same time correct. Yes they will both be run by a different thread since they share their rights. But since they share the lark it's going to keep each other safe So it's what we want but I can tell you that this one is not rocking working yet it's an example. We are speaking about tags. Just pretend it's another goal. Another bug. So what does the bug what's going to happen here does anybody know so it looks like both threads. We went into the other guys locked BM. Exactly happen. Exactly. This is a special kind of bug. And I'm sure you know the name deadlock. Deadlock. So what this mean is that a deadlock if basically we have a synchronous Shen resource like a lock which is weighted by two people when it can only be accessed by one. And if they both wait for heat and none of them read it because they cannot really did go back because this would be better to explain here. Did you say earlier that only one thread is doing Locke says that alpha is really the only one thread deadlocks Executives movie advisor. Obviously here. Once they've ok that hokey and maybe I misunderstood you okay okay God you said that only one thread does the locking yes only one thread can when he called fetlock to set the lock to obtain the luck. He he meant to say only one thread will obtain the luck. Okay obtains. All the threads was do the login. Okay yeah. At any 1 only one thread can obtain the lock. So the classic example very classic example and a bit. Yeah it's here's a key for the restaurant. And if you want to go to the restroom you need to obtain the key. If someone have the key you can net obtains a key you cannot go to the restaurant. Ok that's basically it. That's the example I had when I was in class works. So what's happened here is that. So we will say Thread one go on the left threat to go in the right section. So Thread one take the dog a threat to take the log B. So they have one look each. So they continue they do the initialization and then thread 11 to obtain the Ruby thread to habit and three to one to obtain that thread one. And to be able to read it into 3Ds. Look a for example in the case of thread. What if we wanted to add one to release drug a it need to pass. The rugby. Rugby here need to be able to packet before being able to read it. So it cannot be it cannot release. On the other side it cannot backtrack a so you cannot read a b. The deadlock application which tell which told They like the day and stayed Aggies until and then Twitter came and said yeah I think it shouldn't have been here for two weeks. That's what basically happened. One of the problems with these kind of boxes there what's called nondeterministic bugs meaning that the bug may not occur on every run of your program. So this one will look you know this one to know because if you can if one sexual run yeah yeah. And then the other section runs. Yes you won't ever deadlock But you will have always a race condition. We want our budget all William serially basically yeah but you will have you will still have a race a race condition because you will have in it. All right yeah. You want in it be before well why would you have a donor you want to have any bug The sections are executed sequentially. You will have because we want a and B being done before the Twitter task. You want. You can do an a and then read a write to B. One section. If you'd look at packet your code. You require this one. Right so if you if you run these sequentially then you'll get the we go to the you will do in it right we'd read when you want in it and in it being before that to say Go go go forward. One more one more Okay now you went too fast the picture right here. If you want you want to correct behavior which is on the left. If it is not the same program right but so you're in it in a and then you read it reading and writing to be. If you go okay go back two slides. Ok. So the first section where they say okay so the first section initializes a yes. And then you read a right to be yes. Next session section initializers B. You read from B and write a. So if you execute those sequentially that's correct Why not because you need to be need to be execute before white by the way. That's what we require in the why the Irish need. Why do you need an init before you initialize it with your baby you want to read b. Now you need a read B after the initialization a right to read right if you prefer. So errands as you're writing beta the Fourier size B. So you're saying you need are initialized before you write to be before. For example if a memory allocation you tackle nutrition or you're talking about memory allocation. She had numerous heritage and their task. They are purely task Okay so right okay so yeah if you execute sequentially so you might you might say allocate a and B are the initial knit knit sounds like putting initial value into me right right yeah. That's what I thought I was sitting. Where you're really me an allocation of a and B. Okay it's as big of a data structure for that. Alright okay but I agree that it's. So yeah so going but then to that race condition. If you execute those sequentially You won't have the deadlock but you will have the race condition. Yes. And so you have the deadlock. And so that's another solution that we want to see. And this solution in fact is cleaned because when we're in it a it's protected we cannot write at the same time that we in a same thing for B all the local passing. But what's happened is that we can still do the serial execution. It's okay for everybody. Everybody sees that. Basically we can we don't rely on taking another look before we release the lock. So there is no risk. The thing is that there is no synchronization. We just we just if both of them I will too. To set the luck to have dialogue set at the same time the fetlock. They will walk. But if it's tapped cereal it can be completely serialized. For example if we execute initialization of a do nothing and the other thread and takes it would be. Here. If we end users switch data this woman if you want to take the don't be evil. And so we started writing in B before the initialization of v. So we again have the race condition. At this time we don't have that So we have an idea of a correct. A correct version of it. You have basically three solution. I think. You need set love before. The initialization Before industrialization What do you want to do is escape or do you want to do it because you can do that can be done in parallel. So you want to launch the unitary edition without the need of protecting them. Not one thread could initialize a while the other is in its resting. But they're not they're not. Right know they they are in parallel. They execute in parallel because it's like a and not B. Yes the initialization of a and B are two different variables. So why can't they just happen in parallel and why would they can fluctuate because what we do in this case it's preventing initialization of a to institute at the same time as right a and B. And we prevent initiation of B to execute at the same time as write b and we are going to ask why. But the thing is that we don't enforce right a and B to happen after initialization and YB and we'd add two up and after initialization of B. The last synchronization we need to add these two solutions two or three iteration. I haven't white any of them. But the simplest one is to use a barrier. So simply you will for section instead of instead of two you don't even have luck. But you just take third section with initiation of a sequence section with you say edition of B then you put an open MP by. And then you create the two Section. Four YB. Read a write a will be. Yes the logs are low-level synchronization. And most of the time it's easier to use the Idaho open MP synchronization system. The barrier between the two sections that you're setting up separately for the initialization to wait until they're done and the other two sections can execute MPa. So the problem with that is that you'd use the barrier. So here we have only 2%. But if you had three process with a barrier you need to raise a three with a low-level synchronization. Maybe you have only two that we'll wait each other. Before continuing that were there were. When the third one I had a lot more work. And so it came later So low-level synchronization with integral to reduce the small gap. We were saying the synchronization also enable to have five different sweat doing five different thing on one common data structure. Because they need to access different plate of the data structure and use for them. So they will enable that. And they way they're also really useful to present deadlock. Another one. So hydrogen. I don't think anybody's attention. Maybe you can go go to the next slides may not matter. If it can anyway we have always MPI and we have ten minutes to finish. So we want to speak. We will speak about communication interlock. So that's another kind of basic erased race condition side problem. So what we will do is that we will try to buried a ring the communication ring MPI though we have an MPI process and each MPI send the message to its predecessor. And the first of the MPI send to the n minus one. So they have a ring communication ring. So to start the communication ring we start by a linear chain. So for the linear Shane it's really simple. If your rank is superior to if greater than 0 you just send the data to the to the to the to the process of rank rank minus one And if you have the rank m minus 14 you said rank minus one. That's what they mean. You mean minus one of that rank is your rank. And so you want to send that to rank magnets when you went off of your rank not rank minus one there wouldn't be no. There is no rank minus one. Minus one of your rank. Yes. Gosh isn't just a subtle thing that you said you said rank minus well OK. So you're saying the current rank equal rank minus one Okay yes okay. And so for all the others that are not the last one you receive data from the previous one. So what's happened here we have each block is the furthest number. Each process try to send then to receive. Except for the first one which one you want to receive. And the last one which one you want to send. So what's happened is that if you remember send and receive in MPI are waiting for the communication to compete Before they before they return. So basically here one to n minus one are waiting for the Send to complete before they can call receive. So what's happened is that the 0 as creative as this woman it complete in one. Complete the communication in one. Then it can start the receive which complete the connection for two and continue like that until then minus one. Basically you linearize the communication by judging that. So it's a problem. And it's the first problem we have in this the conception of this ring. The second program is okay. What do we do to do the border because we don't have the border. We don't have the clinician education between 0 and n minus one. Okay let's simply compute the previous rank differently. If I'm the ranks for the previous defends the 0 its its number of task minus one. And if it's s It's just rank minus one. Same for next. So what do we do now we are trying to send data from here to here. But we can not receive because we're also trying to send from here to here. And from here to here. So basically we are waiting for this one to receive to start receiving while this one. I've not finished to send and will never finished to ten. We have a communication interlock. Because everybody tried to send that nobody receive and they all need to send before they can start receiving So it's a classic problem in MPI communication to its people usually say Oh I will start by sending and receiving. But you need to be careful that the process receiving will receive before you try to send data back to outweigh. So if I really want to do that how do I do we have non waiting send and receive non-blocking communication in MPI send and I receive. So it's exactly the same column. Basically we use I send and receive. And we add the last parameter to those which are Request MPI request object. And then we say we want to wait on those requests. So we said we have a send request and we have received requests and we wait for them to finish after what's happened. Each furthest say I send data I received and I want to send the and I want to receive it. Suppose I do that. And then the other four says oh I have this data for you and give me this data And so they all work together they are not waiting to send before they can receive. And so the communication happen all at the same. At the same time medicaid without linearization. And they happened also. So that's basically the way it all works. Those that wait or worse like a barrier sort of but local for that. Sorry. Those that wait all command. Yes working works like a barrier. Barrier. It's kind of barrier. Basic. In fact the implementation I will say we meet the molecule look like a lot is you can imagine that baby key the send and receive both theta a lot. And when the operation is finished it's inside the block. And here the weight you try to obtain the rock. Basically I don't know after exactly how they implement that. But it's it's basically this this call to wait all will not complete before the send and the receive are finished. So we were supposed to have some open MPI and MPI exited bags exegesis. But yes you don't have time for that. We can do because of time allowed three. It was if time allowed to do that but anyway Douglas is available on the Lawrence Livermore National Laboratory. Competing website. They have a set of tutorial for high-performance computing. They have MPI class an MPI tutorial and an open MP tutorial. Most of the information you saw here in fact come from this. And on this side. So this slide you have the link on the GitHub for the slide. It's a Google Doc. And you have also this link in the GitHub readme. Any ever thought documentation about Valgrind. So with the memory check manure. And the small section about the MPI Walker and the GitHub and Let's try to spend like maybe 15-20 minutes in the next lecture to go over some of these okay cuz it makes sense because we're going to go in new more profiling and debugging tools. And so we can talk more about the about the bugs and then go into those tools. Okay yes we can we can do that is I can afford to give them homework. Yeah thanks trust. We can afford him with this kind of thing
9Mills Profiling and Debugging I.mp4
From Anita Schwartz May 07, 2019
1 plays
1
0 comments
0
You unliked the media.