Ok. Ok welcome back. So let's go over some parallelization basics and then design patterns. And you'll see what all that means as we go along. Okay so parallelization. So so typically what I would what I would recommend is if you don't if you're starting fresh and you don't have your application written I would in fact Recommend writing the sequential version of that application first. And the reason being is that I think it's good to have a a version of the application that you trust that's well debugged and that you can use as a comparison. With the results hopefully have your you're able to get some results that you can then verify with the sequential version. And then he just sort of mention here as a footnote. The reason that you do parallelization is for performance. There's no Really there's no other reason why you would do parallelization of your application. So if you're already getting good speed ups with or if your application is already fast enough. As if the sequential version I should say of the application is already fast enough. There's really no reason to do parallelization of your application. And it's many times. Your sequential application may be good enough in terms of performance. So and the reason that's the case is because parallelization often is more difficult Most of the time it's more difficult. A lot of times your initial parallel parallel implementations will actually be slower than your sequential implementation. Often you have bugs which are much harder to find and to to diagnose to fix than the sequential bugs I see a few smiles in the audience. And so maybe some of you have already encountered. Some of these very hard to find and diagnose and fix bugs in your parallel programs. So debugging is already hard enough. So once you go to the parallel parallel world it's it becomes that much harder. Okay so let's say you start out and you write a fresh new application or you're given an application in which to improve the performance of. So you have a sequential version of the application. The first thing is to study the problem or the code So I'm assuming that you've decided to parallelize the code. So you should focus on the data structures that are that are iterated over in the loops of your application. So typically have one or more loops. If you don't have any loops in your application then it's hard to really justify parallelization. Most of the time is spent in loops in your application and you would want to then focus your energy in restructuring in parallelizing your application on those loops. I guess is a good time to mention. There's other forms of parallelization that don't really require restructuring your application. So if you have a sequential application that you would try you would run many different iterations of that application using different parameters sometimes called parameter sweeps. Then you would need to parallelize that sequential application. But in fact you would be obtaining parallelization with what's called Embarrassingly parallel you know an embarrassingly parallel a run of the program where you're basically running this the sequential version multiple times on different nodes or on different processors on a node. And you can get parallelization that weight. And that's often a very nice way of getting parallelization because it scales very easily right you have if you have 1000 processors or a million processors you just run a 1000 sequential versions of your application or a million sequential versions and it scales very very easily However it's not often the case that you have an embarrassingly parallel scenario. And so you want to parallelize your code. And so you want to break down your or decompose your sequential computation into a variety different tasks and that's what I'm picturing here. So there the high level is to partition the application. And so you first you start out with dividing the application into multiple tasks and those tasks can come from the different functions in your application can come from the different iterations of the loops in your application. And so What you want to do is basically by dividing your application into multiple tasks. You're looking for the parallelization opportunities. And often it's the case that you want to find more parallelism than the available resources on the machine. And the reason being is that your software as is often the case outlives the useful life of the machine. So the males cluster will be gone retired and will be replaced by a new cluster and your application Have to run on that new cluster or on subsequent clusters. And so you want to be able to have as much parallelism you want to identify as much parallelism as possible. So you can take advantage of future clusters and the future available parallelism that comes from those clusters. Once you have your tasks broken down your sequential application into tasks then you want to. You want to map that map those tasks into processes or threads. Ok and so all that corresponds to Two to the decomposition. Okay so as I mentioned you have to study a code. So there's no. There's no free lunch here. You know it's not just a matter of saying oh here's a loop. Let me just throw an open MP pragma. And it's just going to work that often ends up. That's often the case where you end up with a bunch of bugs and then it's harder to debug your program because you don't understand your program so you have to understand it obviously. And it also may require rewriting or restructuring your algorithm. Even Coming up with a whole new algorithm and many times it's the case that the new algorithm that is a better fit for the parallel resources on the target architecture on the target cluster that new algorithm actually is a is not a good algorithm for the sequential case. So it's it's often the case that you want to rewrite your application. Specifically for parallelism in mind Can you obviously want to keep you want to have enough tasks to keep your processors busy. So this corresponds to the partitioning part here. So I'll just I mentioned breaking down into tasks. You want to sign those tasks into processes. You want to try to balance your computation. So you want to try to divide the tasks as evenly as possible among the different threads or processes. And that is because very often it's the case that once the threads finish there's a synchronization point where all the threads have to have to wait for the slowest thread to finish. Okay so that's why you want to balance the computation. If you have your computation balanced across all the threads then you're still going to have to wait for the slowest thread. However if it's more balanced than it's often the LL it's the case that your threads will finish Around the same time you want to also reduce the communication that goes without saying. So communication actually is one of the most important bottlenecks in a parallel program. And so if you can if you can group the tasks that do heavy communication amongst each other into one thread. That's a good thing. Okay why you might it's also important because these threads might actually be executing on There are obviously going to be executing on different cores. Well not obviously but many times they execute on different cores. And if those different if there's a lot of communication that communication goes through the memory hierarchy. And if the cores are far away the two threads are executing on in terms of the memory hierarchy. And there's a lot of comique communication. That's going to be a bottleneck that's going to be slowed down. Okay so so right so as much communication as possible you want to group them together into the same task. Then construction right By thread I mean. So it's not so. So I really mean just computation. I don't mean data right now because data can be shared amongst different threads. So I'm really talking about the I'm really talking about the the instructions. I mean you want to restructure your data as well and data structures but I'm not necessarily talking about that right now. But that is something that I will discuss later on in this lecture or or next time. Okay So right so the approach to dividing the tasks and then putting the tasks together into processes can be done in structured approaches. And that is why I decided to discuss design patterns because I think that I mean it's basically design patterns as a way of getting at repeated repeated scenarios that you see time and time again as you're parallelizing your application or different applications I should say Okay so yeah granularity is important here and so this is a good time to mention granularity. So granularity corresponds to the ratio between computation and communication. So fine grain granularity in fine grain granularity. You have a you have a bunch of small tasks that. So you're able to partition your program into very small tests or fine granular tasks. Okay and then you can You can easily interspersed the communication. There's not as much communication hopefully going on in those those finer grain tasks. And also when you when you are able to partition your your code into these fine-grain tasks that makes it easier to do load balancing or being able to balance your computation across all the threads evenly. Okay because you have you have this smaller Tasks you have many more of them typically and you can balance them across the threads. However if you have two fine granularity than you're doing a lot of computation. And the computation may outweigh the CMMI site. You have a lot of communication and the communication may outweigh the computation. So you may be doing so much communication that you're not seen good gains and performance. On the other hand the other extreme is to have coarse-grain parallelism where you have course big tasks. And and so you do a lot of computation between the communication and the computation. So we do a lot of you do a lot of computation and then you do some communication afterwards. So as I mentioned communication can be a bottleneck. So doing a lot of computation can increase performance however makes it harder to do load balancing. So for example if I if I have in the extreme case I only have one big communication or sorry one big computation. And I have nothing to to balance. I run that on one thread. I may break up my Program into two tasks that still coarse-grain but I may have multiple threads. So I only am able to run that comp computation on two threads because I have two tasks basically. So the more you chop your code into finer and finer task the more threads that you can keep busy. So ofcourse the the right granularity depends on the application and the architecture. But I would as a recommendation say that it's better to go with a fine grain. A fine-grained partitioning of your code. What why might that be better than doing coarse-grained that corresponds to what I was saying earlier about future clusters coming online. Take advantage of resources right right so you have more tasks available. And although you might not have the resources the hardware resources right now to take a vantage of all those tasks. Later on you will have additional hardware resources. And so it'll be easier than to take advantage of those additional hardware resources. So You end up with then with finer granular tasks is you have you you end up with a granularity knob where you can say okay have fine-grained tasks. Maybe I will put some of them together. For the hardware resources I have available. But later when I have more hardware resources I can split those tasks up. Back into the fine granular tasks that I had before. And that's easier to do than starting with coarse-grain tasks and then having to restructure your algorithm again to get finer grain tasks But again you know the the so the the the right granularity depends on the algorithm in the hardware. Okay so then we want to as I mention map those tasks or assign those tasks to processes. And then I want to do what's called orchestration and mapping basically it's the it's the scheduling of the of the parallel program. So I have my processes and then I want us I want to schedule that communication between all the processes in my parallel program and And did that that scheduling of communication requires basically what's called orchestration. Okay so orchestrating the computation and communication so that the most efficient applications are those that can overlap. Communicate useful community. Sart useful computation when communication is going on. So if you have some communication you need to do and you have some computation that's independent of that communication. Then you want to do some overlap in there. And those are the most efficient parallel algorithms are parallel programs. Okay and as I mentioned here preserving locality of data that just means I want to keep my data close to the where it's needed. So as I mentioned before this example if I have data that's shared between two threads that are far apart. In the processor. I'm going to have that bottleneck of communicating through the memory hierarchy. And so that doesn't preserve my locality. So if I can keep my data local as much as possible on the same thread. Then I have locality and my data there. My data is near where the computation is. Sorry if you've had situations like that is to take advantage of that can you copy the data to keep it going man you can segment copies but often it's the case that you can't make copies because Sir you have a dependency. Some tasks are creating data for other tasks. And you can't there's no you know if you have obviously the data's completely independent then yeah you can make a copy of it. But if you have a task that's creating data for another task you want to keep that those tasks together on the same thread if possible especially if there's a lot of communication going on there Okay so I've orchestrated my parallel program and then I want to map that. Those processes are the parallel program onto my processors. Okay and you can do that with like MPI an open-end P as well. Okay so let's talk about design patterns for the for the last ten minutes. Okay so as I mentioned parallelization is difficult. It's hard to fully exploit all the available hardware. So there's been reoccurring patterns that come up when people had been parallelizing their their applications. So Develop what's called design patterns and this comes from design patterns that came up in actually goes back to even in architecture of buildings where they saw that there was reoccurring patterns that were happening when people were building buildings. So they came up. So this this guy named Billy was named Christopher Alexander came up with this book called design patterns Where you could design a building by looking at by using these this cookbook of patterns. And some people in object oriented languages took that concept of design patterns to develop object oriented programs that there was these reoccurring patterns that were coming coming along. So they basically developed a cookbook or a basically encyclopedia of every possible pattern that you would need. And then you know you think of your application and what kind of you can try to understand what kind of pattern you would need for your application and then you just take one of these patterns and put it into your application. Again the same thing applies in object oriented programming as it does in parallel programming. So there's a set of parallel design patterns that I'm going to talk about that. It also gives us a nice vocabulary where we can talk about you know the same thing. So often now you'll hear. If you go to a conference on parallel programming you'll hear you'll hear people say some pattern like inspector executer pattern and people know right away. Yes I know what you're talking about as opposed to having to spend like the first 15 minutes to explain what your application is doing Okay so here's the pattern's book patterns for parallel programming that's developed by Skype Timothy Madsen I believe he's that IBM with another couple of fellow IBMers. And so let's go into the design patterns. Okay so it corresponds to finding concurrency as I mentioned before this decomposition. So either coming up with the tasks that correspond to the concurrency are parallelization in your program. You can also decompose your application By the data or you can create this pipeline pattern. Again we'll talk about each of those. And then there's certain issues that you have to deal with when dealing with parallel programs and those issues crop up as control or data dependencies. And we'll talk a little bit about what those are in a moment. Okay So there's different ways to break up your program. There's either breaking up your program. By the data or breaking up your program by the tasks. And depending on your program it's more natural to think of breaking it up either by the data or by the tasks and breaking up the program by the tasks implies breaking it up by the data. Ok however it might be more natural to think of breaking up the program into tasks as as opposed to breaking it up by the data. Okay So let's go into some detail and then there's a special case of pipelining. Okay so data decomposition also called domain decomposition. So the first question to ask is what is the natural decomposition of your application how should I break up my application most naturally okay so you can decide is it better to split the program up by the data so cases where that is a good starting point is when you have a large data structure like a graph for example or a tree Or an array and you're operating on one large data structure. And you might think okay then it makes sense to partition that data and operate on the different parts of that data on different processors. Okay so similar operations would operate on different parts of the data structure. So for instance graphics where we have an image and the image can be represented as a large array of pixels. That's a case where I may want to just break up the Large array into sub-arrays. So I'm doing data decomposition there and it's easier to think of it's a natural way of thinking about partitioning that particular application by just breaking up the data into subarrays as opposed to trying to think okay let me look at this application. And let me do this part of the application here and this part of the application here and this part of the application here. Okay of course these are just rules of thumb. There's no hard. You know this I'm not saying there's just you know if you have a large data structure you should you you have to do data decomposition. I'm just saying as a rule of thumb. You might think if I have a large data structure it might be more natural to think of breaking up that application by its data. Okay so other common data decompositions are yeah as I mentioned an array. So I might want to break it up by the Rose and operate on each of the rows individually or by columns or by blocks tiles of the array. Or I might have a recursive data structure like a tree or a graph. And then I might want to decompose the tree or graph into smaller trees are graphs. Okay here's a a data decomposition when I want to find the largest element of the array. Ok so what I would do then is I would split this array into multiple parts or sub-arrays. And then I would on each CPU find the largest element in that subarray. So I do the similar operation in parallel. So I'm I'm looking at each of the elements and I'm putting the largest element of each sub-array in this element below each sub-array. If I go through each of the elements In parallel in each array and then I do what's called a reduction. So I look at now each of the largest elements in each of the processors and I take the largest one. Okay this is a toy example. But you know just to give you a sense of how I might do data decomposition data decomposition on an array. Okay so as I mentioned before you want to break up the problem as fine grained as possible. And then you end up with a a granularity knob which is parameterized by the size of the data chunk. So you know the particular size of data chunk that would work well on the milk. Mel's cluster today will be different than the second iteration of the males cluster. You might want to have a different size chunk maybe a larger sub-array. Okay and but if you develop your application with that in mind it's easy then to port your application and get the best performance when the new iteration of the males clustered the next Mills cluster or the next cluster whatever that may be that you're going to run your application on. Okay so you also want to have comparable size data chunks because you want to load that. You want to do load balancing. That makes sense. You want to keep it as simple as possible but you can only obviously make it so simple. I mean you can't you know there's there's there's a famous saying keep it as simple as possible but not any simpler something like this. So anyway you want to make sure basically that you can if it's a complex data structure that you want to make this decomposition simple enough that it doesn't make debugging a nightmare because debugging as I say a parallel program is going to be very difficult. Ok so task decomposition. So if you have a situation where your where your program is an actual naturally decomposed into say several functions that can be executed in parallel. This is a case where you can do then task or functional decomposition. It's easier as you know as I've been saying all along it's easier to To start with too many tasks and then fuse sum Later if you find that there's just too much computation too it's too fine grain. The granularities too fine. So by by by decomposing your problem into functions then that implicitly require some data decomposition. And so you then you then decide what is the data that needs to be accessed by each function okay but you've done the decomposition already functional decomposition and then you're splitting up your data based on that function functional decomposition. And you're also doing as Anita mentioned before Perhaps some copying of the data structure to make duplicates. If there's not any dependency going on between the different functions or sorry if the if the data it can be independently processed on the different functions and I can make duplicate copies of of the of the data. Okay so it's four o'clock. So this is a good. Yeah that's a good place to stop right here. And I'll continue on with this. With this lecture at the beginning of Thursday and then we'll go into MPI and open MPI program
6Mills Parallelism I - Part 2.mp4
From Anita Schwartz May 07, 2019
3 plays
3
0 comments
0
You unliked the media.