Graph Partitioning for Load Balancing of Multi-phase, Multi-physics, and Multi-mesh Simulations

by Kirk Schloegel, George Karypis, Vipin Kumar, Army HPC Research Center

Introduction

Algorithms that find good partitionings of irregular graphs are critical for the efficient execution of scientific simulations on high-performance parallel computers. In these simulations, computation is performed iteratively on each element (and/or node) of a physical two- or three-dimensional mesh and then information is exchanged between adjacent mesh elements. For example, computation is performed on each triangle of the two-dimensional mesh shown in Figure 1. Then information is exchanged for every face between adjacent triangles. The efficient execution of such simulations on parallel machines requires a mapping of the computational mesh onto the processors such that each processor gets roughly an equal number of mesh elements and that the amount of inter-processor communication required to perform the information exchange between adjacent elements is minimized. Such a mapping is commonly found by solving a graph partitioning problem [9]. For example, a graph partitioning algorithm was used to decompose the mesh in Figure 1. Here, the mesh elements have been colored to indicate the processor to which they have been mapped.

The graph partitioning problem is known to be NP-complete. Therefore, it is not possible to compute optimal partitionings for graphs of interesting size in a reasonable amount of time. This fact, combined with the importance of the problem, has led to the development of several heuristic approaches. See [9] for a recent survey of these schemes. Of these, multilevel algorithms are widely recognized as the state-of-the-art, as they are able to robustly compute high-quality partitionings quickly. Furthermore, many of these are available as serial (Chaco [2], JOSTLE [13], and MUDS [4]) or parallel (ParMeTiS [6] and PJOSTLE [12]) software libraries.

While graph partitioning algorithms have enabled the efficient execution of a wide range of scientific simulations on parallel machines, other applications have a number of additional requirements for their mesh decompositions that traditional partitioners are unable to satisfy. For example, many scientific simulations consist of a number of computational phases separated by synchronization steps (i.e., multi-phase simulations). These require that each of the phases be individually load balanced. Still other scientific simulations model multiple physical phenomenon (i.e., multi-physics simulations) or employ multiple meshes simultaneously (i.e., multi-mesh simulations). These also impose additional requirements that the partitioning algorithm must take into account. In this article, we describe some of these classes of simulations, as well as highlight new, generalized partitioning formulations and algorithms designed for them.

Multi-phase Simulations

Multi-phase simulations consist of a number of distinct computational phases, each separated by an explicit synchronization step. In general, the amount of computation performed for each element of the mesh is different for different phases. The existence of the synchronization steps between the phases requires that each phase be individually load balanced. That is, it is not sufficient to simply sum up the relative times required for each phase and to compute a decomposition based on this sum. Doing so may lead to some processors having too much work during one phase of the computation (and so, these may still be working after other processors are idle), and not enough work during other phases (and so these may be idle while other processors are still working). Instead, it is critical that every processor have an equal amount of work from all of the phases of the computation. A traditional graph partitioning scheme can be used to balance the load across the processors for a single phase of the computation. However, the load may be seriously imbalanced for the other phases. Another method is to use a different partitioning for every phase, each of which balances the load of a single phase only. This method requires that costly data redistribution be performed after each phase in order to realize the partitioning corresponding to the next phase. A better method is to compute a single partitioning that simultaneously balances the work performed in each of the phases. In this case, no redistribution of the data is necessary, and all of the phases are well balanced.

Figures 2 and 3 give examples of multi-phase simulations. Figure 2 shows a mesh and the particles from a particle-in-mesh simulation. This computation is composed of two phases. The first phase is the mesh based computation, and the second phase is the particle-based computation. In order to load balance such a simulation, each processor must have a roughly equal amount of both the mesh computation and the particle computation. This is not trivial because the number of particles within each mesh element can be different. Therefore, while ensuring that each processor has an equal number of mesh elements will load balance the mesh-based computation, it does not guarantee that each processor has an equal number of particles. Likewise, ensuring that each processor has an equal number of particles does not guarantee that they also have equal numbers of mesh elements. The dark line in Figure 2 gives a single bisection that splits both the mesh elements and the particles evenly. Figure 3 illustrates the mesh associated with the simulation of the ports and the combustion chamber of an internal combustion engine. Here, the simulation is performed in six computational phases. (Each of these corresponds to a different color in the figure.) In order to solve such a multi-phase computation efficiently on a parallel machine, every processor should contain an equal number of mesh elements of all six different colors. Figure 4 shows two subdomains from an 8-way partitioning of the mesh in Figure 3. This partitioning (computed by the multi-constraint partitioning algorithm implemented in MeTiS [4]) balances all six of the phases while also minimizing the inter-processor communications. (Note that not all of the colors are visible in Figures 3 and 4.)

Multi-physics Simulations

Many computations simulate a variety of materials and/or physical phenomenon together. An example is the elastic-plastic soil-structure interaction computations that are used to simulate static and dynamic (earthquake) loading events. In such simulations, elastic computations are performed on each of the elements of the mesh. After these are completed, a yield condition is checked for every mesh element. This check determines whether or not the plastic computation must be performed on the element. Typically, zones in a three-dimensional solid may become plastic (i.e., load) and then elastic (i.e., unload). Thus, the extent of the plastic zone changes dynamically. This change can be both slow and rapid. Slow change usually occurs during initial loading phases, while the later deformation tends to localize in narrow zones rapidly [3]. (See Figure 5.) This is an example of a dynamically evolving computation that has multiple (i.e., two) phases. Due to its dynamic nature, not only is a static decomposition required, but periodic load balancing must also be performed to maximize efficiency.

Multi-mesh Simulations

Another important class of emerging methods are multi-mesh computations. Multiple meshes arise in several settings that use grids to discretize partial differential equations. For example, some operations are innately more efficient on structured grids, such as radiation transport sweeps. However, complex geometries are better fitted with unstructured meshes. In some simulations, both kinds of grids may be used throughout the computation. Similarly, various codes that solve for multiple physical quantities (eg., multi-physics computations) may use separate grids to solve the appropriate equations for each variable. For example, consider a simulation of the welding of a joint between two parts, a process in which the parts are pressed together and thermally annealed [7]. One grid could be used for the solution of the stress-strain relations that mediate the mechanical deformation of the parts. A second grid could be used to solve the heat equation for thermal conduction in the system. Since the regions of high strain may be distinct from those with high thermal gradients, each grid can be individually tailored to accurately represent the relevant physics.

Now consider the implementation of such a multi-mesh example on distributed-memory parallel machine. A typical time-step consists of computing a solution on the first mesh, interpolating the result to the second mesh, computing a solution on the second mesh, interpolating it back to the first mesh, and so on. One way of performing this type of computation in parallel is to partition the meshes separately so that every processor has a portion of each mesh. This approach will balance the computations and minimize the communications during each of the solution phases. However, because the different meshes are partitioned independently, there is no assurance that an individual processor will own portions of the meshes that spatially overlap. Therefore, the amount of communication performed during the interpolation and transfer of the solution data can be quite high, even if an efficient approach is used to manage this communication [7]. Ideally, we would like to partition the different meshes such that each processor performs an equal amount of work for every mesh, and at the same time, the inter-processor communications required during the computations of the solutions, as well as those required during the interpolation and transfer of the solutions, are minimized.

Multi-constraint Graph Partitioning

The common characteristic of these problems is that they all require the computation of partitionings that satisfy more than one balance constraint. Traditional graph partitioning techniques have been designed to balance only a single constraint (i.e., the vertex weight). An extension of the graph partitioning problem formulation is to assign a weight vector of size m to each vertex. The problem then becomes that of finding a partitioning that minimizes the inter-processor communication, subject to the constraints that each of the m weights is balanced across the subdomains. This multi-constraint graph partitioning problem [5] is able to effectively model all of the problems described above.

Figure 6 illustrates an example with three constraints. The graph here is derived from a multi-phase simulation in which each vertex is active during one or more computational phases. The specific phases in which a vertex is active depend upon the region of the graph in which that vertex is located. For example, the vertex in the upper-left corner of Figure 6(a) is in the region of the graph that is active only during the first phase of the computation, while the vertex in the bottom-middle of the graph is active during both the first and second phases. Figure 6(b) shows the weight vectors that are assigned to each vertex. Here, an entry of one indicates that the vertex is active during the corresponding phase and an entry of zero indicates that the vertex is not active during this phase.

We have developed serial and parallel, static and adaptive multi-constraint graph partitioners [5, 8, 10] that are based on this generalized formulation. These have been shown to be effective in computing high-quality partitionings for real applications while simultaneously balancing a number of constraints [1, 8, 11]. Serial versions of these algorithms [5] are included in the widely-used MeTiS 4.0 graph partitioning library [4]. Parallel formulations of our multi-constraint algorithms [8, 10] have been developed and will be included in the next version of the ParMeTiS library [6].

Author Contact Information
Kirk Schloegel Dept of Computer Science & Engineering, University of Minnesota Army HPC Research Center, Minneapolis, MN [email protected]	George Karypis Dept of Computer Science & Engineering, University of Minnesota Army HPC Research Center, Minneapolis, MN [email protected]
Vipin Kumar Dept of Computer Science & Engineering, University of Minnesota Army HPC Research Center, Minneapolis, MN [email protected]

Acknowledgements

This work was supported by DOE contract number LLNL B347881, by NSF grant CCR-9972519, by Army Research Office contract DA/DAAG55-98-1-0441, by Army High Performance Computing Research Center cooperative agreement number DAAH04-95-2-0003/contract number DAAH04-95-C-0008, the content of which does not necessarily reflect the position or the policy of the government, and no official endorsement should be inferred. Additional support was provided by the IBM Partnership Award, and by the IBM SUR equipment grant. Access to computing facilities was provided by AHPCRC, Minnesota Supercomputer Institute.

The MeTiS and ParMeTiS libraries are available at
http://www-users.cs.umn.edu/~karypis/metis/