Improving openmp scaling nanoxia deep silence 4 claw

Cori’s Knights Landing (KNL) processors will each have more than 60 cores, with each core having the ability to support 4 hardware threads, leading to the possibility of up to 240 threads per single-socket node. Your application is likely to run on KNL without significant modification, but achieving good performance while using all the cores and threads may be more difficult. Applications may not fit into the memory of a Cori node if they try to use an MPI task for each core because of the memory overhead for each MPI task.

To get around this limitation, a mixed MPI/OpenMP programming model is recommended to achieve scalability on the node. Using a hybrid MPI/OpenMP model can reduce memory usage by having fewer MPI tasks per node, thus reducing the need to store local copies of variables and data structures for each MPI task.


For example, a grid based application that requires ghost cells for each MPI task can use hybrid MPI/OpenMP to reduce memory usage associated with storing ghost cells by using fewer MPI tasks per node. anoxia symptoms And with hybrid MPI/OpenMP, there will be fewer MPI messages, with larger message sizes, both of which help to improve application performance across the high speed interconnect.

Good thread scaling is important for an application to get good performance on Cori. The best combination of MPI tasks and threads per task should be determined by experimentation. You should run you code with various values of MPI tasks and OpenMP threads to find "sweet spots" that minimize the overall run time, such as: 4 MPI tasks * 15 OpenMP threads per task, or 8* MPI tasks * 10 OpenMP threads per task, etc. We expect that the most common values will be 4, 8, 16 or 32 MPI tasks and some range of OpenMP threads per MPI task between 4 to 16.

You can use Babbage and Edison today to start experimenting with mixed MPI and OpenMP programming. This web page illustrates some tips and things to consider for efficient scaling hybrid MPI/OpenMP programs, and presents a few case studies on optimization techniques for real applications. Fine Grain and Coarse Grain Models

Below is an example code with the fine grain model. The program is single threaded except when actively using multiple threads, such as loop processing. A fine grain model is easier to implement within an MPI program, however, there is more thread overhead and serial sections of the a code may become bottlenecks.

Below is an example code with the coarse grain model where most of the code runs within an OMP parallel region. The coarse gain model has low thread creation overhead and consistent thread affinity. However it is harder to code, and prone to race conditions.

In order to get optimal performance it is best for each thread of execution to allocate memory "closest" to the core on which it is executing. This is accomplished by allocating and immediately initializing memory from the thread that will be using it. This is often referred to as a “first touch” policy because the OS allocates memory as close as possible to the thread that "first touches" (initializes) it.

In the following code snippet, both the initialization and compute sections have the same OpenMP parallel regions defined, so the array partitioned to each thread will be local to the memory of this thread. If the first "#pragma" directive is removed, then the initialization for each array is done by the master thread only and the memory associated with the arrays will be local to the master thread and potentially farther, and thus slower to access, than from the other threads.

(Advanced notes: In this example, we assume the array size is big enough so that the array section local to one thread will not fit in the same page allocated for another thread. For smaller size shared variables, it is better for a thread which "potentially owns" the most data within a page to do the first touch. Or a workaround for small size array is to define "threadprivate" for this variable, then each thread will have a copy of the complete array in own local memory. Threads can update the whole array with atomic operations afterwards.)

It is very hard to do “perfect touch” for real applications, so a common strategy is to limit the number of threads to fewer than the number of cores in a NUMA domain (assuming one thread per core). On Hopper, the number of cores in a NUMA domain is 6 and on Edison the number is 12. For this reason you might expect good performance on Hopper using 4 MPI tasks with 6 OpenMP threads for each MPI task and on Edison 2 or 4 MPI tasks and 12 or 6 OpenMP threads . When we have more experience with early Knights Landing (KNL) hardware, NERSC will be able to give more precise recommendations on memory affinity. For now, what we know is that Intel has revealed that there will be multiple NUMA domains on a KNL node and that the many compute cores on the node will be connected in a 2D mesh architecture, so we expect memory affinity will be important. False Sharing

Data from memory are accessed via cache lines. Threads hold local copies of the same (global) data in their caches. Cache coherence ensures the local copy to be consistent with the global data. Main copy needs to be updated when a thread writes to local copy. Multiple threads writing to the same cache line is called false sharing or cache thrashing, since it needs to be done in serial to ensure correctness. False sharing hurts parallel performance.

The following example illustrates a false sharing. Array a is shared. The chunk size of 1 causes each thread to update one element of a, but since a[0] and a[1] are contiguous and on the same cache line. When a[0] is updated, the main copy of a needs to be updated before a[1] is updated by thread 1, causing the loop to be carried out in sequential.

MPI process affinity with hybrid MPI/OpenMP can be controlled by a run time environment variable "O_MPI_PIN_DOMAIN" in which you define the size of your domains, and the layout for each domain. Where a domain is a group of logical cores, so it must be a multiple of 4. 1 MPI rank will be put in 1 such domain. anxiety attack treatment home remedies OpenMP threads can then be pinned inside each domain. More details on MIC process affinity settings can be found here. Programming Tips for Adding OpenMP

In the above figure, each line represents multiple runs using fixed total number of cores = #MPI_tasks x #OpenMP threads/task. Scaling may depend on the kernel algorithms and problem sizes. In this test case, 15 MPI tasks with 8 OpenMP threads per task is optimal. Nested OpenMP

Sometimes it is beneficial to use nested OpenMP to allow more fine-grained thread parallelism. Using best achievable process and thread affinity is crucial in getting good performance with nested OpenMP, yet it is not straightforward to do so. Please refer to the Nested OpenMP page on how to achieve best affinity on Edison, Hopper, and Babbage. A combination of OpenMP environment variables and run time flags are needed for different compilers on different systems. Tools for OpenMP Adding OpenMP to Your Program

On Hopper or Edison, Cray Reveal helps to perform scope analysis, and suggests OpenMP compiler directives. It is based on CrayPat performance analysis, and utilizes Cray compiler optimization information. Keep in mind that it can only be used under Cray programming environment. The result of the OpenMP enabled codes can then be compiled and run with any compilers.

Performance tools available on Hopper and Edison are: Cray performance tools, IPM, Allinea MAP and Perf Reports, and TAU. VTune is also available on Edison. Debugging tools available on Hopper and Edison are: Allinea DDT, Totalview, LGDB, and Valgrind.

The figure below shows the total run time with incremental steps to add OpenMP. There are four major functions for this application: Collision, PostCollision, Stream, and PostStream. In original serial execution, run time in Collision is > 2500 sec (plotted below as 200 sec only for better display), which takes >95% of total run time.

Step 1: Add OpenMP to the hotspot Collision. 60X speedup is achieved for Collision. Step 2: The new bottleneck is now Stream. Add OpenMP to Stream and others. 89X speedup in Stream is achieved. Step 3: Now add vectorization, there is an additional 5X speedup in Collision.

It needs at least the OpenMP support level of MPI_THREAD_FUNNELED. While master or single thread is making MPI calls, other threads are computing. This is usually very hard to do, due to the need of separating application codes to run before or after halo info is received. Automatic compiler optimizations will usually not be obtained.

The figure below shows the speedup of various hybrid MPI/OpenMP implementations compared to pure MPI on Hopper using the total number of 12,096 processors. anoxic seizure symptoms Hybrid A uses 2016 MPI tasks, with 6 OpenMP threads per MPI task. Hybrid B is hybrid A plus merging MPI_Reduce and MPI_Scatter into MPI_Reduce_Scatter, and also merging MPI_Gather and MPI_Bcast into MPI_Allgatherv. Hybrid C is Hybrid B plus overlapping row-group MPI communications with computation. And Hybrid D is Hybrid C plus overlapping (most) column-group communications with computation.

There are two computational sections: Loop Nests and GetBlock. OpenMP is added at the outermost loops of the Loop Nests hotspots. It scales well up to 120 threads. GetBlock is not parallelized with OpenMP. The figure below shows the baseline OpenMP results. Total time has perfect scaling from 1 to 16 threads. Best total time is reached at 120 threads. Hyper-threading hurts performance. The "balanced" affinity gives best performance.

Optimization is then applied to the GetBlock function. Techniques applied include: Parallelize sort, loop unrolling; Reorder array indices to match loop indices; Merge adjacent loop indices to increase number of iterations; Align arrays to 64 bytes boundary; Exploit OpenMP loop control directive, remove loop dependency and provide complier hints. Figure below shows the the optimized OpenMP run time. Total speedup from base is 2.3x.

The figure below shows the total time spent in each of top 10 subroutines (which takes 75% of total CPU time) in the TEXAS integral package using top level threading using different number of OpenMP threads. Total number of MPI ranks is fixed at 60; OMP=N means N threads per MPI rank. The original code uses a shared global task counter to deal with dynamic load balancing with MPI ranks. OMP=1 has overhead over pure MPI. anxiety meaning in bengali OMP=2 has overall best performance in many routines.

Three OpenMP implementations were attempted. OpenMP #1 uses flat MPI up to 60 MPI processes, then uses 2, 3, and 4 threads per MPI rank. OpenMP #2 and #3 are pure OpenMP. OpenMP #2 uses module-level parallelism. and OpenMP #3 uses OpenMP task implementation. The pseudo code below illustrates the OpenMP tasks. To avoid two threads updating Fock matrix simultaneously, a local copy is used per thread. Reduction at the end.