Title | : | Automatic Code Generation for Graph Algorithms on GPUs |
Speaker | : | Shashidhar G (IITM) |
Details | : | Wed, 22 Mar, 2017 3:00 PM @ BSB 361 |
Abstract: | : | Graphics Processing Units (GPUs) have been used in accelerating data parallel applications as they provide thousands of threads. Several optimizations have been proposed based on the behaviour exibited by applications like memory patterns, work patterns etc. CUBlas(for Matrix operations), Pregel(for graph processing) etc. are examples of libraries built for specific purposes.
Graph algorithms exhibit enough parallelism to keep several cores busy. Former research has proposed effective techniques to parallelize irregular graph algorithms on various platforms, such as multicore, GPUs, clusters, as well as their combinations. However, extracting parallel performance out of a graph algorithm implementation is still very challenging. For instance, on GPUs, minimizing thread-divergence, improving memory coalescing, exploiting the memory hierarchy, reducing communication across CPU and GPU, overlapping kernel execution with memory transfer, etc. are the tasks of an HPC expert. On the other hand, graph algorithms are increasingly being used by domain experts, such as in bioinformatics, imaging, visualization, mining, chemical reactions, weather forecasting etc. It is ideal to have a domain-specific language for the domain experts especially for graph algorithms (like MATLAB for matrices). Currently, to the best of our knowledge, there exists no compiler which can generate efficient code for different platforms from the same specification of a graph algorithm. We have domain-specific languages targeting multicore CPUs, graph libraries for GPUs, and frameworks for clusters. For generating efficient code, a domain expert needs to modify the algorithmic specification to suit a framework. We propose LightHouse, a GPU code-generator for a graph language named Green-Marl for which a multicore CPU backend already exists. This allows a user to seamlessly generate both the multicore as well as the GPU backends from the same specification of a graph algorithm. This restriction of not modifying the language poses several challenges as we work with an existing abstract syntax tree of the language, which is not tailored to GPUs. LightHouse overcomes these challenges with various optimizations such as reducing the number of atomics and collapsing loops. We illustrate its effectiveness by generating efficient CUDA codes for four graph analytic algorithms, and comparing performance against their multicore OpenMP versions generated by Green- Marl. In particular, our generated CUDA code performs comparable to 4 to 64-threaded OpenMP versions for different algorithms. This work is published in the 29th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2016), Rochester, New York. The code is publicly available at the PACE Lab website. |