High Performance Computing Group

Compiler optimizations in the scalar world have, many researchers believe, run their course. We're squeezing about as much performance as we can out of a single processor, at least in terms of compiler techniques. The result is that the most active area of compiler research addresses parallel computer architectures, at both the fine-grained or instruction level and at the more coarse-grained or distributed level. One observation is that many of the existing compiler and optimization techniques can be adapted to benefit parallel architectures, sometimes with unexpected benefits. The focus of my research follows this general trend.

Project: Automatic Parallelization

Purpose: Develop automatic parallelization compilation techniques for discovering and exploiting thread-level and other coarse-grained parallelism

Researchers: Thomas Way

Researcher Alumni: Rushikesh Katikar, Puroshothan Ch

Applications: Compiler optimizations for high-performance computing

Abstract of current research paper:

Automatic parallelization in a compiler is becoming more important as computer
technologies expand to include more distributed computing. This paper focuses
on a comparative study of past and present techniques for automatic parallelization. It includes techniques such as scalar analysis, array analysis, commutativity analysis, as well as other related approaches.

Past compiler research on automatic parallelization dealt with parallelizing
sections of a source program with a specific system in mind. Such compilers
would parallelize loops and other segments of code, and introduce
synchronization and message passing code to maintain correctness. These types
of compilers worked only for limited versions of imperative languages, such as
subsets of C or FORTRAN. These compilers used scalar and array analysis
techniques that made it difficult to use language features like pointers in a
section to be parallelized automatically, a restriction that has become less
practical.

Recently, a shift in paradigm began and researchers started to look
into other techniques for automatically generating parallelized programs, such
as commutativity analysis applied over a broader window of the source
code. This type of analysis worked quite well for subsets of newer languages
like C++, which enable more flexibility in today's development
environments. However, the need for automatic parallelization in compilers is
growing as clusters and other forms of distributed computing are becoming more
popular just as CPU technology is trending towards higher degrees and coarser
granularities of parallelism. Thus, there is a need for static compiler
techniques that identify coarse-grain tasks from program source, breaking the
source into independent segments which can be executed in parallel across
multiple processors.

In this paper, we review known parallelization techniques
and propose new techniques for thread level identification in programs, and
argue that these same techniques may also apply to generalized coarse-grain
task identification.

Project: Source Code Characterization

Purpose: Develop analysis techniques for characterization of the machine requirements of source code.

Researchers: Tom Way

Researcher Alumni: Rushikesh Katikar, Puroshothan Ch

Applications: Compiler optimizations for high performance computing, machine description generation for reconfigurable computing.

Description:

Execution of any program depends on the architecture of the processor, which includes memory organization, registers, functional units and other components. As reconfigurable computing becomes more popular, research is needed to determine how to make the best use of this greater processing flexibility.

This project will develop source code analyses, implemented within a research compiler framework, to characterize the machine requirements of the source code input, including specific hardware needs, availability for parallelization, etc., producing a corresponding machine configuration or description.

The resulting machine description can be used to setup a reconfigurable computer to improve performance of the original source program by providing a best-fit hardware on which it can be executed.

Resources

Cetus - project home page, source-to-source compiler for C/C++
ANTLR - compiler language tool, needed to run Cetus
SUIF - research compiler
Trimaran - research compiler
compiler-tools.org - Catalog of research compiler infrastructures and tools
Compiler research links - Clean list of useful compiler research links

Topics to explore

Source code characterization
Machine requirements compiler analysis

Project: Partial Inlining

Purpose: Explore partial inlining

Researchers: Tom Way

Applications: Compiler optimizations for high-performance computing

Description:

This research involves Explicitly Parallel Instruction Computing (EPIC), a form of Instruction Level Parallel (ILP) computing in which parallelism is indicated in a more direct way by the compiler. This is based on earlier research which revolved around Region-based Compilation, an interprocedural analysis and optimization technique, that was integrated into the Trimaran Research Compiler platform. This earlier research led to discoveries in demand-driven inlining, improvements to cloning, interesting metrics for measuring how well regions are formed, and a novel approach to partial inlining.

The current research looks at the efficiency of partial inlining techniques. A standard compiler technique for many years has been procedure inlining, where the call to a procedure (or function or method) is replaced with the body of that procedure when it is determined that it could be beneficial to the resulting program's performance. The well-known problem with this technique is that it leads to "code growth," a situation where the program grows very large due to the introduction of multiple inlined copies of the same procedure throughout the program at each of its "call sites."

Partial inlining is a technique where just the most "important" portions of a procedure are inlined at a call site. Since it is common that only a small portion of a typical procedure is frequently executed, it makes sense to try to only inline that portion. By inlining this "hot" code, the opportunities for performance-enhancing improvements, often called "optimizations," are increased because there is more good code to work with in a given portion of the program. Code growth is better controlled using partial inlining while the benefits to performance of traditional inlining are maintained.

Research subprojects:

Journal article on "Demand-driven inlining in a region-based optimizer for ILP architectures", gathering and then extending my research in this area. Submitted February 13, 2005 to the Journal of Instruction-Level Parallelism (JIPL).
Journal article on "Design and Implementation of a Region-based Compiler with Partial Inlining and Cloning", which gathers and extends my research in those areas. Deadline (artificial) of Mar. 15, 2005. Potential journals for submission:
- Software Practice and Experience - SPE is very implementation and non-theoretical, difficult rewrite to fit
- ACM Transactions on Architecture and Code Optimization (TACO) - newer ACM journal
- International Journal of Parallel Programming - Hank's 1997 region-based paper
- Journal of Supercomputing - Hwu (et al.) 1993 paper on superblocks
- Proceedings of the IEEE - Arnold's 2005 survey paper of adaptive optimizations
- ACM Transactions on Programming Languages and Systems - Evelyn's 1997 demand-driven analysis paper
- Computer Languages - 1998 paper evaluating inlining techniques
Implementation of a partial inlining algorithm within a production-quality compiler (gcc?)

Papers:

Dynamic Cooperative Optimizations - mentions some benefits of partial inlining.

updated: 10/01/09

actlab.csc.villanova.edu