Parallel Computing

 

Daniel S. Priece

 

Villanova University Computer Science Department

Villanova, PA 19085

daniel.priece@villanova.edu

 

 

Abstract

The rapid rate at which technology is evolving has lead to a need for more sophisticated techniques that can take full advantage of the hardware available.  Software traditionally has lagged behind hardware and the next step to take with new multiprocessor systems would be to develop software to maximize performance.  This paper reviews the fundamentals of parallel computing with a focus on implicit, or automatic, parallelization.  We will take a look at different software techniques as well as restrictions that arise when creating parallel code.

 

1. Introduction

 

“Parallel computing is the use of several computers, processors, or processes working together on a common task in a concurrent manner” [1].  The purpose of exploring parallel computing is to cut down on the execution time of processor intense code.  Instructions are run simultaneously and then synchronized back together before producing output. 

 

1.1  Background

 

The very first example of parallel computing was found to be from a tablet dated back to 100 B.C. [2].  The tablet consisted of three calculating areas where the user was able to compute at higher speeds.  It has been an idea since the earliest digital computers were built in the 40’s and developed multi-processor machines in the late 60’s.  Today the most powerful computer is able to use 20 teraflops of processing power [1].

 

2. Parallel Hardware Setups

 

Today there are four standard setups for parallel systems [1].  The first is a shared-memory multiprocessor setup which is found in a standard multi-processor machine.  The machine uses multiple processors for computing while sharing one common memory area.  The next most commonly thought of parallel system is network computing.  Network computing is more commonly referred to a large network system usually operating over an internet connection.  The problem with network computing in this fashion is that different computers have different CPU and internet setups thus making the computations both inefficient and slow.  A smaller computer network, known as cluster computers, work over a local area network connection and although they are faster than a larger internet based network, they are still limited by their communication speeds.  The final type of parallel system is a parallel vector processor.  A vector processor is a type of CPU that is able to run operations and simultaneously [3].  Vector processors were used for supercomputers in the past but due to cost are used less today.  The choice of parallel system is important as well as the architecture of the processor and memory.

 

2.1  Flynn’s Taxonomy

 

Flynn’s Taxonomy provides four possible architectures for parallel computers in terms of processors and memory [4].  A standard non-parallel computer is referred to as a single instruction, single data machine because during any one clock cycle there is one instruction stream being executed using one data stream as input.  Single instruction, multiple data refers to a machine that for one clock cycle, every processor of the machine executes the same code with separate inputs. 

 

 

Figure 1: Single Instruction, Multiple Data Execution [4]

 

For example if during the first clock cycle processor 1 is executing a store instruction then every other processor is at the same time executing a store instruction but each with different data to store.  As you can see in the figure above, every processor is executing the same set of instructions however, P1 is using “1” as input, P2 is using “2” as input and P3 is using “n” as input.  In an ideal situation, the processors will compute the inputs simultaneously such that they end execution at the same time and can synchronize the output.  The remaining two machines are the multiple instruction, single data machines and the multiple instruction, multiple data machines which use separate instructions for each processor given a single set of data or multiple sets of data.  Most modern parallel computers fall into the multiple instruction, multiple data category.

 

3. Programming Models

 

Currently there are many programming models being used but I will only go over four of the more commonly used and less difficult models [4].  These models are simply abstract ideas to be implemented on different hardware architectures.  A shared memory model is where semaphores are used to control when a processor has access to the shared memory.  This also simplifies the execution in that data does not have to be sent between processors because it is all stored in a common space.  Next is the threads model which allows code to have multiple execution paths by using threads to act like subroutines in a serial program.  Thirdly is the message passing model where each process in the code has its own memory that can send and receive data from other processes.  Finally, the last process is the data parallel model where processes all work on the same set of data but different partitions within that set of data.

 

4. Automatic Parallelism

 

Automatic parallelism, or implicit parallelism, is when a program is made parallel at compile time by the compiler instead of programmed from square one to be parallel which is referred to as manual parallelism.  An example of serial code converted to parallel code can be found in the two figures below.

 
npoints = 10000
circle_count = 0
do j = 1,npoints
  generate 2 random numbers between 0 and 1
  xcoordinate = random1 ; ycoordinate = random2
  if (xcoordinate, ycoordinate) inside circle
  then circle_count = circle_count + 1
end do
PI = 4.0*circle_count/npoints

                             Figure 2: Serial Code [ 13]

 

 
npoints = 10000
circle_count = 0
p = number of processors
num = npoints/p
find out if I am MASTER or WORKER
do j = 1,num
  generate 2 random numbers between 0 and 1
  xcoordinate = random1 ; ycoordinate = random2
  if (xcoordinate, ycoordinate) inside circle
  then circle_count = circle_count + 1
end do
if I am MASTER
  receive from WORKER their circle_counts
  compute PI (use MASTER and WORKER calculations)
else if I am WORKER
  send to MASTER circle_count
endif

                            Figure 3: Parallel Code [13]

 

The set of pseudo-code above shows an algorithm for approximating PI through serial and parallel code.  In figure 2, after the program inscribes a circle in a square, the serial algorithm finds the number of points in the square that are also in the circle then it divides the number of points in the circle by the total points in the square, multiplies by 4 and gets an approximation of PI.  The parallel code is marked in red where changes have been made to make it parallel.  In the execution of the parallel code, a master processor is setup that takes in all the information of the other processors.  The worker processors each take a different section of the loop that needs computing, they each work on their share simultaneously and then send the information back to the master processor.  This takes less time because more computations are being processed per clock cycle.

 

Because manual parallelism has many time and budget restraints, I chose to explored automatic parallelism more deeply.  The automation of the compiler provides a less error-prone, less time consuming, and less complex way to parallelize code.

 

4.1 Compilers

 

To understand how a parallel compiler works we first have to look at a standard serial compiler which commonly consists of 7 parts [5].

 

Figure 4: Compiler Diagram [5]

 

The first part of the compiler is the Scanner which is also known as the Lexical Analyzer.

Here is where the source code is tokenized.  White space is removed as well as comments and lexical errors are handled at this step.  Next is the Parser.  The Parser makes sure the syntax of the language is followed according to the language’s context free grammar.  This is where the code is tested for parallelism and separated in a parallel compiler.  Then, the Semantic Analyzer handles type checking.  From there an abstract syntax tree is sent out to generate and optimize a lower level code which is then generated and optimized into machine code.

 

4.2 Problems to Consider

 

When designing any kind of parallel compiler a few problems must be considered [6]. Latency is the time it takes to send a zero length message from one processor to another.  High latency will require more planning in sending messages.  Smaller sized messages should be clustered together before sending in order to be more efficient.  Bandwidth refers to the speed at which data is transferred between processors during communication.  Lower bandwidth would call for as little communication as possible to be optimal.  Determining which tasks have to communicate to each other is a problem called scope.  All communication has to be synchronized as well which means synchronization points need to be created to optimize all the processing power and not have any processes waiting for others to finish.  One more problem to consider when creating parallel code is dependencies. 

 

4.2.1 Dependencies

 

Dependencies within code are the largest problems for parallelism and therefore to look at dependencies within code we use a data dependency graph.

 

Figure 5: Dependency Graph [7]

 

Flow dependencies are ones where there is an execution path between two statements.  In a flow dependency, a value is written and then read.  An output dependency is where the same memory location is being accessed and written to more than once.  Anti dependency is when a value of a variable is read and then written [8].  From a dependency graph we could optimize the dependencies so that the code can be as parallel as possible.  With the remaining dependencies it is necessary to use synchronization points to optimize parallelism.

 

4.2.2 Avoiding Dependencies

 

There are a few methods that have been implemented to get around the problem of dependencies [9].  Some values of a program are not known until run-time which makes splitting code difficult.  For small network setups, multiple path allocation is a good method to use.  It involves scheduling every possible path that the data could take at compile-time and then run the appropriate one at run-time.  This obviously becomes very inefficient for larger numbers of processors because as processors are added, path allocation greatly increases.  Another technique is to use a single data stream instead of multiple streams between processors.  This is called dynamic destination selection.  The single stream carries all of the data which is marked by the destination that the data is going to.  This way the run-time systems are able to quickly send data that is dependent to the same place.

 

5. Related Topics

 

There are many compilers, systems, and tools already available that use parallel computing. 

 

5.1 BOINC

 

According to their website, “the [Berkeley Open Infrastructure for Network Computing] is a software platform for volunteer computing and desktop grid computing” [10].  This meaning that users volunteer their processor’s computing power over an anonymous connection or a small private network.  Projects have been developed like SETI@home where anyone can sign up and use their home computer’s down time to do large amounts of parallel processing helping the search for extraterrestrial intelligence. 

 

5.2 Tools

 

There are many tools available to programmers such as Microsoft’s Task Parallel Library [11] which provides methods for parallelism.  Microsoft shows the efficiency of the TPL Parallel.For method for loops versus the .NET ThreadPool class where as the number of processors increase, the efficiency of the TPL increases.

 

Figure 6: Efficiency Graph [11]

 

One example that Microsoft gives for their TPL uses ray tracing.  Ray tracing is a way to produce photorealistic renderings which requires much computation [11].  In ray tracing, a path is traced from what an imaginary eye would look at and the lighting is affected accordingly.  Each path of pixels is called a ray.

 

Figure 7: Rendered Image [11]

 

Figure 5 is an image from a rendering using the ray tracing method and the Microsoft TPL.  The result was a seven times speed up on an eight processor machine from 1.7 frames per second to 12 frames per second in its animations.

 

6. Conclusion

 

In starting this research, it quickly became apparent that parallel computing is a very broad field and narrowing down on a single problem would not be very simple.  The majority of information that I could find was very abstract and mostly served as an introduction into the topic.  In order to understand the problem that I was researching, a very extensive background knowledge had to have been developed.  It was not until I started searching for information on implicit parallelism and compilers did I start to get to more technical sources of information.  Information learned from one of my classes had helped to shape the basic idea of a compiler and how it works.  Eventually, I was lead to the topic of dependencies.  This is a large problem in parallelism and ways around it must be found in order to use parallelism to its full potential.  Unfortunately again, most information found on dependencies is abstract for implementation with little or no examples.  I strongly believe that with slowing hardware development, parallel software will become more popular as well as parallel systems.  With a better development of automatic parallelism, new software would be simple and inexpensive to write.

 

 

 

References:

 

[1] M. Katz, G. Bruno, “Guide to Parallel Computing,” CTBP & SDSC Education Module. [Online]. Available: http://ctbp.ucsd.edu/pc/html/intro1.html. [Accessed: Nov. 4, 2007].

 

[2]  http://www.buyya.com/microkernel/chap1.pdf [Accessed: Nov. 2, 2007].

 

[3] “Vector Processing,” Wikipedia. [Online]. Available: http://en.wikipedia.org/wiki/Vector_processor. [Accessed: Nov. 4, 2007].

 

[4] B. Barney, "Introduction to Parallel Computing," Introduction to Parallel Computing, Livermore Computing, June 2007. [Online]. Available: http://www.llnl.gov/computing/tutorials/parallel_comp. [Accessed: Sept. 11, 2007].

 

[5] V. Gehlot, “CSC8310 Linguistics of Programming Languages,” Villanova University. [Online]. Available: http://www.csc.villanova.edu/~gehlot/8310/lec/Week2.ppt. [Accessed: Oct. 28, 2007].

 

[6] S. Chang, “Parallelization of Codes on the SGI Origins,” NASA Ames Research Center. [Online]. Available: http://people.nas.nasa.gov/~schang/origin_parallel.html#fraction [Accessed: Sept 26, 2007].

 

[7] “Parallel Compiler Tutorial,” Tutorial-Reports.com. [Online]. Available: http://www.tutorial-reports.com/computer-science/parallel-compiler/tutorial.php. [Accessed Oct. 28, 2007].

 

[8] “Data Dependency,” The Everything Development Company. [Online]. Available: http://everything2.com/index.pl?node=data%20dependency.  [Accessed: Oct 28, 2007].

 

[9] C. Metcalf, “Approaches to Data Dependency,”. [Online]. Available: http://home.comcast.net/~cdmetcalf/papers/cop/node67.html. [Accessed: Nov 2, 2007].

 

[10] “BOINC,” Berkeley Open Infrastructure for Network Computing. [Online]. Available: http://boinc.berkeley.edu/. [Accessed: Sept. 28, 2007].

 

[11] D Leijen and J Hall, “Optimize Managed Code For Multi-Core Machines,” The Microsoft Journal for Developers, 2007. [Online]. Available: http://msdn.microsoft.com/msdnmag/issues/07/10/Futures/default.aspx [Accessed: Sept. 28, 2007].

 

[12] “Ray Tracing,” Wikipedia. [Online]. Available: http://en.wikipedia.org/wiki/Ray_tracing . [Accessed: Nov. 4, 2007].

 

[13] “Introduction to Parallel Programming,” Maui High Performance Computing Center. [Online]. Available: http://www.mhpcc.edu/training/workshop/parallel_intro/MAIN.html. [Accessed: Nov 6, 2007].