titulo.jpg (22800 bytes)

 

 

Status report on the parallelization of the VPBAR Algorithm

Carlos Figueira, Emely Arraiz, Maruja Ortega and Alejandro Teruel



boton.jpg (1101 bytes)Introduction

A parallel version of the last VPBAR algorithm (January 1996) has been coded and fixed for a while; we have been experimenting with various input test networks. The version used for the tests reported here uses the eMP communication library, and has been run on the CESMA'S MC-3E Parsytec machine (described below). Since then, we have ported the parallel application to the Power Xplorer. Additionnally, this new version use the standard MPI communications library. Preliminary results show that the application runs about ten times faster on an 8 processors Power Xplorer than on a 16 processors MC-3E. We are currently running this version on a definite set of inputs on the Power Xplorer, in order to gather the actual results which will be included in an article jointly prepared with GTE labs people. This document here report the results of the tests achieved on the CESMA'S MC-3E Parsytec machine (described below). The results are preceded with a brief presentation of the hardware and software platforms, and with a description of the parallel algorithm.

boton.jpg (1101 bytes)Hardware and Software Platforms

The parallel versions of the optimal dynamic virtual path bandwidth allocation algorithm (called VPBAR Algorithm) run on a Parsytec MC-3DE machine with 16 INMOS Transputer (T805) processor nodes. Each Transputer has 4 Megabytes of local memory and runs at 30 MHz. The processors are linked in a two dimensional array. The front-end to the Parsytec machine is a Sun Sparc workstation, running at 70 MHz. The programs are written in C. They include calls to the eMP (easy Message Passing) communications library which is written on top of Parsytec's Parix operating system communication routines. Processes are assigned to processors statically.

boton.jpg (1101 bytes)Implementation of the Parallel Algorithm

We designed the parallel algorithm using the SPMD/DM (Single Program, Multiple Data / Distributed Memory) paradigm with synchronous message passing. The algorithm is based on a static uniform distribution of all scenarios amongst processors. Each processor runs all computations corresponding to its assigned scenario's topologies, which are read from files at the beginning of the execution. There is a ``master processor'', to which is assigned the Zero Failure Scenario. The processors compute on their assigned scenarios and communicate their local results to the master processor, which in turn evaluates convergence conditions, then broadcast the test result to all processors. The master processor writes out the results to a file upon completion of the parallel algorithm.

boton.jpg (1101 bytes)Results

We ran the parallel algorithm with a variety of test networks (from 12 nodes to 20 nodes ATM networks) and load demand combinations (low, medium and high). The results are summarized in the following tables.

 

  • Topology: 12 Nodes (A)

  • Topology: 12 Nodes (B). Medium load

  • Topology: 16 Nodes (A). Medium load

  • Topology: 16 Nodes (B). Medium load

  • Topology: 20 Nodes (A). Medium load

  • Topology: 20 Nodes (B). Medium load

     

The algorithm exhibited coarse granularity, which increases on the network size and the load demand. For our smaller ATM network (12 Nodes), the speed-up quickly saturates, barely improving for more than 8 processors. For larger and heavily loaded networks, the speed-up improves steadily through 16 processors. Clearly the processor load increases faster than communication load. This property of the parallel algorithm allows us to expect much better execution times and efficiency on MIMD architectures with more powerful nodes, like the Power Xplorer's PowerPc 601.

boton.jpg (1101 bytes)Preliminary conclusions

After summarizing the results of the about 90 experiments, we extract the following information:

  • The Parallel Algorithm exhibits very good speedups, as expected. This means it is highly parallelizable, and confirm that it is a coarse grain application, very suitable for MIMD machines as the ones we have.
  • Nevertheless, the speedup growth rate decays as we increase the number of processors, except for the largest and/or heavily loaded networks.

boton.jpg (1101 bytes)Information

        Carlos Figueira, Emely Arraiz, Maruja Ortega and Alejandro Teruel


Última modificación realizada por Julio Rodríguez  el día viernes 14 de enero de 2000