FERMILAB-Conf-88/111

# The Fermilab Advanced Computer Program Multi-Array Processor System (ACPMAPS) A Site Oriented Supercomputer for Theoretical Physics<sup>†</sup>

T. Nash, H. Areti, R. Atac, J. Biel, A. Cook, J. Deppe, M. Edel, M. Fischler, I. Gaines, D. Husby, T. Pham, and T. Zmuda

Advanced Computer Program

Fermi National Accelerator Laboratory

P.O. Box 500, Batavia, Illinois 60510

E. Eichten, G. Hockney, P. Mackenzie, H. B. Thacker, and D. Toussaint\*

Theoretical Physics Group

Fermi National Accelerator Laboratory

P.O. Box 500, Batavia, Illinois 60510

August 1988

<sup>&</sup>lt;sup>†</sup>Presented by Thomas Nash at the Adriatico Conference on the "Impact of Digital Microelectronics and Microprocessors on Particle Physics," International Centre for Theoretical Physics, Trieste, Italy, March 28-30, 1988.



## THE FERMILAB ADVANCED COMPUTER PROGRAM MULTI-ARRAY PROCESSOR SYSTEM (ACPMAPS) A SITE ORIENTED SUPERCOMPUTER FOR THEORETICAL PHYSICS\*

T. Nash, H. Areti, R. Atac, J. Biel, A. Cook, J. Deppe, M. Edel, M. Fischler, I. Gaines, D. Husby, T. Pham, and T. Zmuda

Advanced Computer Program

Fermi National Accelerator Laboratory+

Batavia, IL 60510 USA

E. Eichten, G. Hockney, P. Mackenzie, H. B. Thacker and D. Toussaint\*

Theoretical Physics Group

Fermi National Accelerator Laboratory+

Batavia, IL 60510 USA

### ABSTRACT

The ACP Multi-Array Processor System (ACPMAPS) is a highly cost effective, local memory parallel computer designed for floating point intensive grid based problems. The processing nodes of the system are single board array processors based on the FORTRAN and C programmable Weitek XL chip set. The nodes are connected by a network of very high bandwidth 16 port crossbar switches. The architecture is designed to achieve the highest possible cost effectiveness while maintaining a high level of programmability. The primary application of the machine at Fermilab will be lattice gauge theory. The hardware is supported by a transparent site oriented software system called CANOPY which shields theorist users from the underlying node structure.

## 1. INTRODUCTION

The ACP Multi-Array Processor System (ACPMAPS) is a highly cost effective, local memory parallel computer designed for floating point intensive grid based problems. The project is a joint effort of Fermilab's Advanced Computer Program (ACP) and Theoretical Physics Group. The processing nodes of the system are single board array processors based on the FORTRAN and C programmable Weitek XL chip set. The nodes are connected by a network of very high bandwidth 16 port crossbar switches. The architecture is designed to achieve the

<sup>\*</sup>Talk given by Thomas Nash at the Adriatico Conference on the Impact of Digital Microelectronics and Microprocessors on Particle Physics, International Centre for Theoretical Physics, Trieste, Italy, March 28-30, 1988.

<sup>&</sup>lt;sup>+</sup>Fermilab is operated by Universities Research Association, Inc., under contract with the U.S. Department of Energy.

<sup>\*</sup>On leave from Department of Physics, University of California, San Diego, La Jolla, California 92093 USA.

highest possible cost effectiveness while maintaining a high level of programmability. The primary application of the machine at Fermilab will be lattice gauge theory.

To obtain some estimates of the computing needs of lattice quantum chromodynamics one can consider the calculation of the deconfining temperature in SU(3) gauge theory without quarks, <sup>1]</sup> which is one of the most solid four dimensional calculations done so far. Something like 500,000 MFlops - hours (peak, ~70% delivered) were used on a Star ST-100 array processor. This calculation required a lattice spacing of less than 0.1 fermi and a volume of close to (2 fermi)<sup>3</sup>, resulting in lattices with spatial sizes of up to 19<sup>3</sup>. It is virtually certain that calculations with quarks will require even larger lattices than this for comparable accuracy. Lattices with space-time sizes 32<sup>4</sup> to 64<sup>4</sup>, requiring 1-20 GBytes of data memory, are a reasonable guess. Calculations of hadron masses in the approximation of ignoring dynamical quarks have not yet achieved a reasonable understanding of calculational errors, even on Cray-sized supercomputers. Although algorithms for the inclusion of dynamical quark effects have made tremendous progress in the last few years, at present they still seem to require at least two orders of magnitude more computer time than comparable calculations without quarks. It is thus clear that large increases in combined CPU power and algorithmic power are still required for even simple QCD calculations.

The aim for the Fermilab ACP Multi-Array Processor System (ACPMAPS) is the delivery of such large amounts of memory and CPU power at the lowest possible cost, without compromising the programmability required for rapid algorithm development which is just as important as raw computing power in achieving the goals of lattice gauge theory. For other large-scale computer projects aimed at lattice gauge theory, see Reference 2.

A 16 node system will be built in the summer. Two switch prototypes are working and under going a rigorous testing program now. The Floating Point Array Processor (FPAP) node design is being simulated. PC board lay out is underway and a prototype planned for May. Fermilab intends to proceed to a 256 node (5 GFlop for about \$750K) system as soon as the 16 node system is operational. Maximum system size is 2048 nodes. The system is being designed in the ACP tradition to be commercialized and available to other institutions.

### 2. ARCHITECTURE OVERVIEW

A block diagram of the system is shown in Figure 1. The individual single board Floating Point Array Processors have peak performance of 20 MFlops. They each contain 8 Mbytes of data and 2 Mbytes of program memory. The FPAPs are plugged into a crate whose backplane is a 16 fold bidirectional high speed cross bar. The nodes can speak with each other in pairs at a full 20 Mbytes/sec simultaneously. The architecture of ACPMAPS is a hypercube network of such crossbar switch crates each supporting 8-16 FPAPs. In a typical configuration 8 array processor nodes will be plugged into each switch crate along with up to 8 I/O modules that interconnect crates in a hypercube (or better). The switches handle intercrate routing automatically. The system therefore does not operate with all nodes in lock step like an SIMD machine as is the case in other projects of this type (Columbia, IBM GF11 and APE). It also does not strongly favor local communication (as existing hypercubes do). It thus allows for any conceivable new lattice algorithm unconstrained by synchronous or local communication requirements. Despite its algorithmic flexibility the system ranks as the best (or nearly so, we won't argue) in terms of cost effectiveness of MFlops/\$.

The system is controlled by a host microVAX (or a mainframe VAX) operating part time. The host starts a control program running on a control node, which is identical to all the other nodes of the system except in software. This node controls the lattice wide parts of the program and starts subprocesses on the individual nodes.

As noted, the nodes operate completely asynchronously. The use of MIMD (multiple instruction, multiple data) rather than SIMD (single instruction, multiple data) architecture is

one of the most important features of the system. There are many advantages to this type of architecture. It is very flexible: it can handle problems which are awkward or impossible in SIMD such as heat bath and incomplete LU decomposition algorithms and random lattice problems. The allowed sizes and shapes of the lattices are independent of the details of the hardware. The node structure of the machine can be made invisible in much or all of the high-level user code, resulting in improved programmability. This also results in improved fault tolerance, since the system can be reconfigured with one fewer node when one node is down, without requiring changes in user software or the setting aside of any spare nodes. Complications which have to be faced with MIMD include the potential for synchronization conflicts, which can't occur with SIMD. This requires care in designing and understanding the communications system. In addition, a nontrivial system software design effect is required to ensure that overheads associated with the communications software are kept to acceptable levels.



Figure 1. ACP Multi Array Processor System 256 node configuration.

A major new package of software (CANOPY) is being developed for this system. Theorist users will think only in terms of sites and fields on sites. The system will automatically allocate sites to nodes and handle all site to site communication whether on the same node or another. Thus users will not have to know details of the hardware for effective use of the system. Routines that are used heavily will be microcoded. The skeleton of all programs will be written in FORTRAN or C using a series of special subroutine calls that will make the programs particularly readable for lattice gauge theorists and others with site oriented algorithms. In

this way, despite ease of use and flexibility, the system will approach 10 MFlops/\$3000 node in FORTRAN or C.\*

### 3. HARDWARE

The nodes are single board floating point array processor using the Weitek XL chip set which contains a 32 bit, 20 MFlop (peak) floating point unit, an integer processor, 32 floating point and 32 integer registers, and an instruction sequencer. The chip set as a whole is programmable in FORTRAN and C, at some sacrifice in performance. Thus, these modules incorporate the functions of a high level language programmable single board computer and a high performance floating point array processor. No external CPU is required as a controller for these stand alone floating point engines.

The FPAP modules (Figure 2) contain the XL chip set, the data and code memory, and the interface logic and input and output queues for communicating with the crossbar switch crates. One floating point unit is used per node, in contrast to the designs of most of the SIMD machines aimed at lattice gauge theory. In addition to being a flexible and sensible design for a wide variety of problems, this was dictated by the desire to be able to use the Weitek FORTRAN and C compilers for the XL chip set.

The 2 MBytes of program memory and 8 MBytes of data memory is made from 1 Mbit 80 nsec access time page mode dynamic RAM chips. In page mode these memory systems can deliver data at a rate of one word per 100 nsec. This rate is fast enough that little additional efficiency would be gained in most lattice algorithms by replacing some of the DRAM by faster, more expensive static RAM. The memory chips constitute almost a third of the total cost of the machine. The memory to power ratio provided (8 MBytes to 20 MFlops) is larger than that provided by most other machines of this type, and is larger than is required by presently existing algorithms for simulating full QCD including internal fermion loops. It is approximately appropriate for calculations in the valence approximation, ignoring fermion loops. Algorithmic improvements over the next few years will certainly change the required ratio. It seems likely that the possibilities which will increase the required ratio (preconditioning and Fourier acceleration of quark propagator calculation, Fourier acceleration of gauge simulation) are currently more promising than those which reduce the amount of memory required per CPU cycle (such as adding nonlocal operators to the action to reduce finite lattice spacing errors) and that the large amount of memory could easily become crucial in the years to come.

The nodes are plugged into a network of switch crates (described in the accompanying paper<sup>3</sup>) whose backplanes handle full sixteen port crossbar switching at bandwidths of 20 MBytes/second per connection. This yields a total bandwidth of 2.56 GBytes/sec for 256 node machine. A cluster of 8-12 nodes is attached to each switch. The switches are connected in a hypercube, which may be augmented by additional communication channels along heavily used paths. This structure allows the nodes to communicate as if they were connected in a conventional hypercube arrangement, but more than this, it allows any node to communicate at full speed with *any* other node, allowing efficient running of algorithms requiring nonlocal communications. The switch crates allow any node to access any other node's data memory without needing to know where the other node is located on the network. With the current switch crate hardware, systems of up to 2048 nodes are possible before this transparent nonlocal communication feature is lost. The switch crates are based on SN74AS8840 sixteen input crossbar switch chips from Texas Instruments. They will also be

<sup>\*</sup>Memory prices are fluctuating widely at this writing. FPAP costs given here are based on Fall 1987 DRAM prices.

used in a variety of high performance experimental particle physics applications of the ACP Multiprocessor System,<sup>3</sup> including the "Level-3" programmable trigger for the Collider Detector at Fermilab (CDF).



Figure 2. Schematic Design of the Floating Point Array Processor (FPAP)

Low cost video technology tape drives will be used for check pointing long calculations and for archiving of gauge fields and propagators. These tape drives cost a few thousand dollars and can store 2 GBytes of data on an 8 mm video tape cartridge costing \$5-\$10. One drive will be attached to every switch crate, enabling all of memory to be stored in under five minutes.

### 4. SOFTWARE

Lattice gauge theories are part of a large class of grid-based problems derived from discretization of a set of differential equations which are very suitable for a parallel architecture like this one. The natural breakdown of the problem is to assign a certain subset of the sites in the space or spacetime to each node, which stores the data for the field variables defined on the sites assigned to it in its local memory and does calculations for its sites. The system software, 4l known as CANOPY, has been designed to shield the user as much

as possible from the hardware dependent node structure of the parallel architecture. The user thinks in terms of sites not nodes.

User programs are divided conceptually into two pieces: the control program, which is called from a microVAX host or mainframe VAX and runs on the control node, and site subroutines, which run on the individual nodes. The control program controls the execution of lattice-wide tasks. It will typically be written in ordinary FORTRAN or C augmented by a set of system subroutines for dealing with global concepts (e.g., field memory, lattice wide tasks) which are distributed over all the nodes and require special treatment. The beginning of the control program will included statements like the following:

```
call define_periodic_lattice ( ndims, sizes, lat1 )
call define_field ( lat1, quarksize, q )
call define_field ( lat1, quarksize, q1 )
call complete_definitions
```

The routine define\_periodic\_lattice tells the system that our problem contains one lattice called latl of ndims dimensions with the size of each dimension contained in the array sizes and with standard hypercube connectivity. More general user defined connectivities will be allowed. It is possible to define several lattices in the same program for block spin renomalization group or multigrid algorithms. The routine define\_field tells the system that memory will be required for storing two fields identified by q and ql, each with quarksize components for each site of latl. The routine complete\_definitions calls routines which assign specific sites to specific nodes, allocate memory in the nodes for the field data and site structures, and set up structures for each site pointing to the memory areas of adjacent sites of the lattice.

A control node subroutine which operates on a field q with an operator dslash\_ and stores the result in another field q1 would be written as follows.

The system subroutine do\_task passes to all the nodes a pointer to a subroutine dslash which operates on a single site and a pointer to a list of sites on which to operate, which may be the entire lattice lat1 or some previously defined set of sites such as red\_sites. A system routine on the node, invisible to the user, calls dslash\_ for all the sites in the set of sites which have been assigned to the node. do\_task may also be used to pass (pass\$) to the nodes parameters required by the site subroutine (like the field identifiers q and q1) and to integrate (integrate\$) data returned from the individual nodes.

The site subroutines access and replace data from global fields with subroutines like:

```
call get_field ( q, site, qtemp )
call put_field ( q1, site, qtemp )
```

which determine if the desired data is already resident on the node and open a channel to the communications hardware if necessary.

Most site subroutines can be written in FORTRAN or C. CPU intensive kernels such as SU(3) matrix multiplication and essential routines like dslash\_ will be microcoded for maximum efficiency. We expect that lattice gauge algorithms prepared in this way will run at 8-10 MFlops per node.

### 5. PHYSICS PROGRAM

The main interest of the Fermilab lattice group is the application of lattice gauge theory to QCD and "beyond the standard model" phenomenology. During the initial phase of running the machine we will pursue a program which as much as possible serves the multiple purposes of machine shakedown, algorithm development, careful error analysis for heavily studied quantities like the hadron spectrum, and production of new physics results.

Enormous progress has been made in the last years on algorithms for inclusion of dynamical quark loops in QCD calculations. Many (possibly most) groups with large machines have begun attempts to calculate hadron masses with the new algorithms. Since there do not yet exist definitive spectrum results in the approximation of ignoring quark loops, this may seem somewhat premature. It is dictated to some extent by the availability of machines with low memory to CPU power, which cannot easily go to lattices with larger volumes and smaller lattice spacing. With our machine, it will make more sense to do a careful study of the statistical, finite volume, and finite lattice spacing errors in the valence approximation before going on to the inclusion of internal quark loops.

In the valence approximation, CPU time is dominated by the efficiency of the quark matrix inversion algorithm (as opposed to the product of the efficiencies of the quark matrix inversion and the simulation algorithms in dynamical fermion algorithms). The relative efficiencies of these algorithms is very dependent on lattice size, the quark mass, etc. Our first algorithm project will be the testing on large lattices of the most promising of these methods (conjugate gradient, minimal residual, . . .) and various methods of preconditioning (Fourier acceleration, incomplete LU decomposition).

During the initial period of running, we will be doing careful analysis of errors in the valence approximation on a variety of lattice spacings and volumes. We intend to have as complete a set of analysis programs as possible for quantities which can be calculated from the basic set of quark propagators to run as a standard package on all data sets produced. These will include meson and baryon heavy quark potentials, the light hadron spectrum and decay constants, the kaon bag parameter, and the properties of D and B mesons in the 1/M expansion, including masses, spin splittings, decay constants and bag parameters.

### 6. REFERENCES

- 1] Gottlieb, S.A., et al., Phys. Rev. Lett. <u>55</u> 1958 (1985); Christ, N.H. and Terrano, A.E., Phys. Rev. Lett. <u>56</u> 111 (1986).
- 2] Christ, N.H. and Terrano, A. E., IEEE Trans. Comp. <u>C-33</u>(4) 344 (1984); Seitz, C.L., Commun. ACM <u>28</u> 22 (1985); Beteem, J., Denneau, M. and Weingarten, D., J. Stat. Phys., <u>43</u> 1171 (1986); Albanese, M., et al., The APE Computer, ROM2F/87/005 (1987).
- 3] Gaines, I., Areti, H., Atac, R., Biel, J., Cook, A., Deppe, J., Edel, M., Fischler, M., Hance, R., Husby, D., Nash, T., Pham, T. and Zmuda, T., "Multi-Processor Developments in the United States for Future High Energy Physics Experiments and Accelerators" presented at the Adriatico Conference on the Impact of Digital Microelectronics and Microprocessors on Particle Physics, International Centre for Theoretical Physics, Trieste, Italy, March 28-30, 1988.
- 4] "CANOPY ACP MAPS Software Specifications Document", Fermilab Advanced Computer Program internal document. To be released. Availability of ACP publications may be determined by reference to the file at the HEPNET location, FNACP::ACPDOCS\_ROOT:[DOCS]DOCLIST.DOC