# An overview of the APEmille parallel computer

**Article**

*in*Nuclear Instruments and Methods in Physics Research Section A Accelerators Spectrometers Detectors and Associated Equipment 389(1-2):56-58 · April 1997

*with*24 Reads

Abstract

We describe the architecture of the APEmille Parallel Computer, the new generation of the APE family of processors optimized for Lattice Gauge Theory simulations. We emphasize the features of the new machine potentially useful for applications in other areas of computational physics

- ... E); 4: Init(Q); 5: d(s) ← 0; S ←{s}; 6: Update(Q, L s (0)); 7: /* Main loop */ 8: while ¬Empty(Q) do 9: v(d) ← DeleteMin(Q); 10: d(v) ← d; S ← S ∪ {v}; 11: Update(Q, L v (d)); 12: for all w ∈ I v par do 13: if w / ∈ S then remove v from L w ; 14: end for 15: end while 4 Implementation on the APEmille supercomputer 4.1 The APEmille architecture APEmille [2] [16] ...Conference PaperFull-text available
- Dec 2006

We investigate the practical merits of a parallel priority queue through its use in the development of a fast and work-efficient parallel shortest path algorithm, originally designed for an EREW PRAM. Our study reveals that an efficient implementation on a real supercomputer requires considerable effort to reduce the communication performance (which in theory is assumed to take constant time). It turns out that the most crucial part of the implementation is the mapping of the logical processors to the physical processing nodes of the supercomputer. We achieve the requested efficient mapping through a new graph-theoretic result of independent interest: computing a Hamiltonian cycle on a directed hyper-torus. No such algorithm was known before for the case of directed hypertori. Our Hamiltonian cycle algorithm allows us to considerably improve the communication cost and thus the overall performance of our implementation. - Article
- May 2003
- COMPUT PHYS COMMUN

We study the feasibility of a PC-based parallel computer for medium to large scale lattice QCD simulations. The Eötvös Univ., Inst. Theor. Phys. cluster consists of 137 Intel P4-1.7GHz nodes with 512 MB RDRAM. The 32-bit, single precision sustained performance for dynamical QCD without communication is 1510 Mflops/node with Wilson and 970 Mflops/node with staggered fermions. This gives a total performance of 208 Gflops for Wilson and 133 Gflops for staggered QCD, respectively (for 64-bit applications the performance is approximately halved). The novel feature of our system is its communication architecture. In order to have a scalable, cost-effective machine we use Gigabit Ethernet cards for nearest-neighbor communications in a two-dimensional mesh. This type of communication is cost effective (only 30% of the hardware costs is spent on the communication). According to our benchmark measurements this type of communication results in around 40% communication time fraction for lattices upto 483·96 in full QCD simulations. The price/sustained-performance ratio for full QCD is better than $l/Mflops for Wilson (and around $1.5/Mflops for staggered) quarks for practically any lattice size, which can fit in our parallel computer. The communication software is freely available upon request for non-profit organizations. - Article
- Feb 2005
- Nucl Phys B Proc Suppl

In this talk I report on the status of the apeNEXT project. apeNEXT is the last of a family of parallel computers designed, in a research environment, to provide multi-teraflops computing power to scientists involved in heavy numerical simulations. The architecture and the custom chip are optimized for Lattice QCD (LQCD) calculations but the favourable price performance ratio and the good efficiency for other kind of calculations make it a quite interesting tool for a large class of scientific problems. - Conference Paper
- Jan 2004

The experience described in this paper relates to the implementation on the parallel computer APEmille of a model for large-scale atmosphere motion, originally developed in Fortran for a conventional architecture. The most critical aspects of this work are described: the mapping of a bidimensional problem on the tridimensional thoroidal architecture of the parallel machine, the choice of a data distribution strategy that minimizes the internode communication needs, the definition of an algorithm for internode communication that minimizes communication costs by performing only first neighbour communications, and the implementation of machine dependant optimizations that allowed to exploit the pipelined architecture of the APEmille processing node and the large register file. An analysis of the performances is reported, compared to both the APEmille peak performance and to the performance on other conventional sequential architectures. Finally, a comparison with the original physical results is presented. - Article
- Dec 1999
- COMPUT PHYS COMMUN

We briefly describe the Poor Man's Supercomputer (PMS) project carried out at Eotvos University, Budapest. The goal was to develop a cost effective, scalable, fast parallel computer to perform numerical calculations of physical problems that can be implemented on a lattice with nearest neighbour interactions. To this end we developed the PMS architecture using PC components and designed a special, low cost communication hardware and the driver software for Linux OS. Our first implementation of PMS includes 32 nodes (PMS1). The performance of PMS1 was tested by Lattice Gauge Theory simulations. Using SU(3) pure gauge theory or bosonic MSSM on PMS1 we obtained 3$/Mflop and 0.45$Mflop price-to-sustained performance for double and single precision operations, respectively. The design of the special hardware and the communication driver are freely available upon request for non-profit organizations. Comment: Latex, 13 pages, 6 figures included, minor additions, typos corrected

- Article
- Nov 2011
- INT J MOD PHYS C

We describe the software environment available for the APE100 parallel processor. We discuss the parallel programming language that we have defined for APE100 and its optimizing compiler. We then describe the operating system that allows to control APE100 from a host computer. - Article
- Sep 1990
- Nucl Phys B Proc Suppl

We summarize the present status of the GF11 parallel computer project. - Article
- Apr 1995
- Nucl Phys B Proc Suppl

We describe APEmille, the latest generation of the APE parallel processors. This machine, an evolution of the APE100 concept, is very efficient for LGT simulations as well as for a broader class of applications requiring massive floating point computations. Several new features characterise this evolution. In particular local addressing capabilities are added to all computing nodes. APEmille also exhibits a higher degree of integration with a network of workstations acting as a global host system. An APEmille system in the Teraflops range will be completed in three-four years. The architecture proposed in this paper is being currently simulated and evaluated. - Article
- May 1991
- Nucl Phys B Proc Suppl

The first successful operation of QCDPAX with 432 nodes is reported. The peak speed is about 12.5 GFLOPS and the total memories are 2.6 GByte. It is planned to increase the number of nodes up to 480 in the near future. After brief descriptions of the system architecture and hardware as well as software development, the result of a study of the phase transition in pure gauge theory are briefly presented. - Article
- Sep 1990
- Nucl Phys B Proc Suppl

The first two weeks of successful operation of our 16 Gigaflop, 256-node parallel computer is reported. The characteristics of this machine and the two preceding versions are briefly reviewed. The physics program being run on those 16- and 64-node machines is described, with some discussion of new string tension results on large lattices obtained with the 64-node machine. Finally plans for a future 1 Teraflop machine are considered. - Article
- Aug 1987
- COMPUT PHYS COMMUN

The APE computer is a high performance processor designed to provide massive computational power for intrinsically parallel and homogeneous applications. APE is a linear array of processing elements and memory boards that execute in parallel in SIMD mode under the control of a CERN/SLAC 3081/E. Processing elements and memory boards are connected by a ‘circular’ switchnet. The hardware and software architecture of APE, as well as its implementation are discussed in this paper. Some physics results obtained in the simulation of lattice gauge theories are also presented. - Article
- Mar 1993
- INT J MOD PHYS C

APE100 processors are based on a simple Single Instruction Multiple Data architecture optimized for the simulation of Lattice Field Theories or other complex physical systems. This paper describes the hardware implementation of the first APE100 machine. Read More: http://www.worldscientific.com/doi/abs/10.1142/S0129183193000744 - Article
- Feb 1993
- INT J MOD PHYS C

In this paper we describe an implementation of the Lattice Boltzmann Equation method for fluid-dynamics simulations on the APE100 parallel computer. We have performed a simulation of a two-dimensional Rayleigh-Benard convection cell. We have tested the theory proposed by Shraiman and Siggia for the scaling of the Nusselt number vs. Rayleigh number. - Article
- Aug 1983
- J COMPUT PHYS

A special purpose processor for Monte-Carlo simulation of the three-dimensional Ising model is described. This device performs the Monte-Carlo updating algorithm on 25 million spins per second on a 643 lattice. The device is also capable of measuring the energy and magnetization of the system or passing the updated lattice to a host computer. - We have designed and built the Orrery, a special computer for high-speed high-precision orbital mechanics computations. On the problems the Orrery was designed to solve, it achieves approximately 10 Mflops in about 1 ft3of space while consuming 150 W of power. The specialized parallelarchitecture of the Orrery, which is well matched to orbital mechanics problems, is the key to obtaining such high performance. In this paper we discuss the design, construction, and programming of the Orrery. Copyright © 1985 by The Institute of Electrical and Electronics Engineers, Inc.