An overview of the APEmille parallel computer

Article in Nuclear Instruments and Methods in Physics Research Section A Accelerators Spectrometers Detectors and Associated Equipment 389(1-2):56-58 · April 1997 with 24 Reads

DOI: 10.1016/S0168-9002(97)00040-5

Show more authors

Hide

We describe the architecture of the APEmille Parallel Computer, the new generation of the APE family of processors optimized for Lattice Gauge Theory simulations. We emphasize the features of the new machine potentially useful for applications in other areas of computational physics

... E); 4: Init(Q); 5: d(s) ← 0; S ←{s}; 6: Update(Q, L s (0)); 7: /* Main loop */ 8: while ¬Empty(Q) do 9: v(d) ← DeleteMin(Q); 10: d(v) ← d; S ← S ∪ {v}; 11: Update(Q, L v (d)); 12: for all w ∈ I v par do 13: if w / ∈ S then remove v from L w ; 14: end for 15: end while 4 Implementation on the APEmille supercomputer 4.1 The APEmille architecture APEmille [2] [16] ...
On the Implementation of Parallel Shortest Path Algorithms on a Supercomputer
Conference Paper
Full-text available
Dec 2006

Alberto Petricola

Christos Zaroliagis

Gabriele Di Stefano
We investigate the practical merits of a parallel priority queue through its use in the development of a fast and work-efficient parallel shortest path algorithm, originally designed for an EREW PRAM. Our study reveals that an efficient implementation on a real supercomputer requires considerable effort to reduce the communication performance (which in theory is assumed to take constant time). It turns out that the most crucial part of the implementation is the mapping of the logical processors to the physical processing nodes of the supercomputer. We achieve the requested efficient mapping through a new graph-theoretic result of independent interest: computing a Hamiltonian cycle on a directed hyper-torus. No such algorithm was known before for the case of directed hypertori. Our Hamiltonian cycle algorithm allows us to considerably improve the communication cost and thus the overall performance of our implementation.
View
Better than $1/Mflops sustained: A scalable PC-based parallel computer for lattice QCD
Article
May 2003
COMPUT PHYS COMMUN

Sandor D. Katz

Gábor Papp

Zoltán Fodor
We study the feasibility of a PC-based parallel computer for medium to large scale lattice QCD simulations. The Eötvös Univ., Inst. Theor. Phys. cluster consists of 137 Intel P4-1.7GHz nodes with 512 MB RDRAM. The 32-bit, single precision sustained performance for dynamical QCD without communication is 1510 Mflops/node with Wilson and 970 Mflops/node with staggered fermions. This gives a total performance of 208 Gflops for Wilson and 133 Gflops for staggered QCD, respectively (for 64-bit applications the performance is approximately halved). The novel feature of our system is its communication architecture. In order to have a scalable, cost-effective machine we use Gigabit Ethernet cards for nearest-neighbor communications in a two-dimensional mesh. This type of communication is cost effective (only 30% of the hardware costs is spent on the communication). According to our benchmark measurements this type of communication results in around 40% communication time fraction for lattices upto 483·96 in full QCD simulations. The price/sustained-performance ratio for full QCD is better than $l/Mflops for Wilson (and around $1.5/Mflops for staggered) quarks for practically any lattice size, which can fit in our parallel computer. The communication software is freely available upon request for non-profit organizations.
View
The apeNEXT project
Article
Feb 2005
Nucl Phys B Proc Suppl

F. Bodin

Ape Collaboration

Piero Vicini

Ph. Boucaud
In this talk I report on the status of the apeNEXT project. apeNEXT is the last of a family of parallel computers designed, in a research environment, to provide multi-teraflops computing power to scientists involved in heavy numerical simulations. The architecture and the custom chip are optimized for Lattice QCD (LQCD) calculations but the favourable price performance ratio and the good efficiency for other kind of calculations make it a quite interesting tool for a large class of scientific problems.
View
Parallel simulation of orography influence on large-scale atmosphere motion on APEmille
Conference Paper
Jan 2004

Alberto Petricola

G. Visconti

M. Francia

Emanuele Panizzi
The experience described in this paper relates to the implementation on the parallel computer APEmille of a model for large-scale atmosphere motion, originally developed in Fortran for a conventional architecture. The most critical aspects of this work are described: the mapping of a bidimensional problem on the tridimensional thoroidal architecture of the parallel machine, the choice of a data distribution strategy that minimizes the internode communication needs, the definition of an algorithm for internode communication that minimizes communication costs by performing only first neighbour communications, and the implementation of machine dependant optimizations that allowed to exploit the pipelined architecture of the APEmille processing node and the large register file. An analysis of the performances is reported, compared to both the APEmille peak performance and to the performance on other conventional sequential architectures. Finally, a comparison with the original physical results is presented.
View
The PMS project: Poor Man's Supercomputer
Article
Dec 1999
COMPUT PHYS COMMUN

Zoltán Fodor

P. Hegedus

A. Piróth

Ferenc F. Csikor
We briefly describe the Poor Man's Supercomputer (PMS) project carried out at Eotvos University, Budapest. The goal was to develop a cost effective, scalable, fast parallel computer to perform numerical calculations of physical problems that can be implemented on a lattice with nearest neighbour interactions. To this end we developed the PMS architecture using PC components and designed a special, low cost communication hardware and the driver software for Linux OS. Our first implementation of PMS includes 32 nodes (PMS1). The performance of PMS1 was tested by Lattice Gauge Theory simulations. Using SU(3) pure gauge theory or bosonic MSSM on PMS1 we obtained 3$/Mflop and 0.45$Mflop price-to-sustained performance for double and single precision operations, respectively. The design of the special hardware and the communication driver are freely available upon request for non-profit organizations. Comment: Latex, 13 pages, 6 figures included, minor additions, typos corrected
View

THE SOFTWARE OF THE APE100 PROCESSOR
Article
Nov 2011
INT J MOD PHYS C

G. Bastianello

Alessandro Bartoloni

Claudia Battista

R. Tripiccione
We describe the software environment available for the APE100 parallel processor. We discuss the parallel programming language that we have defined for APE100 and its optimizing compiler. We then describe the operating system that allows to control APE100 from a host computer.
View
The status of GF11
Article
Sep 1990
Nucl Phys B Proc Suppl

Don Weingarten
We summarize the present status of the GF11 parallel computer project.
View
The new wave of the APE project: APEmille
Article
Apr 1995
Nucl Phys B Proc Suppl

Alessandro Bartoloni

M. Bellacci

Claudia Battista

Walter Rinaldi
We describe APEmille, the latest generation of the APE parallel processors. This machine, an evolution of the APE100 concept, is very efficient for LGT simulations as well as for a broader class of applications requiring massive floating point computations. Several new features characterise this evolution. In particular local addressing capabilities are added to all computing nodes. APEmille also exhibits a higher degree of integration with a network of workstations acting as a global host system. An APEmille system in the Teraflops range will be completed in three-four years. The architecture proposed in this paper is being currently simulated and evaluated.
View
QCDPAX: Present status and first physical results
Article
May 1991
Nucl Phys B Proc Suppl

T. Kawai

Yoichi Iwasaki

K. Kanaya

T. Yoshié
The first successful operation of QCDPAX with 432 nodes is reported. The peak speed is about 12.5 GFLOPS and the total memories are 2.6 GByte. It is planned to increase the number of nodes up to 480 in the near future. After brief descriptions of the system architecture and hardware as well as software development, the result of a study of the phase transition in pure gauge theory are briefly presented.
View
Status of the Columbia 256-node machine
Article
Sep 1990
Nucl Phys B Proc Suppl

Norman H. Christ
The first two weeks of successful operation of our 16 Gigaflop, 256-node parallel computer is reported. The characteristics of this machine and the two preceding versions are briefly reviewed. The physics program being run on those 16- and 64-node machines is described, with some discussion of new string tension results on large lattices obtained with the 64-node machine. Finally plans for a future 1 Teraflop machine are considered.
View
The APE computer: An array processor optimized for lattice gauge theory simulations
Article
Aug 1987
COMPUT PHYS COMMUN

M. Albanese

P. Bacilieri

R. Tripiccione

S. Cabasino
The APE computer is a high performance processor designed to provide massive computational power for intrinsically parallel and homogeneous applications. APE is a linear array of processing elements and memory boards that execute in parallel in SIMD mode under the control of a CERN/SLAC 3081/E. Processing elements and memory boards are connected by a ‘circular’ switchnet. The hardware and software architecture of APE, as well as its implementation are discussed in this paper. Some physics results obtained in the simulation of lattice gauge theories are also presented.
View
A Hardware Implementation of the Ape100 Architecture
Article
Mar 1993
INT J MOD PHYS C

G. Bastianello

Alessandro Bartoloni

Claudia Battista

R. Tripiccione
APE100 processors are based on a simple Single Instruction Multiple Data architecture optimized for the simulation of Lattice Field Theories or other complex physical systems. This paper describes the hardware implementation of the first APE100 machine. Read More: http://www.worldscientific.com/doi/abs/10.1142/S0129183193000744
View
LBE SIMULATIONS OF RAYLEIGH-BÉNARD CONVECTION ON THE APE100 PARALLEL PROCESSOR
Article
Feb 1993
INT J MOD PHYS C

C CABASINO

A BATTISTA

Alessandro Bartoloni

R AF BARTOLONI
In this paper we describe an implementation of the Lattice Boltzmann Equation method for fluid-dynamics simulations on the APE100 parallel computer. We have performed a simulation of a two-dimensional Rayleigh-Benard convection cell. We have tested the theory proposed by Shraiman and Siggia for the scaling of the Nusselt number vs. Rayleigh number.
View
A fast processor for Monte-Carlo simulation
Article
Aug 1983
J COMPUT PHYS

Robert B. Pearson

Doug Toussain

John L. Richardson
A special purpose processor for Monte-Carlo simulation of the three-dimensional Ising model is described. This device performs the Monte-Carlo updating algorithm on 25 million spins per second on a 643 lattice. The device is also capable of measuring the energy and magnetization of the system or passing the updated lattice to a host computer.
View
A Digital Orrery
Article
Full-text available
Oct 1985
IEEE T COMPUT

Y. Gursel

James H. Applegate

Gerald Jay Sussman

Michael R. Douglas
We have designed and built the Orrery, a special computer for high-speed high-precision orbital mechanics computations. On the problems the Orrery was designed to solve, it achieves approximately 10 Mflops in about 1 ft3of space while consuming 150 W of power. The specialized parallelarchitecture of the Orrery, which is well matched to orbital mechanics problems, is the key to obtaining such high performance. In this paper we discuss the design, construction, and programming of the Orrery. Copyright © 1985 by The Institute of Electrical and Electronics Engineers, Inc.
View

An overview of the APEmille parallel computer

Recommendations

The WaveScalES experiment, Human Brain Project, SP3

APEmille

APE-100

NaNet: a Low-Latency, Real-Time, Multi-Standard Network Interface Card with GPUDirect Features

The new wave of the APE project: APEmille

The Teraflop Parallel Computer APEmille.

An overview of the APEmille project

SIMD ALGORITHM FOR MATRIX TRANSPOSITION