An overview of the APEmille parallel computer

Abstract
We describe the architecture of the APEmille Parallel Computer, the new generation of the APE family of processors optimized for Lattice Gauge Theory simulations. We emphasize the features of the new machine potentially useful for applications in other areas of computational physics

Do you want to read the rest of this article?

Request full-text
Request Full-text Paper PDF
  • ... E); 4: Init(Q); 5: d(s) ← 0; S ←{s}; 6: Update(Q, L s (0)); 7: /* Main loop */ 8: while ¬Empty(Q) do 9: v(d) ← DeleteMin(Q); 10: d(v) ← d; S ← S ∪ {v}; 11: Update(Q, L v (d)); 12: for all w ∈ I v par do 13: if w / ∈ S then remove v from L w ; 14: end for 15: end while 4 Implementation on the APEmille supercomputer 4.1 The APEmille architecture APEmille [2] [16] ...
    Conference Paper
    Full-text available
    We investigate the practical merits of a parallel priority queue through its use in the development of a fast and work-efficient parallel shortest path algorithm, originally designed for an EREW PRAM. Our study reveals that an efficient implementation on a real supercomputer requires considerable effort to reduce the communication performance (which in theory is assumed to take constant time). It turns out that the most crucial part of the implementation is the mapping of the logical processors to the physical processing nodes of the supercomputer. We achieve the requested efficient mapping through a new graph-theoretic result of independent interest: computing a Hamiltonian cycle on a directed hyper-torus. No such algorithm was known before for the case of directed hypertori. Our Hamiltonian cycle algorithm allows us to considerably improve the communication cost and thus the overall performance of our implementation.
  • Article
    We study the feasibility of a PC-based parallel computer for medium to large scale lattice QCD simulations. The Eötvös Univ., Inst. Theor. Phys. cluster consists of 137 Intel P4-1.7GHz nodes with 512 MB RDRAM. The 32-bit, single precision sustained performance for dynamical QCD without communication is 1510 Mflops/node with Wilson and 970 Mflops/node with staggered fermions. This gives a total performance of 208 Gflops for Wilson and 133 Gflops for staggered QCD, respectively (for 64-bit applications the performance is approximately halved). The novel feature of our system is its communication architecture. In order to have a scalable, cost-effective machine we use Gigabit Ethernet cards for nearest-neighbor communications in a two-dimensional mesh. This type of communication is cost effective (only 30% of the hardware costs is spent on the communication). According to our benchmark measurements this type of communication results in around 40% communication time fraction for lattices upto 483·96 in full QCD simulations. The price/sustained-performance ratio for full QCD is better than $l/Mflops for Wilson (and around $1.5/Mflops for staggered) quarks for practically any lattice size, which can fit in our parallel computer. The communication software is freely available upon request for non-profit organizations.
  • Article
    In this talk I report on the status of the apeNEXT project. apeNEXT is the last of a family of parallel computers designed, in a research environment, to provide multi-teraflops computing power to scientists involved in heavy numerical simulations. The architecture and the custom chip are optimized for Lattice QCD (LQCD) calculations but the favourable price performance ratio and the good efficiency for other kind of calculations make it a quite interesting tool for a large class of scientific problems.
  • Conference Paper
    The experience described in this paper relates to the implementation on the parallel computer APEmille of a model for large-scale atmosphere motion, originally developed in Fortran for a conventional architecture. The most critical aspects of this work are described: the mapping of a bidimensional problem on the tridimensional thoroidal architecture of the parallel machine, the choice of a data distribution strategy that minimizes the internode communication needs, the definition of an algorithm for internode communication that minimizes communication costs by performing only first neighbour communications, and the implementation of machine dependant optimizations that allowed to exploit the pipelined architecture of the APEmille processing node and the large register file. An analysis of the performances is reported, compared to both the APEmille peak performance and to the performance on other conventional sequential architectures. Finally, a comparison with the original physical results is presented.
  • Article
    We briefly describe the Poor Man's Supercomputer (PMS) project carried out at Eotvos University, Budapest. The goal was to develop a cost effective, scalable, fast parallel computer to perform numerical calculations of physical problems that can be implemented on a lattice with nearest neighbour interactions. To this end we developed the PMS architecture using PC components and designed a special, low cost communication hardware and the driver software for Linux OS. Our first implementation of PMS includes 32 nodes (PMS1). The performance of PMS1 was tested by Lattice Gauge Theory simulations. Using SU(3) pure gauge theory or bosonic MSSM on PMS1 we obtained 3$/Mflop and 0.45$Mflop price-to-sustained performance for double and single precision operations, respectively. The design of the special hardware and the communication driver are freely available upon request for non-profit organizations. Comment: Latex, 13 pages, 6 figures included, minor additions, typos corrected