# A special-purpose N-body machine GRAPE1

Abstract

We have designed and built GRAPE-1 (GRAvity PipE 1), a special-purpose computer for astrophysical N-body calculations. It is designed as a back-end processor that calculates the gravitational interaction between particles. All other calculations are performed on a host computer connected to GRAPE-1. For large-N calculations (N>~104), GRAPE-1 achieves about 100 Mflops-equivalent in one board of size about 40 by 30 cm at the power of 2.5 watt. The pipelined architecture of the GRAPE-1 which is specialized and optimized for the N-body calculation is the key to the high performance. The design and construction of the GRAPE-1 system are discussed.

- ... In order to accelerate this kind of N -body simulations, we have developed a series of special-purpose hardware, starting with GRAPE-1 [8]. The basic idea of the GRAPE (GRAvity piPE) architecture is to develop a fully pipelined processor specialized for the calculation of the gravitational interaction between particles . ...Conference Paper
- Dec 2003

In this paper, we describe the performance characteristics of GRAPE-6, the sixth-generation special-purpose computer for gravitational many-body problems. GRAPE-6 consists of 2048 custom pipeline chips, each of which integrates six pipeline processors specialized for the calculation of gravitational interaction between particles. The GRAPE hardware performs the evaluation of the interaction. The frontend processors perform all other operations, such as the time integration of the orbits of particles, I/O, on-the-fly analysis etc. The theoretical peak speed of GRAPE-6 is 63.4 Tflops. We present the result of benchmark runs, and discuss the performance characteristics. We also present the measured performance for a few real scientific applications. The best performance so far achieved with real applications is 35.3 Tflops. - ... GRAPE (GRAvity PipE; Sugimoto et al. 1990) is a special purpose hardware that calculates Newtonian gravitational forces efficiently in large scale N-body simulations. A series of GRAPE was developed by Ito et al. (1990;GRAPE-1), Ito et al. (1991;GRAPE-2), Okumura et al. (1993;GRAPE-3), Makino et al. (1997;GRAPE-4), and Kawai et al. (2000;GRAPE-5). GRAPE is connected with, and controlled by a typical workstation or PC. ...We describe a new method to accelerate neighbor searches on GRAPE, i.e. a special-purpose hardware that efficiently calculates the gravitational forces and potentials in $N$-body simulations. In addition to gravitational calculations, GRAPE simultaneously constructs lists of neighbor particles that are necessary for Smoothed Particle Hydrodynamics (SPH). However, the data transfer of the neighbor lists from GRAPE to the host computer is time-consuming, and can be a bottleneck. In fact, the data transfer can take about the same time as the calculations of the force themselves. Making use of GRAPE’s special treatment of neighbor lists, we can reduce the amount of data transfer if we search neighbors in the order that the neighbor lists, constructed in a single GRAPE run, overlap each other. We find that the Morton ordering requires very low additional calculation and programming costs, and results in successful speed-up on data transfer. We show some benchmark results in the case of GRAPE-5. Typical reduction in transferred data becomes as much as 90%. This method is suitable not only for GRAPE-5, but also for GRAPE-3 and the other versions of GRAPE.
- ... One early approach to high performance N-body codes was the development of specialized parallel hardware to perform the explicit all pairs interaction. The Gravity Pipeline, or GRAPE, processor was one of these implementations [73, 40, 49]. While the computation suffers for its O(N 2 ) scaling, the speed of the dedicated processors has lead to some impressive performance numbers, eventually leading to the winning of two Gordon Bell prizes for high performance parallel computation. ...Article
- Aug 1997

Introduction Multipole-based Algorithms provide O(N) or O(N log N) solution to computation of N-body interaction problems. These types of problems are seen in a wide variety of applications including molecular dynamics, fluid dynamics, and astrophysical simulation, among others. Efficient O(N) implementations can provide a significant time savings over other methods for medium to large sized simulations. Maximizing the performance of these algorithms involves improvements to both the base algorithms as well as an efficient parallel and distributed implementations. The proposed dissertation research will cover three major areas: 1. A distributed memory implementation of the Parallel Multipole Tree Algorithm (DPMTA). 2. Computation of Isotropic Constant Pressure Simulations within the framework of the Parallel Multipole Tree Algorithm. 3. Efficient distributed load balancing techniques applied to - ... In the classical implementation of the tree-code algorithm all the work is done on the CPU, since special purpose hardware was not available at that time [1]. With the introduction of GRAPE special purpose hardware [16, 17], it became computationally favourable to let the special purpose hardware, instead of the CPU, calculate accelerations . Construction of the interaction list in these implementations takes nearly as much time as calculating the accelerations. ...ArticleFull-text available
- May 2010

We present a new very fast tree-code which runs on massively parallel Graphical Processing Units (GPU) with NVIDIA CUDA architecture. The tree-construction and calculation of multipole moments is carried out on the host CPU, while the force calculation which consists of tree walks and evaluation of interaction list is carried out on the GPU. In this way we achieve a sustained performance of about 100GFLOP/s and data transfer rates of about 50GB/s. It takes about a second to compute forces on a million particles with an opening angle of θ≈0.5. The code has a convenient user interface and is freely available for use1. - ... We started GRAPE project in 1988. The ÿrst machine we completed, the GRAPE-1 [5] was a single-board unit on which around 100 IC and LSI chips were mounted and wire-wrapped. We used commercially available IC and LSI chips to implement force calculation pipeline. ...Article
- Dec 2002
- J COMPUT APPL MATH

We overview our GRAvity PipE (GRAPE) project to develop special-purpose computers for astrophysical N-body simulations. The basic idea of GRAPE is to attach a custom-build computer dedicated to the calculation of gravitational interaction between particles to a general-purpose programmable computer. By this hybrid architecture, we can achieve both a wide range of applications and very high peak performance. Our newest machine, GRAPE-6, achieved the peak speed of , and sustained performance of , for the total budget of about 4 million USD.We also discuss relative advantages of special-purpose and general-purpose computers and the future of high-performance computing for science and technology. - ... An extreme example is a special-purpose computer that can calculate usually only one kind of calculation. One of the most famous special-purpose computers may be GRAPE hardware that was at first developed at University of Tokyo [14]. The type of calculation is limited to the gravitational force for n stars. ...Article
- Jan 2008

textversion:author RIMS 研究集会報告集（２００８年２月１８日～２月２０日）研究代表者 坂上 貴之 (Takashi Sakajo) 1606:Fast Algorithms in Computational Fluids: theory and applications＝流体計算における高速アルゴリズムの理論とその応用 - A two-dimensional simulation model of the "magnetohydrodynamic (MHD)" vortex method, current-vortex method, is developed. The concept is based on the previously developed current-vortex filament model in three-dimensional space. It is assumed that electric current and vorticity have discontinuous filamentary (point) distributions on the two-dimensional plane, and both the point electric current and the point vortex are confined in a filament. In other words, they share the same point on the two-dimensional plane, which is called the "current-vortex filament." The spatial profiles of the electric current and the vorticity are determined by the sum of such filaments. Time development equations for a filament are obtained by integrating the two-dimensional MHD equations around the filament. It is found that a special-purpose computer, MDGRAPE-2, is capable not only of molecular dynamics simulations but also of MHD simulations, because MDGRAPE-2 accelerates calculations of the Biot-Savart integral. The current-vortex method on MDGRAPE-2 reproduces the result obtained by the traditional MHD code on a general-purpose computer. (C) 2003 American Institute of Physics.
- Article
- Jan 1991
- ASTROPHYS SPACE SCI

Merger process of binary globular cluster is discussed for a pair of unequal-mass components. We calculated the case of mass ratio 10.5 by means of anN-body code with 6144 particles in total. We have found the followings. The mass exchange between the components takes place through the Roche-lobe overflow. In the early stages, however, the dynamical evolution is mainly governed by escape of particles from the system. As the particles escape carrying angular momentum with them, the separation between the component cluster shrinks. The time-scale of this shrinkage depends upon the size of the clusters. When a critical separation is reached, the orbital angular momentum is transferred unstably to the spins of the component clusters. This is the process of the synchronization instability which was found in a previous study on binary cluster of equal masses. As a result the component clusters merge into a single cluster. The structures of the mergers are quite similar among different cases except for the central cores which retain their initial central concentrations. In particular, the ellipticity and the rotation curve are quite close each other among models of different initial radii and of different mass ratios. - ArticleFull-text available
- May 2011

We describe the astrophysical and numerical basis of N -body simulations, both of collisional stellar systems (dense star clusters and galactic centres) and collisionless stellar dynamics (galaxies and large-scale structure). We explain and discuss the state-of-the-art algorithms used for these quite different regimes, attempt to give a fair critique, and point out possible directions of future improvement and development. We briefly touch upon the history of N -body simulations and their most important results. - Article
- Dec 1993
- Vistas Astron

A review of recent observations and computer simulations of galaxy groups is given. The most important aspects of the review, such as the main features and evolutionary status of compact and loose groups, dynamical mass estimations, hidden mass problem, galaxy merging, are considered from the point of view of observational data and numerical studies. A short description of the basic N-body algorithms is also given. - Article
- Dec 1994
- J Comput Chem

The special-purpose computer GRAPE-2A accelerates the calculation of pairwise interactions in many-body systems. This computer is a back-end processor connected to a host computer through a Versa Module Europe (VME) bus. GRAPE-2A receives coordinates and other physical data for particles from the host and then calculates the pairwise interactions. The host then integrates an equation of motion by using these interactions. We did molecular dynamics simulations for two systems of liquid water: System 1 (1000 molecules), and System 2 (1728 molecules). The time spent for one step of molecular dynamics was 3.9 s (System l), and 10.2 s (System 2). The larger the molecular system, the higher the performance. The speed of GRAPE-2A did not depend on the formula describing the pairwise interaction. The cost performance was about 20 times better than that of the fastest workstations available today, and GRAPE-2A cost only $22,000. © 1994 by John Wiley & Sons, Inc. - Article
- Dec 1999
- PARALLEL COMPUT

GRAPE (GRAvitational PipelinE) is a parallel computer dedicated to solve classical gravitational many-body problems in astrophysics. Its prototype was first designed in 1988, and the machine GRAPE-4 attained its performance at 1 teraflops equivalent in 1995. Now, a sub-petaflops machine is expected to be complete before the beginning of the 21st century by a group that is led by one of the members of the former GRAPE project. In the gravitational N-body problems, calculation of gravitational forces between pairs of bodies attracting each other is the heaviest part of computation. It requires calculations of the order of N2 for one time-step, which is compared with the other calculations such as the time marching only of the order of N calculations. Therefore, in the GRAPE system, the former is performed on a dedicated computer and the latter on a general-purpose computer as a host machine. Only the former is parallelized; it is easy because we can compute forces between different pairs of bodies independently of each other when the positions of the bodies are given. Since details of GRAPE are already published elsewhere, we discuss in the present paper how such a concept of GRAPE is related to the nature of the problem. A heterogeneous system, in which problem-specific, ultra-speed machines are connected with general-purpose machines, will be one of the solutions for highly computation intensive problems in natural science. Cooperations between computer scientists and computational scientists are urgently needed for its realization. - The tree method is a widely implemented algorithm for collisionless $N$-body simulations in astrophysics well suited for GPU(s). Adopting hierarchical time stepping can accelerate $N$-body simulations; however, it is infrequently implemented and its potential remains untested in GPU implementations. We have developed a Gravitational Oct-Tree code accelerated by HIerarchical time step Controlling named \texttt{GOTHIC}, which adopts both the tree method and the hierarchical time step. The code adopts some adaptive optimizations by monitoring the execution time of each function on-the-fly and minimizes the time-to-solution by balancing the measured time of multiple functions. Results of performance measurements with realistic particle distribution performed on NVIDIA Tesla M2090, K20X, and GeForce GTX TITAN X, which are representative GPUs of the Fermi, Kepler, and Maxwell generation of GPUs, show that the hierarchical time step achieves a speedup by a factor of around 3--5 times compared to the shared time step. The measured elapsed time per step of \texttt{GOTHIC} is 0.30~s or 0.44~s on GTX TITAN X when the particle distribution represents the Andromeda galaxy or the NFW sphere, respectively, with $2^{24} =$~16,777,216 particles. The averaged performance of the code corresponds to 10--30\% of the theoretical single precision peak performance of the GPU.
- Article
- Dec 1992
- PUBL ASTRON SOC PAC

Recent observations have shown that globular clusters contain a substantial number of binaries most of which are believed to be primordial. We discuss different successful optical search techniques, based on radial-velocity variables, photometric variables, and the positions of stars in the color-magnitude diagram. In addition, we review searches in other wavelengths, which have turned up low-mass X-ray binaries and more recently a variety of radio pulsars. On the theoretical side, we give an overview of the different physical mechanisms through which individual binaries evolve. We discuss the various simulation techniques which recently have been employed to study the effects of a primordial binary population, and the fascinating interplay between stellar evolution and stellar dynamics which drives globular-cluster evolution. - Article
- May 1992
- ASTROPHYS J

A simple model is presented for the evolution of a primordial binary population in a globular cluster. Monte Carlo simulations are given for an initial population of 50,000 binaries against a fixed background population of 500,000 single stars in a tidally truncated cluster model. Individual histories of all binaries are followed through mass segregation, scattering recoil, escape from the cluster, or coalescence. It is found that most binaries are destroyed by binary-binary interactions, with the rest escaping in the point-mass approximation. In a more realistic model, the majority of the rest merge. At any instant, most of the remaining binaries are drifting in toward the center before their first strong encounter. A typical binary spends most of its active life in or near the cluster core. The few binaries which receive a recoil sufficient to place them in the halo past the half-mass radius remain there long enough to make a significant contribution to the radial binary distribution. - Article
- Jan 2005
- J KOREAN ASTRON SOC

We overview the GRAPE (GRAvity piPE) project. The goal of the GRAPE project is to accelerate the astrophysical N -body simulations. Since almost all computing time is spent for the evaluation of the gravitational force between particles, we can greatly accelerate many N -body simulations by developing a specialized hardware for the force calculation. In 1989, the first such hardware, GRAPE-1, was completed, with the peak speed of 120 Mflops. In 2003, GRAPE-6 was completed, with the peak speed of 64 Tflops, which is nearly 10 6 times faster than GRAPE-1 and was the fastest computer at that time. In this paper, we review the basic concept of the GRAPE hardwares, the history of the GRAPE project, and two ongoing projects, GRAPE-DR and Project Milkyway. - Conference Paper
- Dec 2000

As an entry for the 2000 Gordon Bell performance prize, we report the performance achieved on a prototype GRAPE-6 system. GRAPE-6 is a special-purpose computer for as-trophysical N-body calculations. The present configuration has 96 custom pipeline processors, each containing six pipeline processors for the calculation of gravitational interactions between particles. Its theoretical peak performance is 2.889 Tflops. The complete GRAPE-6 system will consist of 3072 pipeline chips and will achieve a peak speed of 100 Tflops. The actual performance obtained on the present 96-chip system was 1.349 Tflops, for a simulation of massive black holes embedded in the core of a galaxy with 786,432 stars. For a short benchmark run with 1,400,000 particles, the average speed was 1.640 Tflops. - Conference Paper
- Apr 1995

We are constructing a one tera-flops machine dedicated to astronomical many-body problems. It consists of parallelized GRAPE machines connected to a host workstation. The GRAPE machines only calculate forces between particles in the system by pipeline architecture. We designed and fabricated LSI chips for it, and about 2000 chips are being connected in parallel. The machine will be in operation by summer of 1995. General concept and features of the machine, mode of parallelization, and their merits are discussed in addition to scientific objectives of the project - Conference Paper
- Dec 1994

We describe the GRAPE-4 (Gravity Pipe 4) system, a special-purpose computer for astrophysical N-body simulations. In N-body simulations, most of the computing time is spent to calculate the force between particles, since the number of interactions is proportional to the square of the number of particles. For many problems the accuracy of fast algorithms such as the particle-mesh scheme is not sufficient and we have to use the straightforward direct summation. In order to accelerate the force calculation, we have developed a series of hardware, the GRAPE (Gravity Pipe) systems. The basic idea of our GRAPE systems is to develop a hardware specialized for the force calculation. The rest of the calculation is performed on the general-purpose computer connected to GRAPE. The GRAPE-4 system is our newest hardware, scheduled to be completed in early 1995. Planned peak speed is 1.15 Tflops. This speed is achieved by running 1920 pipeline LSIs, each provides 600 Mflops, in parallel. A prototype system has been completed July 1994, and the full system is now under manufacturing - Conference Paper
- Feb 1992

The authors have developed a highly parallelized special-purpose computer GRAPE (GRAvity PipE)-3 for gravitational many-body simulations. It accelerates gravitational force calculations which are the most expensive part of the many-body simulations. The peak computing speed is equivalent to about 15 GFLOPS. The GRAPE-3 system consists of two identical boards connected to a host computer through a VME bus. Each board has 24 custom LSI GRAPE chips which calculate gravitational forces in parallel. The gravitational force calculation is easily parallelized because the forces on different particles con be calculated independently. Using the pipelined architecture, one GRAPE chip calculates one gravitational force between a pair of particles at every clock cycle. The number of floating point operations needed to calculate one force is about 90. Therefore, one GRAPE chip running at 10 MHz clock-rate has a computing speed equivalent to 0.3 GFLOPS. The GRAPE-3 with 48 GRAPE chips achieves about 15 GFLOPS. One GRAPE chip has 110000 transistors in an 8 mm×8 mm area and its power consumption is 1.2 W at 10 MHz. Its package is ceramic PGA with 181 pins. One GRAPE-3 board is a 9U Eurocard, on which 159 chips are wire-wrapped - Article
- Nov 1997

. We overview the GRAPE-6 project, a follow-up of the teraflops GRAPE-4 project. GRAPE-6 will be completed by 1999-2000 and its planned peak speed is 200 Tflops. Its architecture will be largely similar to that of GRAPE-4, which is a specialized hardware to calculate the gravitational interaction between particles. The improvement in the speed will come mainly from the advance in the silicon semiconductor technology. GRAPE-6 will enable us to directly simulate the evolution of star clusters with up to 1 million stars. 1. Introduction In 1988, we started the development of special-purpose computers for astrophysical N-body problems (GRAPE; [11]). The basic idea was to build a simple and small hardware, which is designed specifically to calculate gravitational interactions between particles. This hardware operates in cooperation with a general-purpose programmable computer, which performs all other calculations such as time integration and I/O (see figure 1). We believe this approach... - Article
- May 1997

Recently, special-purpose computers have surpassed general-purpose computers in the speed with which large-scale stellar dynamics simulations can be performed. Speeds up to a Teraflops are now available, for simulations in a variety of fields, such as planetary formation, star cluster dynamics, galactic nuclei, galaxy interactions, galaxy formation, large scale structure, and gravitational lensing. Future speed increases for special-purpose computers will be even more dramatic: a Petaflops version, tentatively named the GRAPE-6, could be built within a few years, whereas general-purpose computers are expected to reach this speed somewhere in the 2010-2015 time frame. Boards with a handful of chips from such a machine could be made available to individual astronomers. Such a board, attached to a fast workstation, will then deliver Teraflops speeds on a desktop, around the year 2000. - A high performance system for the molecular dynamics simulation of biological molecules was constructed by combining a software package, Peach, with a special-purpose computer, Grape. The resultant simulator "Peach-Grape system" was used to analyze several important biological molecules including the Hin/DNA complex, the trp-Repressor/DNA complex, and Calmodulin. In addition to those simulations performed by the Peach-Grape system, other simulation studies of biomolecules by special-purpose computers are briefly reviewed.
- We performed numerical N-body simulations of a violent relaxation process using a special-purpose computer, GRAPE-1. We found that violent relaxation enhances segregation in energy space. In violent relaxation, particles with higher energy gain more energy than particles with lower energy. In two-body relaxation, however, particles with higher energy lose energy, and particles with lower energy gain it. Therefore, a self-gravitating system doesn't approach thermal equilibrium through violent relaxation. Violent relaxation changes the energies of particles through a time variation of the mean potential field caused by the coherent motion of particles. The change in the kinetic energy does not depend on the energy of the particle, itself, since coherent motion, which has a much larger energy, heats all particles in order to approach energy equipartition, regardless of their velocities. The magnitude of the change in the mean potential field in the outer region is higher than that in the inner region. Therefore, the gain in the average energy in the outer region is larger than that in the inner region. This difference in the energy gain enhances segregation in energy space, since particles with higher energy receive a greater amount of energy. Therefore, violent relaxation does not lead a self-gravitating system to thermal equilibrium. Our result implies that the energy of each particle in a merger remnant is strongly correlated with its initial energy. Therefore, the radial structures of galaxies, such as the color gradient, are likely to survive violent relaxation.
- In this paper, we describe the architecture and performance of the GRAPE-4 system, a massively parallel special-purpose computer for N-body simulation of gravitational collisional systems. The calculation cost of N-body simulation of collisional self-gravitating system is O(N3). Thus, even with present-day supercomputers, the number of particles one can handle is still around 10,000. In N-body simulations, almost all computing time is spent calculating the force between particles, since the number of interactions is proportional to the square of the number of particles. Computational cost of the rest of the simulation, such as the time integration and the reduction of the result, is generally proportional to the number of particles. The calculation of the force between particles can be greatly accelerated by means of a dedicated special-purpose hardware. We have developed a series of hardware systems, the GRAPE (GRAvity PipE) systems, which perform the force calculation. They are used with a general-purpose host computer which performs the rest of the calculation. The GRAPE-4 system is our newest hardware, completed in 1995 summer. Its peak speed is 1.08 TFLOPS. This speed is achieved by running 1692 pipeline large-scale integrated circuits (LSIs), each providing 640 MFLOPS, in parallel.
- We describe the implementation of a smoothed particle hydrodynamics (SPH) scheme using GRAPE-1A, a special-purpose processor used for gravitational N-body simulations. The GRAPE-1A calculates the gravitational force exerted on a particle from all other particles in a system, while simultaneously making a list of the nearest neighbors of the particle. It is found that GRAPE-1A accelerates SPH calculations by direct summation by about two orders of magnitudes for a ten thousand-particle simulation. The effective speed is 80 Mflops, which is about 30 percent of the peak speed of GRAPE-1A. Also, in order to investigate the accuracy of GRAPE-SPH, some test simulations were executed. We found that the force and position errors are smaller than those due to representing a fluid by a finite number of particles. The total energy and momentum were conserved within 0.2-0.4 percent and 2-5 x 10 exp -5, respectively, in simulations with several thousand particles. We conclude that GRAPE-SPH is quite effective and sufficiently accurate for self-gravitating hydrodynamics.
- Chapter
- Jan 2013

We describe the architecture and performance of GRAPE-DR (Greatly Reduced Array of Processor Elements with Data Reduction). It operates as an accelerator attached to general-purpose PCs or x86-based servers. The processor chip of a GRAPE-DR board have 512 cores operating at the clock frequency of 400 MHz. The peak speed of a processor chip is 410 Gflops (single precision) or 205 Gflops (double precision). A GRAPE-DR board consists of four GRAPE-DR chips, each with its own local memory of 256 MB. Thus, a GRAPE-DR board has the theoretical peak speed of 1.64 SP and 0.82 DP Tflops. Its power consumption is around 200 W. The application area of GRAPE-DR covers particle-based simulations such as astrophysical many-body simulations and molecular-dynamics simulations, quantum chemistry calculations, various applications which requires dense matrix operations, and many other compute-intensive applications. The architecture of GRAPE-DR is in many ways similar to those of modern GPUs, since the evolutionary tracks are rather similar. GPUs have evolved from specialized hardwired logic for specific operations to a more general-purpose computing engine, in order to meet the perform complex shading and other operations. The predecessor of GRAPE-DR is GRAPE (GRAvity PipE), which is a specialized pipeline processor for gravitational \(N\)-body simulations. We have changed the architecture to extend the range of applications. There are two main differences between GRAPE-DR and GPGPU. One is the transistor and power efficiency. With 90 nm technology and 400M transistors, we have integrated 512 processor cores and achieved the speed of 400 Gflops at 400 MHz clock and 50 W. A Fermi processor from NVIDIA integrates 448 processors with 3B transistors and achieved the speed of 1.03 Tflops at 1.15 GHz and 247 W. Thus, Fermi achieved 2.5 times higher speed compared to GRAPE-DR, with 2.9 times higher clock, 8 times more transistors, and 5 times more power consumption. The other is the external memory bandwidth. GPUs typically have the memory bandwidth of around 100 GB/s, while our GRAPE-DR card, with 4 chips, have only 16 GB/s. Thus, the range of application is somewhat limited, but for suitable applications, the performance and performance per watt of GRAPE-DR is quite good. The single-card performance of HPL benchmark is 480 Gflops for matrix size of t 48 k, and for 81 cards 37 Tflops. - Article
- Jun 2014

A parallel single-instruction multiple data implementation of a two-level nested loop, which uses shared memory, is implemented via general-purpose computing on a graphics processing unit. The general-purpose computing on a graphics processing unit implementation is compared to MATLAB (R), C, and other implementations of the same algorithm, which are primarily executed on the central processing unit. The general-purpose computing on a graphics processing unit implementation is determined to be decisively faster (80 times) than the fastest single threaded implementation. A linear algebra implementation is determined to consume excessive memory without a corresponding increase in computational performance. Although the speedup is hardware dependent, the general-purpose computing on a graphics processing unit algorithm exploits cache memory in a manner that is severely constrained on conventional multicore central processing units. For this reason, the nested loop described here is a natural fit for the single-instruction multiple data shared memory architecture. Details of the implementation are provided. The algorithm is applied to the simulation of vortex dynamics. In particular, it is applied to simulate the rollup of a vortex filament and carry out an unsteady simulation of a thin plate in ground effect. The cases presented here would be intractable to compute without the acceleration offered by the general-purpose computing on a graphics processing unit. - Article
- Dec 2014
- Int J Quant Chem

A large number of molecular dynamics (MD) simulations have been carried out so far on personal computer clusters and conventional supercomputers using general-purpose MD software such as NAMD, GROMACS, CHARMM, and AMBER. The development of MD simulation program is closely related to the architecture of the computers. Recent trend of MD simulation is headed to large-scale calculation, long-time calculation of small systems, and calculation for large statistics. The most challenging one is the large-scale calculation. To realize the large-scale calculation, we have to efficiently control the massively parallel supercomputers with several 10 thousands of nodes. There, it is essential to develop software having a compatibility of such computer architecture. Recently, we have developed general-purpose MD software MODYLAS (MOlecular DYnamics simulation software for LArge Systems) for massively parallel supercomputers such as K-computer. Here, we describe the outline, important features, and computation of MODYLAS. An example result of the 200 ns-MD simulation of the viral capsid system consisting of about 6,500,000 atoms is also presented. © 2014 Wiley Periodicals, Inc. - Article
- Mar 2008
- JPN J MATH

We overview our GRAPE (GRAvity PipE) and GRAPE-DR project to develop dedicated computers for astrophysical N-body simulations. The basic idea of GRAPE is to attach a custom-build computer dedicated to the calculation of gravitational interaction between particles to a general-purpose programmable computer. By this hybrid architecture, we can achieve both a wide range of applications and very high peak performance. GRAPE-6, completed in 2002, achieved the peak speed of 64 Tflops. The next machine, GRAPE-DR, will have the peak speed of 2 Pflops and will be completed in 2008. We discuss the physics of stellar systems, evolution of general-purpose high-performance computers, our GRAPE and GRAPE-DR projects and issues of numerical algorithms. - Conference Paper
- Nov 2012

In this paper, we describe the design and performance of GRAPE-8 accelerator processor for gravitational N-body simulations. It is designed to evaluate gravitational interaction with cutoff between particles. The cutoff function is useful for schemes like TreePM or Particle-Particle Particle-Tree, in which gravitational force is divided to short-range and longrange components. A single GRAPE-8 processor chip integrates 48 pipeline processors. The effective number of floating-point operations per interaction is around 40. Thus the peak performance of a single GRAPE-8 processor chip is 480 Gflops. A GRAPE-8 processor card houses two GRAPE-8 chips and one FPGA chip for PCI-Express interface. The total power consumption of the board is 46W. Thus, theoretical peak performance per watt is 20.5 Gflops/W. The effective performance of the total system, including the host computer, is around 5Gflops/W. This is more than a factor of two higher than the highest number in the current Green500 list. - Article
- Jan 1991

Observations about dynamical aspects of compact groups of galaxies are reviewed, and N-body numerical simulations are applied in order to lend theoretical interpretations to the observational results. In connection with dynamical evolution time scale the possible importance of an extended massive dark matter envelope is pointed out in which a compact galaxy group is embedded. - Article
- Aug 2000
- PUBL ASTRON SOC JPN

We have developed a special-purpose computer for gravitational many-body simulations, GRAPE-5. GRAPE-5 accelerates the force calculation which dominates the calculation cost of the simulation. All other calculations, such as the time integration of orbits, are performed on a general-purpose computer (host computer) connected to GRAPE-5. A GRAPE-5 board consists of eight custom pipeline chips (G5 chip) and its peak performance is 38.4 Gflops. GRAPE-5 is the successor of GRAPE-3. The differences between GRAPE-5 and GRAPE-3 are: (1) The newly developed G5 chip contains two pipelines operating at 80 MHz, while the GRAPE chip, which was used for GRAPE-3, had one at 20 MHz. The calculation speed of GRAPE-5 is 8-times faster than that of GRAPE-3. (2) The GRAPE-5 board adopted a PCI bus as the interface to the host computer instead of VME of GRAPE-3, resulting in a communication speed one order of magnitude faster. (3) In addition to the pure 1/r potential, the G5 chip can calculate forces with arbitrary cutoff functions, so that it can be applied to the Ewald or P3M methods. (4) The pairwise force calculated on GRAPE-5 is about 10-times more accurate than that on GRAPE-3. On one GRAPE-5 board, one timestep with a direct summation algorithm takes 14 (N/128 k)2 seconds. With the Barnes-Hut tree algorithm (θ = 0.75), one timestep can be done in 15 (N/106) seconds. - Article
- Apr 2008

In this paper, we briefly review some aspects of the gravitational many-body problem, which is one of the oldest problems in the modern mathematical science. Then we review our GRAPE project to design computers specialized to this problem. - Article
- May 2011
- Proc Int Astron Union

I'll overview the past, present, and future of the GRAPE project, which started as the effort to design and develop specialized hardware for gravitational N-body problem. The current hardware, GRAPE-DR, has an architecture quite different from previous GRAPEs, in the sense that it is a collection of small, but programmable processors, while previous GRAPEs had hardwired pipelines. I'll discuss pros and cons of these two approaches, comparisons with other accelerators and future directions. - Article
- Dec 1993
- Vistas Astron

In the universe there are a diversity of structures of large and small scales. On the other hand, the universe began with a thermal equilibrium state as evidenced by the black body spectrum of the cosmic background radiation. Emergence of structures or information from the thermal equilibrium state awaits for interpretations. They are given for different strata of structures. For the largest region, where the expansion of the universe played an essential role, the maximum possible entropy, which corresponded to thermal equilibrium state, increased with the expansion of the universe so rapidly that the entropy production due to irreversible processes fell behind. For a smaller scale the spatial structure appears as a result of segregation of lower entropy region from higher entropy region, which is promoted by the gravothermal catastrophe. These two were studied with gaseous models, but there are collisionless systems such as large star clusters and galaxies. Though the phase mixing takes place in such systems, there are no entropy production except due to that by coarse graining. Using large N-body simulations the implication of the violent relaxation is critically discussed. The structures originated from collision and merging of galaxies and the evolution of cluster of galaxies are discussed as examples of structure formation through the collisionless evolution. Such problems await for further large simulations. In August 1992 we initiated a tera-flops special-purpose computer project. - Article
- Dec 2003
- PUBL ASTRON SOC JPN

In this paper, we describe the architecture and performance of the GRAPE-6 system, a massively-parallel special-purpose computer for astrophysical $N$-body simulations. GRAPE-6 is the successor of GRAPE-4, which was completed in 1995 and achieved the theoretical peak speed of 1.08 Tflops. As was the case with GRAPE-4, the primary application of GRAPE-6 is simulations of collisional systems, though it can also be used for collisionless systems. The main differences between GRAPE-4 and GRAPE-6 are (a) the processor chip of GRAPE-6 integrates 6 force-calculation pipelines, compared to one pipeline of GRAPE-4 (which needed 3 clock cycles to calculate one interaction), (b) the clock speed is increased from 32 to 90 MHz, and (C) the total number of processor chips is increased from 1728 to 2048. These improvements resulted in a peak speed of 64 Tflops. We also discuss the design of the successor of GRAPE-6. - Article
- Nov 1991
- PUBL ASTRON SOC JPN

We discuss the performance of a hierarchical timestep algorithm, which is Aarseth's individual timestep algorithm for N-body problems modified for use with GRAPE hardware and/or vector processors. In Aarseth's original algorithm, each particle has its own time and timestep. At each integration step, we update only one particle. To obtain the force on that particle, we predict the positions of all other particles at its time. In our GRAPE-2 system this prediction is performed on the general-purpose host computer, while the force calculation is performed on fast special-purpose hardware. Since the calculation cost of the prediction and the force calculation are comparable, the total speed is limited by the speed of the host computer. In the hierarchical timestep algorithm, we update several particles simultaneously. Therefore, we predict the positions of other particles only once for these particles. In order to update several particles, we organize the timesteps of particles in a hierarchy, where timesteps are ``quantized'' to powers of two. Theoretically, the number of particles that can be updated simultaneously is N_g =~ O(N(2/3) ), where N is the number of particles, if the system can be regarded as being homogeneous. For 50<= N <= 1000, experimentally we obtained N_g ~ 0.5 N(2/3) for a Plummer model. The efficiency that we obtained on GRAPE-2 system is about 70% for N=1024. - Article
- Dec 1992
- ASTROPHYS J

Results of scattering experiments involving hard binaries with binding energies up to a few hundred times larger than the kinetic energy of the incoming field star are reported in the form of total and differential cross sections for a variety of processes. An accurate description of equal mass binary-single star scattering over a complete range of parameters is provided. The heating of star clusters through three-body processes, when stellar collisions can be ignored, as is the case for encounters involving degenerate stars, is illustrated by plotting the average amount of energy exchange between binaries and field stars as a function of binary hardness. - Article
- May 1991
- PUBL ASTRON SOC JPN

We have designed and built GRAPE-2 (GRAvity PipE 2), the second experimental machine for gravitational many-body systems. GRAPE-2 is designed to calculate the dynamical evolution of astronomical collisional systems which require an accuracy higher than that provided by GRAPE-1, our first experimental machine. GRAPE-2 has a word length of 32/64 bits and calculates gravitational forces at the speed of 40 Mflops. It has been built on a 43 cm by 32 cm board. - Article
- May 1993
- PUBL ASTRON SOC JPN

We describe the software system used for GRAPE processors, special-purpose computers for gravitational N-body simulations. In gravitational N-body simulations, almost all of the calculation time is spent to calculate the gravitational force between particles. The GRAPE hardware calculates the gravitational force between particles using hardwired pipelines with a speed in the range of 100 Mflops to 10 Gflops, depending on the model. All GRAPE hardware systems are connected to general-purpose workstations, on which the user program runs. In order to use the GRAPE hardware, a user program calls several library subroutines that actually control GRAPE. In this paper, we present an overview of the user interface of GRAPE software libraries and describe how they work. We also describe how the GRAPE system is used with sophisticated algorithms, such as the tree algorithm or the individual time step algorithm. - Article
- Jul 1991
- PUBL ASTRON SOC JPN

We describe an implementation of the modified Barnes-Hut tree algorithm for a gravitational N-body calculation on a GRAPE (GRAvity PipE) backend processor. GRAPE is a special-purpose computer for N-body calculations. It receives the positions and masses of particles from a host computer and then calculates the gravitational force at each coordinate specified by the host. To use this GRAPE processor with the hierarchical tree algorithm, the host computer must maintain a list of all nodes that exert force on a particle. If we create this list for each particle of the system at each timestep, the number of floating-point operations on the host and that on GRAPE would become comparable, and the increased speed obtained by using GRAPE would be small. In our modified algorithm, we create a list of nodes for many particles. Thus, the amount of the work required of the host is significantly reduced. This algorithm was originally developed by Barnes in order to vectorize the force calculation on a Cyber 205. With this algorithm, the computing time of the force calculation becomes comparable to that of the tree construction, if the GRAPE backend processor is sufficiently fast. The obtained speed-up factor is 30 to 50 for a RISC-based host computer and GRAPE-1A with a peak speed of 240 Mflops. - We performed a set of N-body simulations of the merging of two identical spherical galaxies using a special-purpose computer, GRAPE-1. We found that the kinematic properties of the resulting merger remnants are consistent with those of observed elliptical galaxies. The ratios of the maximum rotation velocity (V(max)) to the velocity dispersion (sigma(0)) are smaller than 0.6, even in the case of large impact parameters. This is in good agreement with the lack of elliptical galaxies whose V(max)/sigma(0) exceeds 0.6. Previous N-body calculations have contradicted the observed kinematic studies of elliptical galaxies, because relaxation was numerically enhanced due to the small number of particles (approximately 500). Our simulations with N = 16384 resolve this problem. The central velocity dispersion increased by 10-20% through merging for a wide range of impact parameters. This increase is consistent with the Faber-Jackson relation.
- We have designed and built GRAPE (GRAvity PipE)-1A, a special-purpose computer for N-body simulations using the O(N log N) tree code. GRAPE-1A calculates the gravitational force between particles. A host computer, which is connected to GRAPE-1A, performs all other calculations. GRAPE-1A is different from its predecessor, GRAPE-1, which was designed for the N2 direct summation, regarding three points: a) it can handle particles with different masses, b) the speed of communication between the host computer and GRAPE-1A has been increased, c) and a neighbor list unit has been added. The peak speed of GRAPE-1A is equivalent to 240 Mflops. The effective speed is > 200 Mflops for the direct summation, and around 80 Mflops for the tree algorithm. Using GRAPE-1A and the modified tree algorithm code, one time step takes 2 s for N = 4096 and 20 s for N = 32768. GRAPE-1A can complete either a simulation of interacting galaxies (N = 10(5), 5000 timesteps) or a cosmological simulation (N = 10(6), 500 timesteps) within one week. GRAPE-1A can be also applied to particle-based hydrodynamical simulations such as smoothed-particle hydrodynamics (SPH). The performance of GRAPE-1A is comparable to that of a supercomputer with a peak-speed of 500 Mflops for any type of the direct summation, the modified tree algorithm and SPH.
- Article
- Jan 1993
- PUBL ASTRON SOC JPN

We have developed a highly parallelized special-purpose computer, GRAPE (GRAvity PipE)-3, for gravitational many-body simulations. Its peak computing speed is equivalent to 15 Gflops. The GRAPE-3 system comprises two identical boards connected to a host computer (workstation) through the VME bus. Each board has 24 custom LSI chips (GRAPE chips) which calculate gravitational forces in parallel. The calculation of the gravitational forces is easily parallelized, since the forces on different particles can be calculated independently. One GRAPE chip running at a 10 MHz clock has a computing speed equivalent to 0.3 Gflops; the GRAPE-3 system with 48 GRAPE chips thus achieves a peak speed of 15 Gflops. The sustained speed of the GRAPE-3 system reached 10 Gflops-equivalent. - Article
- Oct 1996
- ASTRON ASTROPHYS REV

Galactic globular clusters, which are ancient building blocks of our Galaxy, represent a very interesting family of stellar systems in which some fundamental dynamical processes have taken place on time scales shorter than the age of the universe. In contrast with galaxies, these clusters represent unique laboratories for learning about two-body relaxation, mass segregation from equipartition of energy, stellar collisions, stellar mergers, and core collapse. In the present review, we summarize the tremendous developments, as much theoretical as observational, that have taken place during the last two decades, and which have led to a quantum jump in our understanding of these beautiful dynamical systems. Comment: A review of 171 pages (TeX, no special macros) in press in The Astronomy and Astrophysics Review. Only 7 figures available in postscript form

- Article
- Dec 1986
- NATURE

A novel method is described of directly calculating the force on N in the gravitational N-body problem that grows only as N log N. The technique uses a tree-structured hierarchical subdivision of space into cubic cells, each of which is recursively divided into eight subcells whenever more than one particle is found to occupy the same cell. This tree is constructed anew at every time step, avoiding ambiguity and tangling. Advantages over potential-solving codes include accurate local interactions, freedom from geometrical assumptions and restrictions, and applicability to a wide class of systems, including planetary, stellar, galactic, and cosmological ones. Advantages over previous hierarchical tree-codes include simplicity and the possibility of rigorous analysis of error. - Article
- Dec 1982
- PHYS LETT A

We report a molecular-dynamics simulation of very large two-dimensional Lennard-Jones systems (up to 16383 particles). The simulation was carried out on a special-purpose processor built by Bakker. The processor's principal features and possibilities are described. Preliminary results of measurements along the isochore ϱ∗ = 0.94 are presented. - Article
- Jun 1989
- Comput Phys Rep

The author evaluates the performance of the Connection Machine, a highly parallel computer with 65,536 processors and a peak speed of 10 Gflops, on several types of gravitational N-body simulations. He compares the results with similar tests on a variety of more traditional supercomputers. For either type of computer, the most efficient algorithm for simulating an arbitrary, very large system of self-gravitating particles, such as star clusters or galaxies, has a force calculation pattern based on a tree structure. This tree structure is highly irregular and rapidly changing in time. This algorithm therefore presents an extreme challenge for hardware as well as software of fast computers. The author presents benchmarks for this algorithm and also for a much simpler algorithm, together with a detailed analysis of the factors which determine the efficiency. - Article
- May 1990
- Nature

A processor has been constructed using a 'pipeline' architecture to simulate many-body systems with long-range forces. It has a speed equivalent to 120 megaflops, and the architecture can be readily parallelized to make teraflop machines a feasible possibility. The machine can be adapted to study molecular dynamics, plasma dynamics and astrophysical hydrodynamics with only minor modifications. - Article
- Nov 1988

Globular star clusters provide a unique laboratory. In astronomy they present an opportunity to study dense stellar systems that are more accessible than galactic nuclei. In statistical mechanics, concepts of negative heat capacity and resulting gravothermal instabilities challenge the present framework of statistical deceptions of dynamical systems. In computer science, modelling their evolution poses an extreme challenge to the hardware and software capabilities of the next generation of parallel computers, and provides an ideal test case for teraflop machines. - We have designed and built the Orrery, a special computer for high-speed high-precision orbital mechanics computations. On the problems the Orrery was designed to solve, it achieves approximately 10 Mflops in about 1 ft3of space while consuming 150 W of power. The specialized parallelarchitecture of the Orrery, which is well matched to orbital mechanics problems, is the key to obtaining such high performance. In this paper we discuss the design, construction, and programming of the Orrery. Copyright © 1985 by The Institute of Electrical and Electronics Engineers, Inc.
- ) 199. An answer to this question could be as follows
- Jan 1989

- J Makino
- P Hut
- Comput

J. Makino and P. Hut, Comput. Phys. Rep. 9 (1989) 199. An answer to this question could be as follows. - Computer Simulation tions on the special purpose computer. If it had using Particles
- Jan 1981

- R W Hockney
- J W Eastwood

R.W. Hockney and J.W. Eastwood, Computer Simulation tions on the special purpose computer. If it had using Particles (McGraw-Hill, New York, 1981). - in: Special Purpose Computers, proportional to the number of particles N
- Jan 1988

- A F Bakker
- C Bruin

A.F. Bakker and C. Bruin, in: Special Purpose Computers, proportional to the number of particles N. In ed. Berni J. Alder (Academic Press, New York, 1988) p. contrast, the long-range character of the gravita-183. - Ebisuzaki In their molecular dynamics they used the trun-and M. Umemura
- Jan 1990

- D Sugimoto
- Y Chikada
- J Makino
- T Ito

D. Sugimoto, Y. Chikada, J. Makino, T. Ito, T. Ebisuzaki In their molecular dynamics they used the trun-and M. Umemura, Nature 345 (1990) 33. cated Lennard—Jones potential. Therefore, their - Nature 336 calculation; it is (P(N) in their case while e)(N 2) (1988) 31
- P Hut
- J Makino
- S L W Mcmillan

P. Hut, J. Makino and S.L.W. McMillan, Nature 336 calculation; it is (P(N) in their case while e)(N 2) (1988) 31. in our case. This also implies that in their case the - ) 446. our GRAPE, their hardware would have been
- Jan 1986

- J Barnes
- P Hut

J. Barnes and P. Hut, Nature 324 (1986) 446. our GRAPE, their hardware would have been - loads incurred to the force calculation C 34 same order of magnitude as the rest of the calcula-(1985) 822. tions; the separation of them into the front-end
- J H Applegate
- M R Douglas
- Y 0usd
- P Hunter
- C L Seitz
- G J Sussman

J.H. Applegate, M.R. Douglas, Y. 0usd, P. Hunter, C.L. loads incurred to the force calculation is of the Seitz and G.J. Sussman, IEEE Trans. Comput. C 34 same order of magnitude as the rest of the calcula-(1985) 822. tions; the separation of them into the front-end