Refine your search:     
Report No.
 - 
Search Results: Records 1-20 displayed on this page of 43

Presentation/Publication Type

Initialising ...

Refine

Journal/Book Title

Initialising ...

Meeting title

Initialising ...

First Author

Initialising ...

Keyword

Initialising ...

Language

Initialising ...

Publication Year

Initialising ...

Held year of conference

Initialising ...

Save select records

Journal Articles

Tree cutting approach for domain partitioning on forest-of-octrees-based block-structured static adaptive mesh refinement with lattice Boltzmann method

Hasegawa, Yuta; Aoki, Takayuki*; Kobayashi, Hiromichi*; Idomura, Yasuhiro; Onodera, Naoyuki

Parallel Computing, 108, p.102851_1 - 102851_12, 2021/12

The aerodynamics simulation code based on the lattice Boltzmann method (LBM) using forest-of-octrees-based block-structured local mesh refinement (LMR) was implemented, and its performance was evaluated on GPU-based supercomputers. We found that the conventional Space-Filling-Curve-based (SFC) domain partitioning algorithm results in costly halo communication in our aerodynamics simulations. Our new tree cutting approach improved the locality and the topology of the partitioned sub-domains and reduced the communication cost to one-third or one-fourth of the original SFC approach. In the strong scaling test, the code achieved maximum $$times1.82$$ speedup at the performance of 2207 MLUPS (mega- lattice update per second) on 128 GPUs. In the weak scaling test, the code achieved 9620 MLUPS at 128 GPUs with 4.473 billion grid points, while the parallel efficiency was 93.4% from 8 to 128 GPUs.

Journal Articles

Optimization strategy for a performance portable Vlasov code

Asahi, Yuichi; Latu, G.*; Bigot, J.*; Grandgirard, V.*

Proceedings of 2021 International Workshop on Performance, Portability, and Productivity in HPC (P3HPC) (Internet), p.79 - 91, 2021/11

This paper presents optimization strategies dedicated to a kinetic plasma simulation code that makes use of OpenACC/OpenMP directives and Kokkos performance portable framework to run across multiple CPUs and GPUs. We evaluate the impacts of optimizations on multiple hardware platforms: Intel Xeon Skylake, Fujitsu Arm A64FX, and Nvidia Tesla P100 and V100. After the optimizations, the OpenACC/OpenMP version achieved the acceleration of 1.07 to 1.39. The Kokkos version in turn achieved the acceleration of 1.00 to 1.33.Since the impact of optimizations under multiple combinations of kernels, devices and parallel implementations is demonstrated, this paper provides a widely available approach to accelerate a code keeping the performance portability. To achieve an excellent performance on both CPUs and GPUs, Kokkos could be a reasonable choice which offers more flexibility to manage multiple data and loop structures with a single codebase.

Journal Articles

Improved domain partitioning on tree-based mesh-refined lattice Boltzmann method

Hasegawa, Yuta; Aoki, Takayuki*; Kobayashi, Hiromichi*; Idomura, Yasuhiro; Onodera, Naoyuki

Keisan Kogaku Koenkai Rombunshu (CD-ROM), 26, 6 Pages, 2021/05

We introduce an improved domain partitioning method called "tree cutting approach" for the aerodynamics simulation code based on the lattice Boltzmann method (LBM) with the forest-of-octrees-based local mesh refinement (LMR). The conventional domain partitioning algorithm based on the space-filling curve (SFC), which is widely used in LMR, caused a costly halo data communication which became a bottleneck of our aerodynamics simulation on the GPU-based supercomputers. Our tree cutting approach adopts a hybrid domain partitioning with the coarse structured block decomposition and the SFC partitioning in each block. This hybrid approach improved the locality and the topology of the partitioned sub-domains and reduced the amount of the halo communication to one-third of the original SFC approach. The code achieved $$times 1.23$$ speedup on 8 GPUs, and achieved $$times 1.82$$ speedup at the performance of 2207 MLUPS (mega-lattice update per second) on 128 GPUs with strong scaling test.

Journal Articles

Acceleration of locally mesh allocated Poisson solver using mixed precision

Onodera, Naoyuki; Idomura, Yasuhiro; Hasegawa, Yuta; Shimokawabe, Takashi*; Aoki, Takayuki*

Keisan Kogaku Koenkai Rombunshu (CD-ROM), 26, 3 Pages, 2021/05

We develop a mixed-precision preconditioner for the pressure Poisson equation in a two-phase flow CFD code JUPITER-AMR. The multi-grid (MG) preconditioner is constructed based on the geometric MG method with a three- stage V-cycle, and a cache-reuse SOR (CR-SOR) method at each stage. The numerical experiments are conducted for two-phase flows in a fuel bundle of a nuclear reactor. The MG-CG solver in single-precision shows the same convergence histories as double-precision, which is about 75% of the computational time in double-precision. In the strong scaling test, the MG-CG solver in single-precision is accelerated by 1.88 times between 32 and 96 GPUs.

Journal Articles

Multi-resolution steady flow prediction with convolutional neural networks

Asahi, Yuichi; Hatayama, Sora*; Shimokawabe, Takashi*; Onodera, Naoyuki; Hasegawa, Yuta; Idomura, Yasuhiro

Keisan Kogaku Koenkai Rombunshu (CD-ROM), 26, 4 Pages, 2021/05

We develop a convolutional neural network model to predict the multi-resolution steady flow. Based on the state-of-the-art image-to-image translation model Pix2PixHD, our model can predict the high resolution flow field from the signed distance function. By patching the high resolution data, the memory requirements in our model is suppressed compared to Pix2PixHD.

Journal Articles

GPU acceleration of multigrid preconditioned conjugate gradient solver on block-structured Cartesian grid

Onodera, Naoyuki; Idomura, Yasuhiro; Hasegawa, Yuta; Yamashita, Susumu; Shimokawabe, Takashi*; Aoki, Takayuki*

Proceedings of International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia 2021) (Internet), p.120 - 128, 2021/01

 Times Cited Count:0 Percentile:0.01

We develop a multigrid preconditioned conjugate gradient (MG-CG) solver for the pressure Poisson equation in a two-phase flow CFD code JUPITER. The MG preconditioner is constructed based on the geometric MG method with a three-stage V-cycle, and a RB-SOR smoother and its variant with cache-reuse optimization (CR-SOR) are applied at each stage. The numerical experiments are conducted for two-phase flows in a fuel bundle of a nuclear reactor. The MG-CG solvers with the RB-SOR and CR-SOR smoothers reduce the number of iterations to less than 15% and 9% of the original preconditioned CG method, leading to 3.1- and 5.9-times speedups, respectively. The obtained performance indicates that the MG-CG solver designed for the block-structured grid is highly efficient and enables large-scale simulations of two-phase flows on GPU based supercomputers.

Journal Articles

Performance portable implementation of a kinetic plasma simulation mini-app with a higher level abstraction and directives

Asahi, Yuichi; Latu, G.*; Bigot, J.*; Grandgirard, V.*

Proceedings of Joint International Conference on Supercomputing in Nuclear Applications + Monte Carlo 2020 (SNA + MC 2020), p.218 - 224, 2020/10

Performance portability is expected to be a critical issue in the upcoming exascale era. We explore a performance portable approach for a fusion plasma turbulence simulation code employing the kinetic model, namely the GYSELA code. For this purpose, we extract the key features of GYSELA such as the high dimensionality (more than 4D) and the semi-Lagrangian scheme, and encapsulate them into a mini-application which solves the similar but a simplified Vlasov-Poisson system as GYSELA. We implement the mini-app with OpenACC, OpenMP4.5 and Kokkos, where we suppress unnecessary duplications of code lines. Based on our experience, we discuss the advantages and disadvantages of OpenACC, OpenMP4.5 and Kokkos, from the view point of performance portability, readability and productivity.

Journal Articles

Ensemble wind simulations using a mesh-refined lattice Boltzmann method on GPU-accelerated systems

Hasegawa, Yuta; Onodera, Naoyuki; Idomura, Yasuhiro

Proceedings of Joint International Conference on Supercomputing in Nuclear Applications + Monte Carlo 2020 (SNA + MC 2020), p.236 - 242, 2020/10

The wind condition and the plume dispersion in urban areas are strongly affected by buildings and plants, which are hardly described in the conventional mesoscale simulations. To resolve this issue, we developed a GPU-based CFD code using a mesh-refined lattice Boltzmann method (LBM), which enables real-time plume dispersion simulations with a resolution of several meters. However, such high resolution simulations are highly turbulent and the time histories of the results are sensitive to various simulations conditions. In order to improve the reliability of such chaotic simulations, we developed an ensemble simulation approach, which enables a statistical estimation of the uncertainty. We examined the developed code against the field experiment JU2003 in Oklahoma City. In the comparison, the wind conditions showed good agreements, and the average values of the tracer gas concentration satisfied the factor 2 agreements between the ensemble simulation data and the experiment.

Journal Articles

GPU-acceleration of locally mesh allocated two phase flow solver for nuclear reactors

Onodera, Naoyuki; Idomura, Yasuhiro; Ali, Y.*; Yamashita, Susumu; Shimokawabe, Takashi*; Aoki, Takayuki*

Proceedings of Joint International Conference on Supercomputing in Nuclear Applications + Monte Carlo 2020 (SNA + MC 2020), p.210 - 215, 2020/10

This paper presents a GPU-based Poisson solver on a block-based adaptive mesh refinement (block-AMR) framework. The block-AMR method is essential for GPU computation and efficient description of the nuclear reactor. In this paper, we successfully implement a conjugate gradient method with a state-of-the-art multi-grid preconditioner (MG-CG) on the block-AMR framework. GPU kernel performance was measured on the GPU-based supercomputer TSUBAME3.0. The bandwidth of a vector-vector sum, a matrix-vector product, and a dot product in the CG kernel gave good performance at about 60% of the peak performance. In the MG kernel, the smoothers in a three-stage V-cycle MG method are implemented using a mixed precision RB-SOR method, which also gave good performance. For a large-scale Poisson problem with $$453.0 times 10^6$$ cells, the developed MG-CG method reduced the number of iterations to less than 30% and achieved $$times$$ 2.5 speedup compared with the original preconditioned CG method.

Journal Articles

GPU-acceleration of locally mesh allocated Poisson solver

Onodera, Naoyuki; Idomura, Yasuhiro; Ali, Y.*; Shimokawabe, Takashi*; Aoki, Takayuki*

Keisan Kogaku Koenkai Rombunshu (CD-ROM), 25, 4 Pages, 2020/06

We have developed the stencil-based CFD code JUPITER for simulating three-dimensional multiphase flows. A GPU-accelerated Poisson solver based on the preconditioned conjugate gradient (P-CG) method with a multigrid preconditioner was developed for the JUPITER with block-structured AMR mesh. All Poisson kernels were implemented using CUDA, and the GPU kernel function is well tuned to achieve high performance on GPU supercomputers. The developed multigrid solver shows good convergence of about 1/7 compared with the original P-CG method, and $$times$$3 speed up is achieved with strong scaling test from 8 to 216 GPUs on TSUBAME 3.0.

Journal Articles

A Large-scale aerodynamics study on bicycle racing

Aoki, Takayuki*; Hasegawa, Yuta

Jidosha Gijutsu, 74(4), p.18 - 23, 2020/04

Aerodynamics studies for bicycle racings have been carried out by using a CFD simulation based on LES model. For running of alone cyclist and 2-4 cyclists groups, the computational drags are in good agreement with the wind-tunnel experiments. Different shapes of group running and competing two teams are studied. A large-scale computation for a group of 72 cyclists has been performed by using 2.23 billion meshes on a GPU supercomputer.

Journal Articles

Inner and outer-layer similarity of the turbulence intensity profile over a realistic urban geometry

Inagaki, Atsushi*; Wangsaputra, Y.*; Kanda, Manabu*; Y$"u$cel, M.*; Onodera, Naoyuki; Aoki, Takayuki*

SOLA (Scientific Online Letters on the Atmosphere) (Internet), 16, p.120 - 124, 2020/00

 Times Cited Count:0 Percentile:0.01(Meteorology & Atmospheric Sciences)

The similarity of the turbulence intensity profile with the inner-layer and the outer-layer scalings were examined for an urban boundary layer using numerical simulations. The simulations consider a developing neutral boundary layer over realistic building geometry. The computational domain covers an 19.2 km by 4.8 km and extends up to a height of 1 km with 2-m grids. Several turbulence intensity profiles are defined locally in the computational domain. The inner- and outer-layer scalings work well reducing the scatter of the turbulence intensity within the inner- and outer-layers, respectively, regardless of the surface geometry. Although the main scatters among the scaled profiles are attributed to the mismatch of the parts of the layer and the scaling parameters, their behaviors can also be explained by introducing a non-dimensional parameter which consists of the ratio of length or velocity.

Journal Articles

Implementation and performance evaluation of a communication-avoiding GMRES method for stencil-based code on GPU cluster

Matsumoto, Kazuya*; Idomura, Yasuhiro; Ina, Takuya*; Mayumi, Akie; Yamada, Susumu

Journal of Supercomputing, 75(12), p.8115 - 8146, 2019/12

 Times Cited Count:1 Percentile:26.51(Computer Science, Hardware & Architecture)

A communication-avoiding generalized minimum residual method (CA-GMRES) is implemented on a hybrid CPU-GPU cluster, targeted for the performance acceleration of iterative linear system solver in the gyrokinetic toroidal five-dimensional Eulerian code GT5D. In addition to the CA-GMRES, we implement and evaluate a modified variant of CA-GMRES (M-CA-GMRES) proposed in our previous study to reduce the amount of floating-point calculations. This study demonstrates that beneficial features of the CA-GMRES are in its minimum number of collective communications and its highly efficient calculations based on dense matrix-matrix operations. The performance evaluation is conducted on the Reedbush-L GPU cluster, which contains four NVIDIA Tesla P100 GPUs per compute node. The evaluation results show that the M-CA-GMRES is 1.09x, 1.22x and 1.50x faster than the CA-GMRES, the generalized conjugate residual method (GCR), and the GMRES, respectively, when 64 GPUs are used.

Journal Articles

GPU acceleration of communication avoiding Chebyshev basis conjugate gradient solver for multiphase CFD simulations

Ali, Y.*; Onodera, Naoyuki; Idomura, Yasuhiro; Ina, Takuya*; Imamura, Toshiyuki*

Proceedings of 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA 2019), p.1 - 8, 2019/11

 Times Cited Count:6 Percentile:99.17

Iterative methods for solving large linear systems are common parts of computational fluid dynamics (CFD) codes. The Preconditioned Conjugate Gradient (P-CG) method is one of the most widely used iterative methods. However, in the P-CG method, global collective communication is a crucial bottleneck especially on accelerated computing platforms. To resolve this issue, communication avoiding (CA) variants of the P-CG method are becoming increasingly important. In this paper, the P-CG and Preconditioned Chebyshev Basis CA CG (P-CBCG) solvers in the multiphase CFD code JUPITER are ported to the latest V100 GPUs. All GPU kernels are highly optimized to achieve about 90% of the roofline performance, the block Jacobi preconditioner is re-designed to extract high computing power of GPUs, and the remaining bottleneck of halo data communication is avoided by overlapping communication and computation. The overall performance of the P-CG and P-CBCG solvers is determined by the competition between the CA properties of the global collective communication and the halo data communication, indicating an importance of the inter-node interconnect bandwidth per GPU. The developed GPU solvers are accelerated up to 2x compared with the former CPU solvers on KNLs, and excellent strong scaling is achieved up to 7,680 GPUs on the Summit.

Journal Articles

Fuel debris' air cooling analysis using a lattice Boltzmann method

Onodera, Naoyuki; Idomura, Yasuhiro; Kawamura, Takuma; Uesawa, Shinichiro; Yamashita, Susumu; Yoshida, Hiroyuki

Proceedings of 27th International Conference on Nuclear Engineering (ICONE-27) (Internet), 6 Pages, 2019/05

A dry method is one of practical methods for decommissioning the TEPCO's Fukushima Daiichi Nuclear Power Station. Japan Atomic Energy Agency (JAEA) has been evaluating the air cooling performance by using the JUPITER code. However, the JUPITER code requires a large computational cost to capture debris' structures. To accelerate such CFD analyses, we use the CityLBM code, which is based on the lattice Boltzmann method (LBM) and is highly optimized for GPUs. The CityLBM code is validated against free convective heat transfer experiments at JAEA, and the similar accuracy as the JUPITER code is confirmed regarding the prediction capability of heat transfer and the resulting temperature distributions. It is also shown that the elapse time of a CityLBM simulation on GPUs is reduced to 1/6 compared with that of the corresponding JUPITER simulation on CPUs with the same number of GPUs and CPUs. The results show that the LBM is promising for accelerating thermal convective simulations.

Journal Articles

Communication Reduced Multi-time-step Algorithm for Real-time Wind Simulation on GPU-based Supercomputers

Onodera, Naoyuki; Idomura, Yasuhiro; Ali, Y.*; Shimokawabe, Takashi*

Proceedings of 9th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA 2018) (Internet), p.9 - 16, 2018/11

 Times Cited Count:5 Percentile:94.81

We develop a communication reduced multi-time- step (CRMT) algorithm for a Lattice Boltzmann method (LBM) based on a block-structured adaptive mesh refinement (AMR). This algorithm is based on the temporal blocking method, and can improve computational efficiency by replacing a communication bottleneck with additional computation. The proposed method is implemented on an extreme scale airflow simulation code CityLBM, and its impact on the scalability is tested on GPU based supercomputers, TSUBAME and Reedbush. Thanks to the CRMT algorithm, the communication cost is reduced by $$sim 64%$$, and weak and strong scalings are improved up to $$sim 200$$ GPUs. The obtained performance indicates that real time airflow simulations for about 2km square area with the wind speed of $$5m/s$$ is feasible using 1m resolution.

Journal Articles

Acceleration of wind simulation using locally mesh-refined Lattice Boltzmann Method on GPU-Rich supercomputers

Onodera, Naoyuki; Idomura, Yasuhiro

Lecture Notes in Computer Science 10776, p.128 - 145, 2018/00

 Times Cited Count:9 Percentile:94.31

We developed a CFD code based on the adaptive mesh-refined Lattice Boltzmann Method (AMR-LBM). The code is developed on the GPU-rich supercomputer TSUBAME3.0 at the Tokyo Tech, and the GPU kernel functions are tuned to achieve high performance on the Pascal GPU architecture. The performances of weak scaling from 1 nodes to 36 nodes are examined. The GPUs (NVIDIA TESLA P100) achieved more than 10 times higher node performance than that of CPUs (Broadwell).

Journal Articles

A Stencil framework to realize large-scale computations beyond device memory capacity on GPU supercomputers

Shimokawabe, Takashi*; Endo, Toshio*; Onodera, Naoyuki; Aoki, Takayuki*

Proceedings of 2017 IEEE International Conference on Cluster Computing (IEEE Cluster 2017) (Internet), p.525 - 529, 2017/09

Stencil-based applications such as CFD have succeeded in obtaining high performance on GPU supercomputers. The problem sizes of these applications are limited by the GPU device memory capacity, which is typically smaller than the host memory. On GPU supercomputers, a locality improvement technique using temporal blocking method with memory swapping between host and device enables large computation beyond the device memory capacity. Our high-productivity stencil framework automatically applies temporal blocking to boundary exchange required for stencil computation and supports automatic memory swapping provided by a MPI/CUDA wrapper library. The framework-based application for the airflow in an urban city maintains 80% performance even with the twice larger than the GPU memory capacity and have demonstrated good weak scalability on the TSUBAME 2.5 supercomputer.

Journal Articles

A Numerical study of turbulence statistics and the structure of a spatially-developing boundary layer over a realistic urban geometry

Inagaki, Atsushi*; Kanda, Manabu*; Ahmad, N. H.*; Yagi, Ayako*; Onodera, Naoyuki; Aoki, Takayuki*

Boundary-Layer Meteorology, 164(2), p.161 - 181, 2017/08

 Times Cited Count:12 Percentile:53.98(Meteorology & Atmospheric Sciences)

The applicability of outer-layer scaling is examined by numerical simulation of a developing neutral boundary layer over a realistic building geometry of Tokyo. Large-eddy simulations are carried out over a large computational domain 19.2 km $$times$$ 4.8 km $$times$$1 km, with a fine grid spacing (2 m) using the lattice-Boltzmann method with massively parallel graphics processing units. Results from simulations show that outer-layer features are maintained for turbulence statistics in the upper part of the boundary layer, as well as the width of predominant streaky structures throughout the entire boundary layer. This is caused by the existence of very large streaky structures extending throughout the entire boundary layer, which follow outer-layer scaling with a self-preserving development. We assume the top-down mechanism in the physical interpretation of results.

Oral presentation

High performance implementation of nuclear fusion simulation code on GPU cluster

Matsumoto, Kazuya; Asahi, Yuichi*; Ina, Takuya; Idomura, Yasuhiro

no journal, , 

We present the implementation and performance evaluation results of the plasma physics simulation code called GT5D on a GPU cluster. In this study, an iterative matrix solver, which is identified as a performance bottleneck in the code, is tuned on the GPU. The measured performance is compared with attainable performance calculated by the roofline model. Additionally, we show the implementation with direction communications between GPUs for utilizing many GPUs.

43 (Records 1-20 displayed on this page)