Initialising ...
Initialising ...
Initialising ...
Initialising ...
Initialising ...
Initialising ...
Initialising ...
Idomura, Yasuhiro; Ina, Takuya*; Ali, Y.*; Imamura, Toshiyuki*
Dai-34-Kai Suchi Ryutai Rikigaku Shimpojiumu Koen Rombunshu (Internet), 6 Pages, 2020/12
A new communication avoiding (CA) Krylov solver with a FP16 (half precision) preconditioner is developed for a semi-implicit finite difference solver in the Gyrokinetic Toroidal 5D full-f Eulerian code GT5D. In the solver, the bottleneck of global collective communication is resolved using a CA-Krylov subspace method, and halo data communication is reduced by the FP16 preconditioner, which improves the convergence property. The FP16 preconditioner is designed based on the physics properties of the operator and is implemented using the new support for FP16 SIMD operations on A64FX. The solver is ported also on GPUs, and the performance of ITER size simulations with trillion grids is measured on Fugaku (A64FX) and Summit (V100). The new solver accelerates GT5D by
from the conventional non-CA solver, and excellent strong scaling is obtained up to 5,760 CPUs/GPUs both on Fugaku and Summit.
Idomura, Yasuhiro; Ina, Takuya*; Ali, Y.*; Imamura, Toshiyuki*
Proceedings of International Conference on High Performance Computing, Networking, Storage, and Analysis (SC 2020) (Internet), p.1318 - 1330, 2020/11
The multi-scale full- simulation of the next generation experimental fusion reactor ITER based on a five dimensional (5D) gyrokinetic model is one of the most computationally demanding problems in fusion science. In this work, a Gyrokinetic Toroidal 5D Eulerian code (GT5D) is accelerated by a new mixed-precision communication-avoiding (CA) Krylov method. The bottleneck of global collective communication on accelerated computing platforms is resolved using a CA Krylov method. In addition, a new FP16 preconditioner, which is designed using the new support for FP16 SIMD operations on A64FX, reduces both the number of iterations (halo data communication) and the computational cost. The performance of the proposed method for ITER size simulations with 0.1 trillion grids on 1,440 CPUs/GPUs on Fugaku and Summit shows 2.8x and 1.9x speedups respectively from the conventional non-CA Krylov method, and excellent strong scaling is obtained up to 5,760 CPUs/GPUs.
Idomura, Yasuhiro; Ina, Takuya*; Ali, Y.*; Imamura, Toshiyuki*
Proceedings of Joint International Conference on Supercomputing in Nuclear Applications + Monte Carlo 2020 (SNA + MC 2020), p.225 - 230, 2020/10
A new communication avoiding (CA) Krylov solver with a FP16 (half precision) preconditioner is developed for a semi-implicit finite difference solver in the Gyrokinetic Toroidal 5D full-f Eulerian code GT5D. In the solver, the bottleneck of global collective communication is resolved using a CA-Krylov subspace method, while the number of halo data communication is reduced by improving the convergence property using the FP16 preconditioner. The FP16 preconditioner is designed based on the physics properties of the operator and is implemented using the new support for FP16 SIMD operations on A64FX. The solver is ported on Fugaku (A64FX) and Summit (V100), which respectively show 63x and
29x speedups in socket performance compared to the conventional non-CA Krylov solver on JAEA-ICEX (Haswell).
Idomura, Yasuhiro; Onodera, Naoyuki; Yamada, Susumu; Yamashita, Susumu; Ina, Takuya*; Imamura, Toshiyuki*
Supa Kompyuthingu Nyusu, 22(5), p.18 - 29, 2020/09
A communication avoiding multigrid preconditioned conjugate gradient method (CAMGCG) is applied to the pressure Poisson equation in a multiphase CFD code JUPITER, and its computational performance and convergence property are compared against the conventional Krylov methods. The CAMGCG solver has robust convergence properties regardless of the problem size, and shows both communication reduction and convergence improvement, leading to higher performance gain than CA Krylov solvers, which achieve only the former. The CAMGCG solver is applied to extreme scale multiphase CFD simulations with 90 billion DOFs, and its performance is compared against the preconditioned CG solver. In this benchmark, the number of iterations is reduced to , and
speedup is achieved with keeping excellent strong scaling up to 8,000 nodes on the Oakforest-PACS.
Yamada, Susumu; Machida, Masahiko; Imamura, Toshiyuki*
Parallel Computing; Technology Trends, p.105 - 113, 2020/00
no abstracts in English
Ali, Y.*; Onodera, Naoyuki; Idomura, Yasuhiro; Ina, Takuya*; Imamura, Toshiyuki*
Proceedings of 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA 2019), p.1 - 8, 2019/11
Times Cited Count:3 Percentile:1.37Iterative methods for solving large linear systems are common parts of computational fluid dynamics (CFD) codes. The Preconditioned Conjugate Gradient (P-CG) method is one of the most widely used iterative methods. However, in the P-CG method, global collective communication is a crucial bottleneck especially on accelerated computing platforms. To resolve this issue, communication avoiding (CA) variants of the P-CG method are becoming increasingly important. In this paper, the P-CG and Preconditioned Chebyshev Basis CA CG (P-CBCG) solvers in the multiphase CFD code JUPITER are ported to the latest V100 GPUs. All GPU kernels are highly optimized to achieve about 90% of the roofline performance, the block Jacobi preconditioner is re-designed to extract high computing power of GPUs, and the remaining bottleneck of halo data communication is avoided by overlapping communication and computation. The overall performance of the P-CG and P-CBCG solvers is determined by the competition between the CA properties of the global collective communication and the halo data communication, indicating an importance of the inter-node interconnect bandwidth per GPU. The developed GPU solvers are accelerated up to 2x compared with the former CPU solvers on KNLs, and excellent strong scaling is achieved up to 7,680 GPUs on the Summit.
Idomura, Yasuhiro; Ina, Takuya*; Yamashita, Susumu; Onodera, Naoyuki; Yamada, Susumu; Imamura, Toshiyuki*
Proceedings of 9th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA 2018) (Internet), p.17 - 24, 2018/11
Times Cited Count:0 Percentile:100A communication avoiding (CA) multigrid preconditioned conjugate gradient method (CAMGCG) is applied to the pressure Poisson equation in a multiphase CFD code JUPITER, and its computational performance and convergence property are compared against CA Krylov methods. In the JUPITER code, the CAMGCG solver has robust convergence properties regardless of the problem size, and shows both communication reduction and convergence improvement, leading to higher performance gain than CA Krylov solvers, which achieve only the former. The CAMGCG solver is applied to extreme scale multiphase CFD simulations with billion DOFs, and it is shown that compared with a preconditioned CG solver, the number of iterations is reduced to
, and
speedup is achieved with keeping excellent strong scaling up to 8,000 nodes on the Oakforest-PACS.
Yamada, Susumu; Imamura, Toshiyuki*; Machida, Masahiko
Lecture Notes in Computer Science 10776, p.243 - 256, 2018/00
Times Cited Count:0 Percentile:100no abstracts in English
Idomura, Yasuhiro; Ina, Takuya*; Mayumi, Akie; Yamada, Susumu; Imamura, Toshiyuki*
Lecture Notes in Computer Science 10776, p.257 - 273, 2018/00
A preconditioned Chebyshev basis communication-avoiding conjugate gradient method (P-CBCG) is applied to the pressure Poisson equation in a multiphase thermal-hydraulic CFD code JUPITER, and its computational performance and convergence properties are compared against a preconditioned conjugate gradient (P-CG) method and a preconditioned communication-avoiding conjugate gradient (P-CACG) method on the Oakforest-PACS, which consists of 8,208 KNLs. The P-CBCG method reduces the number of collective communications with keeping the robustness of convergence properties. Compared with the P-CACG method, an order of magnitude larger communication-avoiding steps are enabled by the improved robustness. It is shown that the P-CBCG method is and
faster than the P-CG and P-CACG methods at 2,000 processors, respectively.
Yamada, Susumu; Imamura, Toshiyuki*; Machida, Masahiko
Parallel Computing is Everywhere, p.27 - 36, 2018/00
no abstracts in English
Idomura, Yasuhiro; Ina, Takuya*; Mayumi, Akie; Yamada, Susumu; Matsumoto, Kazuya*; Asahi, Yuichi*; Imamura, Toshiyuki*
Proceedings of 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA 2017), p.7_1 - 7_8, 2017/11
A communication-avoiding generalized minimal residual (CA-GMRES) method is applied to the gyrokinetic toroidal five dimensional Eulerian code GT5D, and its performance is compared against the original code with a generalized conjugate residual (GCR) method on the JAEA ICEX (Haswell), the Plasma Simulator (FX100), and the Oakforest-PACS (KNL). The CA-GMRES method has higher arithmetic intensity than the GCR method, and thus, is suitable for future Exa-scale architectures with limited memory and network bandwidths. In the performance evaluation, it is shown that compared with the GCR solver, its computing kernels are accelerated by
, and the cost of data reduction communication is reduced from
to
of the total cost at 1,280 nodes.
Yamada, Susumu; Ina, Takuya*; Sasa, Narimasa; Idomura, Yasuhiro; Machida, Masahiko; Imamura, Toshiyuki*
Proceedings of 2017 IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW) (Internet), p.1418 - 1425, 2017/08
Times Cited Count:3 Percentile:21.83no abstracts in English
Mayumi, Akie; Idomura, Yasuhiro; Ina, Takuya; Yamada, Susumu; Imamura, Toshiyuki*
Proceedings of 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA 2016) (Internet), p.17 - 24, 2016/11
The left-preconditioned communication avoiding conjugate gradient (LP-CA-CG) method is applied to the pressure Poisson equation in the multiphase CFD code JUPITER. The arithmetic intensity of the LP-CA-CG method is analyzed, and is dramatically improved by loop splitting for inner product operations and for three term recurrence operations. Two LP-CA-CG solvers with block Jacobi preconditioning and with underlap preconditioning are developed. It is shown that on the K computer, the LP-CA-CG solvers with block Jacobi preconditioning is faster, because the performance of local point-to-point communications scales well, and the convergence property becomes worse with underlap preconditioning. The LP-CA-CG solver shows good strong scaling up to 30,000 nodes, where the LP-CA-CG solver achieved higher performance than the original CG solver by reducing the cost of global collective communications by 69%.
Sasa, Narimasa; Yamada, Susumu; Machida, Masahiko; Imamura, Toshiyuki*
Nonlinear Theory and Its Applications, IEICE (Internet), 7(3), p.354 - 361, 2016/07
A round off error accumulation in iterative use of the FFT is discussed. By using numerical simulations of partial differential equations, we numerically show that the round off error in iterative use of the FFT tend to be accumulated. To avoid a lack of precision, we give numerical simulations by using a quadruple precision floating point number, which ensure a sufficient precision against the round off errors by the FFT.
Yamada, Susumu; Imamura, Toshiyuki*; Machida, Masahiko
Parallel Computing; On the Road to Exascale, p.361 - 369, 2016/00
Times Cited Count:0no abstracts in English
Kawamura, Takuma; Idomura, Yasuhiro; Miyamura, Hiroko; Imamura, Toshiyuki*; Takemiya, Hiroshi
Shisutemu Seigyo Joho Gakkai Rombunshi, 28(5), p.221 - 227, 2015/05
However remote volume visualization is important to obtain knowledge from complicated large-scale simulation results on supercomputer, the rendering speed and data transfer speed becomes bottleneck of the conventional Client/Server volume visualization techniques. Client/Server visualization system using particle-based volume rendering enables interactive volume visualization, which converts the original volume data to small size light particle data utilizing the supercomputer and transfer the data to Client PC. This system generated the particle data at a few seconds using parallel process on supercomputer Kei with strong scaling till 1000 processers.
Matsuoka, Seikichi*; Satake, Shinsuke*; Idomura, Yasuhiro; Imamura, Toshiyuki*
Proceedings of Joint International Conference on Mathematics and Computation, Supercomputing in Nuclear Applications and the Monte Carlo Method (M&C + SNA + MC 2015) (CD-ROM), 13 Pages, 2015/04
The quality and performance of a parallel pseudo-random number generator (PRNG), KMATH_RANDOM, are investigated using a Monte Carlo particle simulation code for the plasma transport. The library is based on Mersenne Twister with jump routines and provides a numerical tool which is suitable and easy-to-use on massively parallel supercomputers such as K-computer. The library enables the particle code to increase the parallelization up to several thousand processes without loosing the quality and performance of the PRNG. As a result, the particle code can use large amounts of random numbers, which results in removing unphysical phenomena caused by a numerical noise.
Yamada, Susumu; Imamura, Toshiyuki*; Machida, Masahiko
Parallel Computing; Accelerating Computational Science and Engineering (CSE), p.427 - 436, 2014/03
no abstracts in English
Idomura, Yasuhiro; Nakata, Motoki; Yamada, Susumu; Machida, Masahiko; Imamura, Toshiyuki*; Watanabe, Tomohiko*; Nunami, Masanori*; Inoue, Hikaru*; Tsutsumi, Shigenobu*; Miyoshi, Ikuo*; et al.
International Journal of High Performance Computing Applications, 28(1), p.73 - 86, 2014/02
Times Cited Count:15 Percentile:19.47(Computer Science, Hardware & Architecture)Idomura, Yasuhiro; Nakata, Motoki; Yamada, Susumu; Machida, Masahiko; Imamura, Toshiyuki*; Watanabe, Tomohiko*; Nunami, Masanori*; Inoue, Hikaru*; Tsutsumi, Shigenobu*; Miyoshi, Ikuo*; et al.
Proceedings of 31st JSST Annual Conference; International Conference on Simulation Technology (JSST 2012) (USB Flash Drive), p.234 - 242, 2012/09