Initialising ...
Initialising ...
Initialising ...
Initialising ...
Initialising ...
Initialising ...
Initialising ...
Hasegawa, Yuta; Idomura, Yasuhiro; Onodera, Naoyuki
EPJ Web of Conferences, 302, p.03005_1 - 03005_9, 2024/10
Times Cited Count:0 Percentile:0.00(Computer Science, Interdisciplinary Applications)We implemented the ensemble data assimilation (DA) method, the local ensemble transform Kalman filter (LETKF), into the mesh-refined lattice Boltzmann method (LBM) for turbulent flows. Both the LETKF and the mesh-refined LBM were fully implemented on GPUs, so that they are efficiently computed on modern GPU-based supercomputers. We examined the DA accuracy against the flow around a cylinder. The result showed that our method enabled accurate DA with spatially- and temporarily-sparse observation data; the error of the assimilated velocity field with the observation interval of and the observation resolution
(1.56% of the total computational grids) was smaller than the amplitude of the observation noise, where
is the period of the K
rm
n vortex and
is diameter of the square cylinder.
Hasegawa, Yuta; Idomura, Yasuhiro; Onodera, Naoyuki
Keisan Kogaku Koenkai Rombunshu (CD-ROM), 29, 4 Pages, 2024/06
We implemented the ensemble data assimilation (DA) of turbulence by using the mesh-refined lattice Boltzmann method with the local ensemble transform Kalman filter (LBM-LETKF). We examined the accuracy of the data assimilation against a turbulent flow around a three-dimensional square cylinder. The DA error was comparable or less than the observation noise when the observation interval was a half of the period of the Krm
n vortex street and the number of observation points was 0.195% of computational grid points. The LBM-LETKF enables DA of turbulence with spatially- and temporally- sparse observations.
Onodera, Naoyuki; Idomura, Yasuhiro; Hasegawa, Yuta; Asahi, Yuichi; Inagaki, Atsushi*; Shimose, Kenichi*; Hirano, Kohin*
Keisan Kogaku Koenkai Rombunshu (CD-ROM), 28, 4 Pages, 2023/05
We have developed a multi-scale wind simulation code named CityLBM that can resolve entire cities to detailed streets. CityLBM enables a real time ensemble simulation for several km square area by applying the locally mesh-refined lattice Boltzmann method on GPU supercomputers. On the other hand, real-world wind simulations contain complex boundary conditions that cannot be modeled, so data assimilation techniques are needed to reflect observed data in the simulation. This study proposes an optimization method for ground surface temperature bias based on an ensemble Kalman filter to reproduce wind conditions within urban city blocks. As a verification of CityLBM, an Observing System Simulation Experiment (OSSE) is conducted for the central Tokyo area to estimate boundary conditions from observed near-surface temperature values.
Asahi, Yuichi; Maeyama, Shinya*; Fujii, Keisuke*
Keisan Kogaku Koenkai Rombunshu (CD-ROM), 28, 4 Pages, 2023/05
We have developed a deep-learning model to surrogate the effect of small-scale on large-scale fluctuations. We have constructed the sub-grid-scale (SGS) models based on the Mori-Zwanzig projection operatormethod and neural networks. We have performed large eddy simulations (LESs) of the Kuramoto-Sivashinsky turbulence with these SGS models. We have demonstrated that the time averaged energy spectrumof LESs agree with that of the dynamic numerical simulation (DNS).
Asahi, Yuichi; Onodera, Naoyuki; Hasegawa, Yuta; Shimokawabe, Takashi*; Shiba, Hayato*; Idomura, Yasuhiro
Boundary-Layer Meteorology, 186(3), p.659 - 692, 2023/03
Times Cited Count:2 Percentile:32.22(Meteorology & Atmospheric Sciences)We develop a Transformer-based deep learning model to predict the plume concentrations in the urban area under uniform flow conditions. Our model has two distinct input layers: Transformer layers for sequential data and convolutional layers in convolutional neural networks (CNNs) for image-like data. Our model can predict the plume concentration from realistically available data such as the time series monitoring data at a few observation stations and the building shapes and the source location. It is shown that the model can give reasonably accurate prediction with orders of magnitude faster than CFD simulations. It is also shown that the exactly same model can be applied to predict the source location, which also gives reasonable prediction accuracy.
Asahi, Yuichi; Padioleau, T.*; Latu, G.*; Bigot, J.*; Grandgirard, V.*; Obrejan, K.*
Dai-36-Kai Suchi Ryutai Rikigaku Shimpojiumu Koen Rombunshu (Internet), 8 Pages, 2022/12
We implement a kinetic plasma simulation code with multiple performance portable frameworks and evaluated its performance on Intel Icelake, NVIDIA V100 and A100 GPUs, and AMD MI100 GPU. Relying on the language standard parallelism stdpar and proposed language standard multi-dimensional array support mdspan, we demonstrate a performance portable implementation without harming the readability and productivity. With stdpar, we obtain a good overall performance for a kinetic plasma mini-application in the range of 20% to the Kokkos version on Icelake, V100, A100 and MI100. We conclude that stdpar can be a good candidate to develop a performance portable and productive code targeting Exascale era platforms, assuming this programming model will be available on AMD and/or Intel GPUs in the future.
Hasegawa, Yuta; Onodera, Naoyuki; Asahi, Yuichi; Idomura, Yasuhiro
Dai-36-Kai Suchi Ryutai Rikigaku Shimpojiumu Koen Rombunshu (Internet), 5 Pages, 2022/12
This study implemented and tested the ensemble data assimilation (DA) of turbulent flows using the lattice Boltzmann method and the local ensemble transform Kalman filter (LBM-LETKF). The computational code was implemented fully on GPUs. The test was carried out for the 3D turbulent flow around a square cylinder with meshes and 32 ensemble members using 32 GPUs. The time interval of the DA in the test was a half of the period of the Kalman vortex shedding. The normalized mean absolute errors (NMAE) of the lift coefficient were 132%, 148%, and 13.2% for the non-DA case, the nudging case (a simpler DA algorithm), and the LETKF case, respectively. It was found that the LETKF achieved good DA accuracy even though the observation was not frequent enough for the small scale turbulence, while the nudging showed systematic delays in its solution, and could not keep the DA accurately.
Asahi, Yuichi; Padioleau, T.*; Latu, G.*; Bigot, J.*; Grandgirard, V.*; Obrejan, K.*
Proceedings of 2022 International Workshop on Performance, Portability, and Productivity in HPC (P3HPC) (Internet), p.68 - 80, 2022/11
Times Cited Count:4 Percentile:78.31(Computer Science, Theory & Methods)This paper presents the performance portable implementation of a kinetic plasma simulation code with C++ parallel algorithm to run across multiple CPUs and GPUs. Relying on the language standard parallelism stdpar and proposed language standard multi-dimensional array support mdspan, we demonstrate that a performance portable implementation is possible without harming the readability and productivity. We obtain a good overall performance for a mini-application in the range of 20% to the Kokkos version on Intel Icelake, NVIDIA V100, and A100 GPUs. Our conclusion is that stdpar can be a good candidate to develop a performance portable and productive code targeting the Exascale era platform, assuming this approach will be available on AMD and/or Intel GPUs in the future.
Asahi, Yuichi; Onodera, Naoyuki; Hasegawa, Yuta; Shimokawabe, Takashi*; Shiba, Hayato*; Idomura, Yasuhiro
Keisan Kogaku Koenkai Rombunshu (CD-ROM), 27, 5 Pages, 2022/06
We have ported the GPU accelerated Lattice Boltzmann Method code "CityLBM" to AMD MI100 GPU. We present the performance of CityLBM achieved on NVIDIA P100, V100, A100 GPUs and AMDMI100 GPU. Using the host to host MPI communications, the performance on MI100 GPU is around 20% better than on V100 GPU. It has turned out that most of the kernels are successfully accelerated except for interpolation kernels for Adaptive Mesh Refinement (AMR) method.
Hasegawa, Yuta; Imamura, Toshiyuki*; Ina, Takuya; Onodera, Naoyuki; Asahi, Yuichi; Idomura, Yasuhiro
Proceedings of 13th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Heterogeneous Systems (ScalAH22) (Internet), p.10 - 17, 2022/00
The ensemble data assimilation of computational fluid dynamics simulations based on the lattice Boltzmann method (LBM) and the local ensemble transform Kalman filter (LETKF) is implemented and optimized on a GPU supercomputer based on NVIDIA A100 GPUs. To connect the LBM and LETKF parts, data transpose communication is optimized by overlapping computation, file I/O, and communication based on data dependency in each LETKF kernel. In two dimensional forced isotropic turbulence simulations with the ensemble size of and the number of grid points of
, the optimized implementation achieved
speedup from the naive implementation, in which the LETKF part is not parallelized. The main computing kernel of the local problem is the eigenvalue decomposition (EVD) of
real symmetric dense matrices, which is computed by a newly developed batched EVD in EigenG. The batched EVD in EigenG outperforms that in cuSolver, and
speedup was achieved.
Hasegawa, Yuta; Onodera, Naoyuki; Asahi, Yuichi; Idomura, Yasuhiro
Dai-35-Kai Suchi Ryutai Rikigaku Shimpojiumu Koen Rombunshu (Internet), 3 Pages, 2021/12
We are developing a real-time urban wind simulation code called CityLBM. In this paper, a performance measurement of the CityLBM was carried out using Tesla A100 GPUs. To optimize the communication with heterogeneous network architectures of intra-node (NVlink) and inter-node (Infiniband) connection, we designed blocked two dimensional domain partitioning with 2 2 or 2
4 subdomains, which are confined within each node. The strong scaling with 2.4 billion grids was tested. The result showed good strong scalability and performance, leading to
2.81 speedup from 80 GPUs to 256 GPUs and
1.15 speedup with the blocked domain partitioning. Finally, the simulation with 1 m resolution and 5.7 km
5.7 km horizontal region exceeded the real-time performance, where the computational speed was
faster than the real-time.
Hasegawa, Yuta; Aoki, Takayuki*; Kobayashi, Hiromichi*; Idomura, Yasuhiro; Onodera, Naoyuki
Parallel Computing, 108, p.102851_1 - 102851_12, 2021/12
Times Cited Count:6 Percentile:47.10(Computer Science, Theory & Methods)The aerodynamics simulation code based on the lattice Boltzmann method (LBM) using forest-of-octrees-based block-structured local mesh refinement (LMR) was implemented, and its performance was evaluated on GPU-based supercomputers. We found that the conventional Space-Filling-Curve-based (SFC) domain partitioning algorithm results in costly halo communication in our aerodynamics simulations. Our new tree cutting approach improved the locality and the topology of the partitioned sub-domains and reduced the communication cost to one-third or one-fourth of the original SFC approach. In the strong scaling test, the code achieved maximum speedup at the performance of 2207 MLUPS (mega- lattice update per second) on 128 GPUs. In the weak scaling test, the code achieved 9620 MLUPS at 128 GPUs with 4.473 billion grid points, while the parallel efficiency was 93.4% from 8 to 128 GPUs.
Asahi, Yuichi; Latu, G.*; Bigot, J.*; Grandgirard, V.*
Proceedings of 2021 International Workshop on Performance, Portability, and Productivity in HPC (P3HPC) (Internet), p.79 - 91, 2021/11
This paper presents optimization strategies dedicated to a kinetic plasma simulation code that makes use of OpenACC/OpenMP directives and Kokkos performance portable framework to run across multiple CPUs and GPUs. We evaluate the impacts of optimizations on multiple hardware platforms: Intel Xeon Skylake, Fujitsu Arm A64FX, and Nvidia Tesla P100 and V100. After the optimizations, the OpenACC/OpenMP version achieved the acceleration of 1.07 to 1.39. The Kokkos version in turn achieved the acceleration of 1.00 to 1.33. Since the impact of optimizations under multiple combinations of kernels, devices and parallel implementations is demonstrated, this paper provides a widely available approach to accelerate a code keeping the performance portability. To achieve an excellent performance on both CPUs and GPUs, Kokkos could be a reasonable choice which offers more flexibility to manage multiple data and loop structures with a single codebase.
Asahi, Yuichi; Hatayama, Sora*; Shimokawabe, Takashi*; Onodera, Naoyuki; Hasegawa, Yuta; Idomura, Yasuhiro
Keisan Kogaku Koenkai Rombunshu (CD-ROM), 26, 4 Pages, 2021/05
We develop a convolutional neural network model to predict the multi-resolution steady flow. Based on the state-of-the-art image-to-image translation model Pix2PixHD, our model can predict the high resolution flow field from the signed distance function. By patching the high resolution data, the memory requirements in our model is suppressed compared to Pix2PixHD.
Hasegawa, Yuta; Aoki, Takayuki*; Kobayashi, Hiromichi*; Idomura, Yasuhiro; Onodera, Naoyuki
Keisan Kogaku Koenkai Rombunshu (CD-ROM), 26, 6 Pages, 2021/05
We introduce an improved domain partitioning method called "tree cutting approach" for the aerodynamics simulation code based on the lattice Boltzmann method (LBM) with the forest-of-octrees-based local mesh refinement (LMR). The conventional domain partitioning algorithm based on the space-filling curve (SFC), which is widely used in LMR, caused a costly halo data communication which became a bottleneck of our aerodynamics simulation on the GPU-based supercomputers. Our tree cutting approach adopts a hybrid domain partitioning with the coarse structured block decomposition and the SFC partitioning in each block. This hybrid approach improved the locality and the topology of the partitioned sub-domains and reduced the amount of the halo communication to one-third of the original SFC approach. The code achieved speedup on 8 GPUs, and achieved
speedup at the performance of 2207 MLUPS (mega-lattice update per second) on 128 GPUs with strong scaling test.
Onodera, Naoyuki; Idomura, Yasuhiro; Hasegawa, Yuta; Shimokawabe, Takashi*; Aoki, Takayuki*
Keisan Kogaku Koenkai Rombunshu (CD-ROM), 26, 3 Pages, 2021/05
We develop a mixed-precision preconditioner for the pressure Poisson equation in a two-phase flow CFD code JUPITER-AMR. The multi-grid (MG) preconditioner is constructed based on the geometric MG method with a three- stage V-cycle, and a cache-reuse SOR (CR-SOR) method at each stage. The numerical experiments are conducted for two-phase flows in a fuel bundle of a nuclear reactor. The MG-CG solver in single-precision shows the same convergence histories as double-precision, which is about 75% of the computational time in double-precision. In the strong scaling test, the MG-CG solver in single-precision is accelerated by 1.88 times between 32 and 96 GPUs.
Onodera, Naoyuki; Idomura, Yasuhiro; Hasegawa, Yuta; Yamashita, Susumu; Shimokawabe, Takashi*; Aoki, Takayuki*
Proceedings of International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia 2021) (Internet), p.120 - 128, 2021/01
Times Cited Count:0 Percentile:0.00(Computer Science, Hardware & Architecture)We develop a multigrid preconditioned conjugate gradient (MG-CG) solver for the pressure Poisson equation in a two-phase flow CFD code JUPITER. The MG preconditioner is constructed based on the geometric MG method with a three-stage V-cycle, and a RB-SOR smoother and its variant with cache-reuse optimization (CR-SOR) are applied at each stage. The numerical experiments are conducted for two-phase flows in a fuel bundle of a nuclear reactor. The MG-CG solvers with the RB-SOR and CR-SOR smoothers reduce the number of iterations to less than 15% and 9% of the original preconditioned CG method, leading to 3.1- and 5.9-times speedups, respectively. The obtained performance indicates that the MG-CG solver designed for the block-structured grid is highly efficient and enables large-scale simulations of two-phase flows on GPU based supercomputers.
Onodera, Naoyuki; Idomura, Yasuhiro; Ali, Y.*; Yamashita, Susumu; Shimokawabe, Takashi*; Aoki, Takayuki*
Proceedings of Joint International Conference on Supercomputing in Nuclear Applications + Monte Carlo 2020 (SNA + MC 2020), p.210 - 215, 2020/10
This paper presents a GPU-based Poisson solver on a block-based adaptive mesh refinement (block-AMR) framework. The block-AMR method is essential for GPU computation and efficient description of the nuclear reactor. In this paper, we successfully implement a conjugate gradient method with a state-of-the-art multi-grid preconditioner (MG-CG) on the block-AMR framework. GPU kernel performance was measured on the GPU-based supercomputer TSUBAME3.0. The bandwidth of a vector-vector sum, a matrix-vector product, and a dot product in the CG kernel gave good performance at about 60% of the peak performance. In the MG kernel, the smoothers in a three-stage V-cycle MG method are implemented using a mixed precision RB-SOR method, which also gave good performance. For a large-scale Poisson problem with cells, the developed MG-CG method reduced the number of iterations to less than 30% and achieved
2.5 speedup compared with the original preconditioned CG method.
Asahi, Yuichi; Latu, G.*; Bigot, J.*; Grandgirard, V.*
Proceedings of Joint International Conference on Supercomputing in Nuclear Applications + Monte Carlo 2020 (SNA + MC 2020), p.218 - 224, 2020/10
Performance portability is expected to be a critical issue in the upcoming exascale era. We explore a performance portable approach for a fusion plasma turbulence simulation code employing the kinetic model, namely the GYSELA code. For this purpose, we extract the key features of GYSELA such as the high dimensionality (more than 4D) and the semi-Lagrangian scheme, and encapsulate them into a mini-application which solves the similar but a simplified Vlasov-Poisson system as GYSELA. We implement the mini-app with OpenACC, OpenMP4.5 and Kokkos, where we suppress unnecessary duplications of code lines. Based on our experience, we discuss the advantages and disadvantages of OpenACC, OpenMP4.5 and Kokkos, from the view point of performance portability, readability and productivity.
Hasegawa, Yuta; Onodera, Naoyuki; Idomura, Yasuhiro
Proceedings of Joint International Conference on Supercomputing in Nuclear Applications + Monte Carlo 2020 (SNA + MC 2020), p.236 - 242, 2020/10
The wind condition and the plume dispersion in urban areas are strongly affected by buildings and plants, which are hardly described in the conventional mesoscale simulations. To resolve this issue, we developed a GPU-based CFD code using a mesh-refined lattice Boltzmann method (LBM), which enables real-time plume dispersion simulations with a resolution of several meters. However, such high resolution simulations are highly turbulent and the time histories of the results are sensitive to various simulations conditions. In order to improve the reliability of such chaotic simulations, we developed an ensemble simulation approach, which enables a statistical estimation of the uncertainty. We examined the developed code against the field experiment JU2003 in Oklahoma City. In the comparison, the wind conditions showed good agreements, and the average values of the tracer gas concentration satisfied the factor 2 agreements between the ensemble simulation data and the experiment.