top of page

WP2 news: Welcome to AI4HPC - the newest addition to the UAIF!

 

 

 

AI4HPC is an open-source library to train AI models for Computational Fluid Dynamics (CFD)-based applications on High-Performance Computing (HPC) systems. This library has been developed in the context of the CoE RAISE project and is part of the Unique AI Framework (UAIF). AI4HPC offers various routines to enable distributed training of Machine Learning (ML) models targeted towards CFD problems. In particular, the focus of the library is on improving the training performance of AI models on heterogeneous HPC systems with the goal to scale up-to Exascale-sized systems.

 

The core functionalities include multiple distributed training frameworks, data augmentation and manipulation routines for CFD datasets, ML models for investigating CFD problems, optimizations for improving the parallel code performance on HPC systems, and a HyperParameter Optimization (HPO) suite. AI4HPC also includes a benchmarking suite to test the performance and Exascale capabilities of heterogeneous HPC systems.

Figure1.png

Figure 1: Data distribution strategy employed to train the AI models

The Distributed Data Parallel (DDP) strategy (shown in Figure 1) is based on splitting up the data across multiple workers and then exchanging and averaging the gradients between them. The library offers a collection of DDP frameworks namely - PyTorch-DDP, Horovod, DeepSpeed, and HeAT. In terms of ML networks developed for CFD applications, Convolutional Autoencoder, Convolutional Regression Network, and a Convolutional Defiltering Model are provided in the library. AI4HPC is under active development and further networks will be added in due course of time. 

 

The HPO suite contains three widely-used algorithms, i.e., Asynchronous Successive Halving (ASHA), Bayesian Optimization Hyperband (BOHB), and Population Based Training (PBT). 

 

ASHA and BOHB are both based on the concept of early-stopping, which means that hyperparameter trials that perform below average are terminated before they are fully trained. In contrast, PBT leverages the principles of evolutionary optimization for discovering optimal solutions and is particularly suitable for optimizing sequences of hyperparameters like learning rate schedules. 

 

Figure 2 shows the difference between using a naive random search approach to HPO and the ASHA scheduling algorithm. While the random search method blindly trains all trials until completion, ASHA terminates trials with subpar performance already early on. This way, only the most promising trial – the one with the best accuracy – is trained to completion.

Figure2.png

Figure 2: Comparison of learning curves for Random Search (top) and ASHA (bottom)

By using one of these algorithms, the HPO process becomes more efficient and significant amounts of computational resources can be saved. In the AI4HPC repository, these methods are documented in detail, enabling users to apply the algorithms to their own problems.  

 

More details and documentation about this library can be found here: https://ai4hpc.readthedocs.io/en/latest/index.html.

bottom of page