Blueprint of AI Framework for Exascale conceived using Application Co-Design

The CoE RAISE project has significantly advanced towards designing the first blueprint of an AI framework ready for future Exascale HPC systems as described in our ‘AI at Exascale’ Webpage article. This framework is an enabler for highly scalable applications accelerating scientific discovery and advancing engineering in a wide variety of domains. As mentioned in our May 2021 news, it is co-designed by the RAISE Use Cases based on proper software engineering methodologies like Interaction Rooms and Fact Sheets. This news article highlights some of the critical outcomes of using these methodologies.

One of the most important goals of CoE RAISE is to support the development of Artificial Intelligence (AI) technologies towards their Exascale application using cutting-edge HPC systems. Therefore, building, training, and evaluating Machine Learning (ML) and Deep Learning (DL) models is of utmost importance. The Fact Sheet development process and the design and discussion sessions in the continuously accessible virtual Interaction Rooms enabled the identification of an initial set of relevant AI/HPC methods that have contributed to CoE RAISE to meet another highly relevant project milestone. Table 1 shows a summary of these methods. A vital requirement of the RAISE AI framework is thus to support the design, development, and use of AI models of this table. In other words, Table 1 represents the outcome of the requirement analysis concerning the model side of the RAISE AI framework. Furthermore, its underlying ML and DL libraries, packages, and tools need to support these models.

Table 1: Initial outcome of the deeper CoE RAISE Use Case analysis requiring AI/HPC methods.

Apart from the listed AI/HPC methods per Use Case in Table 1, we identified additional software and hardware infrastructure requirements (RQ1-9), as shown in Figure 1. Taking these requirements into account, the software framework serves as a blueprint for important components to build and deploy AI applications for Exascale.

It is a universal, reusable software environment that offers particular functionality as part of a larger software platform (e.g., specific HPC system module deployments) to facilitate the development of AI Use Cases in RAISE. Based on the requirements, the software framework includes supporting programs, code libraries, toolsets, and APIs that bring together all the different components to develop innovative AI models. WP3 Compute-Intensive CoE RAISE Use Cases (Figure 1 A) and WP4 Data-Intensive CoE RAISE Use Cases (Figure 1 B) contributed to the framework design. One of the CoE RAISE Reference Architecture goals is to enable other Exascale HPC & AI Community Use Cases (Figure 1 C), such as those driven by other CoEs or future Digital Twins (e.g., Destination Earth).

Figure 1: Initial AI framework for Exascale blueprint co-designed by CoE RAISE use case applications.

The RAISE Reference Architecture needs to support low-level (RQ5) and high-level access methods (RQ4). In this initial blueprint, the core building block to enable low-level access is via SSH protocols using batch scheduler scripts for automation (Figure 1 D). However, the requirement analysis of the WP3 and WP4 Use Cases revealed that interactive access via Jupyter notebooks is also required (Figure 1 E) to enable quick and rapid prototyping of DL algorithms. In this context, it is possible to create SSH sessions out of Jupyter notebooks on HPC systems. Furthermore, the framework needs to be hardware-agnostic with respect to accelerators (RQ8) and requires a high I/O performance to work with large quantities of data (RQ9). Abstracting specific DL tasks (RQ2) by using the ONNX format (Figure 1 F) enables portability between different DL frameworks and increases reproducibility (RQ6). A more comprehensive view of use cases reveals the requirements that platforms like MLFlow (Figure 1 G) are essential to share and re-use existing AI models among the broader AI & HPC community (RQ6).

The unique AI framework layout design employs an abstract wrapper functionality to map the above abstract specifications to specific software and hardware infrastructure via the Facade pattern (Figure 1 H). It thus maps the abstract specifications to specific software and hardware configurations (RQ1) of available HPC systems (Figure 1 I), encapsulating users from the need to know about low-level version details. However, in this context, it is vital to check what versions are available in what modules on the specific HPC systems (Figure 1 J) as RAISE’s Use Cases considered this a major obstacle for using AI technologies on HPC systems today.

The initial Reference Architecture blueprint in Figure 1 includes specific versions of packages within its software infrastructure provided by HPC centres in Europe. These software packages are basic science libraries (Figure 1 K) such as NumPy or Pandas. In addition, the RAISE Reference Architecture will also leverage specific harmonized versions of the DL packages TensorFlow and PyTorch (Figure 1 L), and PyTorch-DDG and Horovod for scaling towards Exascale (Figure 1 M). At the time of writing, the complementary benchmark activities of WP2 in CoE RAISE investigate scaling and limits when using these packages, particularly concerning their scalability towards Exascale. To enable portability, users of the RAISE AI framework based on this Reference Architecture can use specially prepared containers with Use-Case specific software stacks prepared based on Singularity (Figure 1 P).

Finally, the Reference Architecture in Figure 1 also includes a hardware infrastructure that deploys the software infrastructure above, and that is accessible to RAISE Use Case members. At present, the HPC systems available to RAISE are the DEEP prototype (Figure 1 N), the JUWELS system at the Jülich Supercomputing Centre (Figure 1 O) and the MareNostrum system at BSC-CNS (Figure 1 Q). Other machines not shown in Figure 1 include HPC systems available via the PRACE Rapid Access programme. But many more HPC systems become available and work towards supporting the basic blueprint of the RAISE CoE framework. The future work entails improving the blueprint (e.g., adding hyper-parameter optimization tools like Ray Tune or Optune), identifying more HPC/AI methods listed in Table 1, and working with cutting-edge technologies such as quantum annealers.