Unique AI Framework
- (UAIF) -
CoE RAISE follows the rules of open science and publishes its results open-access when they are ready for wider application. All developments of CoE RAISE are being integrated into the Unique AI Framework (UAIF), which will not only contain the trained models but also documentation on how to use them on current Petaflop and future Exascale HPC, prototype, and disruptive systems. The UAIF developed by CoE RAISE works with processing-intensive applications of a wide variety of scientific and engineering domains.
UAIF in the context of the larger European Ecosystem of Projects and Initiatives
A - Compute- and Data-Intensive CoE RAISE Use Cases
Component (A) in Fig. 1 represents the co-design efforts of the UAIF based on compute- and data-intensive use cases. Fact Sheets for each use case have been produced and describe what novel AI methods correlate to available UAIF components. They foster general understanding of the contributions that have been added over time to the UAIF and include scalability and utility for Exascale aspects. Several tasks in WP2 contributed to benchmarking and proof of scalability of selected components of the UAIF on various production and prototype HPC systems in this context. Detailed co-design activities have been performed via the Interaction Room methodology and Mural Boards. During the project and especially in the last reporting period, a clear picture is provided on what components are relevant for the UAIF.
B - Domain-Specific CoE Use Cases
A wide variety of CoEs have been funded in different domain-specific areas providing use cases that leverage simulation sciences or AI/HPC methods to utilize the emerging Exascale computing. At the time of writing, another EuroHPC JU Work Programme (WOPRO) outlining future funding of CoEs addresses the needs of large user communities in four specific areas of application domains. As shown in Fig. 1 (B), the UAIF is recommended to CoEs to adopt the UAIF to prevent AI developers in domain-specific sciences wasting a lot of effort.
C - NCC and Industrial Use Cases
A pan-European network of NCCs has been created under the EuroCC-1 and EuroCC-2 project umbrella to enable industry and Small and Medium Enterprises (SMEs) to leverage HPC resources made available via EuroHPC. Component (C) of Fig. 1 has been added to represent adoptions of the UAIF by NCCs and the significant potential to governmental, academic, industry, and SME partners to speed-up and scale-up their applications towards Exascale.
D - Digital Twins (DT) Use Cases
DTs and corresponding workflows as they are developed, e.g., in the Destination Earth or interTwin projects, are becoming important for scientific and engineering HPC users in Europe. Component (D) has been added to Fig. 1 to represent the processing-intensive applications of DTs that are also highly relevant for CoE RAISE, either the DTs adopting parts of UAIF components or including new use cases in CoE RAISE.
Reference Architecture Elements
This section describes the reference architecture components relevant for the UAIF for Exascale HPC/AI methods, which are listed in Fig. 1 in the second layer (components (E) – (O)). This covers descriptions of the Secure Shell (SSH) low-level access, Jupyter notebooks high-level access, application workflows, LAMEC API Open Neural Network Exchange (ONNX) standard elements, LAMEC API community platform integration, community platform OpenML interoperability, ClearML MLOps platform interoperability, LAMEC API facade pattern implementation, LAMEC API batch script repository, LAMEC API batch script generator, and open HPC/AI script generator web page(s).
E - Secure Shell (SSH) Low-Level Access
As shown in Fig. 1 (E), the first reference architecture element includes the use of the SSH protocol into the plan. Principally, as a means to remotely log into HPC systems and submit batch scheduler scripts via the Simple Linux Utility for Resource Management (SLURM) tool, it remains one of the integral access methods for HPC applications. It needs to be provided to researchers. One example of relevance for CoE RAISE is that AI researchers often use batch scripts for distributed training of DL models to leverage the high number of Graphical Processing Units (GPUs) that are available on HPC systems.
F - Jupyter Notebooks High-Level Access
AI researchers frequently require some form of interactive access to HPC systems to facilitate quick and rapid prototyping of Machine Learning (ML) and DL algorithms and models. Component (F) in Fig. 1 represents the acknowledgement of this need in the UAIF by offering Jupyter notebooks and JupyterLabs. In addition to interactive graphical access via web interfaces, it may extend the SSH component to create SSH sessions out of Jupyter notebooks on HPC systems, while at the same time being complemented by the Jupyter environment and a wide variety of useful extensions. CoE RAISE offers access to JupyterLab instances running at JSC through its service portal.
G - Application Workflows
The component (G) in Fig. 1 is a new addition to the UAIF and supports application workflows and workflow automation, including task pre- and post data processing capabilities. The UAIF recommends the Apache Airflow tool that is a platform to programmatically author, schedule, and monitor workflows. The UAIF application workflow set may include more workflow automation tools, such as Elyra, in the future.
H - LAMEC API ONNX Standard Elements
As illustrated in Figure 1 (H), a crucial component of the overall UAIF LAMEC API is the fast portability between different DL frameworks and reproducibility achieved by using the standard ONNX format where possible. One short ONNX report has been developed in CoE RAISE and is available in WP2's internal document repository. However, the implementation of the overall UAIF LAMEC API using ONNX is still in progress.
I - LAMEC API Community Platform Integration
Another element of the overall UAIF LAMEC API, shown in Fig. 1 as component (I), represents seamless integration with other tools. The goal is to use the LAMEC API to share and reuse existing AI models of CoE RAISE with community platforms, industry tools, datasets, and enable Transfer Learning (TL). While initial discussions with community platforms have occurred, the implementation of the overall UAIF LAMEC API integration and provisioning of AI models is still work in progress.
J - Community Platform OpenML Interoperability
OpenML19 is an open community platform for sharing datasets, algorithms, models, and experiments in the realm of AI with a wide variety of traditional ML approaches. One ansatz to enlarge the user community of CoE RAISE is to integrate UAIF components into the OpenML platform, allowing experiments to also run on cutting-edge HPC systems where available. Hence, component (J) in Fig. 1 represents how this community might leverage the LAMEC API components. Initial integration with OpenML has started, and a joint training with OpenML is available on the CoE RAISE YouTube Channel. At the time of writing, the CoE RAISE team is focusing on the core functionality of the LAMEC API, with OpenML integration planned for the second half of 2023.
K - ClearML MLOps Platform Interoperability
ClearML21 is a Machine Learning Operations (MLOps) platform used to develop, orchestrate, and automate ML workflows at scale. The CoE RAISE consortium provides an installation of ClearML for its internal and external users. Another approach to enlarge the user community of CoE RAISE is to integrate UAIF components into MLOps platforms like ClearML, allowing its tasks to also run on cutting-edge HPC systems, where useful. The component (K) in Fig. 1 represents how this community might leverage the LAMEC API components through integration with MLOps platforms. Experience with ClearML exists in CoE RAISE, and training on ClearML is available on the CoE RAISE YouTube Channel. Full integration with ClearML is planned in the second half of 2023.
L - LAMEC API Facade Pattern Implementation
To map the abstract specifications of software and hardware needs by AI researchers to specific software and hardware HPC infrastructure elements, the UAIF LAMEC API general design uses a facade pattern. Hence, as represented by component (L) in Fig. 1, the UAIF software layout design employs an abstract wrapper functionality that maps the abstract specifications from users to specific software and hardware configurations. The core of this API is split into two elements: a batch script repository and an API using this repository to generate new batch script elements.
M - LAMEC API Batch Script Repository
As described in Deliverable L, the first core element of the UAIF LAMEC API is a batch script repository. It contains batch scripts for specific HPC systems with a correct setup of modules needed for using specific UAIF AI tools (see Deliverables P - S). Implementations of this component is available on the CoE RAISE YouTube Channel: RAISE CoE Training: Towards a CoE RAISE Unique AI Software Framework for Exascale. As previously described by component M in Fig. 1, one idea is to use this repository with the UAIF LAMEC API (see Deliverable N). The repository in itself is a great resource for AI/HPC researchers that know how to deal with changing HPC modules in batch scripts.
N - LAMEC API Batch Script Generator
As referenced in Deliverable M, the second core element of the UAIF LAMEC API, represented by component (N) in Fig. 1, uses the above-mentioned repository to generate new batch script segments. This approach reduces the barrier for entry for AI researchers who may not be familiar with modules in HPC environments, and it saves time through automation for experienced users. Additional components, such as AI model scripts or datasets for training and inference, are planned for later inclusion. The implementation is work in progress, but both core elements of the UAIF LAMEC API were demonstrated on selected HPC systems during the "all hands meeting" at the European Organization for Nuclear Research (CERN) in January 2023.
O - Open HPC/AI Script Generator Web Page(s)
Component (O) in Fig. 1 represents the open HPC/AI job script generator web page(s) that employ the implementation of the UAIF LAMEC API. This concept is based on existing job script generators like those available at the Swiss National Supercomputing Centre (CSCS) or the National Energy Research Scientific Computing Center (NERSC). The difference lies in the use of UAIF components with a specific focus on AI toolsets. The job script generator may be hosted at several sites to offer seamless access to AI tools on a variety of HPC systems, not just one per HPC site.
The software infrastructure layer components (P) – (S), depicted in Fig. 1, are presented in this section. This layer contains four components, i.e., basic science libraries (P), DL libraries (Q), distributed DL tools (R), and hyperparameter tuner (S), which are described in the following Deliverables P - S. Again, a green arrow represents adoptions (see the connection to the hardware layer in Fig. 1, presented in the Deliverable Hardware Infrastructures).
P - Basic Science Libraries
Despite the massive increase in DL tools and packages, and their uptake in the AI communities, there remains a core of basic science libraries heavily used by CoE RAISE communities. Examples of these basic science libraries for AI are NumPy and scikit-learn. This building block (P) in Fig. 1 of the UAIF also includes simulation science codes, e.g., those using numerical methods based on known physical laws and that have the potential to benefit from coupling to AI models. Since the CoE RAISE project focuses primarily on AI models, the various relevant simulation science codes have been kept out of the UAIF software layout plan. Instead, the reader is referred to the Fact Sheets of the CoE RAISE use case applications that have been described in "D2.10 - Monitoring Report" (M18). They include simulation science codes where relevant.
Q - Deep Learning Libraries
The UAIF recommends the use of PyTorch and TensorFlow. CoE RAISE has tested their performance and scalability in depth using various applications during the last two years. Although these two libraries were featured in previous UAIF software layout plans, this component (Q) is marked as ‘NEW’ in Fig. 1 due to the inclusion of the NVIDIA Data Loading Library (DALI) since the last reporting period. DALI further increases the performance of PyTorch and TensorFlow. This inclusion is represented by component (Q) in Fig. 1 in parenthesis due to the proprietary nature and support for NVIDIA GPUs. At the time of writing, CoE RAISE continues to investigate libraries of other GPU vendors such as from Advanced Micro Devices (AMD).
R - Distributed Deep Learning Tools
Component (R) in Fig. 1 outlines three supported libraries used for accelerating distributed AI model training by leveraging the large number of GPUs available at cutting-edge HPC sites today. Earlier implementations of this component are available as a part of a training on CoE RAISE’s YouTube Channel: RAISE CoE Training: Distributed Deep Learning. PyTorch-Distributed Data Parallel (DDP) and Horovod were already included in earlier UAIF software layout plans.
S - Hyperparameter Tuner
One of the most successful aspects of the current adoptions of the UAIF are the hyperparameter tuning or HPO tools represented by component (S) in Fig. 1. Trainings reflecting this component and its implementation are available on CoE RAISE’s YouTube Channel: RAISE CoE Training: Hyperparameter Tuning with Ray Tune. In addition to the previously included Ray Tune tool, this component was updated with the addition of Optuna and DeepHyper tools.
The hardware infrastructure layer components (T) – (Y) depicted in Fig. 1 are presented in this section. This layer contains components on prototype HPC systems, the D-Wave Quantum Annealing (QA) system, the Modular Supercomputing Architecture (MSA) Juelich Wizard for European Leadership Science (JUWELS), container technologies, EuroHPC JU hosting sites, and EU HPC systems.
T - Prototype HPC Systems
The benchmarking and porting activities of WP2 have been performed on a number of interesting prototype HPC systems that feature new and emerging technologies. Since the beginning of the project, the Dynamical Exascale Entry Platform (DEEP) system has been used to experiment with the MSA type of HPC architecture. This component (T) includes the addition of the two new prototype systems, the Advanced Reduced Instruction Set Computer Machine (ARM)-based CTE-ARM and CTE-AMD, hosted at the Barcelona Supercomputing Centre (BSC) in Spain. The CTE-ARM is a supercomputer based on 192 A64FX ARM processors, with a Linux Operating System (OS) and an Tofu interconnect network (6.8GB/s). CTE-AMD is a cluster based on AMD EPYC processors, with a Linux OS and an Infiniband interconnection network. Its main characteristic is the availability of two AMD MI50 GPUs per node, making it an ideal cluster for GPU applications.
U - D-Wave Quantum Annealer System
Quantum Computing (QC) is gaining momentum as the EuroHPC JU recently funded, together with national contributions, several QC systems. Multiple CoE RAISE use case applications have successfully engaged in QC by utilizing the D-Wave QA system available via the Juelich UNified Infrastructure for Quantum computing (JUNIQ) at JSC in Germany. As represented by component (U) in Fig. 1, the quantum AI models implemented were Support Vector Machines (SVMs). They were used for regression tasks via Support Vector Regression (SVR). An implementation of this component is also available as part of a training on SVMs on the CoE RAISE YouTube Channel: RAISE CoE Training: Quantum Support Vector Machine Algorithms.
V - Modular HPC System JUWELS
The MSA-based HPC system JUWELS is massively used within CoE RAISE for co-designing the UAIF and performing necessary speed-up and scaling benchmarks of its components, see component (V) in Fig. 1. It is an ideal HPC system for AI workloads.
W - Container Technologies
Container technologies are an important tool within larger AI communities to facilitate porting of applications and datasets between systems. One such example in CoE RAISE is shown as component (W) in Fig. 1, where the porting operation of a containerized application from JUWELS at JSC to the MARE NOSTRUM 4 system at BSC is realized. This transparent deployment of containerized code is made possible by the support of Apptainer (previously named Singularity) available at both sites. Initial tests have been performed with containers on both HPC platforms. More application use case uptake is foreseen in the last half of the year 2023. This component of the UAIF is crucial to support more industry applications and to enable easy porting of data science applications that have not used HPC systems before.
X - EuroHPC JU Hosting Sites
Component (X) in Fig. 1 covers the major EuroHPC JU hosting sites that represent stakeholders to adopt the UAIF. Several European HPC systems that are available within CoE RAISE contributed to co-design with applications to the UAIF software layout and design. It is the goal of CoE RAISE is to support as many as possible EuroHPC JU systems during and beyond the lifetime of the project by building on the sustainability strategy developed in WP5 (Business Development). Initial discussions with some of these sites have been started by WP2 partners to encourage the adoption of the UAIF, and to engage the HPC sites in a CoE RAISE certification process jointly with WP5. At the time of writing, the broader adoption strategy is in its initial stages, while components such the LAMEC API are considered to be further developed adding more EuroHPC JU systems support over time. One highlight of the planned adoption will be the integration to the first European Exascale system JUPITER, which will be installed at JSC in 2024.
Y - EU HPC Systems
It is important to consider that the whole landscape of European HPC systems is broader than the EuroHPC JU hosting sites described above. It is observed that new users of the UAIF are often starting using regional or university-level systems before scaling up to larger systems. Component (Y) in Fig. 1 contains examples such as the university-level systems Rudens of the Riga Technical University (RTU), or the HPC systems of Rheinisch-Wesfälische Technische Hochschule Aachen - RWTH Aachen University (RWTH). Both of these sites are in the process of adopting parts of the UAIF framework and are in discussions with CoE RAISE concerning certification steps. Another example are Belgium regional HPC systems such as the Vlaams Supercomputer Centre (VSC) that are in use by CoE RAISE. There is a wide variety of other HPC system providers, such as commercial and industrial systems in Iceland (e.g., Responsible Compute) that are not shown in Fig. 1 (Y), but are in discussions with CoE RAISE to adopt elements of the UAIF.