Using the Slurm REST API to integrate with distributed architectures on AWS – HPCwire

Since 1987 – Covering the Fastest Computers in the World and the People Who Run Them
Since 1987 – Covering the Fastest Computers in the World and the People Who Run Them
By Amazon Web Services
November 24, 2021
The Slurm Workload Manager by SchedMD is a popular HPC scheduler and is supported by AWS ParallelCluster, an elastic HPC cluster management service offered by AWS. Traditional HPC workflows involve logging into a head node and running shell commands to submit jobs to a scheduler and check job status. Modern distributed systems often use representational state transfer (REST) API operations to programmatically communicate between system components. This blog post provides details on the high-level design and use of the Slurm REST API with AWS ParallelCluster, and how you can use it to integrate HPC workloads securely and elastically with other AWS services. After reading this blog post, you will be ready to integrate HPC workflows into a distributed cloud-based architecture.
AWS services are built with API operations. Many of these API operations use HTTP as their communication protocol and function as REST API operations. When connecting Slurm to these services in a distributed architecture, the Slurm REST API provides a more native integration than the alternative method of invoking shell commands in a head node. This allows you to design integrations in a scalable and secure way.
SchedMD provides documentation and support for Slurm while maintaining its open source implementation. Part of the Slurm package supported by SchedMD is the REST API. The Slurm REST API was originally implemented as v0.0.35 in Slurm 20.02. As of this writing, the most recent stable Slurm release is 20.11 and includes v0.0.36 of the REST API. Past versions of the REST API are included in each Slurm release as separate endpoints for backward compatibility and are marked deprecated when they intend to be removed in subsequent versions. This makes it a sustainable interface that maintains compatibility while being improved and expanded.
The first section of this blog outlines the functionality of the Slurm REST API. The second section presents a brief overview of integrating custom private API operations into your AWS architecture. The third section provides example solutions that use the Slurm REST API and AWS ParallelCluster to provide HPC capabilities in a distributed application. The example solution architectures are:
The design patterns used in these examples can easily be extended to other Slurm-based clusters on-premises or in hybrid environments. For assistance building with Slurm, SchedMD offers professional services in AWS Marketplace to help customers create advanced HPC solutions.
In this section, we discuss the REST API architecture from the perspective of scalability, flexibility, and security.
The Slurm REST API is provided through a daemon named slurmrestd. It functions adjacent to Slurm command line interface applications (sbatch, sinfo, scontrol, and squeue) so that Slurm can be interacted with by both interfaces. A Slurm cluster is controlled by the Slurm controller daemon running on the head node (slurmctld), while slurmrestd functions only as the REST API interface. slurmrestd functions synchronously with slurmctld. This means that a request is only considered complete after the HTTP response code is sent. slurmrestd is also stateless because after the request is complete any state associated with a request is discarded. These features allow it to function as a highly scalable interface.
Slurm has traditionally functioned with authentication provided by Munge, which is based on authentication by the UID and GID of a process calling Slurm. With the addition of slurmrestd, JSON Web Tokens (JWTs), an open standard RFC, were added as a new means of authentication. In Slurm, JWTs can function as authentication for slurmctld and slurmdbd. Any call to slurmrestd must include a JWT, which is passed to these daemons for authentication.
To use JWTs as Slurm authentication, you must configure Slurm to use them, create a JWT using scontrol, and provide the JWT to Slurm when using the API or CLI. The JWT functions similarly to Munge authentication in that it is associated with a user defined by a UID/GID. Due to this symmetry, both JWT and Munge authentication can be used concurrently and the familiar Slurm user/group definitions are not changed.
The Slurm REST API is intended for use in a distributed architecture, but is not intended to be externally facing. REST API traffic should be TLS wrapped outside of trusted networks as it does not contain HTTPS support by default.
Read the full blog to learn how to set up the Slurm REST API in AWS ParallelCluster.
Reminder: You can learn a lot from AWS HPC engineers by subscribing to the HPC Tech Short YouTube channel, and following the AWS HPC Blog channel.
 

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!
November 29, 2021
HPCwire’s Managing Editor sits down with Jack Dongarra, Top500 co-founder and Distinguished Professor at the University of Tennessee, during SC21 in St. Louis to discuss the 2021 Top500 list, the outlook for global exascale computing, and what exactly is going on in that Viking helmet photo. Read more…
November 26, 2021
Larry Smarr, founding director of Calit2 (now Distinguished Professor Emeritus at the University of California San Diego) and the first director of NCSA, is one of the seminal figures in the U.S. supercomputing community. What began as a personal drive, shared by others, to spur the creation of supercomputers in the U.S. for scientific use, later expanded into a… Read more…
November 24, 2021
Details about two previously rumored Chinese exascale systems came to light during last week’s SC21 proceedings. Asked about these systems during the Top500 media briefing on Monday, Nov. 15, list author and co-founder Jack Dongarra indicated he was aware of some very impressive results, but withheld comment when asked directly if he had… Read more…
November 19, 2021
SC21 may have been the first major supercomputing conference to return to in-person activities, but not everything returned to the live menu: the Student Cluster Competition – held virtually at ISC 2020, SC20 and ISC 2021 – was again held virtually at SC21. Nevertheless, [email protected] Chair Jay Lofstead took the physical stage at SC21 on Thursday to announce the… Read more…
November 19, 2021
Earlier this week MLCommons issued results from its latest MLPerf HPC training benchmarking exercise. Unlike other MLPerf benchmarks, which mostly measure the training and inference performance of systems that are availa Read more…
The Slurm Workload Manager by SchedMD is a popular HPC scheduler and is supported by AWS ParallelCluster, an elastic HPC cluster management service offered by AWS. Read more…
November 18, 2021
For the second (and, hopefully, final) year in a row, SC21 included a second major research award alongside the ACM 2021 Gordon Bell Prize: the Gordon Bell Special Prize for High Performance Computing-Based COVID-19 Research. Last year, the first iteration of this award went to simulations of the SARS-CoV-2 spike protein; this year, the prize went… Read more…
November 29, 2021
HPCwire’s Managing Editor sits down with Jack Dongarra, Top500 co-founder and Distinguished Professor at the University of Tennessee, during SC21 in St. Louis to discuss the 2021 Top500 list, the outlook for global exascale computing, and what exactly is going on in that Viking helmet photo. Read more…
November 26, 2021
Larry Smarr, founding director of Calit2 (now Distinguished Professor Emeritus at the University of California San Diego) and the first director of NCSA, is one of the seminal figures in the U.S. supercomputing community. What began as a personal drive, shared by others, to spur the creation of supercomputers in the U.S. for scientific use, later expanded into a… Read more…
November 24, 2021
Details about two previously rumored Chinese exascale systems came to light during last week’s SC21 proceedings. Asked about these systems during the Top500 media briefing on Monday, Nov. 15, list author and co-founder Jack Dongarra indicated he was aware of some very impressive results, but withheld comment when asked directly if he had… Read more…
November 19, 2021
SC21 may have been the first major supercomputing conference to return to in-person activities, but not everything returned to the live menu: the Student Cluster Competition – held virtually at ISC 2020, SC20 and ISC 2021 – was again held virtually at SC21. Nevertheless, [email protected] Chair Jay Lofstead took the physical stage at SC21 on Thursday to announce the… Read more…
November 19, 2021
Earlier this week MLCommons issued results from its latest MLPerf HPC training benchmarking exercise. Unlike other MLPerf benchmarks, which mostly measure the t Read more…
November 18, 2021
For the second (and, hopefully, final) year in a row, SC21 included a second major research award alongside the ACM 2021 Gordon Bell Prize: the Gordon Bell Special Prize for High Performance Computing-Based COVID-19 Research. Last year, the first iteration of this award went to simulations of the SARS-CoV-2 spike protein; this year, the prize went… Read more…
November 18, 2021
Today at the hybrid virtual/in-person SC21 conference, the organizers announced the winners of the 2021 ACM Gordon Bell Prize: a team of Chinese researchers leveraging the new exascale Sunway system to simulate quantum circuits. The Gordon Bell Prize, which comes with an award of $10,000 courtesy of HPC pioneer Gordon Bell, is awarded annually… Read more…
November 17, 2021
Unlike the deep technical dives of many SC keynotes, Internet pioneer Vint Cerf steered clear of the trenches and took leisurely stroll through a range of human-machine interactions, touching on ML’s growing capabilities while noting potholes to be avoided if possible. Cerf, of course, is co-designer with Bob Kahn of the TCP/IP protocols and architecture of the internet. He’s heralded… Read more…
November 3, 2021
On October 1 of this year, IonQ became the first pure-play quantum computing start-up to go public. At this writing, the stock (NYSE: IONQ) was around $15 and its market capitalization was roughly $2.89 billion. Co-founder and chief scientist Chris Monroe says it was fun to have a few of the company’s roughly 100 employees travel to New York to ring the opening bell of the New York Stock… Read more…
August 20, 2021
Two months ago, Tesla revealed a massive GPU cluster that it said was “roughly the number five supercomputer in the world,” and which was just a precursor to Tesla’s real supercomputing moonshot: the long-rumored, little-detailed Dojo system. Read more…
August 27, 2021
Esperanto Technologies made waves last December when it announced ET-SoC-1, a new RISC-V-based chip aimed at machine learning that packed nearly 1,100 cores onto a package small enough to fit six times over on a single PCIe card. Now, Esperanto is back, silicon in-hand and taking aim… Read more…
September 29, 2021
At the Advanced Scientific Computing Advisory Committee (ASCAC) meeting, held by Zoom this week (Sept. 29-30), it was revealed that the Frontier supercomputer is currently being installed at Oak Ridge National Laboratory in Oak Ridge, Tenn. The staff at the Oak Ridge Leadership… Read more…
November 8, 2021
At a virtual event this morning, AMD CEO Lisa Su unveiled the company’s latest and much-anticipated server products: the new Milan-X CPU, which leverages AMD’s new 3D V-Cache technology; and its new Instinct MI200 GPU, which provides up to 220 compute units across two Infinity Fabric-connected dies, delivering an astounding 47.9 peak double-precision teraflops. “We’re in a high-performance computing megacycle, driven by the growing need to deploy additional compute performance… Read more…
October 15, 2021
Following on changes made in June that moved Intel’s HPC unit out of the Data Platform Group and into the newly created Accelerated Computing Systems and Graphics (AXG) business unit, led by Raja Koduri, Intel is making further updates to the HPC group and announcing… Read more…
August 10, 2021
Intel reported in a blog this week that its adoption of the open source LLVM architecture for Intel’s C/C++ compiler is complete. The transition is part of In Read more…
October 21, 2021
AMD’s next-generation supercomputer GPU is on its way – and by all appearances, it’s about to make a name for itself. The AMD Radeon Instinct MI200 GPU (a successor to the MI100) will, over the next year, begin to power three massive systems on three continents: the United States’ exascale Frontier system; the European Union’s pre-exascale LUMI system; and Australia’s petascale Setonix system. Read more…
August 25, 2021
The emergence of data processing units (DPU) and infrastructure processing units (IPU) as potentially important pieces in cloud and datacenter architectures was Read more…
October 21, 2021
Earlier this month D-Wave Systems, the quantum computing pioneer that has long championed quantum annealing-based quantum computing (and sometimes taken heat fo Read more…
June 22, 2021
In spring 2019, Tesla made cryptic reference to a project called Dojo, a “super-powerful training computer” for video data processing. Then, in summer 2020, Tesla CEO Elon Musk tweeted: “Tesla is developing a [neural network] training computer… Read more…
September 1, 2021
In the heated, oft-contentious, government IT space, HPE has won a massive $2 billion contract to provide HPC and AI services to the United States’ National Security Agency (NSA). Following on the heels of the now-canceled $10 billion JEDI contract (reissued as JWCC) and a $10 billion… Read more…
September 22, 2021
The latest round of MLPerf inference benchmark (v 1.1) results was released today and Nvidia again dominated, sweeping the top spots in the closed (apples-to-ap Read more…
September 13, 2021
What is one to make of the quantum computing market? Energized (lots of funding) but still chaotic and advancing in unpredictable ways (e.g. competing qubit tec Read more…
June 1, 2020
The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…
November 18, 2021
Today at the hybrid virtual/in-person SC21 conference, the organizers announced the winners of the 2021 ACM Gordon Bell Prize: a team of Chinese researchers leveraging the new exascale Sunway system to simulate quantum circuits. The Gordon Bell Prize, which comes with an award of $10,000 courtesy of HPC pioneer Gordon Bell, is awarded annually… Read more…
© 2021 HPCwire. All Rights Reserved. A Tabor Communications Publication
HPCwire is a registered trademark of Tabor Communications, Inc. Use of this site is governed by our Terms of Use and Privacy Policy.
Reproduction in whole or in part in any form or medium without express written permission of Tabor Communications, Inc. is prohibited.

source