This page has information on how to use Slurm to submit, manage, and analyze jobs.

Resource sharing and allocations on the cluster are handled by a combination of a resource manager (tracking which computational resources are available on which nodes) and a job scheduler (determining when and to which available resources to submit a particular job, and then monitoring it). To accomplish both tasks, The Strelka computing cluster uses the Slurm queue manager.

There are two primary reasons to use Slurm. First, other than for basic, short testing, no “real” work should be performed on the login node, which has several responsibilities such as managing users, handling logins, monitoring the other nodes, etc. For that reason, nearly all work should be performed on the compute nodes, and Slurm acts as the “gateway” into those systems. Second, because Slurm keeps track of which resources are available on the compute nodes, it is able to allocate the most efficient set of them for your tasks, as quickly as possible.

Slurm is a powerful and flexible program, and as such it is beyond the scope of this document to provide an exhaustive tutorial. Rather, the examples provided here should be sufficient to get started, and a wide array of online resources for further guidance.

Slurm References

For detailed reference, the Slurm site has lots of information

Slurm Quick reference guide - two page document with common command references

Slurm Man Pages - full reference for Slurm user commands

This page itself is modeled after the excellent CÉCI Slurm tutorial.

Gathering Information

Slurm offers a variety of commands to query the nodes, which can provide a snapshot of the overall computational ecosystem, list jobs in process or that are queued up, and more.

sinfo

The sinfo command lists available partitions and some basic information about each. A partition is a logical grouping of physical compute nodes. Running sinfo produces output similar to this; the list is dynamic and represents a current snapshot of which partitions are available, which systems comprise a given partition, and an idea of the availability of those systems:

[jsimms1@strelka ~]$ sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*       up   infinite      4  alloc himem[02-03],node[01,06]
compute*       up   infinite      5   resv himem01,node[02-03,05,07]
compute*       up   infinite      1    mix node04
himem          up   infinite      2  alloc himem[02-03]
himem          up   infinite      1   resv himem01
hicpu          up   infinite      1    mix hicpu01
gpu            up   infinite      2    mix gpu[01-02]
interactive    up    5:00:00      2    mix gpu[01-02]

squeue

The command squeue displays a list of jobs that are currently running (denoted with R) or that are pending (denoted with PD). Here is example output:

$ squeue
JOBID PARTITION NAME USER ST  TIME  NODES NODELIST(REASON)
12345     debug job1 dave  R   0:21     4 node[9-12]
12346     debug job2 dave PD   0:00     8 (Resources)
12348     debug job3 ed   PD   0:00     4 (Priority)

In this example, job 12345 is running on nodes 9-12 within the debug partition, job 12346 is pending because requested resources are unavailable, and job 12348 is pending because it is a lower priority than currently-running jobs. The other columns are largely self-explanatory, though TIME is the time up until now that a given job has been running. The squeue help page describes many other options available to control what information is displayed and its formatting.

Job Scheduling

As is commonly the case with HPC clusters, there are often insufficient resources to run all jobs immediately when they are submitted; as such, submitted jobs are regularly placed into the job queue. Each job’s position in the queue depends on a number of factors. Slurm updates the priority queue every 5 seconds, so a job’s priority may change over time, moving up or down.

Slurm also uses backfill scheduling to “fill in” slots when, e.g., a job completes earlier than estimated, so it is possible, especially for shorter jobs, that a job may be run prior to when it was estimated to do so. For this reason, it is critical to estimate the time required for your job as accurately as possible. While you should not underestimate, excessive overestimation can make it appear that subsequent jobs won’t start for a long time. A good rule of thumb, when possible, is to request about 10-15% more time than you think is required.

Prior to submitting a job, you can check when it is estimated to be run:
sbatch --test-only myscript.sh

For a job that has already been submitted, you can check its status:
squeue --start -j <jobid>

For a list of your jobids:
squeue -u <username>

You can check your usage information with sshare; note that RawUsage corresponds to CPU seconds:
sshare -u <username>

You can also see a formatted list of all queued jobs, sorted by current priority (which, as noted, constantly updates):
squeue -o '%.7i %.9Q %.9P %.8j %.8u %.8T %.10M %.11l %.8D %.5C %R' -S '-p' --state=pending | less

Partition Configuration and Job Priority

Strelka's compute nodes are grouped into logical partitions, or collections of physical nodes; one common use, for example, is to designate a set of nodes owned by a particular researcher, to ensure that they have priority access to those resources. The general approach is that, for the most part, any user can use any node, but a node's owner and those they designate can preempt jobs (on nodes they purchased) from other users, resulting in those jobs being requeued.

Users who do not own specific nodes will typically work with three partitions: compute, unowned, and interactive:

Partition	Description	Advantages	Disadvantages

Partition	Description	Advantages	Disadvantages
`compute`	The default partition, containing nearly every node on Strelka; if you do not specify a different partition to use, your jobs will run here.	Due to the large pool of nodes available, jobs will start as quickly as possible.	If a job ends up on a node owned by someone else, it is possible that the job will be preempted and would be requeued. This may not be a concern if the job uses regular checkpointing.
`unowned`	Contains only nodes that have not been purchased by a specific researcher; that is, they are "owned" by ITS.	Jobs running in this partition cannot be preempted.	Because the available node pool is comparatively small, jobs may wait in the queue for a longer period of time.
`interactive`	Contains nodes reserved for interactive sessions, either via the command line or Open OnDemand.	Permits an interactive session that cannot be preempted.	Interactive sessions have a time limit of 5:00:00 (five hours).

It is possible to designate the desired partition to which you want to submit a job, as discussed below.

Creating and Submitting Jobs

Slurm offers two primary ways to submit jobs to the compute nodes: interactive and batch. Interactive is the simpler method, but its usefulness is somewhat limited and is generally used to work with software interactively. Batch is more complex and requires greater planning, but it is by far the most common use of Slurm and provides a great deal of flexibility and power.

Interactive

Command line

The simplest way to connect to a set of resources is to request an interactive shell, which can be accomplished with the salloc command. Here is a basic example:

[user@strelka ~]$ salloc -t 60 --cpus-per-task=1 --mem-per-cpu=32gb --partition=interactive
[user@node01 ~]$

This example allocates an interactive shell session for 60 minutes (-t 60), provides one CPU (--cpus-per-task=1) and 32gb of memory to the session (--mem-per-cpu=32gb), and designates that the job should run on the interactive partition (--partition=interactive). As the second line shows, the requested resources were allocated using node01 and the interactive session switched to that node, ready for commands. At the end of 60 minutes, the session will be terminated, demonstrating why it is important to request a suitable amount of time (if you leave off the -t flag and do not specify a time, your session will be allocated only 5 minutes).

Once your interactive session starts, you will be in your home directory and can begin performing work. But, if you wish to run software with a GUI, you must explicitly indicate that by adding the --x11 flag:

^{[user@strelka ~]$ salloc -t 60 --cpus-per-task=1 --mem-per-cpu=32gb --partition=interactive --x11}

salloc is extremely powerful and there are a number of other options you can leverage. One of the most useful flags is --ntasks-per-node, which will allocate a specific number of computational cores to the session. This can be useful when running software that is optimized for parallel operations, such as Stata. For instance, the following example modifies the previous command to also request 8 cores:

When finished with your session, the exit command will terminate it and return you to the login node:

Virtual desktop / GUI

If a virtual desktop is preferred, or is required to run a GUI program, a second option is to request an interactive session through Open OnDemand.

Batch

The most common way to work with Slurm is to submit batch jobs and allow the scheduler to manage which resources are used, and at which times. So, what then, exactly, is a job? A job has two separate parts:

a resource request, which requests things like required cores, memory, GPUs, etc.
a list of one or more job steps, which are basically the individual commands to be run sequentially to perform the actual tasks of the job.

The best way to manage these two parts is within a single submission script that Slurm uses to allocate resources and process your job steps. Here is an extremely basic sample submission script (we’ll name it sample.sh):

Following the first (shebang) line are any number of SBATCH directives, which handle the resource request and other data (e.g., job name, output file location, and potentially many other options) associated with your job. These all must appear at the top of the file, prior to any job steps. In this file, multiple #SBATCH directives define the job:

Setting	Meaning	Value

Setting	Meaning	Value
`#SBATCH --job-name=sample`	Provide a short-ish descriptive name for your job	sample
`#SBATCH --output=/home/username/samplejob/output/output_%j.txt`	Where to save output from the job; note that any content that normally would be output to the terminal will be saved in the file.	`/home/username/samplejob/output/output_%j.txt` (%j will be replaced by the job number assigned by Slurm; note that Slurm will default to producing an output file in the directory from which the job is submitted)
`#SBATCH --partition=unowned`	Which partition to use	unowned
`#SBATCH --time=1:00:00`	Time limit of the job	1:00:00 (one hour)
`#SBATCH --ntasks=1`	Number of CPU cores to request	1 (this can be increased if your code can leverage additional cores)
`#SBATCH --mem-per-cpu=100mb`	How much memory to request	100mb (this is per-core and can be expressed in gb, etc.)
`#SBATCH --mail-type=BEGIN,END,FAIL,REQUEUE`	Decide when to receive an email	BEGIN,END,FAIL,REQUEUE (this will send an email when the job actually starts running, when it ends, if it fails, and if the job is requeued)
`#SBATCH --mail-user=username@swarthmore.edu`	Email address	username@swarthmore.edu

After the parameters are set, the commands to run the code are added. Note that this is effectively a modified shell script, so any commands that work in such scripts will typically work. It is important, however, to precede any actual commands with srun.

Parallel batch

Please see this page for information about submitting batch or array jobs.

Submit a job to the queue

Once you have a job submission script created (e.g., sample.sh), use sbatch to send it into the queue:

sbatch sample.sh

Cancel a job

Use scancel to cancel a job either waiting in the queue or that is running:

scancel <jobid>

Job Information and Analysis

Get details about a running job

scontrol show jobid -d <jobid>

This command will show detailed information about a running job, including how many nodes and cores requested, memory requested, and the start, elapsed, and end times. This can be run for any job including those you did not submit.

List status info for a currently running job

sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps

The sstat command can only be run on a job that you submitted. It provides detailed, configurable reporting on a running job. See detailed options on the Slurm sstat reference page.

Get efficiency statistics for a currently running job

seff <jobid>

Use the seff command to see CPU and memory usage for a job. The command shows how efficient a job is using both cpu and memory. If you notice low utilization, you may be able to request fewer resources.

Get details about a completed job

sacct -o jobid,jobname,start,end,NNodes,NCPUS,ReqMem,CPUTime,AveRSS,MaxRSS -S <start date> -E <end date> --user=<username> --units=G

The sacct command will show information about completed jobs which can be helpful to see how much memory was used. Check the sacct Slurm reference page for the full list of attributes available.

Example:

sacct -o jobid,jobname,start,end,NNodes,NCPUS,ReqMem,CPUTime,AveRSS,MaxRSS -S 2021-06-01 -E 2021-07-01 --user=apaul1 --units=G

Cluster Information

Show busy/free cores for the entire cluster

Example output (A=allocated, I=idle, O=other, T=total):

In this example, 296 cores in the cluster are allocated (busy), 24 cores are idle, 0 cores are other (e.g. unavailable, down), and there are 320 total cores.

Show busy/free cores for each partition

This commands shows how many cores are allocated and idle for each partition.

Example output (A=allocated, I=idle, O=other, T=total):

Show busy/free cores for each node

This command shows how many cores are allocated and idle for each node. We also add the status (idle, allocated, or mixed)

Example output (A=allocated, I=idle, O=other, T=total):

In this example, some nodes are mixed status and some are completely allocated

Show reserved nodes

Researchers who have purchased nodes on Strelka may reserve them for exclusive use during periods of intense work. Use this command to see the list of reserved nodes:

Example output:

This indicates that node01 and node02 are reserved and can only be used by members of group1.

Reporting

Generate a report for historical usage

sreport cluster UserUtilizationByAccount -t Hours start=<start date> Users=$USER

The sreport command can generate a report of cluster usage for a specific user. This can be helpful when determining how much analyzing how much CPU time was needed to perform a set of jobs. Detailed information available on the Slurm sreport reference page.

Example:

sreport cluster UserUtilizationByAccount -t Hours start=2021-01-01 Users=$USER

Teaching, Learning, Research

Slurm Commands

Slurm References

Gathering Information

sinfo

squeue

Job Scheduling

Partition Configuration and Job Priority

Creating and Submitting Jobs

Interactive

Command line

Virtual desktop / GUI

Batch

Parallel batch

Submit a job to the queue

Cancel a job

Job Information and Analysis

Get details about a running job

List status info for a currently running job

Get efficiency statistics for a currently running job

Get details about a completed job

Cluster Information

Show busy/free cores for the entire cluster

Show busy/free cores for each partition

Show busy/free cores for each node

Show reserved nodes

Reporting

Generate a report for historical usage