The Strelka computing cluster uses the Slurm queue manager to schedule jobs to run on the cluster.  This page has information on how to use Slurm to submit, manage, and analyze jobs.

Slurm References

For detailed reference, the Slurm site has lots of information

Slurm Quick reference guide - two page document with common command references

Slurm Man Pages - full reference for Slurm user commands

Job Management

Submit a job to the queue

sbatch <submission script file>

Cancel a job 

scancel <jobid>

This command will cancel a job waiting in the queue or running.

Partition Information

sinfo

This command shows information about each partition on the server.  You can see the names of each of the nodes and the status of the partition.

Request an interactive node

It is possible to run a job in interactive mode.  This means the code runs on an computing node.  Use the slurm command salloc to start an interactive job.  See the Slurm help page for salloc for options.  

Example: Request a single core for 1 hour in the GPU partition

salloc --ntasks=1 --time=1:00:00 --partition=gpu

Once created, execute code with the srun command.

Job Information and Analysis

Show all jobs in the Slurm queue

squeue

The squeue command lists all the jobs submitted to Slurm, both running and queued.  This can be helpful to see how busy the server is.  

Show all jobs in the Slurm queue submitted by a specific user

squeue -u <username>

Replace "<username>" with a username.  This is useful for listing all your jobs

Get details about a running job

scontrol show jobid -d <jobid>

This command will show detailed information about a running job, including how many nodes and cores requested, memory requested, and the start, elapsed, and end times.  This can be run for any job including those you did not submit.

List status info for a currently running job

sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps

The sstat command can only be run on a job that you submitted.  It provides detailed, configurable reporting on a running job.  See detailed options on the Slurm sstat reference page.

Get efficiency statistics for a currently running job

seff <jobid>

Use the seff command to see CPU and memory usage for a job.  The command shows how efficient a job is using both cpu and memory.  If you notice low utilization, you may be able to request fewer resources.

Get details about a completed job

sacct -o jobid,jobname,start,end,NNodes,NCPUS,ReqMem,CPUTime,AveRSS,MaxRSS -S <start date> -E <end date> --user=<username> --units=G

The sacct command will show information about completed jobs which can be helpful to see how much memory was used.  Check the sacct Slurm reference page for the full list of attributes available. 

Example:

sacct -o jobid,jobname,start,end,NNodes,NCPUS,ReqMem,CPUTime,AveRSS,MaxRSS -S 2021-06-01 -E 2021-07-01 --user=apaul1 --units=G

Cluster Information

Show busy/free cores for the entire cluster

sinfo -o "%C"

Example output (A=allocated, I=idle, O=other, T=total):

CPUS(A/I/O/T)
296/24/0/320

In this example, 296 cores in the cluster are allocated (busy), 24 cores are idle, 0 cores are other (e.g. unavailable, down), and there are 320 total cores.  

Show busy/free cores for each partition

This commands shows how many cores are allocated and idle for each partition.  

sinfo -o "%R %C"

Example output (A=allocated, I=idle, O=other, T=total):

PARTITION CPUS(A/I/O/T)
compute 296/24/0/320
himem 120/0/0/120
gpu 76/4/0/80

Show busy/free cores for each node

This commands shows how many cores are allocated and idle for each node.  We also add the status (idle, allocated, or mixed)

sinfo -o "%n %T %C"

Example output (A=allocated, I=idle, O=other, T=total):

HOSTNAMES STATE CPUS(A/I/O/T)
node01 mixed 30/10/0/40
node03 mixed 30/10/0/40
himem02 allocated 40/0/0/40
gpu02 mixed 36/4/0/40
gpu01 allocated 40/0/0/40
himem01 allocated 40/0/0/40
himem03 allocated 40/0/0/40
node02 allocated 40/0/0/40

In this example, some nodes are mixed status and some are completely allocated

Show reserved nodes

Researchers who have purchased nodes on Strelka may reserve them for exclusive use during periods of intense work.  Use this command to see the list of reserved nodes:

sinfo -T

Example output:

RESV_NAME      STATE           START_TIME             END_TIME     DURATION  NODELIST
group1       ACTIVE  2021-08-31T10:23:01  2021-12-15T09:23:01  106-00:00:00  node[01,02]

This indicates that node01 and node02 are reserved and can only be used by members of group1.

Reporting

Generate a report for historical usage

sreport cluster UserUtilizationByAccount -t Hours start=<start date> Users=$USER

The sreport command can generate a report of cluster usage for a specific user.  This can be helpful when determining how much analyzing how much CPU time was needed to perform a set of jobs.  Detailed information available on the Slurm sreport reference page.

Example:

sreport cluster UserUtilizationByAccount -t Hours start=2021-01-01  Users=$USER