The Strelka computing cluster uses the Slurm queue manager to schedule jobs to run on the cluster. This page has information on how to use Slurm to submit, manage, and analyze jobs.
Slurm References
For detailed reference, the Slurm site has lots of information
Slurm Quick reference guide - two page document with common command references
Slurm Man Pages - full reference for Slurm user commands
Job Management
Submit a job to the queue
sbatch <submission script file>
Cancel a job
scancel <jobid>
This command will cancel a job waiting in the queue or running.
Partition Information
sinfo
This command shows information about each partition on the server. You can see the names of each of the nodes and the status of the partition.
Request an interactive node
It is possible to run a job in interactive mode. This means the code runs on an computing node. Use the slurm command salloc
to start an interactive job. See the Slurm help page for salloc
for options.
Example: Request a single core for 1 hour in the GPU partition
|
Once created, execute code with the srun
command.
Job Information and Analysis
Show all jobs in the Slurm queue
squeue
The squeue
command lists all the jobs submitted to Slurm, both running and queued. This can be helpful to see how busy the server is.
Show all jobs in the Slurm queue submitted by a specific user
squeue -u <username>
Replace "<username>
" with a username. This is useful for listing all your jobs
Get details about a running job
scontrol show jobid -d <jobid>
This command will show detailed information about a running job, including how many nodes and cores requested, memory requested, and the start, elapsed, and end times. This can be run for any job including those you did not submit.
List status info for a currently running job
sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps
The sstat
command can only be run on a job that you submitted. It provides detailed, configurable reporting on a running job. See detailed options on the Slurm sstat reference page.
Get efficiency statistics for a currently running job
seff <jobid>
Use the seff
command to see CPU and memory usage for a job. The command shows how efficient a job is using both cpu and memory. If you notice low utilization, you may be able to request fewer resources.
Get details about a completed job
sacct -o jobid,jobname,start,end,NNodes,NCPUS,ReqMem,CPUTime,AveRSS,MaxRSS -S <start date> -E <end date> --user=<username> --units=G
The sacct command will show information about completed jobs which can be helpful to see how much memory was used. Check the sacct Slurm reference page for the full list of attributes available.
Example:
sacct -o jobid,jobname,start,end,NNodes,NCPUS,ReqMem,CPUTime,AveRSS,MaxRSS -S 2021-06-01 -E 2021-07-01 --user=apaul1 --units=G
Cluster Information
Show busy/free cores for the entire cluster
sinfo -o "%C"
Example output (A=allocated, I=idle, O=other, T=total):
CPUS(A/I/O/T) 296/24/0/320
In this example, 296 cores in the cluster are allocated (busy), 24 cores are idle, 0 cores are other (e.g. unavailable, down), and there are 320 total cores.
Show busy/free cores for each partition
This commands shows how many cores are allocated and idle for each partition.
sinfo -o "%R %C"
Example output (A=allocated, I=idle, O=other, T=total):
PARTITION CPUS(A/I/O/T) compute 296/24/0/320 himem 120/0/0/120 gpu 76/4/0/80
Show busy/free cores for each node
This commands shows how many cores are allocated and idle for each node. We also add the status (idle, allocated, or mixed)
sinfo -o "%n %T %C"
Example output (A=allocated, I=idle, O=other, T=total):
HOSTNAMES STATE CPUS(A/I/O/T) node01 mixed 30/10/0/40 node03 mixed 30/10/0/40 himem02 allocated 40/0/0/40 gpu02 mixed 36/4/0/40 gpu01 allocated 40/0/0/40 himem01 allocated 40/0/0/40 himem03 allocated 40/0/0/40 node02 allocated 40/0/0/40
In this example, some nodes are mixed status and some are completely allocated
Show reserved nodes
Researchers who have purchased nodes on Strelka may reserve them for exclusive use during periods of intense work. Use this command to see the list of reserved nodes:
sinfo -T
Example output:
RESV_NAME STATE START_TIME END_TIME DURATION NODELIST group1 ACTIVE 2021-08-31T10:23:01 2021-12-15T09:23:01 106-00:00:00 node[01,02]
This indicates that node01 and node02 are reserved and can only be used by members of group1.
Reporting
Generate a report for historical usage
sreport cluster UserUtilizationByAccount -t Hours start=<start date> Users=$USER
The sreport command can generate a report of cluster usage for a specific user. This can be helpful when determining how much analyzing how much CPU time was needed to perform a set of jobs. Detailed information available on the Slurm sreport reference page.
Example:
sreport cluster UserUtilizationByAccount -t Hours start=2021-01-01 Users=$USER