All compute node job scheduling uses SLURM on the mind cluster. All job requests are done from the headnode.
Rosetta Stone of Workload Managers as a good starting point to learn about SLURM:
SLURM’s Quickstart User Guide provides some information about how SLURM is architected. The example commands and output should be useful:
Useful information about SLURM
- SLURM prevents jobs from taking more resources than are ask for in the job request.
- SLURM will not assign a job to a node that doesn’t have the resources to accommodate the requested job resources. (i.e. If you ask for 40GB of ram, your job will not be assigned to a node that only has 24GB of ram. If a node has 128 GB of ram but a different user asked for 100GB of RAM and was assigned that node, the scheduler won’t assign your job to it, even if that user is only using a small portion of the memory they requested.
- If a job process requires more resources than the user requested, the job/process will crash before taking additional resources.
- If your jobs are failing, it is unlikely that the cause is other users’ jobs. Common causes of failed jobs are insufficient resources or bugs/user error. Spend some time understanding the resource requirements of your job(s). You should especially concentrate on memory and time requirements.
- Don’t assume that the defaults that are set will work for your processing. These are a few defaults that are set for the cpu queue and may not match your needs.
DefaultTime=4:00:00 DefMemPerCPU=1024
My Jobs are failing and I don’t know why. What should I do?
There are multiple methods for diagnosing causes of failed jobs. Sometimes output or error logs don’t help. One simple suggestion is to run a job (or multiple jobs) using an interactive sessions. When requesting a compute node, ask for a generous amount of memory to prevent the job from unexpectedly using more memory than you request. Use tools such as htop (htop is installed cluster wide) to monitor what resources your job is using.
Balancing Job length and the number of nodes you are using.
Essentially, we ask all of our users to be sure that you are not dominating cluster resources using practices that cause other users to unfairly wait for resources. This really falls under the “The Reasonable Person Principle” mentioned in our cluster policies page.
The cluster uses SLURM to balance resources among the active users so everyone has fair access. SLURM uses a complicated equation to do this and it works very well in many cases. However, there are limitations to what it can do, especially when considering long jobs.
What is considered a long job? We recommend keeping your job length under 4 hours. Maybe your next question is why 4 hours? Well, we feel this is a reasonable length to get work done and minimize the length of a user having to wait to get some resources for themselves on a busy cluster. Ideally, we like users to break their jobs down into steps to achieve the less than 4 hour increments of jobs. We also understand this is not always possible to keep your jobs to this maximum length. Thus, we ask the user to take some additional care and attention to how their work will impact other users for any job that is going to last over 4 hours.
If a user has jobs that surpass the 4 hour threshold they should do, at minimum, one of the following two options:
- Throttle back your consumption to 1 or 2 nodes.
- Continue using multiple nodes and actively monitor the queue for user jobs waiting for resources. If appropriate, kill your jobs to the necessary level to make room for the waiting user(s). There are a lot of factors that go into balancing workload so we realize this can be a difficult task to do as a user.
There are also other factors that go into allocating resources that should be considered in order to be sure that you are being fair to all of our users. For instance, some labs have priority queue because their PI/lab purchased nodes for our cluster. Users of those priority queues have priority over the nodes in that high priority queue. However, consideration should be made by users of those priority queues. They should be certain that while taking advantage of their priority queue they aren’t also blocking the other users from fair access to outside the priority queues they are using.
Examples of resource requests
Interactive session request on a specific cpu node
$ srun -p cpu --cpus-per-task=1 --mem=10GB --time=4:00:00 --nodelist=mind-0-11 --pty bash
Interactive session request for a GPU node – 1 GPU for 4 hours and 10GB RAM
$ srun -p gpu --cpus-per-task=1 --gres=gpu:1 --mem=10GB --time=4:00:00 --pty bash
Interactive session request with X11 display on non gpu node
$ srun --x11 -p cpu --cpus-per-task=1 --mem=10 --time=4:00:00 --pty $SHELL
Interactive session with two GPUs, you could use something like this:
$ srun -N1 --gres=gpu:2 --pty $SHELL
And example of a job script which asks for a cpu node and 5gb of memory for 30 minutes. The actual script only prints out the hostname of the compute node the job lands on; loads/unloads a Matlab module; echo’s some output and sleeps for 10 seconds – you can modify to do your processing.:
#!/bin/bash ## Job name #SBATCH --job-name=dpane_test ## Mail events (NONE, BEGIN, END, FAIL, ALL) ############################################### ########## example #SBATCH --mail-type=END,FAIL ############################################## #SBATCH --mail-type=FAIL #SBATCH --mail-user=dpane@cmu.edu ## Run on a single CPU #SBATCH --ntasks=1 # Submit job to cpu queue #SBATCH -p cpu ## Job memory request #SBATCH --mem=5gb ## Time limit days-hrs:min:sec #SBATCH --time 00-00:30:00 ## Standard output and error log #SBATCH --output=/user_data/dpane/exampleOut.out hostname echo "job starting" module load matlab-8.6 cd /user_data/dpane/from_Brian/klab_Suite2P_SLURM_201906 echo "RUNNING MATLAB" sleep 10 module unload matlab-8.6 echo "job finished"
To see what jobs are running and which jobs are in the queue:
$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 49318 cpu bash nblauch R 40:26 1 mind-0-13
There are a variety of sample SLURM job submission scripts made available by other institutions using SLURM. Many of these should be fairly portable though they may contain some bits specific to their environments.