On the Mind cluster, you are only permitted to SSH into a compute node after Slurm has officially granted you an active allocation on that node (via an active srun or sbatch job).
However, opening a separate, raw SSH window to check on your code or run interactive process like vscode bypasses Slurm’s automatic environment protections. If you do not manually bind your new SSH window to your assigned GPUs, your code will blindly default to Physical GPU 0, accidentally hijacking another user’s hardware and possibly causing their jobs to crash with Out-of-Memory (OOM) errors.
The following explains how you how can identify your exact physical GPUs allocated to you and safely route your SSH GPU processing to them.
Step 1: Find Your Assigned Physical GPU IDs
Before running anything in a direct SSH window, you must find out which specific physical hardware slots Slurm has reserved for your active job.
From your active Slurm terminal window, run the following command (Slurm will automatically substitute your active Job ID for you):
scontrol show job $SLURM_JOB_ID -d | grep "Nodes="
How to read the output:
Look at the end of the returned line for the GRES and IDX labels.
-
If it says
GRES=gpu:3(IDX:5-7), Slurm has assigned you physical GPUs 5, 6, and 7. -
If it says
GRES=gpu:L40S:2(IDX:2-3), Slurm has assigned you physical GPUs 2 and 3.
Step 2: Bind Your SSH Window to Your GPUs
Once you know your physical IDX numbers, open your separate terminal tab and log into your assigned compute node via SSH.
Before you launch any Python script, Jupyter notebook, or background daemon in that SSH window, you must set your environmental boundary. Run the export command using your exact assigned indices:
# Replace the numbers with your actual allocated physical indices
export CUDA_VISIBLE_DEVICES=5,6,7
What this does behind the scenes
Setting this variable forces the NVIDIA driver and machine learning frameworks (like PyTorch, TensorFlow, etc) to completely hide the rest of the node’s GPUs from your SSH session.
To your code, your assigned cards are the only cards that exist on the machine. This guarantees your code will never accidentally bleed over onto your neighbor’s hardware.
Best Practices Checklist
-
You can always double-check that your boundary is active in your current SSH terminal by running
echo $CUDA_VISIBLE_DEVICES. -
A clean, neighbor-safe workflow inside an SSH window looks like this:
# 1. Lock down your hardware footprint export CUDA_VISIBLE_DEVICES=5,6,7 # 2. Activate your environment and run your code conda activate my_env python my_daemon.py &