1. Home
  2. Using SSH Safely Alongside Slurm

Using SSH Safely Alongside Slurm

On the Mind cluster, you are only permitted to SSH into a compute node after Slurm has officially granted you an active allocation on that node (via an active srun or sbatch job).

However, opening a separate, raw SSH window to check on your code or run interactive process like vscode bypasses Slurm’s automatic environment protections. If you do not manually bind your new SSH window to your assigned GPUs, your code will blindly default to Physical GPU 0, accidentally hijacking another user’s hardware and possibly causing their jobs to crash with Out-of-Memory (OOM) errors.

The following explains how you how can identify your exact physical GPUs allocated to you and safely route your SSH GPU processing to them.

Step 1: Find Your Assigned Physical GPU IDs

Before running anything in a direct SSH window, you must find out which specific physical hardware slots Slurm has reserved for your active job.

From your active Slurm terminal window, run the following command (Slurm will automatically substitute your active Job ID for you):

Bash

scontrol show job $SLURM_JOB_ID -d | grep "Nodes="

How to read the output:

Look at the end of the returned line for the GRES and IDX labels.

  • If it says GRES=gpu:3(IDX:5-7), Slurm has assigned you physical GPUs 5, 6, and 7.

  • If it says GRES=gpu:L40S:2(IDX:2-3), Slurm has assigned you physical GPUs 2 and 3.

Step 2: Bind Your SSH Window to Your GPUs

Once you know your physical IDX numbers, open your separate terminal tab and log into your assigned compute node via SSH.

Before you launch any Python script, Jupyter notebook, or background daemon in that SSH window, you must set your environmental boundary. Run the export command using your exact assigned indices:

# Replace the numbers with your actual allocated physical indices
export CUDA_VISIBLE_DEVICES=5,6,7

What this does behind the scenes

Setting this variable forces the NVIDIA driver and machine learning frameworks (like PyTorch, TensorFlow, etc) to completely hide the rest of the node’s GPUs from your SSH session.

To your code, your assigned cards are the only cards that exist on the machine. This guarantees your code will never accidentally bleed over onto your neighbor’s hardware.

Best Practices Checklist

  • You can always double-check that your boundary is active in your current SSH terminal by running echo $CUDA_VISIBLE_DEVICES.

  • A clean, neighbor-safe workflow inside an SSH window looks like this:

    # 1. Lock down your hardware footprint
    export CUDA_VISIBLE_DEVICES=5,6,7
    
    # 2. Activate your environment and run your code
    conda activate my_env
    python my_daemon.py &