{"id":722,"date":"2026-06-15T15:43:37","date_gmt":"2026-06-15T15:43:37","guid":{"rendered":"https:\/\/ni.cmu.edu\/computing\/?page_id=722"},"modified":"2026-06-15T15:43:37","modified_gmt":"2026-06-15T15:43:37","slug":"using-ssh-safely-alongside-slurm","status":"publish","type":"page","link":"https:\/\/ni.cmu.edu\/computing\/using-ssh-safely-alongside-slurm\/","title":{"rendered":"Using SSH Safely Alongside Slurm"},"content":{"rendered":"<p data-path-to-node=\"4\">On the Mind cluster, you are only permitted to SSH into a compute node after Slurm has officially granted you an active allocation on that node (via an active <code data-path-to-node=\"4\" data-index-in-node=\"155\">srun<\/code> or <code data-path-to-node=\"4\" data-index-in-node=\"163\">sbatch<\/code> job).<\/p>\n<p data-path-to-node=\"5\">However, opening a separate, raw SSH window to check on your code or run interactive process like vscode bypasses Slurm&#8217;s automatic environment protections. If you do not manually bind your new SSH window to your assigned GPUs, your code will blindly default to Physical GPU 0, accidentally hijacking another user&#8217;s hardware and possibly causing their jobs to crash with Out-of-Memory (OOM) errors.<\/p>\n<p data-path-to-node=\"6\">The following explains how you how can identify your exact physical GPUs allocated to you and safely route your SSH GPU processing to them.<\/p>\n<h2 data-path-to-node=\"8\">Step 1: Find Your Assigned Physical GPU IDs<\/h2>\n<p data-path-to-node=\"9\">Before running anything in a direct SSH window, you must find out which specific physical hardware slots Slurm has reserved for your active job.<\/p>\n<p data-path-to-node=\"10\">From your active Slurm terminal window, run the following command (Slurm will automatically substitute your active Job ID for you):<\/p>\n<div class=\"code-block ng-tns-c3678280791-359 ng-animate-disabled ng-trigger ng-trigger-codeBlockRevealAnimation\">\n<div class=\"formatted-code-block-internal-container ng-tns-c3678280791-359\">\n<div class=\"animated-opacity ng-tns-c3678280791-359\">\n<div class=\"code-block-decoration header-formatted gds-emphasized-body-m ng-tns-c3678280791-359 ng-star-inserted\"><span class=\"ng-tns-c3678280791-359\">Bash<\/span><\/p>\n<div class=\"buttons ng-tns-c3678280791-359 ng-star-inserted\"><\/div>\n<\/div>\n<pre class=\"ng-tns-c3678280791-359\"><code class=\"code-container formatted ng-tns-c3678280791-359\" role=\"text\" data-test-id=\"code-content\">scontrol show job <span class=\"hljs-variable\">$SLURM_JOB_ID<\/span> -d | grep <span class=\"hljs-string\">\"Nodes=\"<\/span>\r\n<\/code><\/pre>\n<\/div>\n<\/div>\n<\/div>\n<h3 data-path-to-node=\"12\">How to read the output:<\/h3>\n<p data-path-to-node=\"13\">Look at the end of the returned line for the <code data-path-to-node=\"13\" data-index-in-node=\"45\">GRES<\/code> and <code data-path-to-node=\"13\" data-index-in-node=\"54\">IDX<\/code> labels.<\/p>\n<ul data-path-to-node=\"14\">\n<li>\n<p data-path-to-node=\"14,0,0\">If it says <code data-path-to-node=\"14,0,0\" data-index-in-node=\"11\">GRES=gpu:3(IDX:5-7)<\/code>, Slurm has assigned you physical GPUs 5, 6, and 7.<\/p>\n<\/li>\n<li>\n<p data-path-to-node=\"14,1,0\">If it says <code data-path-to-node=\"14,1,0\" data-index-in-node=\"11\">GRES=gpu:L40S:2(IDX:2-3)<\/code>, Slurm has assigned you physical GPUs 2 and 3.<\/p>\n<\/li>\n<\/ul>\n<h2 data-path-to-node=\"16\">Step 2: Bind Your SSH Window to Your GPUs<\/h2>\n<p data-path-to-node=\"17\">Once you know your physical <code data-path-to-node=\"17\" data-index-in-node=\"28\">IDX<\/code> numbers, open your separate terminal tab and log into your assigned compute node via SSH.<\/p>\n<p data-path-to-node=\"18\">Before you launch any Python script, Jupyter notebook, or background daemon in that SSH window, you must set your environmental boundary. Run the <code data-path-to-node=\"18\" data-index-in-node=\"146\">export<\/code> command using your exact assigned indices:<\/p>\n<div class=\"code-block ng-tns-c3678280791-360 ng-animate-disabled ng-trigger ng-trigger-codeBlockRevealAnimation\">\n<div class=\"formatted-code-block-internal-container ng-tns-c3678280791-360\">\n<div class=\"animated-opacity ng-tns-c3678280791-360\">\n<pre class=\"ng-tns-c3678280791-360\"><code class=\"code-container formatted ng-tns-c3678280791-360\" role=\"text\" data-test-id=\"code-content\"><span class=\"hljs-comment\"># Replace the numbers with your actual allocated physical indices<\/span>\r\n<span class=\"hljs-built_in\">export<\/span> CUDA_VISIBLE_DEVICES=5,6,7\r\n<\/code><\/pre>\n<\/div>\n<\/div>\n<\/div>\n<blockquote>\n<h3 data-path-to-node=\"20\">What this does behind the scenes<\/h3>\n<p data-path-to-node=\"21\">Setting this variable forces the NVIDIA driver and machine learning frameworks (like PyTorch, TensorFlow, etc) to completely hide the rest of the node&#8217;s GPUs from your SSH session.<\/p>\n<p data-path-to-node=\"22\">To your code, your assigned cards are the <i data-path-to-node=\"22\" data-index-in-node=\"42\">only<\/i> cards that exist on the machine. This guarantees your code will never accidentally bleed over onto your neighbor&#8217;s hardware.<\/p>\n<\/blockquote>\n<h2 data-path-to-node=\"24\">Best Practices Checklist<\/h2>\n<ul data-path-to-node=\"25\">\n<li>\n<p data-path-to-node=\"25,0,0\">You can always double-check that your boundary is active in your current SSH terminal by running <code data-path-to-node=\"25,0,0\" data-index-in-node=\"119\">echo $CUDA_VISIBLE_DEVICES<\/code>.<\/p>\n<\/li>\n<li>\n<p data-path-to-node=\"25,1,0\">A clean, neighbor-safe workflow inside an SSH window looks like this:<\/p>\n<div class=\"code-block ng-tns-c3678280791-361 ng-animate-disabled ng-trigger ng-trigger-codeBlockRevealAnimation\">\n<div class=\"formatted-code-block-internal-container ng-tns-c3678280791-361\">\n<div class=\"animated-opacity ng-tns-c3678280791-361\">\n<pre class=\"ng-tns-c3678280791-361\"><code class=\"code-container formatted ng-tns-c3678280791-361\" role=\"text\" data-test-id=\"code-content\"><span class=\"hljs-comment\"># 1. Lock down your hardware footprint<\/span>\r\n<span class=\"hljs-built_in\">export<\/span> CUDA_VISIBLE_DEVICES=5,6,7\r\n\r\n<span class=\"hljs-comment\"># 2. Activate your environment and run your code<\/span>\r\nconda activate my_env\r\npython my_daemon.py &amp;\r\n<\/code><\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>On the Mind cluster, you are only permitted to SSH into a compute node after Slurm has officially granted you an active allocation on that node (via an active srun or sbatch job). However, opening a separate, raw SSH window to check on your code or run interactive process like&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-722","page","type-page","status-publish","hentry"],"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/ni.cmu.edu\/computing\/wp-json\/wp\/v2\/pages\/722","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ni.cmu.edu\/computing\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/ni.cmu.edu\/computing\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/ni.cmu.edu\/computing\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ni.cmu.edu\/computing\/wp-json\/wp\/v2\/comments?post=722"}],"version-history":[{"count":1,"href":"https:\/\/ni.cmu.edu\/computing\/wp-json\/wp\/v2\/pages\/722\/revisions"}],"predecessor-version":[{"id":723,"href":"https:\/\/ni.cmu.edu\/computing\/wp-json\/wp\/v2\/pages\/722\/revisions\/723"}],"wp:attachment":[{"href":"https:\/\/ni.cmu.edu\/computing\/wp-json\/wp\/v2\/media?parent=722"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}