Kodiak GPUs

To run a CUDA (i.e., NVIDIA GPU) enabled program, you will need to specify the gpu queue when you submit your job. This will run the program on one of the gpu nodes. Because each gpu node has two GPUs, you should request 18 (or 36 if your job needs both GPUs) processors per node even if your program only runs as a single process. If you specify fewer than 18 processors, then multiple jobs may run simultaneously each competing for the same GPU.

$ qsub -q gpu -l nodes=1:ppn=18 my_gpu_program.sh

Note: Be sure to load the cuda##/toolkit module in your qsub-ed shell script before running your CUDA enabled program.

Each gpu node has two GPUs, designated as device 0 and device 1. So assuming your program is only going to use a single GPU, when your job starts, which device will it run on? By default, it will run on device 0 even if another process is also running on device 0.

NVIDIA provides a mechanism that allows you to specify which GPU to run on. You can set an environment variable, CUDA_VISIBLE_DEVICES to the device number(s) that you want to use. For example, if you want to only run on device 0, then add export CUDA_VISIBLE_DEVICES=0 to your qsub-ed script somewhere before your program runs. Likewise, to run on device 1, add export CUDA_VISIBLE_DEVICES=1. To allow your program to run on either device, export CUDA_VISIBLE_DEVICES=0,1 (or just don't set the environment variable at all). But this will usually mean that the program will run on device 0.

You should set CUDA_VISIBLE_DEVICES to a GPU device that is currently idle and not in use by any other processes. But how can you know if a device is idle? If you are logged in to a gpu node interactively, you can run the nvidia-smi utility:

$ module load cuda92/toolkit/9.2.88

$ nvidia-smi
Tue Jul  9 16:46:45 2019
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P100-PCIE...  On   | 00000000:05:00.0 Off |                    0 |
| N/A   29C    P0    26W / 250W |      0MiB / 16280MiB |     85%      Default |
|   1  Tesla P100-PCIE...  On   | 00000000:82:00.0 Off |                    0 |
| N/A   26C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |

| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|    0    348961      C   some_cuda_program                          15387MiB |


But if you submit your job normally, i.e., a non-interactive session, the nvidia-smi utility isn't an option. Instead, there is a simple utility on Kodiak, idlegpu that outputs the device number of any GPU that is idle. So if device 0 is currently in use but device 1 is not, it will output "1". Similarly, if device 1 is in use but device 0 is idle, it will output "0". If both devices are idle, it will output "0,1". And finally, if both devices are in use, it will output nothing. (This final case shouldn't happen if everyone "plays by the rules" and requests 18 or 36 ppn for their job.)

Note: You can specify a preferred device if both happen to be idle with the -p option. If you run idlegpu -p 1 and both are idle, idlegpu will output "1" instead of "0,1".

So to use idlegpu in your qsub-ed shell script, you will want to set the value of CUDA_VISIBLE_DEVICES to the output of idlegpu by enclosing it in backticks, i.e., `idlegpu`.

export CUDA_VISIBLE_DEVICES=`idlegpu`
echo "Program running on device $CUDA_VISIBLE_DEVICES"