Running Your Program on Kodiak



Contents


The Batch System

Although you will typically edit and compile your programs on the Kodiak login node, you actually run the programs on Kodiak's compute nodes. But there are many compute nodes, possibly running other programs already. So how do you choose which compute node or nodes to run your program on?

You don't, at least, not explicitly. Instead you use Kodiak's batch system, also sometimes called a job scheduler. You tell the batch system the resources, such as how many nodes and processors, that your program requires. The batch system knows what is currently running on all of the compute nodes and will assign unused nodes to your program if they are available and automatically run your program on them. If none are available, your program will wait in a queue until they are. When dealing with the batch system, you will occasionally see the abbreviation "PBS" which stands for "Portable Batch System".

Note: Just to reiterate - Do not run resource intensive programs on the login node. Use the batch system as described below to run your program on one or more compute nodes.

Submitting a Job

Below is a simple C program, "howdy.c". When we run it on the login node, we can see that the hostname is "login001" as expected. Again, you will not run your program on the login node. This is just an example.

$ pwd
/home/bobby/examples/howdy

$ ls
howdy.c  howdy.sh

$ cat howdy.c
#include 

int main()
{
   char hostname[80];

   gethostname(hostname, 80);
   printf("Node %s says, \"Howdy!\"\n", hostname);
}

$ gcc -o howdy howdy.c

$ ./howdy
Node login001 says, "Howdy!"

To run your program on Kodiak, you need to submit it to the batch system. This is done with the qsub command and is known as submitting a job. Don't try to submit the executable program itself. Instead, submit a shell script that runs your program. The shell script can be extremely basic, possibly a single line that runs the program. Or it can be complex, performing multiple tasks in addition to running your program.

$ pwd
/home/bobby/examples/howdy

$ ls
howdy  howdy.c  howdy.sh

$ cat howdy.sh
cd /home/bobby/examples/howdy
./howdy

$ qsub howdy.sh
8675309.batch

Notice that the qsub command above returned the text "8675309.batch". This is known as the job id. You will usually only care about the numeric part. At this point, your program is either running on one of the compute nodes or is in the queue waiting to run.

You can see in the howdy.sh shell script above that we included a line to cd to change the working directory to the directory where the program is actually located. This is because a submitted job is essentially a new login terminal session only without the terminal. When it starts, the working directory is your $HOME directory just as if you had logged in interactively. We changed to the submitted job's working directory and ran howdy with relative path "./howdy" but we could have run the howdy program using it's full path instead.

More detailed instructions on submitting jobs can be found below.

Job Output

The "howdy" program normally prints its output on the terminal, i.e., standard output. But when you submit your job it runs on a compute node that doesn't have a terminal. So where does the output go? By default, standard output is saved to a file named ".o". The default job_name is the name of the shell script submitted with qsub. So for our job above, the output file is "howdy.sh.o8675309". There is a similar file, "howdy.sh.e8675309" for standard error output as well. (Hopefully, the error file will be empty...) Note that these files will contain stdout and stderr only. If your program explicitly creates and writes to other data files, or if it uses ">" to redirect standard output to a file, that output will not appear in the job's output file.

$ ls
howdy  howdy.c howdy.sh  howdy.sh.e8675309  howdy.sh.o8675309

$ cat howdy.sh.o8675309
Node n005 says, "Howdy!"

Getting Job Info

When you submit a job on Kodiak, it is placed in a queue. If your job's requested resources (nodes and processors) are available, then the job is run on a compute node right away. Because Kodiak runs jobs for many users, it is possible that all of the processors on all of the nodes are currently in use by others. When that happens, your job will wait until the requested resources are free and wait in a queue. The queue is, for the most part, "first come, first served". If two jobs are waiting and both require the same resources, then the job that was submitted earlier will run first.

You can get a list of all of the jobs currently running or waiting to run with the qstat command.

$ qstat
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
8675291.batch     prog_01a         betty             1414:57: R batch
8675293.batch     prog_01b         betty             1414:57: R batch
8675294.batch     prog_02a         betty             1228:40: R batch
8675295.batch     prog_02b         betty             1228:40: R batch
8675296.batch     prog_03a         betty                    0 Q batch
8675297.batch     prog_03b         betty                    0 Q batch
8675301.batch     test             bubba             00:12:13 R gpu
8675309.batch     howdy.sh         bobby                    0 R batch

You can see each job's job id, name, user, time used, job state, and queue. (There are actually multiple queues on Kodiak. The default queue is named "batch" and is the one you will usually use.) The "S" column shows the job's current state. An "R" means the job is running; "Q" means it is queued and waiting to run. Other state values that you might see are E (exiting), C (complete), and H (held).

You can display just your jobs by adding a -u option. You can also display the nodes that a job is using with the -n option. You can see below that your job is running on node n005.

$ qstat -n -u bobby

batch: 
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
8675309.batch   bobby    batch    howdy.sh   139656   1   1    --  1488: R 00:00
   n005/1

Stopping a Job

If, for some reason, you want to stop your job, you can do so with the qdel command. If your job is currently running, qdel will terminate it on the compute node(s). Otherwise it will simply remove it from the queue. Specify the job's job id without the ".batch" bit. You can only qdel your own jobs.

$ qdel 8675309

More On Submitting Jobs

Above, we submitted a minimal shell script with qsub. By default, qsub will allocate one processor on one node to the job. If you are submitting a parallel/MPI or multi-threaded program, you would need to specify more than that. Below is the parallel/MPI version of the "howdy" program, called "howdies", compiled with mpicc. In addition to printing out the hostname, it prints out the MPI "rank" and "size".

Note: The mpicc and mpixec (see below) commands are enabled when you load an MPI environment module. The examples in this document will use OpenMPI but there are several different implementations version of MPI available.

$ cat howdies.c
#include 
#include 

int main(int argc, char **argv)
{
   int rank, size;
   char hostname[80];

   MPI_Init(&argc, &argv);
   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
   MPI_Comm_size(MPI_COMM_WORLD, &size);

   gethostname(hostname, 80);
   printf("Howdy! This is rank %d of %d running on %s.\n", rank, size, hostname);

   MPI_Finalize();

   return 0;
}

$ module load openmpi-gcc/2.0.1

$ mpicc -o howdies howdies.c

To run an MPI program, you don't just run the program itself. Instead you use the mpiexec command with your program as one of its arguments and it will launch multiple instances of your program for you. You specify how many processes to launch with the -np option and on which nodes to launch them with the -hostfile option. The "host file" is a simple text file containing a list of nodes to run on.

$ cat nodes.txt
n001
n001
n001
n001
n001
n001
n001
n001
n002
n002
n002
n002
n002
n002
n002
n002

$ mpiexec -np 16 -hostfile nodes.txt ./howdies

So we need to put the above "mpiexec" line in a shell script and submit with qsub. But what about the host file? How do we know what nodes to place in there? We don't. Recall that the batch system allocates the nodes for our job based on availability. It will also create the host file for us and put the path to that host file in an environment variable, $PBS_NODEFILE, that our submitted script can use. Below is a quick test script that just displays $PBS_NODEFILE and its contents. Note that the created host file and the $PBS_NODEFILE environment variable only exist during the execution of a submitted job on a compute node so we'll need to qsub it and look at the results in the job's output file.

$ cat test.sh
echo "Host file: ${PBS_NODEFILE}
echo "Contents:"
cat $PBS_NODEFILE

$ qsub test.sh
8675309.batch

$ cat test.sh.o8675309
Host file: /var/spool/pbs/aux/8675309.batch
Contents:
n005
n005
n005
n005
n005
n005
n005
n005
n012
n012
n012
n012
n012
n012
n012
n012

We see that there are 16 lines in the host file, eight n005s and eight n012s. So when we call mpiexec with an option -np 16, eight processes will launch on n005 and eight will launch on n012. Recall that earlier we stated that the default behavior of the batch system is to allocate one processor on one node. So how did the batch system know to put 16 entries in the file instead of just 1 entry? Because this document cheated. The qsub command above would not actually create the host file that was displayed. We also need to tell qsub how many nodes we want as well as the number of processors on each node. We do this with the -l nodes=N:ppn=P option. (That's a lower case L and not a one, and stands for "resource_list".)

With the -l option, you are actually requesting the number of processors per node. MPI programs typically run one process per processor so it is often convenient to think of "ppn" as "processes per node" instead of "processors per node". But that's not entirely accurate, and not always the case, even with MPI programs.

The -l option to create the above machine files was actually:

qsub -l nodes=2:ppn=8

which is 2 nodes x 8 processors per node = 16 total processors.

Recall that earlier in this document, our submitted shell script had a cd command to change the working directory from $HOME to the directory where we ran qsub. The batch system sets an environment variable, $PBS_O_WORKDIR, to the jobs working directory so we just need to cd $PBS_O_WORKDIR in our script and we won't need to hard-code full paths to our program or data files.

So now back to our "howdies" program...

In addition to specifying with module load) which version of MPI to use when compiling our program, we will also need to specify the same version of MPI when we run it. This allows the system to find the correct MPI libraries and mpiexec command. To do this, add the appropriate module load command at the top of the script. It is probably a good idea to module purge any other loaded modules first just to make sure there are no conflicts with modules that may have been loaded in your .bashrc or .bash_profile login scripts.

For this example, we want to run it with 8 processes, but only 4 per node. We will need to change the "mpiexec -np" option to run 8 MPI processes instead of 16.

$ pwd
/home/bobby/examples/howdies

$ ls
howdies  howdies.c  howdies.sh

$ cat howdies.sh
#!/bin/bash

module purge
module load openmpi-gcc/2.0.1

cd $PBS_O_WORKDIR
echo "Job working directory:" $PBS_O_WORKDIR
echo

mpiexec -np 8 -hostfile $PBS_NODEFILE ./howdies

$ qsub -l nodes=2:ppn=4 howdies.sh
8675309.batch

$ ls
howdies  howdies.c  howdies.sh   howdies.sh.e8675309  howdies.sh.o8675309

$ cat howdies.sh.o8675309
Job working directory: /home/bobby/examples/howdies

Howdy! This is rank 5 of 8 running on n027
Howdy! This is rank 1 of 8 running on n009
Howdy! This is rank 2 of 8 running on n009
Howdy! This is rank 3 of 8 running on n009
Howdy! This is rank 0 of 8 running on n009
Howdy! This is rank 7 of 8 running on n027
Howdy! This is rank 6 of 8 running on n027
Howdy! This is rank 4 of 8 running on n027

Note: You can see in the output file above that the printed statements are in no particular order.

Now let's run it but with only 4 processes (i.e., qsub -l nodes=2:ppn=2). Again, we need to remember to modify the script and change the mpiexec -np option. It would be useful if we could somehow calculate the total number of processes from the value(s) in qsub's -l option. Remember that the host file ($PBS_NODEFILE) lists nodes, 1 per processor. The total number of processes is just the number of lines in that file. We can use the following:

cat $PBS_NODEFILE | wc -l

to return the number of lines. By using "command substitution", i.e, placing that code within "backticks" (the ` character), or within "$( xxx )" we can assign the result of it to a variable in our shell script and use that with mpiexec.

Note: In addition to just being a useful shortcut, calculating the number of processes dynamically has another benefit. It will help you avoid "oversubscribtion", where your job is using more resources than are allocated to it. If you request 8 processors but you forget to update the script which is still tyring to run on 32 processors, it may cause your job to run much slower. More importantly, it may cause another user's job run much slower.

Another useful thing is to use the uniq or sort -u commands with the $PBS_NODEFILE. This will print out non-repeated lines from the file, essentially giving you a list of the nodes your job is running on. Normally, it shouldn't matter which node(s) your job runs on. But occasionally, something goes wrong on the system and knowing which compute node(s) the job was running on makes tracking down issues much easier.

$ cat howdies.sh
#!/bin/bash

module purge
module load openmpi-gcc/2.0.1

cd $PBS_O_WORKDIR
echo "Job working directory: $PBS_O_WORKDIR"
echo

num=`cat $PBS_NODEFILE | wc -l`
echo "Total processes: $num"
echo "Node(s):"
uniq $PBS_NODEFILE
echo

mpiexec -n $num -machinefile $PBS_NODEFILE ./howdies

$ qsub -l nodes=2:ppn=2 howdies.sh
8675309.batch

$ cat howdies.sh.08675309
Job working directory: /home/bellc/bobby/examples/howdies

Total processes: 4
Node(s):
n006
n013

Howdy! This is rank 0 of 4 running on n006.
Howdy! This is rank 1 of 4 running on n006.
Howdy! This is rank 2 of 4 running on n013.
Howdy! This is rank 3 of 4 running on n013.

Below are descriptions of several other options that can be included with the qsub command. A convenient feature of the batch system is the ability for you to include qsub options inside your script rather than have to type them at the command line every time. These are called "PBS directives". To do this, add lines that have the format "#PBS " at the top of your script, after any "#!/bin/sh" line but before any other commands. For example, if your script will always run with "-l nodes=1:ppn=16" then add:

#PBS -l nodes=1:ppn=16

at the top of your shell script. Then just submit it with qsub without the "-l" command line option. You can use multiple PBS directives in your script. From now on, this document will often include PBS directives in the sample scripts instead of qsub command line options.

You can have the batch system send you an email when the job begins, ends and/or aborts with the -m and -M options:

-m bea -M Bobby_Baylor@baylor.edu

With the qsub -N option, you can specify a job name other than the name of the submitted shell script. The job name is shown in the output listing of the qstat command and is also used for the first part of the output and error file names.

If you are submitting jobs often, you may find yourself overwhelmed by the number of output files (jobname.o#####) and error files (jobname.e#####) in your directory. Some useful qsub options are -o and -e which allow you to specify the job's output and error file names explicitly rather than using the job name and job id. Subsequent job submissions will replace the contents of these files.

$ qsub -l nodes=1:ppn=2 -o howdies.out -e howdies.err howdies.sh
1748946.n131.localdomain

$ ls
howdies  howdies.c  howdies.err  howdies.out  howdies.sh
howdy    howdy.c    howdy.sh     test.sh

$ cat howdies.out
Job working directory: /home/bobby/howdy

Total processes: 2
Node(s):
n119

Howdy! This is rank 0 of 2 running on n119
Howdy! This is rank 1 of 2 running on n119

If your qsub options don't change between job submissions, you don't have to type them every time you run qsub. Instead, you can add PBS directives to your shell script. These are special comment lines that appear immediately after the "#!" shebang line. Each directive lines start with "#PBS" followed by a qsub option and any arguments.

$ cat howdies.sh
#!/bin/sh
#PBS -l nodes=1:ppn=4
#PBS -o howdies.out
#PBS -e howdies.err
#PBS -N howdies
#PBS -m be -M Bobby_Baylor@baylor.edu

module purge
module load mvapich2/1.9-gcc-4.9.2

echo "------------------"
echo
echo "Job working directory: $PBS_O_WORKDIR"
echo

num=`cat $PBS_NODEFILE | wc -l`
echo "Total processes: $num"
echo

echo "Job starting at `date`"
echo

cd $PBS_O_WORKDIR
mpiexec -n $num -machinefile $PBS_NODEFILE ./howdies

echo
echo "Job finished at `date`"

$ qsub howdies.sh
1748952.n131.localdomain

$ cat howdies.out
Job working directory: /home/bobby/howdy

Total processes: 2

Howdy! This is rank 0 of 2 running on n119
Howdy! This is rank 1 of 2 running on n119
------------------

Job working directory: /home/bobby/howdy

Total processes: 4
Nodes:

Job starting at Fri Dec  6 12:23:49 CST 2013

Howdy! This is rank 0 of 4 running on n075
Howdy! This is rank 1 of 4 running on n075
Howdy! This is rank 2 of 4 running on n075
Howdy! This is rank 3 of 4 running on n075

Job finished at Fri Dec  6 12:23:51 CST 2013

The batch system's queue is more or less FIFO, "first in, first out", so it's possible that your job may be waiting behind another user's queued job. But if the other user's job is requesting multiple nodes/processors, some nodes will remain unused and idle while waiting on other nodes to free up so that the job can run. The batch system tries to be clever and if it knows that your job can run and finish before then, rather than force your job to wait, it will allocate an idle node it and let it run. The default time limit for jobs is 5000 hours which is considered "infinite" by the batch system. You can specify a much shorter time for your job by adding a "walltime=hh:mm:ss" limit to the qsub command's -l option.

qsub -l nodes=1:ppn=1,walltime=00:30:00 howdy.sh

The job will may not run immediately, but after a set amount of time, should run before the other user's "big" job runs. Be aware that the walltime specified is a hard limit. If your job actually runs longer than the specified walltime, it will be terminated.

In the "Getting Job Info" section above, we saw that you can see the nodes that a job is running on by adding a -n option to qstat. If you added the uniq $PBS_NODEFILE command to your shells script, the nodes will be listed at the top of the job's output file. If necessary, you could log in to those nodes with ssh, and run top or ps to see the status of the processes running on the node. You could use ssh to run ps on the node without actually logging in.

[bobby@login001 howdy]$ qsub howdies.sh
1754935.n131.localdomain

[bobby@login001 howdy]$ qstat -u bobby -n

n131.localdomain: 
                                                                         Req'd  Req'd   Elap
Job ID               Username    Queue    Jobname          SessID NDS   TSK    Memory Time  S Time
-------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - -----
1754935.n131.loc     bobby       batch    howdies           23006     1      1    --  5000: R   -- 
   n010/0

[bobby@n130 howdy]$ ssh n010

[bobby@n010 ~]$ ps u -u bobby
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
bobby    23006  0.0  0.0  65924  1336 ?        Ss   10:52   0:00 -bash
bobby    23065  0.0  0.0  64348  1152 ?        S    10:52   0:00 /bin/bash /var/spool/torque/...
bobby    23072  0.2  0.0  82484  3368 ?        S    10:52   0:00 mpiexec -n 1 -machinefile /var/spool/torque/aux//1754935.n131.local
bobby    23073  0.3  0.0 153784  7260 ?        SLl  10:52   0:00 ./howdies
bobby    23078  0.0  0.0  86896  1708 ?        S    10:52   0:00 sshd: bobby@pts/1
bobby    23079  0.2  0.0  66196  1612 pts/1    Ss   10:52   0:00 -bash
bobby    23139  0.0  0.0  65592   976 pts/1    R+   10:52   0:00 ps u -u bobby

[bobby@n010 ~]$ exit
[bobby@n130 howdy]$

[bobby@n130 howdy]$ ssh n010 ps u -u bobby
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
bobby    23006  0.0  0.0  65924  1336 ?        Ss   10:52   0:00 -bash
bobby    23065  0.0  0.0  64348  1152 ?        S    10:52   0:00 /bin/bash /var/spool/torque/...
bobby    23072  0.0  0.0  82484  3368 ?        S    10:52   0:00 mpiexec -n 1 -machinefile /var/spool/torque/aux//1754935.n131.localdomain ./howdies
bobby    23073  0.1  0.0 153784  7260 ?        SLl  10:52   0:00 ./howdies
bobby    23146  0.0  0.0  86896  1672 ?        S    10:53   0:00 sshd: bobby@notty
bobby    23147  2.0  0.0  65592   972 ?        Rs   10:53   0:00 ps u -u bobby

This was easy for a trivial, one-process job. If your job is running on several nodes, it can be a hassle. A useful trick is to add following function to your .bashrc startup script:

function myjobs()
{
   if [ -f "$1" ]
   then
      for i in `cat "$1"`
      do
         echo "-----"
         echo "Node $i:"
         echo
         ssh $i ps u -u $LOGNAME
      done
   fi
}

You could also modify the code above to be a standalone shell script if you wanted. Now within your shell script that you submit with qsub, add the following code:

# Parse the Job Id (12345.n131.localhost) to just get the number part

jobid_num=${PBS_JOBID%%.*}
nodelist=$PBS_O_WORKDIR/nodes.$jobid_num
sort -u $PBS_NODEFILE > $nodelist

What the above code does is parse the job id (environment variable $PBS_JOBID) to strip off the ".n131.localhost" part and then create a file, "nodes.12345", that contains a list of nodes that the job is running on. Once the job has completed, the nodes.12345 file is no longer useful. You could rm $nodelist at the bottom of the shell script if you wanted.

$ qsub howdies.sh
1754949.n131.localdomain

$ ls nodes*
nodes.1754949

$ myjobs nodes.1754949
-----
Node n022:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
bobby    26684  0.0  0.0  73080  3108 ?        Ss   14:01   0:00 orted -mca ess tm...
bobby    26685  0.2  0.0 221420  7992 ?        SLl  14:01   0:00 ./howdies
bobby    26686  0.2  0.0 221420  7980 ?        SLl  14:01   0:00 ./howdies
bobby    26687  0.2  0.0 221420  7980 ?        SLl  14:01   0:00 ./howdies
bobby    26688  0.1  0.0 221420  7988 ?        SLl  14:01   0:00 ./howdies
bobby    26702  0.0  0.0  86004  1668 ?        S    14:02   0:00 sshd: bobby@notty
bobby    26703  0.0  0.0  65584   976 ?        Rs   14:02   0:00 ps u -u bobby
-----
Node n023:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
bobby    22127  0.1  0.0  65908  1332 ?        Ss   14:01   0:00 -bash
bobby    22128  0.0  0.0  13284   804 ?        S    14:01   0:00 pbs_demux
bobby    22183  0.0  0.0  64332  1160 ?        S    14:01   0:00 /bin/bash /var/spool/...
bobby    22190  0.1  0.0  83004  3956 ?        S    14:01   0:00 mpiexec -n 8 -machinefile /var/spool/torque/aux//1754949.n131.localdomain ./howdies
bobby    22191  0.2  0.0 221420  7988 ?        SLl  14:01   0:00 ./howdies
bobby    22192  0.2  0.0 221420  7976 ?        SLl  14:01   0:00 ./howdies
bobby    22193  0.1  0.0 221420  7984 ?        SLl  14:01   0:00 ./howdies
bobby    22194  0.1  0.0 221420  7980 ?        SLl  14:01   0:00 ./howdies
bobby    22205  0.0  0.0  86004  1668 ?        S    14:02   0:00 sshd: bobby@notty
bobby    22206  2.0  0.0  65576   976 ?        Rs   14:02   0:00 ps u -u bobby

Special Cases


Exclusive Access

Because Kodiak is a multi-user system, it is possible that your job will share a compute node with another user's job if your job doesn't explicitly request all of the processors on the node. But what if your program is very memory intensive and you need to run fewer processes on a node? You can specify -l nodes=4:ppn=2 and run just 2 processes each on 4 different nodes (8 processes total). Unfortunately, although your job only will only run 2 processes on each node, the batch system sees that there are still 6 unused processors on the nodes, and can assign them to other, possibly memory intensive, jobs. This defeats the purpose of specifying fewer processors per node.

Instead, to get exclusive access to a node, you need to force the batch system to allocate all 8 processors on node by specifying ppn=8. If we specify -l nodes=2:ppn=8, we get a $PBS_NODEFILE that looks like the following:

n001
n001
n001
n001
n001
n001
n001
n001
n002
n002
n002
n002
n002
n002
n002
n002

Calling mpiexec -np 16 -machinefile $PBS_NODEFILE ..., the first process (rank 0) will run on the first host listed in the file (n001), the second process (rank 1) will run on the second host listed (also n001), etc. The ninth process (rank 8) will run on the ninth host listed (now n002), etc. If there are fewer lines in the machinefile than the specified number of processes (-np value), mpiexec will start back at the beginning of the list. If the machine file looked like the following:

n001
n002

The first process (rank 0) will run on host n001, the second (rank 1) will run on host n002, the third (rank 2) will run on host n001, the fourth (rank 3) will run on host n002, and so on. So all we have to do is call

mpiexec -np 8 -machinefile $PBS_NEW_NODEFILE

and 4 processes will run on n001 and 4 will run on n002. But because the batch system has reserved all 8 processors on n001 and n002 for our job, no other jobs will be running on the nodes. Our job will have exclusive access, which is what we want. So how do we turn the original $PBS_NODEFILE, created by the batch system, into our trimmed down version? One simple way would be to use the sort -u command to sort the file "uniquely", thus keeping one entry for each host.

PBS_NEW_NODEFILE=$PBS_O_WORKDIR/trimmed_machinefile.dat
sort -u $PBS_NODEFILE > $PBS_NEW_NODEFILE
mpiexec -np 8 -machinefile $PBS_NEW_NODEFILE

Because modifying the $PBS_NODEFILE itself will cause problems with the batch system you should always create a new machine file and use it instead.

Multithreaded Programs

If your program is multithreaded, either explicitly by using OpenMP or POSIX threads, or implictly by using a threaded library such as the Intel Math Kernel Library (MKL), be careful not to have too many threads executing concurrently. You will need to specify the number of processes per node but launch fewer or just one process. This is similar to method described in the "exclusive access" section above. Because each thread runs on a separate processor, you will need to tell the batch system how many processors to allocate to your process.

By default, OpenMP and MKL will use all of the processors on a node (currently 8). If, for some reason, you want to decrease the number of threads, you can do so by setting environment variables within the shell script that you submit via qsub. The threaded MKL library uses OpenMP internally for threading, so you can probably get by with modifying just the OpenMP environment variable.

OMP_NUM_THREADS=4
export OMP_NUM_THREADS

MKL_NUM_THREADS=4
export MKL_NUM_THREADS

You can also set the number of threads at runtime within your code.

// C/C++

#include "mkl_service.h"

mkl_set_num_threads(4);


! Fortran

use mkl_service

call mkl_set_num_threads(4)

Below is a trivial, mutlithreaded (OpenMP) program, "thready". All it does is display the number of threads that will be used and initializes the values of an array. Then, in the parallel region, each thread modifies a specific value of the array and then sleeps for two minutes. After exiting the parallel region, we display the new (hopefully correct) values of the array. The program is compiled with the Intel C compiler icc with the -openmp option.

$ cat thready.c
#include 

#define MAX 8

int main(int argc, char **argv)
{
   int i, arr[MAX];
   int num;

   num = omp_get_max_threads();
   printf("Howdy! We're about to split into %d threads...\n", num);

   for (i=0; iicc -openmp -o thready thready.c

$ ./thready
Howdy! We're about to split into 4 threads...
Before: 
 arr[0] = -1
 arr[1] = -1
 arr[2] = -1
 arr[3] = -1
 arr[4] = -1
 arr[5] = -1
 arr[6] = -1
 arr[7] = -1
After: 
 arr[0] = 0
 arr[1] = 1
 arr[2] = 2
 arr[3] = 3
 arr[4] = -1
 arr[5] = -1
 arr[6] = -1
 arr[7] = -1

Next we need to submit the thready program. For this example, we want it to use 4 threads so we call qsub with a -l nodes=1:ppn=4 option to reserve 4 processors for our threads.

$ cat thready.sh
#!/bin/bash

cd $PBS_O_WORKDIR

echo "Node(s):"
sort -u $PBS_NODEFILE
echo

export OMP_NUM_THREADS=4

echo "OMP_NUM_THREADS: $OMP_NUM_THREADS"
echo

echo "Job starting at `date`"
echo

./thready

echo
echo "Job finished at `date`"

$ qsub -l nodes=1:ppn=4 thready.sh
1755333.n131.localdomain

$ tail thready.sh.o1755333
Node(s):
n005

OMP_NUM_THREADS: 4

Job starting at Wed Dec 18 11:35:32 CST 2013

Now let's make sure that the process and threads running on the compute node are what we expect. We can see that the job is running on node n005. If we run the ps -f -C thready command on node n005, we can see that only one instance of the thready program is running. The only useful information is that its PID (process ID) is 27934. But if we add the -L option to the ps command, we can also get information about the threads.

$ ssh n005 ps -f -C thready
UID        PID  PPID  C STIME TTY          TIME CMD
bobby    27934 27931  0 10:22 ?        00:00:00 ./thready

$ ssh n005 ps -f -L -C thready
UID        PID  PPID   LWP  C NLWP STIME TTY          TIME CMD
bobby    27934 27931 27934  0    5 10:22 ?        00:00:00 ./thready
bobby    27934 27931 27935  0    5 10:22 ?        00:00:00 ./thready
bobby    27934 27931 27936  0    5 10:22 ?        00:00:00 ./thready
bobby    27934 27931 27937  0    5 10:22 ?        00:00:00 ./thready
bobby    27934 27931 27938  0    5 10:22 ?        00:00:00 ./thready

The LWP column above is the "lightweight process" ID (i.e., thread ID) and we can see that the thready process (PID 27934) has a total of 5 threads. The NLWP (number of lightweight processes) column confirms it. But why are there 5 threads instead of 4 threads which was specified? Notice that the first thread has the same thread ID as its process ID. That is the master or main thread, which is created when the program first starts and will exist until the program exits. When the program reaches the parallel region, the other 4 threads are created and do their work while the main thread waits. At that point, there are 5 theads total. When the program finishes with the parallel region, the 4 threads are terminated and the master thread continues.

Interactive Sessions

Occasionally, you may need to log in to a compute node. Perhaps you want to run top or ps to check the status of a job submitted via qsub. In that case, you can simply ssh to the compute node. But there may be times where you need to do interactive work that is more compute intensive. Although you could just ssh to some arbitrary compute node and start working, this is ill-advised because there may be other jobs already running on that node. Not only would those jobs affect your work, your work would affect those jobs.

Instead, you should use the interactive feature of qsub by including the -I option. This will use the batch system to allocate a node (or nodes) and a processor (or processors) just like you would for a regular, non-interactive job. The only difference is that once the requested resources are available, you will be automatically logged into the compute node and get a command prompt. The "batch" job will exist until you exit or log out from the interactive session.

[bobby@n130 ~]$ qsub -I
qsub: waiting for job 1832724.n131.localdomain to start
qsub: job 1832724.n131.localdomain ready

[bobby@n066 ~]$ top

top - 15:47:36 up 102 days, 23:20,  0 users,  load average: 7.50, 7.19, 7.57
Tasks: 289 total,   9 running, 280 sleeping,   0 stopped,   0 zombie
Cpu(s): 55.7%us,  1.0%sy,  0.0%ni, 42.8%id,  0.4%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  16438792k total,  2395964k used, 14042828k free,   162524k buffers
Swap: 18490804k total,   438080k used, 18052724k free,   454596k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND           
 6715 betty     25   0 1196m 134m 1168 R 99.0  0.8   9:44.35 prog_01a                                                               
 6801 betty     25   0 1196m 134m 1168 R 99.0  0.8   7:56.28 prog_01a                                                               
 6832 betty     25   0 1196m 134m 1168 R 99.0  0.8   4:55.46 prog_01a                                                               
 6870 betty     25   0 1196m 134m 1164 R 99.0  0.8   1:25.80 prog_01a                                                               
 6919 betty     25   0 1196m 134m 1164 R 99.0  0.8   1:04.84 prog_01a                                                               
 6923 betty     25   0 1196m 134m 1164 R 99.0  0.8   0:55.40 prog_01a                                                               
 7014 betty     25   0 1196m 134m 1164 R 99.0  0.8   0:46.67 prog_01a                                                               
 7075 bobby     15   0 12864 1120  712 R  1.9  0.0   0:00.01 top                                                                    
    1 root      15   0 10344   76   48 S  0.0  0.0   9:44.65 init                                                                   
    2 root      RT  -5     0    0    0 S  0.0  0.0   0:00.82 migration/0                                                            
    3 root      34  19     0    0    0 S  0.0  0.0   0:00.13 ksoftirqd/0                                                            
    4 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/0                                                             
    5 root      RT  -5     0    0    0 S  0.0  0.0   0:00.59 migration/1            

[bobby@n066 ~]$ cat $PBS_NODEFILE
n066

[bobby@n066 ~]$ exit
logout

qsub: job 1832724.n131.localdomain completed

[bobby@n130 ~]$ 

If there aren't enough available nodes, your interactive job will wait in the queue like any other job. If you get tired of waiting, press ^C.

$ qsub -I -l nodes=20:ppn=8
qsub: waiting for job 1832986.n131.localdomain to start^C
Do you wish to terminate the job and exit (y|[n])? y
Job 1832986.n131.localdomain is being deleted

If you are logged into Kodiak and have enabled X11 forwarding (i.e., logged in with ssh -X bobby@kodiak.baylor.edu from a Mac or Linux system, or have enabled SSH X11 forwarding option in Putty for Windows) you can run X11 options when logged into a compute node interactively. Because X11 from a compute node (or Kodiak's login node) can be slow, you typically won't want to do this. But there may be times when it is required, for example, when you need to run MATLAB interactively, or need to debug a program with a graphical debugger. This assumes you are running an X server (XQuartz for Mac, or Cygwin/X or Xming on Windows) on your desktop system.

First, on the Kodiak login node, make sure that X11 is, in fact, working. The xlogo and xmessage commands are a simple way to test this.

$ echo $DISPLAY
localhost:10.0

$ xlogo &
[1] 9927

$ echo 'Howdy from node' `hostname` | xmessage -file - &
[2] 9931

If you are sure that X11 works from the login node, start an interactive session, as above, but add a -X to tell qsub that you wish to forward X11 from the compute node.

[bobby@n130 howdy]$ qsub -I -X
qsub: waiting for job 2444343.n131.localdomain to start
qsub: job 2444343.n131.localdomain ready

[bobby@n083 ~]$ echo 'Howdy from node' `hostname` | xmessage -file -

[bobby@n083 ~]$ exit qsub: job 2444343.n131.localdomain completed [bobby@n130 howdy]$

Multiple Job Submissions

There may be times where you want to run, or at least submit, multiple jobs at the same time. For example, you may want to run your program with varying input parameters. You could submit each job individually, but if there are many jobs, this might be problematic.

Job Arrays

If each run of your program has the same resource requirements, that is, the same number of nodes and processors, you can run multiple jobs as a job array. Below is a trivial program that simply outputs a word passed to it at the command line.

$ cat howdy2.c
#include 

int main(int argc, char **argv)
{
   if (argc == 2)
   {
      printf("Howdy! You said, \"%s\".\n", argv[1]);
   }
   else
   {
      printf("Howdy!\n");
   }
}

What we want to do is run the program three times, each with a different word. To run this as a job array, add the -t option along with a range of numbers to the qsub command. For example, qsub -t 0-2 array_howdy.sh. This command will submit the array_howdy.sh script three times but with an extra environment variable, $PBS_ARRAYID, which will contain a unique value based on the range. You could test the value of $PBS_ARRAYID and run your program with different arguments.

$ cat array_howdy.sh
#!/bin/sh
#PBS -o array.out
#PBS -e array.err

cd $PBS_O_WORKDIR

if [ $PBS_ARRAYID == 0 ]
then
   ./howdy2 apple
elif [ $PBS_ARRAYID == 1 ]
then
   ./howdy2 banana
elif [ $PBS_ARRAYID == 2 ]
then
   ./howdy2 carrot
fi

For a more complex real-world program, you might have multiple input files named howdy_0.dat through howdy_N.dat. Instead of an unwieldy if-then-elif-elif-elif-etc construct, you could use $PBS_ARRAYID to specify an individual input file and then use a single command to launch your program.

$ ls input
howdy_0.dat  howdy_1.dat  howdy_2.dat

$ cat input/howdy_0.dat
apple

$ cat input/howdy_1.dat
banana

$ cat input/howdy_2.dat
carrot

$ cat array_howdy.sh
#!/bin/sh
#PBS -o array.out
#PBS -e array.err

cd $PBS_O_WORKDIR

DATA=`cat ./input/howdy_$PBS_ARRAYID.dat`

./howdy2 $DATA

When you submit your job as a job array, you will see a slightly different job id, one with "[]" appended to it. This job id represents all of the jobs in the job array. For individual jobs, specify the array id within the brackets (e.g., 123456[0]). Normally, qstat will display one entry for the job array. To see all of them, add a -t option.

The job array's output and error files will be named jobname.o-#, where # is replaced by the individual job array id numbers.

$ qsub -t 0-2 array_howdy.sh
2462930[].n131.localdomain

$ qstat -u bobby

n131.localdomain: 

Job ID               Username    Queue    Jobname          
-------------------- ----------- -------- ---------------- ...
2462930[].n131.l     bobby       batch    array_howdy.sh   

$ qstat -t -u bobby

n131.localdomain: 

Job ID               Username    Queue    Jobname          
-------------------- ----------- -------- ---------------- ...
2462930[0].n131.     bobby       batch    array_howdy.sh-1 
2462930[1].n131.     bobby       batch    array_howdy.sh-2 
2462930[2].n131.     bobby       batch    array_howdy.sh-3 

$ ls array.out*
array.out-0  array.out-1  array.out-2

$ cat array.out-0
Howdy! You said, "apple".

$ cat array.out-1
Howdy! You said, "banana".

$ cat array.out-2
Howdy! You said, "carrot".

Sequential, Non-concurrent Jobs

When you submit a job array as above, all of the individual jobs will run as soon as possible. But there may be times where you don't want them all to run concurrently. For example, the output of one job might be needed as the input for the next. Or perhaps you want this program to run, but not at the expense of some other, higher priorty program you also need to run. Whatever the reason, you can limit the number of jobs that can run simultaneously by adding a slot limit to the -t option. For example, qsub -t 0-9%1 will create a job array with 10 jobs (0 through 9) but only 1 will run at a time. The others will wait in the queue until the one that is running finishes. When running qstat, you can see that the status of each non-running job is H (held) as opposed to Q (queued).

$ qsub -t 0-2%1 array_howdy.sh
2463526[].n131.localdomain

$ qstat -t -u bobby

n131.localdomain: 
                                                            
Job ID               Username    Queue    Jobname              S Time
-------------------- ----------- -------- ---------------- ... - -----
2463526[0].n131.     bobby       batch    array_howdy.sh-0     R   -- 
2463526[1].n131.     bobby       batch    array_howdy.sh-1     H   -- 
2463526[2].n131.     bobby       batch    array_howdy.sh-2     H   -- 

This works if you can run your job as a job array. But what if it can't? For example, you may want to submit multiple jobs but if each requires a different number of nodes or processors then a job array won't work. You will need to qsub the jobs individually. You might be tempted to add a qsub command at the end of your submitted script. However, this will not work because the script runs on the compute nodes but the qsub command only works on Kodiak's login node. Instead, to have a job wait on another job to finish before starting, you add a -W ("additional attributes") option to the qsub to specify a job dependency for the new job. The general format for the -W option is:

qsub -W attr_name=attr_list ...

There are several possible attr_names that can be specified as additional attributes. In this case, the attribute name we want to use is "depend", the attribute list is the type of dependency ("after" or "afterok"), and the depend argument (the id of the job we are waiting on). The difference between after and afterok is that with the former, the new job will start when the first one finishes, not matter what the reason; with the latter, the new job will start only if the first one finished successfully, and returned a 0 status. So the full option will be:

qsub -W depend=after:job_id ...

For the job id, you can use either the full job id (e.g., 123456.n131.localdomain) or just the number part.

$ cat howdies.sh
#!/bin/bash
#PBS -o howdies.out
#PBS -e howdies.err
#PBS -N howdies

module purge
module load mvapich2/1.9-gcc-4.9.2

cd $PBS_O_WORKDIR

echo "------------------"
echo
echo "Job id: $PBS_JOBID"
echo

num=`cat $PBS_NODEFILE | wc -l`
echo "Total processes: $num"
echo

echo "Job starting at `date`"
echo

mpiexec -n $num -machinefile $PBS_NODEFILE ./howdies

echo
echo "Job finished at `date`"
echo

$ qsub -l nodes=1:ppn=8 howdies.sh
2463553.n131.localdomain

$ qsub -l nodes=1:ppn=4 -W depend=afterok:2463553.n131.localdomain howdies.sh
2463555.n131.localdomain

$ qsub -l nodes=1:ppn=2 -W depend=afterok:2463555.n131.localdomain howdies.sh
2463556.n131.localdomain

$ qsub -l nodes=1:ppn=1 -W depend=afterok:2463556.n131.localdomain howdies.sh
2463557.n131.localdomain

$ qstat -u bobby

n131.localdomain: 
                                                              Req'd  Req'd   Elap
Job ID            Username  Queue    Jobname   SessID NDS TSK Memory Time  S Time
----------------- --------- -------- --------- ------ --- --- ------ ----- - -----
2463553.n131.loc  bobby     batch    howdies    10460   1   8    --  5000: R 00:00
2463555.n131.loc  bobby     batch    howdies      --    1   4    --  5000: H   -- 
2463556.n131.loc  bobby     batch    howdies      --    1   2    --  5000: H   -- 
2463557.n131.loc  bobby     batch    howdies      --    1   1    --  5000: H   -- 

$ qstat -u bobby

n131.localdomain: 
                                                              Req'd  Req'd   Elap
Job ID            Username  Queue    Jobname   SessID NDS TSK Memory Time  S Time
----------------- --------- -------- --------- ------ --- --- ------ ----- - -----
2463555.n131.loc  bobby     batch    howdies    15434   1   4    --  5000: R 00:00
2463556.n131.loc  bobby     batch    howdies      --    1   2    --  5000: H   -- 
2463557.n131.loc  bobby     batch    howdies      --    1   1    --  5000: H   -- 

$ qstat -u bobby

n131.localdomain: 
                                                              Req'd  Req'd   Elap
Job ID            Username  Queue    Jobname   SessID NDS TSK Memory Time  S Time
----------------- --------- -------- --------- ------ --- --- ------ ----- - -----
2463556.n131.loc  bobby     batch    howdies    15525   1   2    --  5000: R 00:01 
2463557.n131.loc  bobby     batch    howdies      --    1   1    --  5000: H   -- 

$ qstat -u bobby

n131.localdomain: 
                                                              Req'd  Req'd   Elap
Job ID            Username  Queue    Jobname   SessID NDS TSK Memory Time  S Time
----------------- --------- -------- --------- ------ --- --- ------ ----- - -----
2463557.n131.loc  bobby     batch    howdies    15613   1   1    --  5000: R 00:04

$ cat howdies.out
------------------

Job id: 2463553.n131.localdomain

Total processes: 8

Job starting at Thu Jan 30 13:48:41 CST 2014

Howdy! This is rank 7 of 8 running on n017
Howdy! This is rank 1 of 8 running on n017
Howdy! This is rank 5 of 8 running on n017
Howdy! This is rank 4 of 8 running on n017
Howdy! This is rank 6 of 8 running on n017
Howdy! This is rank 2 of 8 running on n017
Howdy! This is rank 3 of 8 running on n017
Howdy! This is rank 0 of 8 running on n017

Job finished at Thu Jan 30 13:53:43 CST 2014

------------------

Job id: 2463555.n131.localdomain

Total processes: 4

Job starting at Thu Jan 30 13:53:44 CST 2014

Howdy! This is rank 3 of 4 running on n082
Howdy! This is rank 0 of 4 running on n082
Howdy! This is rank 1 of 4 running on n082
Howdy! This is rank 2 of 4 running on n082

Job finished at Thu Jan 30 13:58:46 CST 2014

------------------

Job id: 2463556.n131.localdomain

Total processes: 2

Job starting at Thu Jan 30 13:58:47 CST 2014

Howdy! This is rank 1 of 2 running on n082
Howdy! This is rank 0 of 2 running on n082

Job finished at Thu Jan 30 14:03:49 CST 2014

------------------

Job id: 2463557.n131.localdomain

Total processes: 1

Job starting at Thu Jan 30 14:03:50 CST 2014

Howdy! This is rank 0 of 1 running on n082

Job finished at Thu Jan 30 14:08:52 CST 2014