ARCS System Support
Other Stuff
Online Documentation
Other HPC and Research Related Resources
What computer systems are available?
ARCS supports two computer systems for academic and research purposes: Kodiak, a 128 node (1024 core) Linux cluster; and Rush, a 4 processor (dual core) Linux system. For details on these systems, please see the ARCS Systems page.
How do I get a computer account?
To get a computer account on one of the ARCS systems, contact Mike Hutcheson.
How do I connect and log in?
Use ssh to log in to the ARCS systems. Telnet is not supported.
Windows users should use PuTTY. Faculty and staff can download PuTTY from Baylor's AppCenter page. Students can get PuTTY from the official site. PuTTY is also installed on many of the public access PCs on campus.
Mac OS X users should connect with the ssh command in a Terminal window.
Linux users should connect with the ssh command in an xterm window or the console.
To connect to Kodiak with ssh, enter the following:
% ssh -l username kodiak.baylor.edu
or
% ssh username@kodiak.baylor.edu
If your username on the ARCS system happens to be the same as that on your local system, you can omit the "-l username". Occasionally we update the operating system on the ARCS systems, so you may get the following message when you try to connect:
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
To fix this, edit the file $HOME/.ssh/known_hosts and delete the line for the system and then reconnect. If you are concerned, you can contact us for assistance.
The first time you access your account you should change your password. To do so, type the command passwd and follow the prompts. On Linux systems, the "bash" shell is the default shell. If you require a different shell please let the system administrator know.
How do I compile my program?
To compile serial (non-parallel) programs, use the Intel C/C++ compiler (icc), the Gnu C/C++ compiler (gcc), or the Intel Fortran compiler (ifort).
To compile parallel MPI programs on Kodiak, use mpicc (for C/C++) and mpifort (for Fortran).
How do I run my program?
Although technically you can run your program from the command line, you should use the batch system and the qsub command. To do so, you will first need to create a shell script and place the command inside to run your binary. Below is a sample shell script called run_myprog.sh. This assumes that your program is serial, i.e., not parallel (MPI), is named "myprog" and is in a directory, "/home/username/myprog/".
#!/bin/sh
/home/username/myprog/myprog
To submit the job to the batch system, use the qsub command. Note that you cannot qsub your binary directly but instead must use a shell script.
% qsub -N myprog ./run_myprog.sh
1234.n130
The qsub command will return with the job identifier assigned to the job by the batch system, e.g., 1234 above. The "n130" (sometimes seen as "n130.localdomain") is the default batch queue name which can be omitted on most commands.
When the job finishes, you should find myprog.e1234 and myprog.o1234 in the myprog directory. These correspond to your program's stderr and stdout. If you want to see what is currently running in batch, use the qstat command:
% qstat
Job id Name User Time Use S Queue
------------------- ---------------- ---------------- -------- - -----
1234.n130.loca myprog me 0 R batch
If your program is running (or waiting in the batch queue) and you want to stop it, use the qdel command:
% qdel 1234
Note for matlab users: The "matlab" command requires specific arguments in order to function correctly in batch. There's a sample shell script located in /usr/local/examples called run_matlab.sh that you may copy to your account and use to submit your matlab jobs to the batch system.
Here's an example of how to submit a matlab program to the batch system:
qsub -N MyMatlabProgram ./run_matlab.sh
How do I run my parallel (MPI) program?
Note: In the past, users were instructed to use scasub to run their parallel programs. Use of scasub is now deprecated. Although scasub will continue to work for now, if you are using it, you begin using the qsub command as described below.
To run your parallel program, you should submit it to the batch system with the qsub command. To do so, you will first need to create a shell script and use the mpirun command inside to run your binary. Also included in the shell script are batch system "directives" where you to specify the number of processes you want to run, the name of the job, etc. Below is a sample shell script called run_myprog.sh. This assumes that your program is parallel, is named "myprog" and is in a directory, "/home/username/myprog/". A description of the shell script follows.
#!/bin/sh
#PBS -l nodes=2:ppn=8
#PBS -N test1
#PBS -o test1.out
#PBS -e test1.err
num=`cat $PBS_NODEFILE | wc -l`
mpirun -np $num -machinefile $PBS_NODEFILE /home/username/myprog
To submit the job to the batch system, use the qsub command.
% qsub ./run_myprog.sh
1234.n130
The batch system directives, i.e., the "#PBS" lines, appear at the top of the shell script. The most important directive is the "-l" (that's a lowercase L and not a 1) directive which specifies the number of processes and processes per node. (Actually, the -l directive can specify other resources such as particular hosts and maximum cpu time as well.) In this example, it is requests 2 nodes with 8 processes each for a total of 16 processes. If your program is memory intensive, you might want to run fewer processes per nodes, so instead you could use "-l nodes=4:ppn=4" (4 nodes x 4 processes per node = 16 processes). If your program requires a number of processes that is not divisible by 8, you would add them to the directive with a "+" so if your job requires 20 processes, then the -l directive would be "-l nodes=2:ppn=8+1:ppn=4". That would be 2 nodes x 8 processes per node + 1 node x 4 processes per node, i.e., 16 + 4 = 20 processes.
The other batch directives in the example are -N which specifies the name of the job, and -o and -2 which specify the output and error files respectively. These directives are optional.
Note that all of the directives can also be specified at the command line as options to the qsub command instead of within the shell script.
The next line in the batch system dynamically calculates the total number of jobs to be specified in the mpirun command. The way it works is that the -l directive figures out how many (and which) nodes to use and creates the mpirun command's "machine file" ($PBS_NODEFILE) that contains the list of nodes to use, one node per line. cat $PBS_NODEFILE if you want to see what it looks like. The "cat $PBS_NODEFILE | wc -l" simply counts the number of lines. Yes, since you already know the number of processes (because you specified them with the -l directive) you can just use that value in the mpirun command. However, when/if you change the directive, it's easy to forget to change the mpirun command as well. It is safer to simply calculate it dynamically.
The last line of the shells script is the call to mpirun itself. This command specifies the number of nodes calculated above as well as the machine file. Note that you use the full path to your executable.
What are the runtime limits, quotas, etc.?
How do I move files between my computer and the ARCS systems?
Use sftp (Secure FTP). You can also use scp (secure copy) but sftp is recommended. "Classic" FTP is not enabled on the ARCS systems.
To transfer files between a Windows system and an ARCS system, you should use WinSCP. Faculty and staff can download WinSCP from Baylor's AppCenter page. Students can get WinSCP from the official site. WinSCP is also installed on many of the PCs in the Electronic Library Compute Facilities.
Mac OS X and Linux users should use the sftp (or scp) command from a Terminal window or console.
How can I keep my jobs from waiting in the queue?
Sometimes, when you submit a job it may not execute right away because the batch resource manager doesn't have enough nodes available with the number of processors you specified for each node.
The solution is to first find out how many nodes are available with the number of processors that you're asking for before you submit your job.
There is a utility called freenodes that will help you.
If you have a job you want to submit that requires 48 processors and you want to submit it to six nodes, eight processors per node. First use freenodes to find out if there at least six nodes with eight processors available.
For example:
$ freenodes 8
2
In this example, freenodes is called with an argument of 8 processors. It returned a 2 which means that there are only two nodes available with eight processors. This means either you can submit the job as is and let it wait to run or find out if there is another solution to use.
Are there twelve nodes with four processors available per node?
$ freenodes 4
2
No. How about 24 nodes with two processors available?
$ freenodes 2
36
Yes. There are 36 nodes with at least two processors available. Since the job only needs 24 nodes with two processors available, there are enough resources available to submit it. If you are using the
scasub to submit, then you would specify your resources as such:
$ scasub –np 48 –npn 2 ...
If you use the qsub command, use this:
$ qsub –l nodes=24:ppn=2 ...
What if a job has already been submitted but is waiting in a queued state because there aren't enough nodes available with the number of processors that you asked for per node?
First, use the freenodes command to see if there is a combination of processors and nodes that will work. If there is, then use the qalter command to set your job free:
$ scasub -np 64 -npn 8 -mpimon mpi_prime
145891.n130.localdomain
$ showq
ACTIVE JOBS--------------------
JOBNAME USERNAME STATE PROC REMAINING STARTTIME
145814 joe Running 384 INFINITY Wed Jan 6 10:48:23
145879 joe Running 14 INFINITY Wed Jan 6 10:52:18
145880 bob Running 48 INFINITY Wed Jan 6 10:53:50
145881 bob Running 49 INFINITY Wed Jan 6 10:55:35
145882 ralph Running 256 INFINITY Wed Jan 6 13:11:40
145883 bob Running 24 INFINITY Wed Jan 6 13:53:23
145884 joe Running 36 INFINITY Wed Jan 6 13:54:55
145885 joe Running 36 INFINITY Wed Jan 6 13:55:51
145886 joe Running 36 INFINITY Wed Jan 6 14:29:30
145887 joe Running 48 INFINITY Wed Jan 6 14:39:17
10 Active Jobs 931 of 1024 Processors Active (90.92%)
126 of 128 Nodes Active (98.44%)
IDLE JOBS----------------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
145891 smith Idle 64 INFINITY Wed Jan 6 17:41:41
1 Idle Job
BLOCKED JOBS----------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
Total Jobs: 11 Active Jobs: 10 Idle Jobs: 1 Blocked Jobs: 0
$ checkjob 145891
checking job 145891
State: Idle
Creds: user:smith group:elib class:batch qos:DEFAULT
WallTime: 00:00:00 of INFINITY
SubmitTime: Wed Jan 6 17:57:41
(Time Queued Total: 00:00:01 Eligible: 00:00:01)
Total Tasks: 64
Req[0] TaskCount: 64 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 0
PartitionMask: [ALL]
Flags: RESTARTABLE
Reservation '145891' ( INFINITY -> INFINITY Duration: INFINITY)
PE: 64.00 StartPriority: 1
job cannot run in partition DEFAULT (idle procs do not meet requirements : 16 of 64 procs found)
idle procs: 360 feasible procs: 16
Rejection Reasons: [CPU : 43][State : 83]
$ freenodes 4
2
$ freenodes 2
36
$ qalter -l nodes=32:ppn=2 145891 #Where 145891 is the job id for the user's job waiting to run
$ showq
ACTIVE JOBS--------------------
JOBNAME USERNAME STATE PROC REMAINING STARTTIME
.
.
.
.
.
.
.
145891 hutches Running 64 INFINITY Wed Jan 6 17:49:09
11 Active Jobs 995 of 1024 Processors Active (97.17%)
126 of 128 Nodes Active (98.44%)
IDLE JOBS----------------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
0 Idle Jobs
BLOCKED JOBS----------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
Total Jobs: 11 Active Jobs: 11 Idle Jobs: 0 Blocked Jobs: 0
So the job was already submitted requesting 64 processors, 8 per node but was in the idle state as seen with the showq command. Why is this? The checkjob command indicates that the job is waiting in the idle state in the queue until enough nodes with eight processors became available. There's no telling how long it might have to wait before it runs. Using freenodes, first with four processors per node and then with two, we see that there are 36 nodes available with two processors. This job will run using 32 nodes with 2 processors, so the qalter command to is used tell the scheduler to run the job with 32 nodes with 2 processors per node. After the scheduler has enough time to process the change request, the job is moved from the idle state to the running state.
How can I get Linux installed on my computer?
How can I get the Tivoli backup client installed on my Linux system?
I need help with some other Linux/Unix/VMS issue.