Running Jobs
Jobs on Saguaro execute on “compute” nodes dedicated to that job. These nodes are different from the shared “login” nodes that host interactive sessions. A batch script is written to submit the job with “qsub” command. The batch script contains a number of job control directives. It is possible to run small, short parallel jobs interactively as well.

Login nodes: Saguaro has 3 login nodes can be connected through SSH and can be run by multiple users at the same time. These nodes are only used to interact with the users, edit the files, etc.

*Do not run applications on the login nodes.

Compute nodes: These nodes are used for computation and to run the jobs with loading the modules.

Different processor architectures can be chosen during run time or in the PBS script.

Please refer the RESOURCES-> hardware section for processor architecture details.

Launching Interactive jobs: In order to run the jobs interactively we request for the compute nodes from the login nodes using the command:

qsub -I -V -q -lnodes=n:ppn=i walltime=00:10:00

* The -I flag specifies an interactive job.
* The -V flag passes your current environment variable settings to the compute environment.
* The -q flag specifies the name of the queue.

-lnodes=n:ppn=i determines the number of nodes to allocate for your job. Nodes= no. of nodes ppn= no. of processors per node.

Go to the current working directory and launch your job: cd PBS_O_WORKDIR.

The default walltime for interactive sessions is 24 hours.

For example to launch an OpenMP job interactively:

Load the module:

module load intel

Compile the program:

icc omphello.c –o omphello –openmp

Run the program:

./omphello

These jobs run non-interactively through a script which is a text file containing a number of job directives and LINUX commands. Batch scripts are submitted to the “Torque” which is the batch system of saguaro. Portable Batch System (PBS) is used for batch job submission.
The directive lines which start with #PBS gives the details about the nodes to reserve and the time period required to reserve. Bourne shell script for submitting a serial job to the PBS queue using the qsub command.

Batch script with detailed explanation:

#!/bin/bash
#PBS -N Jobname 
#PBS -l nodes=3:ppn=8
#PBS -j oe
#PBS -e my_job.$PBS_JOBID.err
#PBS -o my_job.$PBS_JOBID.out
#PBS -l walltime=00:10:00
#PBS -m abe
#PBS -M sample_email@asu.edu
#PBS -p 0 
#PBS -q queuename

cd $PBS_O_WORKDIR
module load mvapich2/2.0a-intel-13.0
mpiexec -np 24 program_name < inputfile > outputfile

A line beginning with # is a comment. A line beginning with #PBS is a PBS directive.
The PBS directives must always be listed first.

PBS script line Explanation
#PBS -N JobName Set the name of the job (up to 15 characters,no blank spaces, start with alphanumeric character)
#PBS -l nodes=1:ppn=1 Assigns the number of nodes and the number of processors per node for the requested job to run.
#PBS -j oe The directive directs that the standard output and error streams are to be merged, intermixed, as standard output.
#PBS -e my_job.$PBS_JOBID.err
#PBS -o my_job.$PBS_JOBID.out
The standard output and error files
#PBS -l walltime=00:10:00 Format: hhhh:mm:ss . ifthe job does not finish by the time reached, the job is terminated.
#PBS -m abe PBS can send informative email messages to you about the status of your job. Send mail “a” (job is aborted), “b” (job begins), and “e” (job terminates).
#PBS -M sample_email@umn.edu specify the email address
#PBS -p 0 Set the priority for the job to run. The default is 0.
#PBS -q queuename Specific queue in which you want the job to run
cd $PBS_O_WORKDIR holds the path to the directory from which you submitted your job.

module load mvapich2/2.0a-intel-13.0: load the compilers required from the available module list.

mpiexec -np 24 program_name < inputfile > outputfile 

The mpiexec script will setup the runtime environment for running your parallel program and run your program on 24 processors/cores.

Use the batch script to submit jobs to run in the background on compute nodes to the system using the “qsub” command:

qsub myscript.pbs

If you login to a head node, you can save this in a file such as “my_job_script.pbs” and submit it to the queue with the ‘qsub’ command.

PBS exit codes: When a job exits the queue it returns a value indicating the status of the job.
e.g.: X=0 ( successful execution of the job)

These codes have various ranges
X<0 ( job could not be executed)

Torque provides exit codes:

JOB_EXEC_OK 0 job exec successful
JOB_EXEC_FAIL1 -1 job exec failed, before files, no retry
JOB_EXEC_FAIL2 -2 job exec failed, after files, no retry
JOB_EXEC_RETRY -3 job execution failed, do retry
JOB_EXEC_INITABT -4 job aborted on MOM initialization
JOB_EXEC_INITRST -5 job aborted on MOM init, chkpt, no migrate
JOB_EXEC_INITRMG -6 job aborted on MOM init, chkpt, ok migrate
JOB_EXEC_BADRESRT -7 job restart failed
JOB_EXEC_CMDFAIL -8 exec() of user command failed

The exit status can be viewed using the qstat –f command.

Most used PBS environment variables are:

echo $PBS_O_HOST	-  on which qsub is running 
echo $PBS_O_QUEUE 	 - originating queue 
echo $PBS_QUEUE		 -  executing queue 
echo $PBS_O_WORKDIR	 - working directory 
echo $PBS_ENVIRONMENT 	 - execution mode 
echo $PBS_JOBID		 - job identifier
echo $PBS_JOBNAME 	 - job name 
echo $PBS_NODEFILE	- node file 
echo $PBS_O_HOME	- current home directory  
echo $PBS_O_PATH 	- PATH 

Storage: All of the compute nodes have access to 215 TB of high speed parallel LUSTRE scratch space (scalable to petabytes) scratch storage and 1.5PB of spinning disk primary storage. The scratch space is purged once every 30 days.

qsub batch_script Submits batch script to the queue. The output of qsub will be a jobid
qdel jobid Deletes a job from the queue
qhold jobid Puts a job on hold in the queue.
To delete a job from the saguaro xfer queue users must add an additional parameter
qrls jobid Releases a job from hold.
qalter [options]jobid Change attributes of submitted job. nodes=:ppn=8/td>
qmovenew_queue jobid Move job to new queue. Remember, the new queue must be one of the submission queues (premium, regular, or low)
qstat -a Lists jobs in submission order (more useful than qstat without options) Also takes -u and -f [jobid]> options
qstat -f jobid Produce a detailed report for the job. Note: if used on the login node from which the job was submitted then jobid need only contain the numerical portion of the job id
qs NERSC provided wrapper that shows jobs in priority order. Takes -u username and -w options.
apstat Shows the number of up nodes and idle nodes and a list of current pending and running jobs. apstat -r command displays all the nodes reservations.
showq List jobs in priority order in three categories: active jobs, eligible jobs and blocked jobs. This command lists jobs in priority order. showq -i lists details of all eligible jobs.
Showstart jobid Takes a jobid as its argument and displays an earliest possible start time of such jobs that request the same amount of resources (nodes, walltime, memory, etc)
checkjobjobid Takes a jobid as its argument and displays the current job state and whether nodes are available to run the job currently.

To alter the resources for a current job in the queue:

qalter -lwalltime=new_walltime jobid
qalter -lmppwidth=new_mppwidth jobid

Examples Batch Scripts:

1. Basic batch script

#PBS -q normal
#PBS -lnodes=16:ppn=8
#PBS -l walltime=00:20:00
#PBS -N jobname
#PBS –e jobname.$PBS_JOBID.err
#PBS -o jobname.$PBS_JOBID.out
#PBS -V

cd $PBS_O_WORKDIR
module load openmpi
mpiexec –n 128 ./job_executable

2. Running Hybrid MPI/OpenMP applications

#PBS -q normal
#PBS -lnodes=2 :ppn=8
#PBS -l walltime=12:00:00
#PBS -N my_job
#PBS -e my_job.$PBS_JOBID.err
#PBS -o my_job.$PBS_JOBID.out
#PBS -V 

cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=8
module load openmpi
mpiexec -n 2 ./job_executable

3. OpenMP

#PBS -q normal
#PBS -lnodes=1:ppn=8
#PBS -l walltime=12:00:00
#PBS -N my_job
#PBS -e my_job.$PBS_JOBID.err
#PBS -o my_job.$PBS_JOBID.out
#PBS –V

cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=8
module load intel
./my_executable