Login nodes: Saguaro has 3 login nodes can be connected through SSH and can be run by multiple users at the same time. These nodes are only used to interact with the users, edit the files, etc.
*Do not run applications on the login nodes.
Compute nodes: These nodes are used for computation and to run the jobs with loading the modules.
Different processor architectures can be chosen during run time or in the PBS script.
Please refer the RESOURCES-> hardware section for processor architecture details.
Launching Interactive jobs: In order to run the jobs interactively we request for the compute nodes from the login nodes using the command:
qsub -I -V -q -lnodes=n:ppn=i walltime=00:10:00
* The -I flag specifies an interactive job.
* The -V flag passes your current environment variable settings to the compute environment.
* The -q flag specifies the name of the queue.
-lnodes=n:ppn=i determines the number of nodes to allocate for your job. Nodes= no. of nodes ppn= no. of processors per node.
Go to the current working directory and launch your job: cd PBS_O_WORKDIR.
The default walltime for interactive sessions is 24 hours.
For example to launch an OpenMP job interactively:
Load the module:
module load intel
Compile the program:
icc omphello.c –o omphello –openmp
Run the program:
Batch script with detailed explanation:
#!/bin/bash #PBS -N Jobname #PBS -l nodes=3:ppn=8 #PBS -j oe #PBS -e my_job.$PBS_JOBID.err #PBS -o my_job.$PBS_JOBID.out #PBS -l walltime=00:10:00 #PBS -m abe #PBS -M firstname.lastname@example.org #PBS -p 0 #PBS -q queuename cd $PBS_O_WORKDIR module load mvapich2/2.0a-intel-13.0 mpiexec -np 24 program_name < inputfile > outputfile
A line beginning with # is a comment. A line beginning with #PBS is a PBS directive.
The PBS directives must always be listed first.
|PBS script line||Explanation|
|#PBS -N JobName||Set the name of the job (up to 15 characters,no blank spaces, start with alphanumeric character)|
|#PBS -l nodes=1:ppn=1||Assigns the number of nodes and the number of processors per node for the requested job to run.|
|#PBS -j oe||The directive directs that the standard output and error streams are to be merged, intermixed, as standard output.|
|#PBS -e my_job.$PBS_JOBID.err
#PBS -o my_job.$PBS_JOBID.out
|The standard output and error files|
|#PBS -l walltime=00:10:00||Format: hhhh:mm:ss . ifthe job does not finish by the time reached, the job is terminated.|
|#PBS -m abe||PBS can send informative email messages to you about the status of your job. Send mail “a” (job is aborted), “b” (job begins), and “e” (job terminates).|
|#PBS -M email@example.com||specify the email address|
|#PBS -p 0||Set the priority for the job to run. The default is 0.|
|#PBS -q queuename||Specific queue in which you want the job to run|
|cd $PBS_O_WORKDIR||holds the path to the directory from which you submitted your job.|
module load mvapich2/2.0a-intel-13.0: load the compilers required from the available module list.
mpiexec -np 24 program_name < inputfile > outputfile
The mpiexec script will setup the runtime environment for running your parallel program and run your program on 24 processors/cores.
Use the batch script to submit jobs to run in the background on compute nodes to the system using the “qsub” command:
If you login to a head node, you can save this in a file such as “my_job_script.pbs” and submit it to the queue with the ‘qsub’ command.
PBS exit codes: When a job exits the queue it returns a value indicating the status of the job.
e.g.: X=0 ( successful execution of the job)
These codes have various ranges
X<0 ( job could not be executed)
Torque provides exit codes:
|JOB_EXEC_OK||0||job exec successful|
|JOB_EXEC_FAIL1||-1||job exec failed, before files, no retry|
|JOB_EXEC_FAIL2||-2||job exec failed, after files, no retry|
|JOB_EXEC_RETRY||-3||job execution failed, do retry|
|JOB_EXEC_INITABT||-4||job aborted on MOM initialization|
|JOB_EXEC_INITRST||-5||job aborted on MOM init, chkpt, no migrate|
|JOB_EXEC_INITRMG||-6||job aborted on MOM init, chkpt, ok migrate|
|JOB_EXEC_BADRESRT||-7||job restart failed|
|JOB_EXEC_CMDFAIL||-8||exec() of user command failed|
The exit status can be viewed using the qstat –f command.
Most used PBS environment variables are:
echo $PBS_O_HOST - on which qsub is running echo $PBS_O_QUEUE - originating queue echo $PBS_QUEUE - executing queue echo $PBS_O_WORKDIR - working directory echo $PBS_ENVIRONMENT - execution mode echo $PBS_JOBID - job identifier echo $PBS_JOBNAME - job name echo $PBS_NODEFILE - node file echo $PBS_O_HOME - current home directory echo $PBS_O_PATH - PATH
Storage: All of the compute nodes have access to 215 TB of high speed parallel LUSTRE scratch space (scalable to petabytes) scratch storage and 1.5PB of spinning disk primary storage. The scratch space is purged once every 30 days.
|qsub batch_script||Submits batch script to the queue. The output of qsub will be a jobid|
|qdel jobid||Deletes a job from the queue|
|qhold jobid||Puts a job on hold in the queue.
To delete a job from the saguaro xfer queue users must add an additional parameter
|qrls jobid||Releases a job from hold.|
|qalter [options]jobid||Change attributes of submitted job. nodes=:ppn=8/td>|
|qmovenew_queue jobid||Move job to new queue. Remember, the new queue must be one of the submission queues (premium, regular, or low)|
|qstat -a||Lists jobs in submission order (more useful than qstat without options) Also takes -u and -f [jobid]> options|
|qstat -f jobid||Produce a detailed report for the job. Note: if used on the login node from which the job was submitted then jobid need only contain the numerical portion of the job id|
|qs||NERSC provided wrapper that shows jobs in priority order. Takes -u username and -w options.|
|apstat||Shows the number of up nodes and idle nodes and a list of current pending and running jobs. apstat -r command displays all the nodes reservations.|
|showq||List jobs in priority order in three categories: active jobs, eligible jobs and blocked jobs. This command lists jobs in priority order. showq -i lists details of all eligible jobs.|
|Showstart jobid||Takes a jobid as its argument and displays an earliest possible start time of such jobs that request the same amount of resources (nodes, walltime, memory, etc)|
|checkjobjobid||Takes a jobid as its argument and displays the current job state and whether nodes are available to run the job currently.|
To alter the resources for a current job in the queue:
qalter -lwalltime=new_walltime jobid qalter -lmppwidth=new_mppwidth jobid
Examples Batch Scripts:
1. Basic batch script
#PBS -q normal #PBS -lnodes=16:ppn=8 #PBS -l walltime=00:20:00 #PBS -N jobname #PBS –e jobname.$PBS_JOBID.err #PBS -o jobname.$PBS_JOBID.out #PBS -V cd $PBS_O_WORKDIR module load openmpi mpiexec –n 128 ./job_executable
2. Running Hybrid MPI/OpenMP applications
#PBS -q normal #PBS -lnodes=2 :ppn=8 #PBS -l walltime=12:00:00 #PBS -N my_job #PBS -e my_job.$PBS_JOBID.err #PBS -o my_job.$PBS_JOBID.out #PBS -V cd $PBS_O_WORKDIR export OMP_NUM_THREADS=8 module load openmpi mpiexec -n 2 ./job_executable
#PBS -q normal #PBS -lnodes=1:ppn=8 #PBS -l walltime=12:00:00 #PBS -N my_job #PBS -e my_job.$PBS_JOBID.err #PBS -o my_job.$PBS_JOBID.out #PBS –V cd $PBS_O_WORKDIR export OMP_NUM_THREADS=8 module load intel ./my_executable