Job Submission
In the following sections, john
is used as an
example username. Be sure to replace this with the appropriate
Michigan Tech ISO username.
Transferring Files
Suppose that all necessary files (and/or folders) to run a
(set of) simulation(s) are located in the campus home
directory, in $HOME/project_X
. Then, the
following command - when run from any linux machine that is
maintained by Information Technology Services - will securely
transfer them, as is, to Rama:
export RAMA="rama-login.research.mtu.edu"
rsync -ave ssh -hPz \
$HOME/project_X/ john@${RAMA}:/research/john/project_X/
When the (set of) simulation(s) have been successfully
completed, the following command - when run from any linux
machine that is maintained by Information Technology
Services - will securely transfer them, as is, from
Rama:
export RAMA="rama-login.research.mtu.edu"
rsync -ave ssh -hPz \
john@${RAMA}:/research/john/project_X/ $HOME/project_X/
Queues
rama.q (compute-0-N; N: 0-8)
All users have access to this queue.
Batch Submission Scripts
qgenscript
is a fairly extensive interactive
utility that provides necessary script for all
computational science/engineering suites available in
Rama and should be called from the folder
where all the necessary files (or folders) to run the job
are located. For e.g.,
cd /research/john/project_X
qgenscript
Apart from generating a script based on the user-supplied
information, qgenscript
also displays the
the wait time statistics and the command necessary to
submit that script to the queue. It is critical that users
do not edit this script and/or re-use/re-purpose an old
one for a newer job.
For every job, users can expect to receive
several emails (or SMS notifications) with status (begin,
end, abort/kill and suspend) information.
Submitted jobs can often stay in qw
mode
for longer than normal periods of time, depending on the
availability of resouces (processors, memory, number of
licenses, etc.).
Scheduling Policy
Pre-emption (the act of temporarily interrupting/suspending
a task, with the intention of resuming it at a later point
in time) is disabled on Rama. In other words,
any job that starts will complete, unless it has reached the
wall time limit (for e.g., running on short.q
)
or it has been stopped [terminated by the user, terminated
by the administrator with user's approval, power cycling of
node(s), bug in the source code/compiler, etc.].
Array Jobs
Suppose that a (phase of) project involves running a large
number of simulations, say 1000, and suppose that all these
simulations use the same software suite & command but
run on different data sets, named input_1.txt
,
input_2.txt
, input_3.txt
, ...,
input_1000.txt
.
Using qgenscript
a thousand times to generate the
required batch submission script for each simulation becomes a
very tedious, time consuming and inefficient task, as well as
making monitoring (or deleting) them a hassle.
To facilitate an efficient handling of such a scenario,
qgenscript
offers an optional feature with which
users may specify that a given simulation is an
array job and then specify the range
(i.e., 1-1000
; 1, 2, 3, ..., 1000 are then
referred to as task ID to identify a specific
data set).
The result is just one batch submission script that not
only makes it easier to submit and monitor the job (or
terminate some/all tasks, if need be) but also puts
significantly less burden on the queuing system.
As many of the tasks will run concurrently as possible,
provided all requested resources are available. Output
and other files will also be tagged with respective
task ID for unique identification. Users
must note that the queuing system will send email (or SMS)
notificaitons for each task.
If a given scenario is more complex than the one described
above (e.g., run iff the output file for a given task ID is not
present, select j
th line from a file
as a parameter for j
th task, using
task ID itself as a parameter for a function call,
and so on), users are strongly encouraged to contact
Information Technology Services.
Exclusive Compute Node Access
qgenscript
offers an optional feature with
which users may request exclusive access to a compute node
(for any job with up to a maximum of 12 processors).
Advantage:
If chosen, a job will be sent to a compute node that is
completely free, and no other job will be assigned to that
node until that given job is completed. This feature will
satisfactorily handle CPU and/or memory intensive jobs.
Disadvantage:
If chosen, a job will very likely wait in qw
mode for longer than usual while the queuing system finds a
node that is completely free. This feature, for normal jobs
(i.e., not CPU and/or memory intensive), will hog the entire
node and prevent it from being used by other normal jobs.
If chosen for a job with more than 12 processors, it will
never run and stay in qw
mode forever.
SMS Notifications
Users who wish to receive SMS notifications in lieu of the
usual email notifications will need to create a file,
$HOME/.mobile
. Please note that the service
provider will charge usual messaging rates. Suppose that
(906) 370-1234
is the 10 digit cell number,
please refer to the table below for contents of
$HOME/.mobile
.
Service Provider |
Contents of $HOME/.mobile
|
Alltel | 9063701234@message.alltel.com |
AT&T | 9063701234@txt.att.net |
Nextel | 9063701234@messaging.nextel.com |
Sprint | 9063701234@messaging.sprintpcs.com |
T-Mobile | 9063701234@tmomail.net |
Verizon | 9063701234@vtext.com |
Users will have the opportunity to (not) select this option,
on a per job basis, when generating submission script using
qgenscript
.
Running Programs on Front End/Login Node(s)
When a user runs a program on Front End/Login Node(s),
either inadvertently or intentionally, the system sends a
warning email. When the same user is found to be running a
program on Front End/Login Node(s) even after the first
warning, the system logs the user out without further
warning
and blocks access until further notice. A meeting with the
user and advisor(s) will be necessary before the user can
access Rama again.
Researchers are strongly encouraged to use
less
, more
, grep
or qlogin
, etc. to analyze (larger) files
as using vi
(or vim
) can use a large percentage of
CPU and memory, and in turn can result in a violation.
Running programs on Front End/Login Node(s) can cause
serious problems such as drastically reduced performance
for everyone, corrupted file system, missing features,
prevent users from logging in and shutting the cluster down.
In the event that Information Technology Services has to
rebuild the cluster from scratch to repair these symptoms,
it will take considerable amount of time and researchers
using this cluster might not be able to meet their project
deadlines.
qlogin: Compiling Programs & Minimal Analysis
When there is a need to compile programs and/or utilities,
and/or run scripts/tools/utilities to create the input files
and/or analyze the output files,
users are expected to do so in one of the compute
nodes. Using the following command
qlogin
will not only grant legitimate access to an otherwise
inaccessible compute (or tile) node but also protect the
user from potential violations, as described in
Running Programs on Front End/Login Node(s) section.
Standard set of compilers, libraries, tools and utilities
will always be available. However, depending on the
availability of resources, the number of concurrent
qlogin
sessions (2), the duration of each such
session (4 hours) and/or the list of accessible software
suites might be limited.
Disk Usage
The cluster is not backed up and users are responsible for
their data. Owing to limited storage space, the cluster set
up is such that every user, barring none, will get an email
once the respective HOME
folder exceeds a
pre-set limit (and also, when applicable, once the respective
RESEARCH
folder exceeds a pre-set limit) - and
will continue to get such emails few times a day until disk
usage gets below the set limit.
After the 12th consecutive notification, the user's account
will be locked and will require a request from the advisor to
have the user's account re-enabled.
Useful Commands
Built-in | |
qstat -j JOB_ID |
More information about a running/waiting job |
qhold JOB_ID |
Hold/Pause a running/waiting job |
qrls JOB_ID |
Release a job from hold |
qdel JOB_ID |
Delete a job from the queue |
qlogin |
Log into a compute/tile node |
qacct -j JOB_ID |
A job that has been completed |
qacct -o USERNAME |
A given user |
qhost -u USERNAME |
All jobs & queues used by a given user |