Job Submission


In the following sections, john is used as an example username. Be sure to replace this with the appropriate Michigan Tech ISO username.

 

Transferring Files

Suppose that all necessary files (and/or folders) to run a (set of) simulation(s) are located in the campus home directory, in $HOME/project_X. Then, the following command - when run from any linux machine that is maintained by Information Technology Services - will securely transfer them, as is, to Rama:

export RAMA="rama-login.research.mtu.edu"
rsync -ave ssh -hPz \
$HOME/project_X/ john@${RAMA}:/research/john/project_X/

When the (set of) simulation(s) have been successfully completed, the following command - when run from any linux machine that is maintained by Information Technology Services - will securely transfer them, as is, from Rama:

export RAMA="rama-login.research.mtu.edu"
rsync -ave ssh -hPz \
john@${RAMA}:/research/john/project_X/ $HOME/project_X/

Queues

  1. rama.q (compute-0-N; N: 0-8)

    All users have access to this queue.

Batch Submission Scripts

qgenscript is a fairly extensive interactive utility that provides necessary script for all computational science/engineering suites available in Rama and should be called from the folder where all the necessary files (or folders) to run the job are located. For e.g.,

cd /research/john/project_X
qgenscript

Apart from generating a script based on the user-supplied information, qgenscript also displays the the wait time statistics and the command necessary to submit that script to the queue. It is critical that users do not edit this script and/or re-use/re-purpose an old one for a newer job.

For every job, users can expect to receive several emails (or SMS notifications) with status (begin, end, abort/kill and suspend) information. Submitted jobs can often stay in qw mode for longer than normal periods of time, depending on the availability of resouces (processors, memory, number of licenses, etc.).

Scheduling Policy

Pre-emption (the act of temporarily interrupting/suspending a task, with the intention of resuming it at a later point in time) is disabled on Rama. In other words, any job that starts will complete, unless it has reached the wall time limit (for e.g., running on short.q) or it has been stopped [terminated by the user, terminated by the administrator with user's approval, power cycling of node(s), bug in the source code/compiler, etc.].

Array Jobs

Suppose that a (phase of) project involves running a large number of simulations, say 1000, and suppose that all these simulations use the same software suite & command but run on different data sets, named input_1.txt, input_2.txt, input_3.txt, ..., input_1000.txt.

Using qgenscript a thousand times to generate the required batch submission script for each simulation becomes a very tedious, time consuming and inefficient task, as well as making monitoring (or deleting) them a hassle.

To facilitate an efficient handling of such a scenario, qgenscript offers an optional feature with which users may specify that a given simulation is an array job and then specify the range (i.e., 1-1000; 1, 2, 3, ..., 1000 are then referred to as task ID to identify a specific data set).

The result is just one batch submission script that not only makes it easier to submit and monitor the job (or terminate some/all tasks, if need be) but also puts significantly less burden on the queuing system. As many of the tasks will run concurrently as possible, provided all requested resources are available. Output and other files will also be tagged with respective task ID for unique identification. Users must note that the queuing system will send email (or SMS) notificaitons for each task.

If a given scenario is more complex than the one described above (e.g., run iff the output file for a given task ID is not present, select jth line from a file as a parameter for jth task, using task ID itself as a parameter for a function call, and so on), users are strongly encouraged to contact Information Technology Services.

Exclusive Compute Node Access

qgenscript offers an optional feature with which users may request exclusive access to a compute node (for any job with up to a maximum of 12 processors).

Advantage: If chosen, a job will be sent to a compute node that is completely free, and no other job will be assigned to that node until that given job is completed. This feature will satisfactorily handle CPU and/or memory intensive jobs.

Disadvantage: If chosen, a job will very likely wait in qw mode for longer than usual while the queuing system finds a node that is completely free. This feature, for normal jobs (i.e., not CPU and/or memory intensive), will hog the entire node and prevent it from being used by other normal jobs. If chosen for a job with more than 12 processors, it will never run and stay in qw mode forever.

SMS Notifications

Users who wish to receive SMS notifications in lieu of the usual email notifications will need to create a file, $HOME/.mobile. Please note that the service provider will charge usual messaging rates. Suppose that (906) 370-1234 is the 10 digit cell number, please refer to the table below for contents of $HOME/.mobile.


Service Provider Contents of $HOME/.mobile
Alltel 9063701234@message.alltel.com
AT&T 9063701234@txt.att.net
Nextel 9063701234@messaging.nextel.com
Sprint 9063701234@messaging.sprintpcs.com
T-Mobile 9063701234@tmomail.net
Verizon 9063701234@vtext.com


Users will have the opportunity to (not) select this option, on a per job basis, when generating submission script using qgenscript.

Running Programs on Front End/Login Node(s)

When a user runs a program on Front End/Login Node(s), either inadvertently or intentionally, the system sends a warning email. When the same user is found to be running a program on Front End/Login Node(s) even after the first warning, the system logs the user out without further warning and blocks access until further notice. A meeting with the user and advisor(s) will be necessary before the user can access Rama again.

Researchers are strongly encouraged to use less, more, grep or qlogin, etc. to analyze (larger) files as using vi (or vim) can use a large percentage of CPU and memory, and in turn can result in a violation.

Running programs on Front End/Login Node(s) can cause serious problems such as drastically reduced performance for everyone, corrupted file system, missing features, prevent users from logging in and shutting the cluster down. In the event that Information Technology Services has to rebuild the cluster from scratch to repair these symptoms, it will take considerable amount of time and researchers using this cluster might not be able to meet their project deadlines.

qlogin: Compiling Programs & Minimal Analysis

When there is a need to compile programs and/or utilities, and/or run scripts/tools/utilities to create the input files and/or analyze the output files, users are expected to do so in one of the compute nodes. Using the following command

qlogin

will not only grant legitimate access to an otherwise inaccessible compute (or tile) node but also protect the user from potential violations, as described in Running Programs on Front End/Login Node(s) section.

Standard set of compilers, libraries, tools and utilities will always be available. However, depending on the availability of resources, the number of concurrent qlogin sessions (2), the duration of each such session (4 hours) and/or the list of accessible software suites might be limited.

Disk Usage

The cluster is not backed up and users are responsible for their data. Owing to limited storage space, the cluster set up is such that every user, barring none, will get an email once the respective HOME folder exceeds a pre-set limit (and also, when applicable, once the respective RESEARCH folder exceeds a pre-set limit) - and will continue to get such emails few times a day until disk usage gets below the set limit.

After the 12th consecutive notification, the user's account will be locked and will require a request from the advisor to have the user's account re-enabled.

Useful Commands

 
Built-in
qstat -j JOB_ID More information about a running/waiting job
qhold JOB_ID Hold/Pause a running/waiting job
qrls JOB_ID Release a job from hold
qdel JOB_ID Delete a job from the queue
qlogin Log into a compute/tile node
qacct -j JOB_ID A job that has been completed
qacct -o USERNAME A given user
qhost -u USERNAME All jobs & queues used by a given user