Motivation

On TACC computers, you can get a processor allocation through the LSF queuing system. Each allocation requires a new trip through the queue. Queuesplit allows you to divide up your nodes into smaller groups, and run multiple jobs on your allocation by splitting it up into groups and running a separate job on each group. If you have more jobs than you do groups, it will act as an internal queuing system.

It is designed in such a way that the input to queuesplit are the job files you would normally submit to bsub, making use very easy.

Example

#BSUB -n 16
#BSUB -J qstest
#BSUB -o qstest.o%J
#BSUB -e qstest.e%J
#BSUB -q development
#BSUB -W 5:00
#BSUB -N
#QSplit --divisions 4
#QSplit --wallclocklimit 4:40

QUEUESPLIT=/home/utexas/cm/rdarst/research/queuesplit/queuesplit.py

python $QUEUESPLIT run <<EOF
bash temperature00/run00/queueScript.bsub ;; chdir=. stdout=temperature00/run00/STDOUT
bash temperature01/run00/queueScript.bsub ;; chdir=. stdout=temperature01/run00/STDOUT
bash temperature02/run00/queueScript.bsub ;; chdir=. stdout=temperature02/run00/STDOUT
bash temperature03/run00/queueScript.bsub ;; chdir=. stdout=temperature03/run00/STDOUT

EOF

Explanition of example

This begins with looking like a normal queue submit script. You can note the usual bsub options at the top. We have added Queuesplit (QSPLIT) options at the top. This helps to integrate the overall configuration.

Queuesplit works within what is provided by LSF (the bsub options). This is where we specify how many total processors we have to work with. The stdout and stderr LSF options still apply. These are the default stdout/stderr paths for jobs (they will all get intermixed). (Individual stdout/stderr can be redirected for individual jobs, details below.)

Relavent options are

--divisions

The total number of processors allocated is divided into this many groups. The number of divisions should be chosen "cleverly". For example, if you allocate 12 processors, dividing them into five groups would be silly. Dividing it into four groups wouldn't be smart either, because with three processors per group would mean that the jobs gets split across a nodes in a nonideal way. If divisions is 1, then run each job on all allocated processors, in sequence one after another.

--wallclocklimit

Individual jobs (not the entire Queuesplit process) are killed after this amount of time. (NOTE: not implemented yet

Then, we specify the path to queuesplit (it is still subject to change.) Since it isn't marked executable, we use python to run it. (These choices are subject to change.) Queuesplit takes the single command line argument "run" to indicate that it should actually begin running.

Then, we begin a "here document". Everything between the two EOFs is sent to queuesplit on stdin. Here, we do our main configuration.

There is one line per job to run. Each line specifies a job to run. For example,

bash temperature03/run00/queueScript.bsub ;; chdir=. stdout=temperature03/run00/STDOUT

Everything before the double semicolons is the command to run. The double semicolons are required. The rest of the options are specified as option=arg with no spaces within arguments. Separate options are separated by spaces. Options are documented below

chdir

If given, chdir to this directory before anything else, including custom stdin/stdout files are opened.

stdout

If given, stdout for this child job (including some extra status information) is redirected to this file.

stderr

If given, stderr for this child job is redirected to this file. (NOTE: not implemented yet)

Extra specification

Rossky/queuesplit (last edited 2008-03-10 01:39:23 by localhost)