Users Guide¶
In the following we will refer to a tool called nohup
. This name is
missleading as we do not use the actual nohup tool for several reasons,
the most significant being that nohup some times hangs up. Instead we
start a background job in a subshell, i.e.
$ (./job_name parameters & )
This essentially have the desired effect, namely that the program does
not hang up when the shell is closed. Throughout this document we will
refer to this as nohup
rather than starting a background job in a
subshell
. For more information on nohup alternatives see What to do when nohup hangs up anyway.
Quick start¶
This section is intended for getting you started quickly with BatchQ and consequently, few or no explanations of the commands/scripts will be given. If you would like the full explanation on how BatchQ works, skip this section. If you choose to read the quick start read all of it no matter whether you prefer Python over bash.
Command line¶
First you need to create configurations for the machines you want to
access. This is not necessary, but convenient (more details are given in
the following sections). Open bash
and type
$ q configuration my_server_configuration --working_directory="Submission" --command="./script" --input_directory="." --port=22 --server="server.address.com" --global
$ q configuration your_name --username="your_default_used" --global
In the above change the server.address.com
to the server address you
wish to access. Also, change the username
in the second line to your
default username. Next, create a new director MyFirstSubmission
and
download the script sleepy
mkdir MyFirstSubmission
cd MyFirstSubmission
wget https://raw.github.com/troelsfr/BatchQ/master/scripts/sleepy
chmod +x sleepy
The job sleepy
sleeps for 100 seconds every and for every second it
echos “Hello world”. Submit it using server.address.com
using the
command:
$ q [batch_system] job@my_server_configuration,your_name --command="./sleepy"
Here batch_system
should be either nohup
, ssh-nohup
or lsf
.
Check the status of the job with
$ q [batch_system] job@my_server_configuration,your_name --command="./sleepy"
Job is running.
And after 100s you get
$ q [batch_system] job@my_server_configuration,your_name --command="./sleepy"
Job has finished.
Retrieving results.
At this point new files should appear in your current directory:
$ ls
sleepy sleepy.data
In order to see the logs of the submission type
$ q [batch_system] stdout@my_server_configuration,your_name --command="./sleepy"
This is the sleepy stdout.
$ q [batch_system] stderr@my_server_configuration,your_name --command="./sleepy"
This is the sleepy stderr.
$ q [batch_system] log@my_server_configuration,your_name --command="./sleepy"
(...)
The last command will differ depending on which submission system you use. Finally, we clean up on the server:
$ q [batch_system] delete@my_server_configuration,your_name --command="./sleepy"
True
Congratulations! You have submitted your first job using the command line tool.
Python¶
Next, open an editor, enter the following Python code:
from batchq.queues import LSFBSub
from batchq.core.batch import DescriptorQ
class ServerDescriptor(DescriptorQ):
queue = LSFBSub
username = "default_user"
server="server.address.com"
port=22
prior = "module load open_mpi goto2 python hdf5 cmake mkl\nexport PATH=$PATH:$HOME/opt/alps/bin"
working_directory = "Submission"
desc1 = ServerDescriptor(username="tronnow",command="./sleepy 1", input_directory=".", output_directory=".")
desc2 = ServerDescriptor(desc1, command="./sleepy 2", input_directory=".", output_directory=".")
print "Handling job 1"
desc1.job()
print "Handling job 2"
desc2.job()
and save it as job_submitter.py
in MyFirstSubmission
. Note that
in the above code we use the second descriptor to initiate the first
descriptor in order to reuse the queue defined for desc1
in
desc2
. Go back to the shell and type:
$ python job_submitter.py
- Rerun the code to get the status of the job and to pull finished jobs.
- Your second submission was done with Python.
Note
If choosing same input and output directory you will run into problems when running this script several times as hash sum of the input changes once the results have been pulled. This means that you may accidently resubmit a finished job.
The above can be overcome by either separating input_directory
and
output_directory
, or by setting the submission id manually:
from batchq.queues import LSFBSub
from batchq.core.batch import Descriptor as DescriptorQ
class ServerDescriptor(DescriptorQ):
queue = LSFBSub
username = "default_user"
server="server.address.com"
port=22
prior = "module load open_mpi goto2 python hdf5 cmake mkl\nexport PATH=$PATH:$HOME/opt/alps/bin"
working_directory = "Submission"
desc1 = ServerDescriptor(username="tronnow",command="./sleepy 1", input_directory=".", output_directory=".", overwrite_submission_id="simu1")
desc2 = ServerDescriptor(desc1, command="./sleepy 2", input_directory=".", output_directory=".", overwrite_submission_id="simu2")
print "Handling job 1"
desc1.job()
print "Handling job 2"
desc2.job()
To shorten the above you may use your previously defined configurations
from batchq.queues import LSFBSub
from batchq.core.batch import DescriptorQ, load_queue
q = load_queue(LSFBSub, "my_server_configuration,your_name")
desc1 = DescriptorQ(q, command="./sleepy 1", input_directory=".", output_directory=".", overwrite_submission_id="simu1")
desc2 = DescriptorQ(q, command="./sleepy 2", input_directory=".", output_directory=".", overwrite_submission_id="simu2")
print "Handling job 1"
desc1.job()
print "Handling job 2"
desc2.job()
We can now generalise this to arbitrarily many jobs:
from batchq.queues import LSFBSub
from batchq.core.batch import DescriptorQ, load_queue
q = load_queue(LSFBSub, "my_server_configuration,your_name")
for i in range(1,10):
desc = DescriptorQ(q, command="./sleepy %d" %i, input_directory=".", output_directory=".", overwrite_submission_id="simu%d" %i)
print "Handling job %d" %i
desc.job()
If we know that the output files does not overwrite each other it is
only necessary to keep one copy of the input folder. This can be done by
specifying the subdirectory
from batchq.queues import LSFBSub
from batchq.core.batch import DescriptorQ, load_queue
q = load_queue(LSFBSub, "my_server_configuration,your_name")
for i in range(1,15):
desc = DescriptorQ(q, command="./sleepy %d" %i, input_directory=".", output_directory=".", overwrite_submission_id="simu%d" %i, subdirectory="mysimulation")
print "Handling job %d" %i
desc.job()
Using the command line tool¶
The following section will treat usage of BatchQ from the command line.
Available modules¶
The modules available to BatchQ will vary from system to system depending on whether custom modules have been installed. Modules are divided into four categories: functions, queues, pipelines and templates. The general syntax of the Q command is:
$ q [function/queue/template] [arguments]
The following functions are available through the console interface and using Python and are standard modules included in BatchQ which provides information about other modules
Submitting jobs¶
The BatchQ command line interface provides you with two predefined
submission modules: nohup
and lsf
. nohup
is available on every
To submit a job type:
$ cd /path/to/input/directory
$ q lsf submit -i --username=user --working_directory="Submission" --command="./script" --input_directory="." --port=22 --server="server.address.com"
The above command will attempt to log on to server.address.com
using
the username user
through port 22. It then creates a working
directory called Submission
in the entrance folder (usually your
home directory on the server) and transfer all the files from your
input_directory
to this folder. The command
is then submitted to
lsf
and the SSH connection is terminated.
Once you have automated the submission process you want to store the configuration parameters in a file in order to shorten the commands need to operate on your submissions. Using the example from before, this can be done as
$ q configuration brutus -i --username=user --working_directory="Submission" --command="./script" --input_directory="." --port=22 --server="server.address.com"
The above code creates a configuration named “brutus” which contains the instructions for submitting your job on “server.address.com”. Having created a configuration file you can now submit jobs and check status with
$ q lsf submit@brutus
True
$ q lsf pid@brutus
12452
This keeps things short and simple. You will need to create a
configuration file for each server you want to submit your job. If for
one or another reason you temporarily want to change parameters of your
configuration, say the working_directory
, this can be done by adding
a long parameter:
$ q lsf submit@brutus --working_directory="Submission2"
True
You can configure Batch Q command line tool with several input configurations
Checking the status of a job, retrieving results and deleting the working directory of a simulation is now equally simple
$ q lsf status@brutus
DONE
$ q lsf recv@brutus
True
$ q lsf delete@brutus
True
The retrieve command will only retrieves files that does not exist, or differs from those in the input directory.
Finally, the Q system implements a fully automated job submission meaning that the system will try to determine the state of you job and take action accordingly. For fast job submission and status checking write:
$ q lsf job@brutus,config
Uploading input directory.
Submitted job on brutus.ethz.ch
$ q lsf job@brutus,config
Job pending on brutus.ethz.ch
$ q lsf job@brutus,config
Job running on brutus.ethz.ch
$ q lsf job@brutus,config
Job finished on brutus.ethz.ch
Retrieving results.
Do you want to remove the directory 'Submission2' on brutus.ethz.ch (Y/N)? Y
Deleted Submission2 on brutus.ethz.ch
You can equally submit the job on your local machine using nohup
instead
of lsf
.
A few words on job hashing¶
When submitting a job Batch Q generates a hash for your job. The hash includes following:
- An MD5/SHA sum of the input directory
- The name of the server to which the job is submitted
- The submitted command (including parameters)
It is not recommended, nevertheless possible, to overwrite the hash
key. This can be done by adding a
--overwrite_submission_id="your_custom_id"
. This can be useful in
some cases. For instance you might want to work on your source code
during development. This would consequently changed the MD5 of your
input directory and Batch Q would be incapable of recognising your job
submission. Batch Q is shipped with a configuration for debugging which
can be invoked by
$ q lsf submit@brutus,debug
The debug configuration is only suitable for debug as the submission id
is debug
.
Another scenario where you may want to change the hashing routine is the
case where you store your output data in your input
directory. Submitting several jobs and pulling results will over time
change the hash of the input directory. To overcome this issue add
eio
(short for equal input/output directory) to your configuration
$ q lsf submit@brutus,eio
The eio
flag will overwrite your output_directory
with the value
of your input_directory
and change the hashing routine to only
include the command and server name.
Example: Submitting ALPS jobs using nohup¶
Example: Submitting ALPS jobs using LSF¶
Example: Submitting multiple jobs from one submission directory¶
In some cases one may not want to copy the same directory several times
to the server as this may take up vast amounts of space. If the
simulation output only depends the command line parameters (as is the
case for ALPS spinmc
) one can use the eio
configuration to
submit several commands reusing the same submission directory
$ q lsf job@brutus,eio --working_directory="Submission" --command="spinmc TODO1"
$ q lsf job@brutus,eio --working_directory="Submission" --command="spinmc TODO2"
$ q lsf job@brutus,eio --working_directory="Submission" --command="spinmc TODO3"
Using Python¶
Submitting jobs¶
Retrieving results¶
Q descriptors, Q holders and Q functions¶
Batch Q user API is based on three main classes Q descriptors, Q holders (queues) and Q functions. Usually Q functions are members of instances of Q holder classes while Q descriptors are reference objects used to ensure that you do not open more SSH connections than necessary. Descriptors link a set of input configuration parameters to a given queue. An example could be:
class ServerDescriptor(DescriptorQ):
queue = LSFBSub
username = "user"
server="server.address.com"
port=22
options = ""
prior = "module load open_mpi goto2 python hdf5 cmake mkl\nexport PATH=$PATH:$HOME/opt/alps/bin"
post = ""
working_directory = "Submission"
The descriptor ServerDescriptor
implements all Q functions and
properties defined in the class LSFBSub
. However, the descriptor
ensures that the all queue parameters are set accordingly to those given
by the descriptor definition before executing a command on the
queue. Therefore, if you have two descriptor instances that shares a queue
queue = LSFBSub()
desc1 = DescriptorQ(queue)
desc1.update_configuration(working_directory = "Submission1")
desc2 = DescriptorQ(queue)
desc2.update_configuration(working_directory = "Submission2")
you are ensured that your are working in the correct directory by using
the descriptor instead of the queue directly.
Notice that in order to update the queue properties using the descriptor
one needs to use update_configuration
rather than
descriptor.property
(i.e. desc2.working_directory
in the above
example). The reason for this is that any method or property of a
descriptor object is an “reference” to queue methods and properties. The
only methods that are not redirected are the implemented descriptor
methods:
desc1 = ServerDescriptor()
desc2 = ServerDescriptor(desc2)
desc2.update_configuration(working_directory = "Submission2")
In general, when copying descriptors, make sure to do a shallow copy as you do not want to make a deep copy of the queue object.
Example: Submitting ALPS jobs using nohup¶
The BatchQ package comes with a preprogrammed package for ALPS. This enables easy and fast scripting for submitting background jobs on local and remote machines. Our starting points is the Spin MC example from the ALPS documentation:
import pyalps
import matplotlib.pyplot as plt
import pyalps.plot
import sys
print "Starting"
parms = []
for t in [1.5,2,2.5]:
parms.append(
{
'LATTICE' : "square lattice",
'T' : t,
'J' : 1 ,
'THERMALIZATION' : 1000,
'SWEEPS' : 100000,
'UPDATE' : "cluster",
'MODEL' : "Ising",
'L' : 8
}
)
input_file = pyalps.writeInputFiles('parm1',parms)
desc = pyalps.runApplication('spinmc',input_file,Tmin=5,writexml=True)
result_files = pyalps.getResultFiles(prefix='parm1')
print result_files
print pyalps.loadObservableList(result_files)
data = pyalps.loadMeasurements(result_files,['|Magnetization|','Magnetization^2'])
print data
plotdata = pyalps.collectXY(data,'T','|Magnetization|')
plt.figure()
pyalps.plot.plot(plotdata)
plt.xlim(0,3)
plt.ylim(0,1)
plt.title('Ising model')
plt.show()
print pyalps.plot.convertToText(plotdata)
print pyalps.plot.makeGracePlot(plotdata)
print pyalps.plot.makeGnuplotPlot(plotdata)
binder = pyalps.DataSet()
binder.props = pyalps.dict_intersect([d[0].props for d in data])
binder.x = [d[0].props['T'] for d in data]
binder.y = [d[1].y[0]/(d[0].y[0]*d[0].y[0]) for d in data]
print binder
plt.figure()
pyalps.plot.plot(binder)
plt.xlabel('T')
plt.ylabel('Binder cumulant')
plt.show()
Introducing a few small changes the script now runs using BatchQ for submission:
from batchq.contrib.alps import runApplicationBackground, LSFBSub, DescriptorQ
import pyalps
import matplotlib.pyplot as plt
import pyalps.plot
import sys
parms = []
for t in [1.5,2,2.5]:
parms.append(
{
'LATTICE' : "square lattice",
'T' : t,
'J' : 1 ,
'THERMALIZATION' : 1000,
'SWEEPS' : 100000,
'UPDATE' : "cluster",
'MODEL' : "Ising",
'L' : 8
}
)
input_file = pyalps.writeInputFiles('parm1',parms)
class Brutus(DescriptorQ):
queue = LSFBSub
username = "tronnow"
server="brutus.ethz.ch"
port=22
options = ""
prior = "module load open_mpi goto2 python hdf5 cmake mkl\nexport PATH=$PATH:$HOME/opt/alps/bin"
post = ""
working_directory = "Submission"
desc = runApplicationBackground('spinmc',input_file,Tmin=5,writexml=True, descriptor = Brutus(), force_resubmit = False )
if not desc.finished():
print "Your simulations has not yet ended, please run this command again later."
else:
if desc.failed():
print "Your submission has failed"
sys.exit(-1)
result_files = pyalps.getResultFiles(prefix='parm1')
print result_files
print pyalps.loadObservableList(result_files)
data = pyalps.loadMeasurements(result_files,['|Magnetization|','Magnetization^2'])
print data
plotdata = pyalps.collectXY(data,'T','|Magnetization|')
plt.figure()
pyalps.plot.plot(plotdata)
plt.xlim(0,3)
plt.ylim(0,1)
plt.title('Ising model')
plt.show()
print pyalps.plot.convertToText(plotdata)
print pyalps.plot.makeGracePlot(plotdata)
print pyalps.plot.makeGnuplotPlot(plotdata)
binder = pyalps.DataSet()
binder.props = pyalps.dict_intersect([d[0].props for d in data])
binder.x = [d[0].props['T'] for d in data]
binder.y = [d[1].y[0]/(d[0].y[0]*d[0].y[0]) for d in data]
print binder
plt.figure()
pyalps.plot.plot(binder)
plt.xlabel('T')
plt.ylabel('Binder cumulant')
plt.show()
Executing this code submit the program as a background process on the local machine. It can easily be extended to supporting SSH and LFS by changing the queue object:
# TODO: Give example