clusterjob package¶
Abstraction for job scripts and cluster schedulers, for a variety of scheduling backends (e.g., SLURM, PBS/TORQUE, …)
Note
To see debug messages, set:
import logging
logging.basicConfig(level=logging.DEBUG)
-
class
clusterjob.
JobScript
(body, jobname, aux_scripts=None, **kwargs)[source]¶ Bases:
object
Encapsulation of a job script
Parameters: - body (str) – Body template for the jobscript as multiline string.
Will be stored in the body instance attribute, and processed by
the
render_script()
method before execution. - jobname (str) – Name of the job. Will be stored in the resources[‘jobname’] instance attribute.
- aux_scripts (dict(str=>str), optional) – dictionary of auxiliary scripts, to be stored in the aux_scripts attribute.
Keyword arguments (kwargs) that correspond to known attributes set the value of that (instance) attribute. Any other keyword arguments are stored as entries in the resources attribute, to be processed by the backend. The following keyword arguments set resource specification that should be handled by any backend (or, the backend should raise a
ResourcesNotSupportedError
).Keyword Arguments: - queue (str) – Name of queue/partition to which to submit the job
- time (str) – Maximum runtime.
See
time_to_seconds()
for acceptable formats. - nodes (int) – Number of nodes on which to run. Depending on the configuration of the scheduler, if the number of used cores per node is smaller than the number of CPU cores on a physical node, multiple jobs may or may not be placed on the same physical node.
- ppn (int) – (MPI) processes to run per node. The total number of MPI
processes will be
nodes*ppn
. Note that ppn is not the same as the ppn keyword in PBS/TORQUE (which refers to the total number of CPU cores used per node). - threads (int) – Number of OpenMP threads, or subprocesses, spawned per
process. The total number of CPU cores used per
node will be
ppn*threads
. - mem (int) – Required memory, per node in MB
- stdout (str) – Name of file to which to write the jobs stdout
- stderr (str) – Name of file to which to write the jobs stderr
The above list constitutes the simplified resource model supported by the clusterjob package, as a lowest common denominator of various schedulig systems. Other keyword argument can be used, but they will be backend-specific, and may or may not be handled correctly. In the default SLURM backend, any keyword arguments not in the above list are transformed directly to arguments for
sbatch
, where single-letter argument names are prepended with-
, and multi-letter argument names with--
. An argument with boolean values is passed without any value iff the value is True:contiguous=True -> --contiguous dependency='after:12454' -> --dependency=after:12454 F='nodefile.txt' -> -F nodefile.txt
All backends are encouraged to implement a similar behavior, to handle arbitrary resource requirements. Note that an alternative (and preferred) way of setting properties (especially backend-specific ones) is through the
read_settings()
method.Class Attributes
The following class attributes cannot be shadowed by instance attributes of the same name (attempting to do so raises an AttributeError)
Class Attributes: - cache_folder (str or None) – Local folder in which to cache the
AsyncResult
instances resulting from job submission. If None (default), caching is disabled. - cache_prefix (str) – Prefix for cache filenames. If caching is enabled, jobs will be stored inside cachefolder in a file cache_prefix.`cache_id`.cache, where cache_id is defined in the submit method.
- resources (OrderedDict) – Dictionary of default resource requirements. Modifying the resources class attribute affects the default resources for all future instantiations.
Note
The preferred way to set these class attributes is through the
read_defaults()
class method.Class/Instance Attributes
The following are class attributes, with the expectation that they may be shadowed by instance attributes of the same name.
Attributes: - backend (str) – Name of backend, must be an element in
JobScript.backends
. That is, if backend does not refer to one of the default backends, theregister_backend()
class method must be used to register the backend before any job may use it. Defaults to ‘slurm’. - shell (str) – Shell that is used to execute the job script. Defaults to
/bin/bash
. - remote (str or None) – Remote server on which to execute submit commands. If None (default), submit locally.
- rootdir (str) – Root directory for workdir, locally or remote. Defaults
to
'.'
, i.e., the current working directory. The rootdir is guaranteed not to have a trailing slash. - workdir (str) – Work directory (local or remote) in which the job script
file will be placed, and from which the submission command will be
called. Relative to rootdir. Defaults to
'.'
(current working directory). The workdir is guaranteed not to have a trailing slash. - filename (str or None) – Name of file to which the job script will be written (inside rootdir/workdir). If None (default), the filename will be set from the job name (resources[‘jobname’] attribute) together with a backend-specific file extension
- prologue (str) – Multiline shell script that will be executed locally
in the current working directory before submitting the job. Before
running, the script will be rendered using the
render_script()
method. - epilogue (str) – multiline shell script that will be executed locally
in the current working directory the first time that the job is
known to have finished. It will be rendered using the
render_script()
method at the time that the job is submitted. It’s execution will be handled by theAsyncResult
object resulting from the job submission. The main purpose of the epilogue script is to move data from a remote cluster upon completion of the job. - max_sleep_interval (int) – Upper limit for the number of seconds to sleep between polling the status of a submitted job.
- ssh (str) – The executable to use for ssh. If not a full path, must be
in the
$PATH
. - scp (str) – The executable to use for scp. If not a full path, must be
in the
$PATH
.
This allows to define defaults for all jobs by setting the class attribute, and overriding them for specific jobs by setting the instance attribute. For example,
>>> jobscript = JobScript(body='echo "Hello"', jobname='test') >>> jobscript.shell = '/bin/sh'
sets the shell for only this specific jobscript, whereas
>>> JobScript.shell = '/bin/sh'
sets the class attribute, and thus the default shell for all JobScript instances, both future and existing instantiation:
>>> job1 = JobScript(body='echo "Hello"', jobname='test1') >>> job2 = JobScript(body='echo "Hello"', jobname='test2') >>> assert job1.shell == job2.shell == '/bin/sh' # class attribute >>> JobScript.shell = '/bin/bash' >>> assert job1.shell == job2.shell == '/bin/bash' # class attribute >>> job1.shell = '/bin/sh' >>> assert job1.shell == '/bin/sh' # instance attribute >>> assert job2.shell == '/bin/bash' # class attribute
Note
The preferred way to set these attributes as class attributes (i.e., to provide defaults for any instance) is through the
read_defaults()
class method. To set them as instance attributes, or to set values in the resources instance attribute defined below, theread_settings()
method should be used.A common purpose of the prologue and epilogue scripts is to move data to a remote cluster, e.g. via the prologue commands:
ssh {remote} 'mkdir -p {rootdir}/{workdir}' rsync -av {workdir}/ {remote}:{rootdir}/{workdir}
Instance Attributes
The following attributes are local to any JobScript instance, and are set automatically during instantiation.
Attributes: - body (str) – Multiline string of shell commands. Should not contain
backend-specific resource headers. Before submission, it will be
rendered using the
render_script()
method. - resources (dict) – Dictionary of submission options describing resource requirements. Set on instantiation, based on the default values in the resources class attribute and the keyword arguments passed to the instantiator.
- aux_scripts (dict(str=>str)) – Dictionary mapping filenames to script
bodies for any auxiliary scripts. As the main job script (body)
is written during submission, any script defined in this
dictionary will also be rendered using the
render_script()
method and will be written in the same folder as the main script. While generally not needed, auxiliary scripts may be useful in structuring a large job.
Example
>>> body = r''' ... echo "####################################################" ... echo "Job id: $CLUSTERJOB_ID" ... echo "Job name: $CLUSTERJOB_WORKDIR" ... echo "Job started on" `hostname` `date` ... echo "Current directory:" `pwd` ... echo "####################################################" ... ... echo "####################################################" ... echo "Full Environment:" ... printenv ... echo "####################################################" ... ... sleep 90 ... ... echo "Job Finished: " `date` ... exit 0 ... ''' >>> jobscript = JobScript(body, backend='slurm', jobname='printenv', ... queue='test', time='00:05:00', nodes=1, threads=1, mem=100, ... stdout='printenv.out', stderr='printenv.err') >>> print(jobscript) #!/bin/bash #SBATCH --job-name=printenv #SBATCH --mem=100 #SBATCH --nodes=1 #SBATCH --partition=test #SBATCH --error=printenv.err #SBATCH --output=printenv.out #SBATCH --cpus-per-task=1 #SBATCH --time=00:05:00 echo "####################################################" echo "Job id: $SLURM_JOB_ID" echo "Job name: $SLURM_SUBMIT_DIR" echo "Job started on" `hostname` `date` echo "Current directory:" `pwd` echo "####################################################" echo "####################################################" echo "Full Environment:" printenv echo "####################################################" sleep 90 echo "Job Finished: " `date` exit 0
Note
The fact that arbitrary attributes can be added to an existing object can be exploited to define arbitrary template variables in the job script:
>>> body = r''' ... echo {myvar} ... ''' >>> jobscript = JobScript(body, jobname='myvar_test') >>> jobscript.myvar = 'Hello' >>> print(jobscript) #!/bin/bash #SBATCH --job-name=myvar_test echo Hello
-
classmethod
register_backend
(backend, name=None)[source]¶ Register a new backend.
Parameters: - backend (clusterjob.backends.ClusterjobBackend) – The backend to register. After registration, the backend attribute of a ClusterJob instance may then refer to the backend by name.
- name (str) – The name under which to register the backend. If not given, use the name defind in the backend’s name attribute. This attribute will be updated with name, if given, to ensure that the name under which the backend is registered and the backend’s internal name attribute are the same.
Raises: TypeError
– if backend is not an instance of ClusterjobBackend, or does not implement the backend interface correctlyAttributeError
– if backend does not have the attributes name and extension
-
classmethod
clear_cache_folder
()[source]¶ Remove all files in the
cache_folder
-
resources
= {}¶
-
backends
¶ List of names of registered backends
-
__setattr__
(name, value)[source]¶ Set attributes while preventing shadowing the “genuine” class attributes by raising an AttributeError. Perform some checks on the value, raising a ValueError if necessary.
-
classmethod
read_defaults
(filename=None)[source]¶ Set class attributes from the INI file with the given file name
The file must be in the format specified in https://docs.python.org/3.5/library/configparser.html#supported-ini-file-structure with the default ConfigParser settings, except that all keys are case sensitive. It must contain exactly one or both of the sections “Attributes” and “Resources”. The key-value pairs in the Attributes sections are set as class attributes, whereas the key-value pairs in the “Resources” section are set as keys and values in the resources class attribute.
All keys in the “Attributes” section must be start with a letter, and must consist only of letters, numbers, and underscores. Keys in the “Resources” section can be arbitrary string. The key names ‘resources’ and ‘backends’ may not be used. An example for a valid config file is:
[Attributes] remote = login.cluster.edu prologue = ssh {remote} 'mkdir -p {rootdir}/{workdir}' rsync -av {workdir}/ {remote}:{rootdir}/{workdir} epilogue = rsync -av {remote}:{rootdir}/{workdir}/ {workdir} rootdir = ~/jobs/ # the following is a new attribute text = Hello World [Resources] queue = exec nodes = 1 threads = 1 mem = 10
If no filename is given, reset all class attributes to their initial value, and delete any attributes that do not exist by default. This restores the JobScript class to a pristine state.
-
read_settings
(filename)[source]¶ Set instance attribute from the INI file with the given file name
This method behaves exactly like the
read_defaults()
class method, but instead of setting class attributes, it sets instance attributes (“Attributes” section in the INI file), and instead of setting values inJobScript.resources
, it sets values in the instance’s resources dictionary (“Resources” section in the INI file).
-
render_script
(scriptbody, jobscript=False)[source]¶ Render the body of a script. This brings both the main body, as well as the prologue, epilogue, and any auxiliary scripts into the final form in which they will be executed.
Rendering proceeds in the following steps:
- Add a “shbang” (e.g.
#!/bin/bash
) based on the shell attribute if the scriptbody does not yet have a shbang on the first line (otherwise the existing shbang will remain) - If rendering the body of a JobScript (jobscript=True), add backend-specific resource headers (based on the resources attribute)
- Map environment variables to their corresponding scheduler-specific
version, using the backend’s
replace_body_vars()
method. Note that the prologue and epilogue will not be run by a scheduler, and thus will not have access to the same environment variables as a job script. - Format each line with known attributes (see
https://docs.python.org/3.5/library/string.html#formatspec).
In order of precedence (highest to lowest), the following keys will
be replaced:
- keys in the resources attribute
- instance attributes
- class attributes
- Add a “shbang” (e.g.
-
write
(filename=None)[source]¶ Write out the fully rendered jobscript to file. If filename is not None, write to the given local file. Otherwise, write to the local or remote file specified in the filename attribute, in the folder specified by the rootdir and workdir attributes. The folder will be created if it does not exist already. A ‘~’ in filename will be expanded to the user’s home directory.
-
submit
(block=False, cache_id=None, force=False, retry=True)[source]¶ Run the
prologue
script (if defined), then submit the job to a local or remote scheduler.Parameters: - block (boolean, optional) – If block is True, wait until the job is finished, and return the
exit status code (see
clusterjob.status
). Otherwise, return anAsyncResult
object. - cache_id (str or None, optional) – An ID uniquely defining the submission, used as identifier for the
cached
AsyncResult
object. If not given, the cache_id is determined internally. If anAsyncResult
with a matching cache_id is present in thecache_folder
, nothing is submitted to the scheduler, and the cachedAsyncResult
object is returned. Theprologue
script is not re-run when recovering a cached result. - force (boolean, optional) – If True, discard any existing cached
AsyncResult
object, ensuring that the job is sent to the scheduler. - retry (boolean, optional) – If True, and the existing cached
AsyncResult
indicates that the job finished with an error (CANCELLED
/FAILED
), resubmit the job, discard the cache and return a freshAsyncResult
object
- block (boolean, optional) – If block is True, wait until the job is finished, and return the
exit status code (see
-
backend
= 'slurm'¶
-
cache_folder
= None¶
-
cache_prefix
= 'clusterjob'¶
-
epilogue
= ''¶
-
filename
= None¶
-
max_sleep_interval
= 900¶
-
prologue
= ''¶
-
remote
= None¶
-
rootdir
= '.'¶
-
scp
= 'scp'¶
-
shell
= '/bin/bash'¶
-
ssh
= 'ssh'¶
-
workdir
= '.'¶
- body (str) – Body template for the jobscript as multiline string.
Will be stored in the body instance attribute, and processed by
the
-
class
clusterjob.
AsyncResult
(backend)[source]¶ Bases:
object
Result of submitting a jobscript
Parameters: backend (clusterjob.backends.ClusterjobBackend) – Value for the
backend
attributeAttributes: - remote (str or None) – The remote host on which the job is running. Passwordless ssh must be set up to reach the remote. A value of None indicates that the job is running locally
- cache_file (str or None) – The full path and name of the file to be used to cache the AsyncResult object. The cache file will be written automatically anytime a change in status is detected
- backend (clusterjob.backends.ClusterjobBackend) – A reference to the backend instance under which the job is running
- max_sleep_interval (int) – Upper limit for the number of seconds to sleep between polls to the cluster scheduling systems when waiting for the Job to finish
- job_id (str) – The Job ID assigned by the cluster scheduler
- epilogue (str) – Multiline script to be run once when the status changes from “running” (pending/running) to “not running” (completed, canceled, failed). The contents of this variable will be written to a temporary file as is, and executed as a script in the current working directory.
- ssh (str) – The executable to use for ssh. If not a full path, must be
in the
$PATH
. - scp (str) – The executable to use for scp. If not a full path, must be
in the
$PATH
.
-
status
¶ Return the job status as one of the codes defined in the clusterjob.status module. finished, communicate with the cluster to determine the job’s status.
-
classmethod
load
(cache_file, backend=None)[source]¶ Instantiate AsyncResult from dumped cache_file.
This is the inverse of
dump()
.Parameters: - cache_file (str) – Name of file from which the run should be read.
- backend (clusterjob.backends.ClusterjobBackend or None) – The backend instance for the job. If None, the backend will be determined by the name of the dumped job’s backend.
-
wait
(timeout=None)[source]¶ Wait until the result is available or until roughly timeout seconds pass.
-
successful
()[source]¶ Return True if the job finished with a COMPLETED status, False if it finished with a CANCELLED or FAILED status. Raise an AssertionError if the job has not completed
-
cancel
()[source]¶ Instruct the cluster to cancel the running job. Has no effect if job is not running
-
run_epilogue
()[source]¶ Run the epilogue script in the current working directory.
Raises: subprocess.CalledProcessError
– if the script does not finish with exit code zero.