clusterjob package

Abstraction for job scripts and cluster schedulers, for a variety of scheduling backends (e.g., SLURM, PBS/TORQUE, …)

Note

To see debug messages, set:

import logging
logging.basicConfig(level=logging.DEBUG)
class clusterjob.JobScript(body, jobname, aux_scripts=None, **kwargs)[source]

Bases: object

Encapsulation of a job script

Parameters:
  • body (str) – Body template for the jobscript as multiline string. Will be stored in the body instance attribute, and processed by the render_script() method before execution.
  • jobname (str) – Name of the job. Will be stored in the resources[‘jobname’] instance attribute.
  • aux_scripts (dict(str=>str), optional) – dictionary of auxiliary scripts, to be stored in the aux_scripts attribute.

Keyword arguments (kwargs) that correspond to known attributes set the value of that (instance) attribute. Any other keyword arguments are stored as entries in the resources attribute, to be processed by the backend. The following keyword arguments set resource specification that should be handled by any backend (or, the backend should raise a ResourcesNotSupportedError).

Keyword Arguments:
 
  • queue (str) – Name of queue/partition to which to submit the job
  • time (str) – Maximum runtime. See time_to_seconds() for acceptable formats.
  • nodes (int) – Number of nodes on which to run. Depending on the configuration of the scheduler, if the number of used cores per node is smaller than the number of CPU cores on a physical node, multiple jobs may or may not be placed on the same physical node.
  • ppn (int) – (MPI) processes to run per node. The total number of MPI processes will be nodes*ppn. Note that ppn is not the same as the ppn keyword in PBS/TORQUE (which refers to the total number of CPU cores used per node).
  • threads (int) – Number of OpenMP threads, or subprocesses, spawned per process. The total number of CPU cores used per node will be ppn*threads.
  • mem (int) – Required memory, per node in MB
  • stdout (str) – Name of file to which to write the jobs stdout
  • stderr (str) – Name of file to which to write the jobs stderr

The above list constitutes the simplified resource model supported by the clusterjob package, as a lowest common denominator of various schedulig systems. Other keyword argument can be used, but they will be backend-specific, and may or may not be handled correctly. In the default SLURM backend, any keyword arguments not in the above list are transformed directly to arguments for sbatch, where single-letter argument names are prepended with -, and multi-letter argument names with --. An argument with boolean values is passed without any value iff the value is True:

contiguous=True          -> --contiguous
dependency='after:12454' -> --dependency=after:12454
F='nodefile.txt'         -> -F nodefile.txt

All backends are encouraged to implement a similar behavior, to handle arbitrary resource requirements. Note that an alternative (and preferred) way of setting properties (especially backend-specific ones) is through the read_settings() method.

Class Attributes

The following class attributes cannot be shadowed by instance attributes of the same name (attempting to do so raises an AttributeError)

Class Attributes:
 
  • cache_folder (str or None) – Local folder in which to cache the AsyncResult instances resulting from job submission. If None (default), caching is disabled.
  • cache_prefix (str) – Prefix for cache filenames. If caching is enabled, jobs will be stored inside cachefolder in a file cache_prefix.`cache_id`.cache, where cache_id is defined in the submit method.
  • resources (OrderedDict) – Dictionary of default resource requirements. Modifying the resources class attribute affects the default resources for all future instantiations.

Note

The preferred way to set these class attributes is through the read_defaults() class method.

Class/Instance Attributes

The following are class attributes, with the expectation that they may be shadowed by instance attributes of the same name.

Attributes:
  • backend (str) – Name of backend, must be an element in JobScript.backends. That is, if backend does not refer to one of the default backends, the register_backend() class method must be used to register the backend before any job may use it. Defaults to ‘slurm’.
  • shell (str) – Shell that is used to execute the job script. Defaults to /bin/bash.
  • remote (str or None) – Remote server on which to execute submit commands. If None (default), submit locally.
  • rootdir (str) – Root directory for workdir, locally or remote. Defaults to '.', i.e., the current working directory. The rootdir is guaranteed not to have a trailing slash.
  • workdir (str) – Work directory (local or remote) in which the job script file will be placed, and from which the submission command will be called. Relative to rootdir. Defaults to '.' (current working directory). The workdir is guaranteed not to have a trailing slash.
  • filename (str or None) – Name of file to which the job script will be written (inside rootdir/workdir). If None (default), the filename will be set from the job name (resources[‘jobname’] attribute) together with a backend-specific file extension
  • prologue (str) – Multiline shell script that will be executed locally in the current working directory before submitting the job. Before running, the script will be rendered using the render_script() method.
  • epilogue (str) – multiline shell script that will be executed locally in the current working directory the first time that the job is known to have finished. It will be rendered using the render_script() method at the time that the job is submitted. It’s execution will be handled by the AsyncResult object resulting from the job submission. The main purpose of the epilogue script is to move data from a remote cluster upon completion of the job.
  • max_sleep_interval (int) – Upper limit for the number of seconds to sleep between polling the status of a submitted job.
  • ssh (str) – The executable to use for ssh. If not a full path, must be in the $PATH.
  • scp (str) – The executable to use for scp. If not a full path, must be in the $PATH.

This allows to define defaults for all jobs by setting the class attribute, and overriding them for specific jobs by setting the instance attribute. For example,

>>> jobscript = JobScript(body='echo "Hello"', jobname='test')
>>> jobscript.shell = '/bin/sh'

sets the shell for only this specific jobscript, whereas

>>> JobScript.shell = '/bin/sh'

sets the class attribute, and thus the default shell for all JobScript instances, both future and existing instantiation:

>>> job1 = JobScript(body='echo "Hello"', jobname='test1')
>>> job2 = JobScript(body='echo "Hello"', jobname='test2')
>>> assert job1.shell == job2.shell == '/bin/sh'   # class attribute
>>> JobScript.shell = '/bin/bash'
>>> assert job1.shell == job2.shell == '/bin/bash' # class attribute
>>> job1.shell = '/bin/sh'
>>> assert job1.shell == '/bin/sh'                 # instance attribute
>>> assert job2.shell == '/bin/bash'               # class attribute

Note

  • The preferred way to set these attributes as class attributes (i.e., to provide defaults for any instance) is through the read_defaults() class method. To set them as instance attributes, or to set values in the resources instance attribute defined below, the read_settings() method should be used.

  • A common purpose of the prologue and epilogue scripts is to move data to a remote cluster, e.g. via the prologue commands:

    ssh {remote} 'mkdir -p {rootdir}/{workdir}'
    rsync -av {workdir}/ {remote}:{rootdir}/{workdir}
    

Instance Attributes

The following attributes are local to any JobScript instance, and are set automatically during instantiation.

Attributes:
  • body (str) – Multiline string of shell commands. Should not contain backend-specific resource headers. Before submission, it will be rendered using the render_script() method.
  • resources (dict) – Dictionary of submission options describing resource requirements. Set on instantiation, based on the default values in the resources class attribute and the keyword arguments passed to the instantiator.
  • aux_scripts (dict(str=>str)) – Dictionary mapping filenames to script bodies for any auxiliary scripts. As the main job script (body) is written during submission, any script defined in this dictionary will also be rendered using the render_script() method and will be written in the same folder as the main script. While generally not needed, auxiliary scripts may be useful in structuring a large job.

Example

>>> body = r'''
... echo "####################################################"
... echo "Job id: $CLUSTERJOB_ID"
... echo "Job name: $CLUSTERJOB_WORKDIR"
... echo "Job started on" `hostname` `date`
... echo "Current directory:" `pwd`
... echo "####################################################"
...
... echo "####################################################"
... echo "Full Environment:"
... printenv
... echo "####################################################"
...
... sleep 90
...
... echo "Job Finished: " `date`
... exit 0
... '''
>>> jobscript = JobScript(body, backend='slurm', jobname='printenv',
... queue='test', time='00:05:00', nodes=1, threads=1, mem=100,
... stdout='printenv.out', stderr='printenv.err')
>>> print(jobscript)
#!/bin/bash
#SBATCH --job-name=printenv
#SBATCH --mem=100
#SBATCH --nodes=1
#SBATCH --partition=test
#SBATCH --error=printenv.err
#SBATCH --output=printenv.out
#SBATCH --cpus-per-task=1
#SBATCH --time=00:05:00

echo "####################################################"
echo "Job id: $SLURM_JOB_ID"
echo "Job name: $SLURM_SUBMIT_DIR"
echo "Job started on" `hostname` `date`
echo "Current directory:" `pwd`
echo "####################################################"

echo "####################################################"
echo "Full Environment:"
printenv
echo "####################################################"

sleep 90

echo "Job Finished: " `date`
exit 0

Note

The fact that arbitrary attributes can be added to an existing object can be exploited to define arbitrary template variables in the job script:

>>> body = r'''
... echo {myvar}
... '''
>>> jobscript = JobScript(body, jobname='myvar_test')
>>> jobscript.myvar = 'Hello'
>>> print(jobscript)
#!/bin/bash
#SBATCH --job-name=myvar_test

echo Hello
classmethod register_backend(backend, name=None)[source]

Register a new backend.

Parameters:
  • backend (clusterjob.backends.ClusterjobBackend) – The backend to register. After registration, the backend attribute of a ClusterJob instance may then refer to the backend by name.
  • name (str) – The name under which to register the backend. If not given, use the name defind in the backend’s name attribute. This attribute will be updated with name, if given, to ensure that the name under which the backend is registered and the backend’s internal name attribute are the same.
Raises:
  • TypeError – if backend is not an instance of ClusterjobBackend, or does not implement the backend interface correctly
  • AttributeError – if backend does not have the attributes name and extension
classmethod clear_cache_folder()[source]

Remove all files in the cache_folder

resources = {}
backends

List of names of registered backends

__setattr__(name, value)[source]

Set attributes while preventing shadowing the “genuine” class attributes by raising an AttributeError. Perform some checks on the value, raising a ValueError if necessary.

classmethod read_defaults(filename=None)[source]

Set class attributes from the INI file with the given file name

The file must be in the format specified in https://docs.python.org/3.5/library/configparser.html#supported-ini-file-structure with the default ConfigParser settings, except that all keys are case sensitive. It must contain exactly one or both of the sections “Attributes” and “Resources”. The key-value pairs in the Attributes sections are set as class attributes, whereas the key-value pairs in the “Resources” section are set as keys and values in the resources class attribute.

All keys in the “Attributes” section must be start with a letter, and must consist only of letters, numbers, and underscores. Keys in the “Resources” section can be arbitrary string. The key names ‘resources’ and ‘backends’ may not be used. An example for a valid config file is:

[Attributes]
remote = login.cluster.edu
prologue =
    ssh {remote} 'mkdir -p {rootdir}/{workdir}'
    rsync -av {workdir}/ {remote}:{rootdir}/{workdir}
epilogue = rsync -av {remote}:{rootdir}/{workdir}/ {workdir}
rootdir = ~/jobs/
# the following is a new attribute
text = Hello World

[Resources]
queue = exec
nodes = 1
threads = 1
mem = 10

If no filename is given, reset all class attributes to their initial value, and delete any attributes that do not exist by default. This restores the JobScript class to a pristine state.

read_settings(filename)[source]

Set instance attribute from the INI file with the given file name

This method behaves exactly like the read_defaults() class method, but instead of setting class attributes, it sets instance attributes (“Attributes” section in the INI file), and instead of setting values in JobScript.resources, it sets values in the instance’s resources dictionary (“Resources” section in the INI file).

render_script(scriptbody, jobscript=False)[source]

Render the body of a script. This brings both the main body, as well as the prologue, epilogue, and any auxiliary scripts into the final form in which they will be executed.

Rendering proceeds in the following steps:

  • Add a “shbang” (e.g. #!/bin/bash) based on the shell attribute if the scriptbody does not yet have a shbang on the first line (otherwise the existing shbang will remain)
  • If rendering the body of a JobScript (jobscript=True), add backend-specific resource headers (based on the resources attribute)
  • Map environment variables to their corresponding scheduler-specific version, using the backend’s replace_body_vars() method. Note that the prologue and epilogue will not be run by a scheduler, and thus will not have access to the same environment variables as a job script.
  • Format each line with known attributes (see https://docs.python.org/3.5/library/string.html#formatspec). In order of precedence (highest to lowest), the following keys will be replaced:
    • keys in the resources attribute
    • instance attributes
    • class attributes
__str__()[source]

String representation of the job, i.e., the fully rendered jobscript

write(filename=None)[source]

Write out the fully rendered jobscript to file. If filename is not None, write to the given local file. Otherwise, write to the local or remote file specified in the filename attribute, in the folder specified by the rootdir and workdir attributes. The folder will be created if it does not exist already. A ‘~’ in filename will be expanded to the user’s home directory.

submit(block=False, cache_id=None, force=False, retry=True)[source]

Run the prologue script (if defined), then submit the job to a local or remote scheduler.

Parameters:
  • block (boolean, optional) – If block is True, wait until the job is finished, and return the exit status code (see clusterjob.status). Otherwise, return an AsyncResult object.
  • cache_id (str or None, optional) – An ID uniquely defining the submission, used as identifier for the cached AsyncResult object. If not given, the cache_id is determined internally. If an AsyncResult with a matching cache_id is present in the cache_folder, nothing is submitted to the scheduler, and the cached AsyncResult object is returned. The prologue script is not re-run when recovering a cached result.
  • force (boolean, optional) – If True, discard any existing cached AsyncResult object, ensuring that the job is sent to the scheduler.
  • retry (boolean, optional) – If True, and the existing cached AsyncResult indicates that the job finished with an error (CANCELLED/FAILED), resubmit the job, discard the cache and return a fresh AsyncResult object
backend = 'slurm'
cache_folder = None
cache_prefix = 'clusterjob'
epilogue = ''
filename = None
max_sleep_interval = 900
prologue = ''
remote = None
rootdir = '.'
scp = 'scp'
shell = '/bin/bash'
ssh = 'ssh'
workdir = '.'
class clusterjob.AsyncResult(backend)[source]

Bases: object

Result of submitting a jobscript

Parameters:

backend (clusterjob.backends.ClusterjobBackend) – Value for the backend attribute

Attributes:
  • remote (str or None) – The remote host on which the job is running. Passwordless ssh must be set up to reach the remote. A value of None indicates that the job is running locally
  • cache_file (str or None) – The full path and name of the file to be used to cache the AsyncResult object. The cache file will be written automatically anytime a change in status is detected
  • backend (clusterjob.backends.ClusterjobBackend) – A reference to the backend instance under which the job is running
  • max_sleep_interval (int) – Upper limit for the number of seconds to sleep between polls to the cluster scheduling systems when waiting for the Job to finish
  • job_id (str) – The Job ID assigned by the cluster scheduler
  • epilogue (str) – Multiline script to be run once when the status changes from “running” (pending/running) to “not running” (completed, canceled, failed). The contents of this variable will be written to a temporary file as is, and executed as a script in the current working directory.
  • ssh (str) – The executable to use for ssh. If not a full path, must be in the $PATH.
  • scp (str) – The executable to use for scp. If not a full path, must be in the $PATH.
status

Return the job status as one of the codes defined in the clusterjob.status module. finished, communicate with the cluster to determine the job’s status.

get(timeout=None)[source]

Return status

dump(cache_file=None)[source]

Write dump out to file cache_file, defaulting to self.cache_file

classmethod load(cache_file, backend=None)[source]

Instantiate AsyncResult from dumped cache_file.

This is the inverse of dump().

Parameters:
  • cache_file (str) – Name of file from which the run should be read.
  • backend (clusterjob.backends.ClusterjobBackend or None) – The backend instance for the job. If None, the backend will be determined by the name of the dumped job’s backend.
wait(timeout=None)[source]

Wait until the result is available or until roughly timeout seconds pass.

ready()[source]

Return whether the job has completed.

successful()[source]

Return True if the job finished with a COMPLETED status, False if it finished with a CANCELLED or FAILED status. Raise an AssertionError if the job has not completed

cancel()[source]

Instruct the cluster to cancel the running job. Has no effect if job is not running

run_epilogue()[source]

Run the epilogue script in the current working directory.

Raises:subprocess.CalledProcessError – if the script does not finish with exit code zero.