Cluster Guide

User Manual:

Open the PDF directly: View PDF .
Page Count: 11

Draft version June 11, 2018

Typeset using L

Xmodern style in AASTeX62

Running code on the Wesleyan Cluster

ABSTRACT

This document is intended as a quick reference guide for running code on the Wes-

leyan computing cluster. It is not an introduction to high-performance cluster com-

puting (I would not be qualiﬁed to write such a document) but it does cover the

details of computing power available, and how to get started using it.

1. CLUSTER BASICS

The Wesleyan computing center (located on the 5th ﬂoor of the Excley Science

Tower) contains a range of computing capabilities, all available through the Wesleyan

network. These capabilities include highly parallelized computing, GPU computing,

high memory (∼100 GB) computing, etc. A full list of the hardware setup is avail-

able online1, along with a list of available software2. A full list of all information

about the cluster is also available online3. The cluster is managed by Henk Meij

(hmeij@wesleyan.edu); I have always found Henk to be very responsive and helpful

when setting up my code, and running through any sort of trouble shooting that I

can’t handle on my own. The ﬁrst step in running code on the cluster is to contact

Henk, and to request an account on the cluster.

The ﬁrst point of contact with the computing cluster is through one of the

cottontail2 and swallowtail are also available. These login nodes can be accessed

through ssh

ssh -XY kflaherty@cottontail

You will be deposited into your home directory (/home/kflaherty) on the computing

cluster, where you will store your code and data. Note that this home directory is

not the same as the home directory on you local machine and your data and code

must be copied over from your local machine to the computing clusters home directory

1https://dokuwiki.wesleyan.edu/doku.php?id=cluster:126

2https://dokuwiki.wesleyan.edu/doku.php?id=cluster:73

3https://dokuwiki.wesleyan.edu/doku.php?id=cluster:0

scp mydata.fits kflaherty@cottontail:/home/kflaherty/mycode/

scp mycode.py kflaherty@cottontail:/home/kflaherty/mycode/

There is a total of 10 TB of space allocated for home directories; if more space is

needed for your data, contact Henk to make arrangements.

After connecting to the login node, you can access the computing nodes through the

use of the OpenLava Scheduler. A job, which is one instance of running your code,

is submitted to the computing cluster using the scheduler, which in turn distributes

the computational work according to the requested conﬁguration (e.g. the requested

computing queue, the requested number of cores, etc.) onto any available cores. The

computing cluster is set up with a number of diﬀerent queues, which provide access

to a collection of nodes, which are many up of multiple chips. The number of nodes,

and the number of chips per node depends on the chosen queue. The exact queue

that works best for you will depend on the computing capabilities needed for your

code. The nodes are speciﬁed by nfollowed by a number (e.g. n33 n14 n27). When

multiple cores from a single node are used by a piece of code, you will see the node

name repeated multiple times (e.g. n33 n33 n33 n33 n33 for 5 cores from node 33)

The main queues available on the Wesleyan computing cluster are:

1. hp12: The default queue. Consists of 32 nodes, with dual quad core chips with

12 GB of memory each

2. mw256fd: Contains 8 nodes with dual eight core chips, with 256 GB of memory

each.

3. mwgpu: Contains four GPUs, 2,500 cores/GPU with GPU memory footprint of

5 GB.

The queue hp12 is designed for highly parallelized, but small memory, jobs (tinymem

is a queue that also handles small memory jobs). The queue mw256fd, the primary

queue that I use, is designed for large memory jobs, while mwgpu is designed for

GPU computations. All queues have access to the same software base, and are all

accessible from any login node. As described below your choice of queue is speciﬁed

when submitting a job to the scheduler.

2. SCRIPTS FOR RUNNING JOBS

The submit a job to the desired queue, a submission script must be created. This

script contains a series of commands that setup the computing job, and then call you

code to start the computation. Outlined below are the basic commands used in the

submission script for either a serial or parallel job. Examples of these pieces of code

are available in Henk’s home directory (/home/hmeij/jobs within the serial and

parallel) directories. Henk has also implemented checkpointing, which allows for a

job to be restarted from a saved checkpoint if the cluster crashes. The checkpointing

scripts include the same basic setup, with additional code (that doesn’t need to be

modiﬁed by the user) that handles the checkpointing functionality.

2.1. Basic serial script

The simplest type of job is one that requires only serial computations. Here you

simply need to specify the requested queue, provide a name for your job, setup up

the software environment, copy the data and code to the scratch directory, and run

your code. This is all done within a shell script, with the associated commands for

shell scripting available within this submission script.

The ﬁrst line setups up the bash script

#!/bin/bash

Note that the cluster uses the bash shell, rather than csh or tcsh. Any shell scripts

called by you code must be setup in bash.

Next are a series of parameters for the scheduler:

#BSUB -q mw256fd

#BSUB -J test

#BSUB -o test.stdout

#BSUB -e test.stderr

#BSUB -N

The ﬁrst line speciﬁes the queue to which the code is sent (e.g. hp12, mwd26fd).

The second line provides the name of the job, while the third and fourth lines specify

the names for the output and error ﬁles. These two ﬁles are created once the job is

complete and they contain outputs from you code, as well an error messages generated

while running your code. The test.stdout ﬁle will also contain basic information on

the name of the job, the home directory, the working directory, the job id, a snippet

of the code, plus some other information about the run. The last line requests that

the scheduler send you an email when the code has ﬁnished running.

Next you need to set up the software environment for your code. This can require

deﬁning path variables for various packages (in this case Python and MIRIAD).

export PYTHONHOME=/home/apps/python/2.6.1

export PYTHONPATH=/home/apps/python/2.6.1/site-packages

export PATH=$PYTHONHOME/bin:$PATH

. /home/apps/miriad/MIRRC.sh

export PATH=$MIRBIN:$PATH

The ﬁrst three lines set up the python paths, speciﬁcally referencing python 2.6.1.

Note that multiple versions of python exist on the cluster, but version 2.6.1 contains

numpy, scipy, astropy and emcee which are all used in my own code. The last two

lines set up the MIRIAD installation. First it calls the MIRIAD initialization script,

which sets up the variable $MIRBIN, which is then added to the path. Note that the

MIRIAD installation requires a fortran library that is only available on the mw256fd

queue.

The next step is to copy the data and code over to the working directory. Upon the

start of a job a temporary scratch directory is created, with a name speciﬁed by the

ID number associated with job. This directory is removed once the job is complete,

but is a useful place for running the code in order to reduce the load on the home

directory server. By default this happens in the /sanscratch directory, but if large

amounts of data are being generated, then the /localscratch5tb scratch directory

should be used4.

cp -r mydata.fits /sanscratch/$LSB JOBID

cp -r mycode.py /sanscratch/$LSB JOBID

cd /sanscratch/$LSB JOBID

The system variable $LSB JOBID contains the job ID number that is generated when

a job is sent to the Scheduler. This sequence of commands not only copies the data

(mydata.fits) and code (mycode.py) into the scratch directory, but it also moves into

that directory. This is important so that any read/write commands happen within

the scratch directory. Otherwise these commands will be sent to your home directory,

which will put unnecessary load onto the server that runs the home directory, slowing

down operations on the home directory.

Once the setup is complete, you can call your actual code (test.py), making sure

to copy over any results (myresults.txt) from the scratch directory back into the

home directory.

python test.py

cp myresults.txt /home/kflaherty/mycode/

2.2. Basic Parallel script

Running a parallel code that distributes jobs over multiple cores (which may or

may not be on the same node) follows the same basic ﬂow as the serial script, with

additional parameters for the scheduler, and additional PATH variables to provide

access to the software that does the parallelization. Listed below are the commands

that are included in addition to those already speciﬁed within the basic serial script.

4The /localscratch5tb directory is only available to the mw256fd queue

The ﬁrst addition is in the parameters for the scheduler. Since a parallel job

requires resources from multiple cores you need to specify the necessary computing

resources.

#BSUB -n 8

#BSUB -R ‘‘span[hosts=1]’’

The ﬁrst command speciﬁes the number of cores requested for this job, in this case

8. The second command tells the scheduler that all of the cores must come from one

node. This, in turn, limits the maximum number of cores that can be requested in

the ﬁrst line; in the hp12 queue there are 8 cores per node, while for the mw256fd

queue there are 16 cores per node. Conversely, you can remove the second command

and replace it with

#BSUB -R ‘‘span[ptile=1]’’

which tells the scheduler to use one core per node (e.g. use 8 nodes, instead of 8 cores

within a single node). Here the limit is the number of nodes within a queue (32 for

hp12, and 8 for mw256fd). I have always found that openmpi only works when all of

the cores are within a single node, rather than being spread out over multiple nodes.

The next step is to set up paths to the openmpi code

export PATH=/share/apps/openmpi/1.2+intel-10/bin:$PATH

export LD LIBRARY PATH=/share/apps/openmpi/1.2+intel-10/lib:$LD LIBRARY PATH

From here your code can be called:

./lava.openmpi.wrapper python mpi run models.py

Calling your code is slightly diﬀerent here that in the serial case. The call is done

via a wrapper (lava.openmpi.wrapper) that sets up variables needed for paral-

lel jobs. This wrapper takes the place of a direct call to e.g. mpirun that you

would use on your local machine. The wrapper is available from Henk’s directory

(/home/hmeij/jobs/parallel/)

2.3. Checkpointing

Recently Henk has set up the cluster to take advantage of checkpointing within the

code, based on the Berkeley Laboratory Checkpoint/Restart tool. What this means

is that if the cluster crashes, you can restart from the last checkpoint, rather than

having to restart from the very beginning. This is very useful if you e.g. are running

a week long MCMC chain and the cluster crashes on day four. The details of how this

works are complicated, but Henk has hidden most of this within new execution ﬁles,

having worked closely with me and Jesse Tarnes (’16, within Seth’s group) to make

sure our code works with checkpointing. Information on both serial5and parallel6

jobs can be found online.

Once you have copied over Henk’s checkpointing scripts from his directory

(/home/hmeij/jobs/blcr/blcr wrapper.serial), you will notice some small

diﬀerences from the original sample scripts. The ﬁrst thing the script does is set up

the code to operate from the scratch directory

export MYSANSCRATCH=/sanscratch/$LSB JOBID

cd $MYSANSCRATCH

pre cmd=‘‘scp $HOME/kflaherty/test.py .’’

post cmd=‘‘scp $MYSANSCRATCH/result.txt $HOME/kflaherty/‘‘

These commands ﬁrst set up the $MYSANSCRATCH system variable, move to the tem-

porary directory and set up commands that will copy over your code at the start

of the run, and copy back any results (result.txt) of the run. The operation of

copying your code and data into the scratch directory is not executed right away, this

just sets up the commands to be called later. Moving everything to the temporary

directory, including your code, is important when checkpointing because the BLCR

code has to recover everything that was used by the code in order to properly restart

it. This includes every ﬁle created by the code while it is running. In the example

above, I only have the ﬁle test.py that needs to be copied over at the beginning,

but this can be expanded to include any pieces of code, data sets, support ﬁles, etc.

that are needed.

Next you need to tell the wrapper that you are starting a new job:

mode=start

queue=mwd256fd

cmd=‘‘python test.py’’

where queue speciﬁes the queue to which the code is sent (the same as speciﬁed with

BSUB -q), while the second line speciﬁes command used to call the code. If you are

restarting a job then the code block above is replaced by:

mode=restart

queue=mw256fd

5https://dokuwiki.wesleyan.edu/doku.php?id=cluster:147

6https://dokuwiki.wesleyan.edu/doku.php?id=cluster:148

orgjobid=250

The queue must be the same as the queue used for the original job. The orgjobid

parameter is the job ID number for the original job.

Finally, the time interval over which the checkpointing occurs is speciﬁed:

cpti=15m

This requests that a complete copy of the state of system is saved every 15 min-

utes. This copy includes all of the ﬁles that have been generated, all of the code

used to generate them, and the state of the memory. These states are saved within

/sanscratch/checkpoints/$JOBID where $JOBID is the unique ID number for the

job. For any full run of your code a checkpoint every 15 minutes is overkill; more

reasonable values are 12h (12 hours) or 2d (2 days).

You can check on the status of the script using two ﬁles within the checkpoints

folder titled cr checkpoint.err and cr mpirun.err. The ﬁrst of these keeps a log

of errors associated with the checkpointing process. The second keeps tracks of any

errors/warning output by your code during the course of its operation. This later

ﬁle can be useful for checking on the progress of your code (e.g. if your code crashes

without stopping, you will likely see an error message here).

Also, the blcr wrapper will pipe any text that would normally be written on the

screen into a ﬁle in your home directory. It is located in /home/kflaherty/.lsbatch

and the name will have a format (random number).$JOBID.out. If your code reg-

ularly prints text to the screen, then this is a good place to look for the text as the

code is running.

Once the wrapper has been set up it can be sent to the scheduler in the same manner

as any other script. As soon as a ﬁle titled chk.PID shows up, the code has completed

its ﬁrst checkpoint. From here on the code can be resumed from this point after it

crashes.

3. SUBMITTING A JOB

Regardless of whether a job is serial or parallel, or includes checkpointing, submit-

ting the job to the scheduler is handled in the same way. This is done with the bsub

command (Figure 1):

bsub < myscript

where myscript is the script containing the commands listed above. This pipes the

myscript ﬁle into the bsub command, including all of the bsub parameters speciﬁed

within the script.

Figure 1. Example of the bsub command. A script is submitted to the mw256fd queue and is

assigned a job id number.

Before submitting a job, you can check the status of the queues to e.g. determine

how many other jobs are still pending on your desired queue. This is done with

bqueues (Figure 2).

Figure 2. Example of the bqueues command. This lists all of the available queues, the maximum

number of jobs, the allowable jobs per host, the number of submitted jobs, the number of pending

jobs, the number of running jobs and the number of suspended jobs. Most of the queues are currently

overloaded, although this tends to ﬂuctuate with time.

This lists the name of all available queues, the status of the queue, the number of

jobs per host, the number of total jobs submitted to the queue, the number of pending

jobs, and the number of running jobs.

You can check on any submitted jobs using the bjobs command (Figure 3).

Figure 3. Example of a call to bjobs. I have submitted three jobs from greentail, all of which are

running. Each job is taking up eight cores on diﬀerent nodes.

This command will list any jobs that are running, or have been recently submitted.

It lists the unique job ID number, the user that submitted the job, the status of the

job (e.g. running vs pending), the queue that the job was sent to, the login node

from which the job was submitted, the cores that are executing the job, the name of

the job, and the time that the job that was submitted. A useful variant of bjobs is:

bjobs -u all -q mw256fd

This lists the jobs from all users that have been submitted to a particular queue.

This can be useful for looking in detail at the load on a particular queue. The bjobs

command can also be called with the -m node ﬂag, where node is the node name (e.g.

n44); bjobs will then show the jobs operating on a particular node.

Similarly, the bhist command (Figure 4) lists any submitted jobs, including the

amount of time they have spent in various states (e.g. running vs pending).

Figure 4. Example of a call to bhist. This shows the three jobs currently running, along with

the number of seconds they have spent in various states.

Once a job is running, a number of commands can be used to check its status. The

ﬁrst is lsload (Figure 5), which lists the load on a particular node, in the form of the

number of cores that are currently operating, and the amount of memory being used.

The number of cores should be at least as large as the number of cores requested

by your code (assuming you have requested that all cores be located within a single

node). If not, then the code has likely crashed without stopping (see below).

Figure 5. Example of a call to lsload. The values r15s,r1m,r15m shows the core usage averaged

over the previous 15 seconds, 1 miute and 15 minutes.

Another method for checking on the operation of your code is with the top

command, which can be called as :

ssh <node> top -u <username> -b -n 1

where <node> is the node name (e.g. n42) and <username> is your username. This

will execute top on all of your commands executing on the speciﬁed node (Figure 6).

This contains information on the total memory usage (important to keep track of)

and the fraction of CPU being devoted to a particular command.

If you need to terminate a job for any reason, then this can be done with the bkill

command:

bkill <jobid>

where <jobid> is the unique ID number for the particular job that you want to kill.

This may take a minute or two depending on the complexity of the job that is being

killed.

Figure 6. Example of calling top for a particular node. There are lots of python processes running

right now.

4. MISC.

4.1. Crashing without Stopping

Most of the time if your code crashes the job will be killed, and error messages

will be output to the .stderr and .stdout ﬁles. But I have run into some in-

stances where the code will crash, but the job will not kill itself. This can happen

if emcee runs into a problem with a particular model and isn’t able to proceed any

further along its chains. If you have piped your output into test.out (described

below), then you can see this happen when test.out stops growing. At this point

you have to kill the job and examine the error ﬁle to discern what went wrong.

Similarly, for checkpointed jobs you can look at ~/.lsbatch/[0-9]*.JOBID.out or

/sanscratch/checkpoints/$JOBID/cr mpirun.err to see if it has registered any

errors.

4.2. MIRIAD on the Cluster

Miriad runs the same as on our local machines, with one small exception having

to do with how uvmodel handles polarized data and an unpolarized model. On my

local machine, if the data has dimensions NbasexNchanx3x2, where Nbase is the number

of baselines and Nchan is the number of channels and the last dimension splits the

data into XX and YY, then when the model is run through uvmodel it results in a

visibility ﬁle with dimensions 2NbasexNchanx3x1. The last dimension has been reduced

while the ﬁrst dimension has been doubled, with the intensity copied into twice as

many baseline positions. On the cluster uvmodel does not do this. It maintains

Table 1. Cluster Commands

Command Description

bsub < <myjob> Submit a job (<myjob>) to the scheduler

bqueues Report on the status of the queues within the clus-

ter. The output includes the number of submitted jobs

NJOBS, the number of pending jobs PEND, the number

of running jobs RUN, the number of allowed jobs per

host JL/H, and the status (e.g. Open vs Closed) for

each queue on the cluster.

bjobs List your submitted jobs, including their job id, the

queue they were submitted to, the name of the job,

and, if the job is running, a list of the cores on which

the code is running.

bjobs -u all -q <queue> List all of the jobs currently running on a speciﬁc

queue.

bhist List a history of time spent on your submitted jobs

lsload <node> Lists the number of currently running cores, averaged

over the past 15 seconds (r15s), 1 minute (r1m), and

15 minutes (r15m), on a particular node. It also lists

the memory usage for that particular node.

ssh <node> top -u <username> -b -n 1 Run top for node <node> for all jobs submitted by

user <username>.

bkil <jobid> End/kill a particular job, speciﬁed by its unique ID

number.

Note—A description of common commands for use on Wesleyan’s computing cluster

the original dimensions of the data (NbasexNchan x3x2) and copies intensity into the

XX,YY positions in the array. This requires slightly diﬀerent handling of the model

visibilities when they are loaded into python.

Also note that miriad is very talkative, and likes to tell you all about things that

are going on when it is running. This can become burdensome when hundreds of

thousands of models are run since all of this information will be dumped into the

.stderr and .stdout ﬁles created at the end of each run (described below). To

avoid keeping this useless information, I usually pipe the command into a ﬁle (e.g.

python code.py >test.out. The ﬁle test.out will slowly grow with time, which

a useful way to test that your code has not crashed. If you place this ﬁle in the scratch

directory then it will deleted once the code is done running. When checkpointing is

used but the output is instead placed in ~/.lsbatch/[0-9]*.JOBID.out. When the

job ends, this ﬁle becomes the .out ﬁle.

Cluster Guide

Navigation menu

Versions of this User Manual:

Views

Navigation