Page tree
Skip to end of metadata
Go to start of metadata

Once you have obtained an account on Plato, follow this guide to make your first steps on the cluster.

Contents

Receiving important notifications

We strongly recommend that all Plato users subscribe to our Mailing list to stay informed about the status of the machine and receive important notifications.

Accessing the system

Plato is accessible through SSH at plato.usask.ca. Your user name is your NSID and your password is the one associated with your NSID (used on paws.usask.ca, for instance). On a UNIX machine (such as Linux or MacOS), use the following command in a terminal (replacing abc123 by your NSID):

$ ssh abc123@plato.usask.ca

The $ sign is used throughout this guide to denote the bash prompt. The text that follows is what should be typed in your shell (do not type the $ itself).

To transfer files, you can use standard UNIX tools such as scp or rsync.

On Windows machines, you will need an SSH client such as MobaXterm or PuTTY. To transfer files on Windows, we recommend WinSCP.

To access Plato from outside the campus, you will first need to connect to the University’s virtual private network (VPN).

Plato is also accessible via Globus (GridFTP).

Finding and using provided software

On Plato, the module command allows you to search for installed software packages and add them to your environment. Since the amount of available software is large, and several versions of the same package are often offered, packages are not available directly by default. Instead, you need to add software to your environment manually. If you are already familiar with module, you should feel at home on Plato. The software stack is the same which is used on Compute Canada machines. However, not all versions and all packages are guaranteed to be available on Plato. If you are not familiar with module, here are a few examples to get you started.

Basics

To load a particular software package:

$ module load openfoam

To load a specific version of a package:

$ module load openfoam/4.1

To unload a module:

$ module unload openfoam

To list the currently loaded modules:

$ module list

Currently Loaded Modules:
  1) nixpkgs/16.09   (S)      3) StdEnv/2016.4  (S)   5) gcc/5.4.0     (t)
  2) imkl/11.3.4.258 (math)   4) gcccore/.5.4.0 (H)   6) openmpi/2.1.1 (m)

  Where:
   S:     Module is Sticky, requires --force to unload or purge
   m:     MPI implementations
   math:  Mathematical libraries
   t:     Tools for development
   H:     Hidden Module

These are the default modules that are loaded for you when starting a session:

  • NixPkgs: A collection of basic UNIX packages that provides a uniform environment on all Compute Canada machines (and Plato)
  • StdEnv: The module that defines the standard environment for all Compute Canada machines (and Plato), such as the default compilers, MPI and mathematical libraries
  • GCC: The default compilers (GNU Compiler Collection)
  • OpenMPI: The default MPI libraries used
  • Intel MKL: The default mathematical libraries
  • gcccore: This is a hidden infrastructure module that users can ignore

To list all available modules:

$ module avail

-------------------------------- Global Aliases --------------------------------
   allinea-cpu -> ddt-cpu    arm-forge-cpu -> ddt-cpu
   allinea-gpu -> ddt-gpu    arm-forge-gpu -> ddt-gpu

--------------------------- Cluster specific modules ---------------------------
   almabte/1.2              libharu/2.3.0            openbugs/3.2.3
   binfotools/1.0           maple/2018               petsc/3.7.5         (t)
   bowtie/1.2.2   (bio,D)   Mathematica/11.0         pgi-community/18.4  (t)
   bowtie2/2.2.8  (bio)     Mathematica/11.3 (D)     pgi-ompi/2.1.2/2018
   cd-hit/4.6.6   (bio)     matlab/mcr       (t)     phylip/3.697        (bio)
   chapel/1.18.0  (t)       matlab/R2013a    (t)     R/3.4.3
   emboss/6.6.0   (bio)     matlab/R2015a    (t)     raxml/8.2.10        (bio)
   fasta/36.3.5e  (bio)     matlab/R2015b    (t)     trinity/2.4.0       (bio,D)
   gifs/workshop            matlab/R2016b    (t)     usearch/10.0.240
   imagej/1.51k             matlab/R2017b    (t)     vtk/8.0.0           (vis,D)
   infernal/1.1.2 (bio)     muscle/3.8.1551  (bio)

[...]

Be aware that this lists only the modules that are available for your currently loaded toolchain, i.e. the combination of compilers, MPI and mathematical libraries currently loaded. If you load different compilers, MPI or mathemical libraries, you will get a different list. Not all software is available for all toolchains for compatibility reasons and because the number of possible of combinations is simply too great to install everything!

Searching

To search for a specific package without considering the currently loaded toolchain:

$ module keyword foam

--------------------------------------------------------------------------------------

The following modules match your search criteria: "foam"
--------------------------------------------------------------------------------------

  openfoam: openfoam/2.3.1, openfoam/3.0.1, openfoam/4.1, openfoam/5.0, openfoam/7
    OpenFOAM is a free, open source CFD software package. OpenFOAM has an extensive
    range of features to solve anything from complex fluid flows involving chemical
    reactions, turbulence and heat transfer, to solid dynamics and electromagnetics.

--------------------------------------------------------------------------------------

[...]

To learn more about a package, use:

$ module spider openfoam

--------------------------------------------------------------------------------
  openfoam:
--------------------------------------------------------------------------------
    Description:
      OpenFOAM is a free, open source CFD software package. OpenFOAM has an
      extensive range of features to solve anything from complex fluid flows
      involving chemical reactions, turbulence and heat transfer, to solid
      dynamics and electromagnetics.

     Versions:
        openfoam/2.3.1
        openfoam/3.0.1
        openfoam/4.1
        openfoam/5.0

[...]

You can also get detailed information about a specific version:

$ module spider openfoam/5.0

--------------------------------------------------------------------------------------
  openfoam: openfoam/5.0
--------------------------------------------------------------------------------------
    Description:
      OpenFOAM is a free, open source CFD software package. OpenFOAM has an extensive
      range of features to solve anything from complex fluid flows involving chemical
      reactions, turbulence and heat transfer, to solid dynamics and
      electromagnetics.

    Properties:
      Physics libraries/apps / Logiciels de physique

    You will need to load all module(s) on any one of the lines below before the "openfoam/5.0" module is available to load.

      nixpkgs/16.09  gcc/5.4.0  openmpi/2.1.1
      nixpkgs/16.09  intel/2016.4  openmpi/2.1.1
 
[...]

The output of module spider above tells you how to load the module. Here, we see that OpenFOAM 5.0 is available for two different toolchains: GCC 5.4.0 + OpenMPI 2.1.1, and Intel 2016.4 + OpenMPI 2.1.1. We can therefore load OpenFOAM compiled with GCC using:

$ module load gcc/5.4.0
$ module load openmpi/2.1.1
$ module load openfoam/5.0

Note that it is not necessary to load nixpkgs/16.09 since it is always loaded by default and cannot be replaced.

Automatically loading modules

To automatically add modules to your environment every time you open a session on the cluster, first load the modules you want. Then, save this state as your default collection:

$ module load openfoam/5.0
$ module save
Saved current collection of modules to: "default"

Getting help

For more details, see the extensive help provided by the module command itself:

module --help

You will also find many examples in Compute Canada’s module usage page.

Compiling custom software

If the software you require is not available on Plato, you can compile it yourself. First, connect to a login node to perform the compilation there:

$ ssh abc123@plato

Then, select a compiler and load the appropriate module. In doubt, we suggest using GCC 5.4.0:

$ module load gcc/5.4.0

Compilers and other development tools are indicated by the t category in the output of module avail. Available compilers include GCC (gcc), Intel (intel), MCR for MATHLAB (mcr) and CUDA-enabled GCC (gcccuda).

If your program uses MPI, you should also make sure that an MPI library is present in your environment. We recommend OpenMPI 2.1.1:

$ module load openmpi/2.1.1

If your software requires other libraries, you should first check if they are already available on Plato. If they are, there is no need for you to install them again! For example, if your package depends on CMake, you can check that this software is available on Plato and load it with:

$ module spider cmake

----------------------------------------------------------------------------
  cmake: cmake/3.12.3
----------------------------------------------------------------------------
    Description:
      CMake, the cross-platform, open-source build system. CMake is a family
      of tools designed to build, test and package software.

[...]

$ module load cmake

Once you have loaded the appropriate modules, follow your software package’s instructions. Most packages use a build system, such as Autotools or CMake, which provides a script to execute instead of calling the compiler directly.

Performing computations

To perform computations on Plato, you must ask the scheduler to allocate resources for you. Your software then runs on one or more compute nodes that have been granted to you. Computationally intensive software should not be run on the login nodes; these are reserved for software compilation, preparing jobs to submit to the scheduler, etc. You should not connect to compute nodes that have not been allocated to you either (e.g. using SSH); the compute nodes are managed by the scheduler.

The scheduler keeps a list of all requests to run computations on Plato and dispatches these requests to the compute nodes. On a large, multi-user system such as Plato, the use of a scheduler is necessary to ensure an efficient use of resources. The scheduler will avoid running your program on nodes that are already busy, wait for a node to become available, and automatically start your job when a node is ready. The scheduler also keeps track of how much computing time has been allocated to each research group to ensure that resources are shared fairly.

Plato uses the SLURM scheduler. There are two main ways to ask SLURM for resources: batch job scripts and interactive sessions. We will introduce both, starting with batch job scripts since they are more common.

Batch job scripts

The sbatch SLURM command allows you to submit a shell script to the scheduler. This shell script will be executed on a compute node when your resources have been allocated. Here is a minimal example script:

#!/bin/bash

echo "Beginning job script on the batch host"
hostname

echo "Running hostname on each allocated task host"
srun hostname

echo "End of job script"

Assuming this script is saved to test-job.sh, it can be submitted to the scheduler. In the following example, we request resources for 4 tasks:

$ sbatch --ntasks=4 test-job.sh

SLURM will create an output file for the script in the working directory. Once the job has completed, the output will look like the following:

$ cat slurm-425784.out
Beginning job script on master host
plato344
Running hostname on each allocated task host
plato344
plato344
plato345
plato345
End of job script

Let us break down the script and its resulting output line by line. First, #!/bin/bash tells SLURM that this is a script for the bash shell (the most common Linux shell). One compute node (the batch host) will execute the script. Here, the output tells us this is node plato344. Then, command srun runs a program in parallel on all allocated resources. Since we requested resources for 4 tasks, the program was run four times. The output shows that two of these tasks were dispatched to plato344, and the two others to plato345. The srun command is provided by SLURM to run parallel programs. If your program is MPI-enabled, we recommend running it with srun rather than mpirun or mpiexec.

sbatch accepts a vast number of options. A very important one is --time, which allows you to request a specific amount of computational time. For example, sbatch --ntasks=4 --time=2-00:00:00 would request resources for 4 tasks and 2 days. If you omit the time, you are granted 20 minutes, allowing you to perform quick tests but nothing more. If your job does not finish within the requested time frame, it will be stopped by the system; you should therefore always request slightly more time than you require. You can request at most 21 days for any given job.

Using the following syntax, options for sbatch can be given in the script rather than on the command line:

#!/bin/bash

#SBATCH --job-name=my_test
#SBATCH --time=2-00:00:00
#SBATCH --ntasks=4
#SBATCH --mem=8G

module load my_program/2.0

srun my_program

The above example also shows how to request a specific amount of total memory per node, and how to load modules in your job script to add the necessary software to the environment on the compute nodes.

To see a list of the jobs submitted to the scheduler, use squeue. Since this is usually a pretty large list, you can filter it to show only your own jobs:

$ squeue -u abc123
 JOBID     USER       ACCOUNT          NAME  ST   TIME_LEFT NODES CPUS       GRES MIN_MEM NODELIST (REASON)
415174   abc123 hpc_e_gratton        exp001   R  2-23:13:37     4   64     (null)      8G plato[425,428,431,438] (None)
415175   abc123 hpc_e_gratton        exp002   R  9-06:30:55     4   64     (null)      8G plato[433,444-446] (None)
415177   abc123 hpc_e_gratton        exp003  PD 10-00:00:00     4   64     (null)      8G  (Resources)

The output tells us that user abc123, who is part of the hpc_e_gratton SLURM account, has submitted 3 jobs, with names exp001, exp002 and exp003. Each job has a unique ID in SLURM (415174 for exp001). Two are running (status R), and one is pending (status PD). exp001 has a little under 3 days of allocated time remaining, exp002 has over 9 days, and exp003 requested 10 days. All jobs requested 64 tasks, resulting in 64 CPUS being used on 4 nodes (16 CPU/node). None of these jobs requested any special resources (GRES is (null)), and they all requested 8G of memory. If a job is not running, the last column gives the reason; here, only the last job is still pending, because (Resources) are not available yet (i.e. nodes are busy).

A job can be cancelled from its SLURM ID:

$ scancel 415174

Interactive sessions

It is sometimes useful to run commands manually instead of wrapping them in a batch job script to pass to sbatch, such as when performing quick tests or using interactive mathematical tools. SLURM makes this possible through interactive sessions.

To start an interactive session through SLURM, use the salloc command:

$ salloc
salloc: Granted job allocation 425277
salloc: Waiting for resource configuration
salloc: Nodes plato313 are ready for job
$ hostname
plato313

This will allocate one task on a node and open a session on the allocated node (as shown by the output of hostname).

The salloc configuration on Plato matches that of Compute Canada machines: when the allocation is granted, a Bash session is automatically started on the allocated node. This can be overridden by specifying the command for salloc to execute (see man salloc).

From there, you can run commands directly as you would on a login node (but without the restrictions on computationally-demanding tasks).

When you are finished, use exit to close the session on the compute node and relinquish the resource allocation:

$ exit
Connection to plato344 closed.
salloc: Relinquishing job allocation 425809

By default, salloc allocates resources for a single task. However, you can pass it options just like sbatch:

$ salloc --ntasks=16
salloc: Granted job allocation 425810
salloc: Waiting for resource configuration
salloc: Nodes plato[312-313,344] are ready for job

Getting the most out of the scheduler

Since many research groups use Plato, your jobs are likely to spend some time waiting in queue before they run! You can ask SLURM the estimated start time of your upcoming jobs:

$ squeue -u abc123 --start

You can also ask SLURM to compute the estimated start of a job without submitting it to the queue:

$ sbatch --test-only job-script.sh

To minimise the time spent waiting in the queue, make sure to provide a good estimate of the time you require. If you simply request the maximum possible duration (21 days), your wait time will be longer since the scheduler favours short jobs by giving them a higher priority. Short jobs requiring less than 4 hours are fastest (we call these “burst” jobs).

The more computing time you consume, however, the more your overall priority decreases, to ensure that all users can run jobs and that no one can hog the cluster. Your overall priority goes back to normal over a two-week period. This means that the optimal way to work with Plato is to submit jobs regularly over the weeks rather than a large numbers of jobs once in a while. It also means that you should only request the necessary resources to complete your jobs. Asking resources for 64 tasks if your program is only marginally slower when using 32, for instance, would penalise you in the long run. You should therefore carefully assess the efficiency of your program before requesting more resources.

Also note that priorities are managed per group rather than per user: all students and staff in a group share the same priority. Again, this ensures that a single research group cannot hog resources to the detriment of others. You can check your cluster usage with sshare.

Under special circumstances, we can alter job priorities to some extent. However, we expect users who make such requests to have already optimised their workflow and to justify their request.

Further reading

The main Plato documentation page gives an overview of the cluster; it is also the root of the Plato documentation and links all Plato-related topics (subpages). We offer generic SLURM job script examples that will help you fit your program into the Plato scheduler, be it a trivial parallel job or a fine-grained hybrid MPI/OpenMP program. You may also be able to find software-specific documentation if your program is commonly used on Plato. Be sure to browse through the documentation, and happy computing!

Getting help

If you encounter a problem while working on Plato, or otherwise need help, please read our user support page.

References