Once you have obtained an account on Plato, follow this guide to make your first steps on the cluster.
Contents
Receiving important notifications
We strongly recommend that all Plato users subscribe to our Mailing list to stay informed about the status of the machine and receive important notifications.
Accessing the system
Plato is accessible through SSH at plato.usask.ca
. Your user name is your NSID and your password is the one associated with your NSID (used on paws.usask.ca
, for instance). On a UNIX machine (such as Linux or MacOS), use the following command in a terminal (replacing abc123
by your NSID):
$ ssh abc123@plato.usask.ca
The $
sign is used throughout this guide to denote the bash prompt. The text that follows is what should be typed in your shell (do not type the $
itself).
To transfer files, you can use standard UNIX tools such as scp
or rsync
.
On Windows machines, you will need an SSH client such as MobaXterm or PuTTY. To transfer files on Windows, we recommend WinSCP.
To access Plato from outside the campus, you will first need to connect to the University’s virtual private network (VPN).
Plato is also accessible via Globus (GridFTP).
Linux basics
Once you are connected to Plato, you will be presented with a Linux command prompt. If you are not familiar with the text-based Linux command-line environment, we recommend following our Introduction to the Linux command line workshop first. This workshop is given in an instructor-led format at least once per term, and the material is always available online for self-learning.
Finding and using provided software
On Plato, the module
command allows you to search for installed software packages and add them to your environment. Since the amount of available software is large, and several versions of the same package are often offered, packages are not available directly by default. Instead, you need to add software to your environment manually. If you are already familiar with module
, you should feel at home on Plato. The software stack is the same which is used on Compute Canada machines. However, not all versions and all packages are guaranteed to be available on Plato. If you are not familiar with module,
here are a few examples to get you started.
Basics
To load a particular software package:
$ module load openfoam
To load a specific version of a package:
$ module load openfoam/4.1
To unload a module:
$ module unload openfoam
To list the currently loaded modules:
$ module list Currently Loaded Modules: 1) CCconfig 5) intel/2020.1.217 (t) 9) StdEnv/2020 (S) 2) gentoo/2020 (S) 6) ucx/1.8.0 10) mii/1.1.1 3) gcccore/.9.3.0 (H) 7) libfabric/1.10.1 4) imkl/2020.1.217 (math) 8) openmpi/4.0.3 (m) Where: S: Module is Sticky, requires --force to unload or purge m: MPI implementations / Implémentations MPI math: Mathematical libraries / Bibliothèques mathématiques t: Tools for development / Outils de développement H: Hidden Module
These are the default modules that are loaded for you when starting a session:
- Gentoo: The base Linux layer that provides a uniform environment on all Compute Canada machines (and Plato)
- StdEnv: The module that defines the standard environment for all Compute Canada machines (and Plato), such as the default compilers, MPI and mathematical libraries
- Intel: The default compilers
- OpenMPI: The default MPI library
- Intel MKL: The default mathematical library
- MII: A smart search engine for module environments
gcccore
: This is a hidden infrastructure module that users can ignoreucx
andlibfabric
: Dependencies of OpenMPI that users can ignore
To list all available modules:
$ module spider -------------------------------------------------------------------------------------- The following is a list of the modules and extensions currently available: -------------------------------------------------------------------------------------- abaqus: abaqus/6.14.1, abaqus/2020, abaqus/2021 Finite Element Analysis software for modeling, visualization and best-in-class implicit and explicit dynamics FEA. abinit: abinit/8.2.2 ABINIT is a package whose main program allows one to find the total energy, charge density and electronic structure of systems made of electrons and nuclei (molecules and periodic solids) within Density Functional Theory (DFT), using pseudopotentials and a planewave or wavelet basis. [...] zeromq: zeromq/4.2.5 ZeroMQ looks like an embeddable networking library but acts like a concurrency framework. It gives you sockets that carry atomic messages across various transports like in-process, inter-process, TCP, and multicast. You can connect sockets N-to-N with patterns like fanout, pub-sub, task distribution, and request-reply. It's fast enough to be the fabric for clustered products. Its asynchronous I/O model gives you scalable multicore applications, built as asynchronous message-processing tasks. It has a score of language APIs and runs on most operating systems. zipp: zipp/0.6.0 (E), zipp/1.2.0 (E), ... Names marked by a trailing (E) are extensions provided by another module. [...]
Searching
To search for a specific package using a keyword or partial word:
$ module spider gen -------------------------------------------------------------------------------------- async-generator: -------------------------------------------------------------------------------------- [...] -------------------------------------------------------------------------------------- eigen: -------------------------------------------------------------------------------------- [...] -------------------------------------------------------------------------------------- simplegeneric: -------------------------------------------------------------------------------------- [...]
To learn more about a package, use:
$ module spider gromacs -------------------------------------------------------------------------------------- gromacs: -------------------------------------------------------------------------------------- Description: GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles. This is a CPU only build, containing both MPI and threadMPI builds. - CC-Wiki: GROMACS Versions: gromacs/4.6.7 gromacs/5.0.7 gromacs/5.1.4 gromacs/2016.3 gromacs/2018 gromacs/2018.1 gromacs/2019.3 gromacs/2020.4 gromacs/2021.3 Other possible modules matches: gromacs-plumed [...]
You can also get detailed information about a specific version:
$ module spider gromacs/2021.4 -------------------------------------------------------------------------------------- gromacs: gromacs/2021.4 -------------------------------------------------------------------------------------- Description: GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles. This is a CPU only build, containing both MPI and threadMPI builds. - CC-Wiki: GROMACS Properties: Physics libraries/apps / Logiciels de physique You will need to load all module(s) on any one of the lines below before the "openfoam/5.0" module is available to load. StdEnv/2020 gcc/9.3.0 cuda/11.4 openmpi/4.0.3 StdEnv/2020 gcc/9.3.0 openmpi/4.0.3 [...]
The output of module spider
above tells you how to load the module. Here, we see that GROMACS 2021.4 is available for two different toolchains: GCC 9.3.0 + OpenMPI 4.0.3, and GCC 9.3.0 + OpenMPI 4.0.3 + CUDA 11.4 (for GPU computing). We can therefore load GROMACS for GPUs using:
$ module load gcc/9.3.0 $ module load openmpi/4.0.3 $ module load cuda/11.4 $ module load gromacs/2021.4
Getting help
For more details, see the extensive help provided by the module
command itself:
module --help
You will also find many examples in Compute Canada’s module usage page.
Compiling custom software
If the software you require is not available on Plato, you can compile it yourself. First, connect to a login node to perform the compilation there:
$ ssh abc123@plato
Then, select a compiler and load the appropriate module. In doubt, we suggest using GCC:
$ module load gcc/9.3.0
Compilers and other development tools are indicated by the t category in the output of module avail
. Available compilers include GCC (gcc
) and Intel.
If your program uses MPI, you should also make sure that an MPI library is present in your environment. We recommend OpenMPI:
$ module load openmpi/4.0.3
If your software requires other libraries, you should first check if they are already available on Plato. If they are, there is no need for you to install them again! For example, if your package depends on HDF5, you can check that this software is available on Plato and load it with:
$ module spider hdf5 -------------------------------------------------------------------------------------- hdf5: -------------------------------------------------------------------------------------- Description: HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data. [...] $ module load gcc hdf5
Once you have loaded the appropriate modules, follow your software package’s instructions. Most packages use a build system, such as Autotools or CMake, which provides a script to execute instead of calling the compiler directly.
Performing computations
To perform computations on Plato, you must ask the scheduler to allocate resources for you. Your software then runs on one or more compute nodes that have been granted to you. Computationally intensive software should not be run on the login nodes; these are reserved for software compilation, preparing jobs to submit to the scheduler, etc. You should not connect to compute nodes that have not been allocated to you either (e.g. using SSH); the compute nodes are managed by the scheduler.
The scheduler keeps a list of all requests to run computations on Plato and dispatches these requests to the compute nodes. On a large, multi-user system such as Plato, the use of a scheduler is necessary to ensure an efficient use of resources. The scheduler will avoid running your program on nodes that are already busy, wait for a node to become available, and automatically start your job when a node is ready. The scheduler also keeps track of how much computing time has been allocated to each research group to ensure that resources are shared fairly.
Plato uses the SLURM scheduler. There are two main ways to ask SLURM for resources: batch job scripts and interactive sessions. We will introduce both, starting with batch job scripts since they are more common.
Batch job scripts
The sbatch
SLURM command allows you to submit a shell script to the scheduler. This shell script will be executed on a compute node when your resources have been allocated. Here is a minimal example script:
#!/bin/bash echo "Beginning job script on the batch host" hostname echo "Running hostname on each allocated task host" srun hostname echo "End of job script"
Assuming this script is saved to test-job.sh
, it can be submitted to the scheduler. In the following example, we request resources for 4 tasks:
$ sbatch --ntasks=4 test-job.sh
SLURM will create an output file for the script in the working directory. Once the job has completed, the output will look like the following:
$ cat slurm-425784.out Beginning job script on master host plato344 Running hostname on each allocated task host plato344 plato344 plato345 plato345 End of job script
Let us break down the script and its resulting output line by line. First, #!/bin/bash
tells SLURM that this is a script for the bash shell (the most common Linux shell). One compute node (the batch host) will execute the script. Here, the output tells us this is node plato344
. Then, command srun
runs a program in parallel on all allocated resources. Since we requested resources for 4 tasks, the program was run four times. The output shows that two of these tasks were dispatched to plato344
, and the two others to plato345
. The srun
command is provided by SLURM to run parallel programs. If your program is MPI-enabled, we recommend running it with srun
rather than mpirun
or mpiexec
.
sbatch
accepts a vast number of options. A very important one is --time
, which allows you to request a specific amount of computational time. For example, sbatch --ntasks=4 --time=2-00:00:00
would request resources for 4 tasks and 2 days. If you omit the time, you are granted 20 minutes, allowing you to perform quick tests but nothing more. If your job does not finish within the requested time frame, it will be stopped by the system; you should therefore always request slightly more time than you require. You can request at most 21 days for any given job.
Using the following syntax, options for sbatch
can be given in the script rather than on the command line:
#!/bin/bash #SBATCH --job-name=my_test #SBATCH --time=2-00:00:00 #SBATCH --ntasks=4 #SBATCH --mem=8G module load my_program/2.0 srun my_program
The above example also shows how to request a specific amount of total memory per node, and how to load modules in your job script to add the necessary software to the environment on the compute nodes.
To see a list of the jobs submitted to the scheduler, use squeue
. Since this is usually a pretty large list, you can filter it to show only your own jobs:
$ squeue -u abc123 JOBID USER ACCOUNT NAME ST TIME_LEFT NODES CPUS GRES MIN_MEM NODELIST (REASON) 415174 abc123 hpc_e_gratton exp001 R 2-23:13:37 4 64 (null) 8G plato[425,428,431,438] (None) 415175 abc123 hpc_e_gratton exp002 R 9-06:30:55 4 64 (null) 8G plato[433,444-446] (None) 415177 abc123 hpc_e_gratton exp003 PD 10-00:00:00 4 64 (null) 8G (Resources)
The output tells us that user abc123
, who is part of the hpc_e_gratton
SLURM account, has submitted 3 jobs, with names exp001
, exp002
and exp003
. Each job has a unique ID in SLURM (415174 for exp001
). Two are running (status R
), and one is pending (status PD
). exp001
has a little under 3 days of allocated time remaining, exp002
has over 9 days, and exp003
requested 10 days. All jobs requested 64 tasks, resulting in 64 CPU cores being used on 4 nodes (16 cores/node). None of these jobs requested any special resources (GRES
is (null)
), and they all requested 8G of memory. If a job is not running, the last column gives the reason; here, only the last job is still pending, because (Resources)
are not available yet (i.e. nodes are busy).
A job can be cancelled from its SLURM ID:
$ scancel 415174
Interactive sessions
It is sometimes useful to run commands manually instead of wrapping them in a batch job script to pass to sbatch
, such as when performing quick tests or using interactive mathematical tools. SLURM makes this possible through interactive sessions.
To start an interactive session through SLURM, use the salloc
command:
$ salloc salloc: Granted job allocation 425277 salloc: Waiting for resource configuration salloc: Nodes plato313 are ready for job $ hostname plato313
This will allocate one task on a node and open a session on the allocated node (as shown by the output of hostname
).
The salloc
configuration on Plato matches that of Compute Canada machines: when the allocation is granted, a Bash session is automatically started on the allocated node. This can be overridden by specifying the command for salloc
to execute (see man salloc
).
From there, you can run commands directly as you would on a login node (but without the restrictions on computationally-demanding tasks).
When you are finished, use exit
to close the session on the compute node and relinquish the resource allocation:
$ exit Connection to plato344 closed. salloc: Relinquishing job allocation 425809
By default, salloc
allocates resources for a single task. However, you can pass it options just like sbatch
:
$ salloc --ntasks=16 salloc: Granted job allocation 425810 salloc: Waiting for resource configuration salloc: Nodes plato[312-313,344] are ready for job
Getting the most out of the scheduler
Since many research groups use Plato, your jobs are likely to spend some time waiting in queue before they run! You can ask SLURM the estimated start time of your upcoming jobs:
$ squeue -u abc123 --start
You can also ask SLURM to compute the estimated start of a job without submitting it to the queue:
$ sbatch --test-only job-script.sh
To minimise the time spent waiting in the queue, make sure to provide a good estimate of the time you require. If you simply request the maximum possible duration (21 days), your wait time will be longer since the scheduler favours short jobs by giving them a higher priority. Short jobs requiring less than 4 hours are fastest (we call these “burst” jobs).
The more computing time you consume, however, the more your overall priority decreases, to ensure that all users can run jobs and that no one can hog the cluster. Your overall priority goes back to normal over a two-week period. This means that the optimal way to work with Plato is to submit jobs regularly over the weeks rather than a large numbers of jobs once in a while. It also means that you should only request the necessary resources to complete your jobs. Asking resources for 64 tasks if your program is only marginally slower when using 32, for instance, would penalise you in the long run. You should therefore carefully assess the efficiency of your program before requesting more resources.
Also note that priorities are managed per group rather than per user: all students and staff in a group share the same priority. Again, this ensures that a single research group cannot hog resources to the detriment of others. You can check your cluster usage with sshare
.
Under special circumstances, we can alter job priorities to some extent. However, we expect users who make such requests to have already optimised their workflow and to justify their request.
Training
This Getting started guide only stratches the surface! To help you learn more about Linux, HPC, and the Plato cluster, we offer hands-on workshops every term; see our Training page. In particular, we recommend Introduction to high-performance computing, which is an extended version of the present guide. Even if the workshop you are interested in is not scheduled at this time, all course material is available online for self-learning.
Documentation
The main Plato documentation page gives an overview of the cluster; it is also the root of the Plato documentation and links all Plato-related topics (subpages). We offer generic SLURM job script examples that will help you fit your program into the Plato scheduler, be it a trivial parallel job or a fine-grained hybrid MPI/OpenMP program. You may also be able to find software-specific documentation if your program is commonly used on Plato. Be sure to browse through the documentation, and happy computing!
User support
If you encounter a problem while working on Plato, or otherwise need help, please read our user support page.
References
- Putty (Windows SSH client)
- WinSCP (Windows file transfer client)
- Globus (GridFTP transfers)
- DATASTORE (University storage service)
- Lmod documentation (
module
command) - SLURM documentation (scheduler)