Usage#
Running Legate Programs#
Python programs using the Legate APIs (directly or through a Legate-based library such as cuPyNumeric) can be run using the standard Python interpreter:
$ python myprog.py <myprog.py options>
By default this invocation will use most of the hardware resources (e.g. CPU cores, RAM and GPUs) available on the current machine, but this can be controlled, see Resource Allocation.
You can also use the custom legate
driver script, that makes configuration
easier, and offers more execution options, particularly for Distributed
Launch:
$ legate myprog.py <myprog.py options>
Compiled C++ programs using the Legate API can also be run using the legate
driver:
$ legate ./myprog <myprog options>
or invoked directly:
$ ./myprog <myprog options>
Resource Allocation#
By default Legate will query the available hardware on the current machine, and
reserve for its use all CPU cores, all GPUs and most of the available memory.
You can use LEGATE_SHOW_CONFIG=1
to inspect the exact set of resources that
Legate decided to reserve. You can fine-tune Legate’s default resource
reservation by passing specific flags to the legate
driver script, listed
later in this section.
You can also use LEGATE_AUTO_CONFIG=0
to disable Legate’s automatic
configuration. In this mode Legate will only reserve a minimal set of resources
(only 1 CPU core for task execution, no GPUs, minimal system memory allocation),
and any increases must be specified manually.
The following legate
flags control how many processors are used by Legate:
--cpus
: how many individual CPU threads are spawned--omps
: how many OpenMP groups are spawned--ompthreads
: how many threads are spawned per OpenMP group--gpus
: how many GPUs are used
The following flags control how much memory is reserved by Legate:
--sysmem
: how much DRAM (in MiB) to reserve--numamem
: how much NUMA-specific DRAM (in MiB) to reserve per NUMA node--fbmem
: how much GPU memory (in MiB) to reserve per GPU
See legate --help
for a full list of accepted configuration options.
For example, if you wanted to use only part of the resources on a DGX station, you might run your application as follows:
$ legate --gpus 4 --sysmem 1000 --fbmem 15000 myprog.py
This will make only 4 of the 8 GPUs available for use by Legate. It will also allow Legate to consume up to 1000 MiB of DRAM and 15000 MiB of each GPU’s memory for a total of 60000 MiB of GPU memory.
The same configuration can also be passed through the environment variable
LEGATE_CONFIG
:
$ LEGATE_CONFIG="--gpus 4 --sysmem 1000 --fbmem 15000" legate myprog.py
including when using the standard Python interpreter:
$ LEGATE_CONFIG="--gpus 4 --sysmem 1000 --fbmem 15000" python myprog.py
or when running a compiled C++ Legate program directly:
$ LEGATE_CONFIG="--gpus 4 --sysmem 1000 --fbmem 15000" ./myprog
To see the full list of arguments accepted in LEGATE_CONFIG
, you can pass
LEGATE_CONFIG="--help"
:
$ LEGATE_CONFIG="--help" ./myprog
You can also allocate resources when running in interactive mode (by not passing
any *.py
files on the command line):
$ legate --gpus 4 --sysmem 1000 --fbmem 15000
Python 3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:23:07) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
Distributed Launch#
You can run your program across multiple nodes by using the --nodes
option
followed by the number of nodes to be used. When doing a multi-process run, a
launcher program must be specified, that will do the actual spawning of the
processes. Run a command like the following from the same machine where you would
normally invoke mpirun
:
$ legate --nodes 2 --launcher mpirun --cpus 4 --gpus 1 myprog.py
In the above invocation the mpirun
launcher will be used to spawn one Legate
process on each of two nodes. Each process will use 4 CPU cores and 1 GPU on its
assigned node.
The default Legate conda packages include networking support based on UCX, but GASNet-based packages are also available.
Note that resource setting flags such as --cpus 4
and --gpus 1
refer to
each process. In the above invocation, each one of the two launched processes
will reserve 4 CPU cores and 1 GPU, for a total of 8 CPU cores and 2 GPUs across
the whole run.
Check the output of legate --help
for the full list of supported launchers.
You can also perform the same launch as above externally to legate
:
$ mpirun -n 2 -npernode 1 legate --cpus 4 --gpus 1 myprog.py
or use python
directly:
$ LEGATE_CONFIG="--cpus 4 --gpus 1" mpirun -n 2 -npernode 1 -x LEGATE_CONFIG python myprog.py
Multiple processes (“ranks”) can also be launched on each node, using the
--ranks-per-node
legate
option:
$ legate --ranks-per-node 2 --launcher mpirun myprog.py
The above will launch two processes on the same node (the default value for
--nodes
is 1).
Because Legate’s automatic configuration will not check for other processes sharing the same node, each of these two processes will attempt to use the full set of CPU cores on the node, causing contention. Even worse, each process will try to reserve most of the system memory in the machine, leading to a memory reservation failure at startup.
To work around this, you will want to explicitly reduce the resources requested by each process:
$ legate --ranks-per-node 2 --launcher mpirun --cpus 4 --sysmem 1000 myprog.py
With this change, each process will only reserve 4 CPU cores and 1000 MiB of system memory, so there will be enough resources for both.
Even with the above change contention remains an issue, as the processes may end
up overlapping on their use of CPU cores. To work around this, you can
explicitly partition CPU cores between the processes running on the same node,
using the --cpu-bind
legate
option:
$ legate --ranks-per-node 2 --launcher mpirun --cpus 4 --sysmem 1000 --cpu-bind 0-15/16-32 myprog.py
The above command will restrict the first process to CPU cores 0-15, and the second to CPU cores 16-32, thus removing any contention. Each process will reserve 4 out of its allocated cores for task execution.
You can similarly restrict processes to specific NUMA domains, GPUs and NICs
using --mem-bind
, --gpu-bind
and --nic-bind
respectively.
You can also launch multiple processes per node when doing an external launch, but you then have to manually control the binding of resources:
$ mpirun -n 2 -npernode 2 --bind-to socket legate --cpus 4 --sysmem 1000 myprog.py
The above will launch two processes on one node, and relies on mpirun
to
bind each process to a separate CPU socket, thus partitioning the CPU cores
between them.
Running Legate on Typical SLURM Clusters#
Here is an example showing how to run Legate programs on typical SLURM clusters.
To get started, create a conda environment and install Legate, following the installation guide:
$ conda create -n legate -c conda-forge -c legate legate
For interactive runs, here are the steps:
Use srun
from the login node to allocate compute nodes:
$ srun --exclusive -J <job-name> -p <partition> -A <account> -t <time> -N <nodes> --pty bash
Once the compute nodes are allocated, use the legate
driver script to launch
applications:
$ source "<path-to-conda>/etc/profile.d/conda.sh" # Needed if conda isn't already loaded
$ conda activate legate
$ legate --launcher mpirun --verbose prog.py
You need to ensure the correct launcher is specified for your cluster. Some
SLURM clusters support both srun
and mpirun
, while others only support
srun
.
The driver script should be able to infer the number of nodes to launch over, by
reading environment variables set by SLURM. Inspect the output of --verbose
,
which lists the full launch command generated by the legate
driver script,
to confirm that this is the case. If the setting is incorrect, set --nodes
and/or --ranks-per-node
explicitly to override it.
Each Legate process should be able to detect the correct hardware configuration automatically, see the Resource Allocation section.
A more common way to run programs on clusters is via a SLURM script. Here is
a sample scipt saved as run_legate.slurm
:
#!/bin/bash
#SBATCH --job-name=<job-name> # Job name
#SBATCH --output=legate.out # Output file
#SBATCH --nodes=2 # Number of nodes
#SBATCH --ntasks-per-node=1 # Processes per node
#SBATCH --time=00:10:00 # Time limit hrs:min:sec
#SBATCH --partition=<partition> # Partition name
#SBATCH --account=<account> # Account name
conda activate legate
legate --launcher mpirun --verbose prog.py
Submit the script with sbatch
:
$ sbatch run_legate.slurm
Debugging and Profiling#
Legate comes with tools that you can use to better understand your program both from a correctness and a performance standpoint.
For correctness, Legate has facilities for visualizing the dataflow and task
graph from a run of the application. First you need to run your application with
the --spy
legate
option (or pass the same through LEGATE_CONFIG
):
legate --spy myprog.py
Legate will collect the necessary logging information in the legate_*.log
files (one per process). By default these files are placed in the same directory
where the program was launched (you can control this with the --logdir
option). To post-process these files, install
GraphViz on your machine, then run:
legion_spy.py --dataflow --event legate_*.log
This command will produce a dataflow visualization in dataflow_[...].pdf
,
and a task graph visualization in event_[...].pdf
. Note that these files can
grow to be fairly large for non-trivial programs so we encourage you to keep
your programs small when using these visualizations.
For profiling, first you need to install the Legate profile viewer, available on
the Legate conda channel as legate-profiler
:
conda install -c conda-forge -c legate legate-profiler
Then you need to pass the --profile
flag to the legate
driver when
launching the application (or through LEGATE_CONFIG
):
legate --profile myprog.py
At the end of execution you will have a set of legate_*.prof
files (one per
process). By default these files are placed in the same directory where the
program was launched (you can control this with the --logdir
option). These
files can be opened with the profile viewer, to see a timeline of your
program’s execution:
legion_prof view legate_*.prof
We recommend that you do not mix debugging and profiling in the same run, as some of the logging for the debugging features requires significant file I/O that can adversely effect the performance of the application.