Using GPUDirect Storage to perform HDF5 file I/O#
Overview#
This document outlines the enablement of GPUDirect Storage(GDS in short) to perform I/O between the GPU memory and the underlying storage device specifically for HDF5 files in an efficient manner.
GPUDirect Storage enables a direct path between local or remote storage and GPU memory avoiding an extra copies through a bounce buffer in the CPU’s memory and enables a direct memory access (DMA) engine near the NIC or storage to move data on a direct path into or out of GPU memory, all without burdening the CPU or GPU.
HDF5 library provides a flexibility to attach any custom virtual file driver for I/O. In Legate, HDF5 library can use vfd-gds virtual driver which in turn would call GDS specific APIs for the i/o.
Installation#
Please refer to https://docs.nvidia.com/gpudirect-storage/troubleshooting-guide/index.html for the installation instructions of GPUDirect Storage.
The vfd-gds library will be installed as part of legate installation.
In order to enable GDS for the HDF5 library, do the following before starting legate environment.
$ export LEGATE_IO_USE_VFD_GDS=1
Read Example#
Below is the output of an example python program which iterates all the datasets in a HDF5 file and reads the data. The source code for the example may be found at nv-legate/legate.
The program prints throughput numbers based on the total amount of data read over total elapsed time. Please note that this is not a standard benchmark to calculate I/O throughput. However, it can give an approximate throughput value.
Usage Example - 1#
Assuming the HDF5 data set at /path/to/hdf/data exists, running the example with 1 rank:
$ legate \
--launcher mpirun \
--ranks-per-node 1 \
--gpus 1 \
--gpu-bind 0 \
--cpu-bind 48-63 \
--mem-bind 3 \
--io-use-vfd-gds \
--sysmem 15000 \
--fbmem 80000 \
--zcmem 5000 \
share/legate/examples/io/hdf5/ex1.py /path/to/hdf/data --n_rank 1
IO MODE : GDS
Total Data Read: 17179869184
Total Turnaround Time (seconds): 31.506664514541626
Throughput (MB/sec): 520.0169631551479
Usage Example - 1#
Running the example with 4 ranks:
$ legate \
--launcher mpirun \
--ranks-per-node 4 \
--gpus 1 \
--gpu-bind 0/1/2/3 \
--cpu-bind 48-63/176-191/16-31/144-159 \
--mem-bind 3/3/1/1 \
--sysmem 15000 \
--io-use-vfd-gds \
--fbmem 80000 \
--zcmem 5000 \
share/legate/examples/io/hdf5/ex1.py /path/to/hdf/data --n_rank 4
IO MODE : GDS
IO MODE : GDS
IO MODE : GDS
IO MODE : GDS
Total Data Read: 68719476736
Total Turnaround time (seconds): 72.61114501953125
Throughput (MB/sec): 902.5611699467328
Total Data Read: 68719476736
Total Turnaround time (seconds): 72.72363376617432
Throughput (MB/sec): 901.1650904397261
Total Data Read: 68719476736
Total Turnaround time (seconds): 72.35940861701965
Throughput (MB/sec): 905.7011555589922
Total Data Read: 68719476736
Total Turnaround time (seconds): 72.73599338531494
Throughput (MB/sec): 901.0119605135058
Write Example#
The write example can be found at nv-legate/legate.
The example writes data to a HDF5 file and measures the throughput.
Example Usage#
$ legate \
--launcher mpirun \
--ranks-per-node 1 \
--io-use-vfd-gds \
--gpus 1 \
--gpu-bind 1 \
--sysmem 15000 \
--fbmem 40000 \
--zcmem 5000 \
share/legate/examples/io/hdf5/hdf5_write_benchmark.py --output-dir /path/to/output --sizes 1000000 --dtypes float32 --iterations 3
===============================================================================
HDF5 WRITE BENCHMARK
================================================================================
Output directory: output
Sizes: [1000000]
Data types: ['float32']
Iterations per config: 3
================================================================================
Benchmarking size=1,000,000, dtype=float32
Iteration 1: Wall=0.444s, Legate=0.240s, Throughput=15.88 MB/s
Iteration 2: Wall=0.014s, Legate=0.014s, Throughput=267.70 MB/s
Iteration 3: Wall=0.010s, Legate=0.010s, Throughput=398.15 MB/s
Average: Wall=0.156s, Legate=0.088s, Throughput=227.24 MB/s
================================================================================
BENCHMARK SUMMARY
================================================================================
Size Type MB Wall(s) Legate(s) Throughput
--------------------------------------------------------------------------------
1,000,000 float32 3.81 0.156 0.088 227.24 MB/s
================================================================================
Best throughput: 227.24 MB/s (size=1,000,000, dtype=float32)
GDS Not Available#
If GDS is not available in the system, the user can still use the
LEGATE_CONFIG="--io-use-vfd-gds" with the cuFile compatibility mode.
In that case the user needs to set the following environment variable:
$ export CUFILE_ALLOW_COMPAT_MODE='true'
GDS Performance Tuning#
If the HDF5 datasets are larger, then the following tuning helps improve the performance.
Edit /etc/cufile.json and add the following line under properties section
(for 12.6 CUDA release and above):
"execution" : {
...
// max number of host threads per gpu to spawn for parallel IO
"max_io_threads" : 8,
// enable support for parallel IO
"parallel_io" : true,
// maximum parallelism for a single request
"max_request_parallelism" : 8
...
},
"properties": {
...
"per_buffer_cache_size_kb" : 16384,
...
}
For more GDS specific performance tuning, please refer to https://docs.nvidia.com/gpudirect-storage/troubleshooting-guide/index.html
Usually cuFile.json is located at /etc/cufile.json. The user can specify the path to the cuFile.json file using the CUFILE_CONFIG_FILE environment variable.
$ export CUFILE_ENV_PATH_JSON=/path/to/cufile.json