Using GPUDirect Storage to perform HDF5 file I/O#
Overview#
This document outlines the enablement of GPUDirect Storage(GDS in short) to perform I/O between the GPU memory and the underlying storage device specifically for HDF5 files in an efficient manner.
GPUDirect Storage enables a direct path between local or remote storage and GPU memory avoiding an extra copies through a bounce buffer in the CPU’s memory and enables a direct memory access (DMA) engine near the NIC or storage to move data on a direct path into or out of GPU memory, all without burdening the CPU or GPU.
HDF5 library provides a flexibility to attach any custom virtual file driver for I/O. In Legate, HDF5 library can use vfd-gds virtual driver which in turn would call GDS specific APIs for the i/o.
Installation#
Please refer to https://docs.nvidia.com/gpudirect-storage/troubleshooting-guide/index.html for the installation instructions of GPUDirect Storage.
The vfd-gds library will be installed as part of legate installation.
In order to enable GDS for the HDF5 library, use the following configuration flag. Legate will try to detect if GDS is available and enable it automatically. But if you want to force enable it, you can use the following flag.
$ export LEGATE_CONFIG="--io-use-vfd-gds"
Read Example#
Below is the output of an example python program which iterates all the datasets in a HDF5 file and reads the data. The source code for the example may be found at nv-legate/legate.
The program prints throughput numbers based on the total amount of data read over total elapsed time. Please note that this is not a standard benchmark to calculate I/O throughput. However, it can give an approximate throughput value.
Usage Example - 1#
Assuming the HDF5 data set at /path/to/hdf/data exists, running the example with 1 rank:
$ legate \
--launcher mpirun \
--ranks-per-node 1 \
--gpus 1 \
--gpu-bind 0 \
--cpu-bind 48-63 \
--mem-bind 3 \
--io-use-vfd-gds \
--sysmem 15000 \
--fbmem 80000 \
--zcmem 5000 \
share/legate/examples/io/hdf5/ex1.py /path/to/hdf/data --n_rank 1
IO MODE : GDS
Total Data Read: 17179869184
Total Turnaround Time (seconds): 31.506664514541626
Throughput (MB/sec): 520.0169631551479
Usage Example - 1#
Running the example with 4 ranks:
$ legate \
--launcher mpirun \
--ranks-per-node 4 \
--gpus 1 \
--gpu-bind 0/1/2/3 \
--cpu-bind 48-63/176-191/16-31/144-159 \
--mem-bind 3/3/1/1 \
--sysmem 15000 \
--io-use-vfd-gds \
--fbmem 80000 \
--zcmem 5000 \
share/legate/examples/io/hdf5/ex1.py /path/to/hdf/data --n_rank 4
IO MODE : GDS
IO MODE : GDS
IO MODE : GDS
IO MODE : GDS
Total Data Read: 68719476736
Total Turnaround time (seconds): 72.61114501953125
Throughput (MB/sec): 902.5611699467328
Total Data Read: 68719476736
Total Turnaround time (seconds): 72.72363376617432
Throughput (MB/sec): 901.1650904397261
Total Data Read: 68719476736
Total Turnaround time (seconds): 72.35940861701965
Throughput (MB/sec): 905.7011555589922
Total Data Read: 68719476736
Total Turnaround time (seconds): 72.73599338531494
Throughput (MB/sec): 901.0119605135058
Write Example#
The write example can be found at nv-legate/legate.
The example writes data to a HDF5 file and measures the throughput.
Example Usage#
$ legate \
--launcher mpirun \
--ranks-per-node 1 \
--io-use-vfd-gds \
--gpus 1 \
--gpu-bind 1 \
--sysmem 15000 \
--fbmem 40000 \
--zcmem 5000 \
share/legate/examples/io/hdf5/hdf5_write_benchmark.py --output-dir /path/to/output --sizes 1000000 --dtypes float32 --iterations 3
===============================================================================
HDF5 WRITE BENCHMARK
================================================================================
Output directory: output
Sizes: [1000000]
Data types: ['float32']
Iterations per config: 3
================================================================================
Benchmarking size=1,000,000, dtype=float32
Iteration 1: Wall=0.444s, Legate=0.240s, Throughput=15.88 MB/s
Iteration 2: Wall=0.014s, Legate=0.014s, Throughput=267.70 MB/s
Iteration 3: Wall=0.010s, Legate=0.010s, Throughput=398.15 MB/s
Average: Wall=0.156s, Legate=0.088s, Throughput=227.24 MB/s
================================================================================
BENCHMARK SUMMARY
================================================================================
Size Type MB Wall(s) Legate(s) Throughput
--------------------------------------------------------------------------------
1,000,000 float32 3.81 0.156 0.088 227.24 MB/s
================================================================================
Best throughput: 227.24 MB/s (size=1,000,000, dtype=float32)
GDS Not Available#
If GDS is not available in the system, the user can still use the
LEGATE_CONFIG="--io-use-vfd-gds" with the cuFile compatibility mode.
In that case the user needs to set the following environment variable:
$ export CUFILE_ALLOW_COMPAT_MODE='true'
GDS Performance Tuning#
If the HDF5 datasets are larger, then the following tuning helps improve the performance.
Edit /etc/cufile.json and add the following line under properties section
(for 12.6 CUDA release and above):
"execution" : {
...
// max number of host threads per gpu to spawn for parallel IO
"max_io_threads" : 8,
// enable support for parallel IO
"parallel_io" : true,
// maximum parallelism for a single request
"max_request_parallelism" : 8
...
},
"properties": {
...
"per_buffer_cache_size_kb" : 16384,
...
}
For more GDS specific performance tuning, please refer to https://docs.nvidia.com/gpudirect-storage/troubleshooting-guide/index.html
Usually cuFile.json is located at /etc/cufile.json. The user can specify the path to a custom cuFile.json file using the CUFILE_ENV_PATH_JSON environment variable.
$ export CUFILE_ENV_PATH_JSON=/path/to/cufile.json
HDF5 File Layout#
Legate optimizes HDF5 file reading by leveraging the underlying file layout. HDF5 supports several layout types, and Legate provides specialized handling for contiguous, chunked, and virtual layouts.
Contiguous Layout
For contiguous datasets, Legate partitions the data by splitting along the slowest-varying dimension while preserving the shape of all other dimensions.
Chunked Layout
For chunked datasets, Legate uses the chunk dimensions to guide partitioning. It splits chunks along the slowest-varying dimension to achieve the desired number of partitions. The reads are done according to the chunk sizes so having larger chunk sizes helps improve the performance.
Virtual Dataset Layout
For virtual datasets, all source files must use the same layout type.
If the source files use contiguous layout, Legate applies the same partitioning strategy as for regular contiguous datasets. It expects all the source files to have the same block shapes except the edge blocks.
If the source files use chunked layout, Legate applies the same partitioning strategy as for regular chunked datasets. It expects all the source files to have the same chunk sizes.
Performance Considerations#
Based on Legate performance testing on distributed file systems such as Lustre, the following factors have shown to impact read performance:
File organization: Virtual datasets with data distributed across multiple source files perform better than a single large file.
Dataset size: Larger datasets generally achieve higher throughput than smaller datasets.
Read parallelism: Increasing parallelism through cuFile threads and Legate processes improves performance.
Further to extract the maximum performance, it is recommended to use file system specific settings such as Lustre file striping where it controls the data placement to the underlying storage devices.