Using GPUDirect Storage to perform HDF5 file I/O#
Overview#
This document outlines the enablement of GPUDirect Storage(GDS in short) to perform I/O between the GPU memory and the underlying storage device specifically for HDF5 files in an efficient manner.
GPUDirect Storage enables a direct path between local or remote storage and GPU memory avoiding an extra copies through a bounce buffer in the CPU’s memory and enables a direct memory access (DMA) engine near the NIC or storage to move data on a direct path into or out of GPU memory, all without burdening the CPU or GPU.
HDF5 library provides a flexibility to attach any custom virtual file driver for I/O. In Legate, HDF5 library can use vfd-gds virtual driver which in turn would call GDS specific APIs for the i/o.
Installation#
Please refer to https://docs.nvidia.com/gpudirect-storage/troubleshooting-guide/index.html for the installation instructions of GPUDirect Storage.
The vfd-gds library will be installed as part of legate installation.
In order to enable GDS for the HDF5 library, do the following before starting legate environment.
$ export LEGATE_IO_USE_VFD_GDS=1
Example#
Below is the output of an example python program which iterates all the datasets in a HDF5 file and reads the data. The source code for the example may be found at nv-legate/legate.
The program prints throughput numbers based on the total amount of data read over total elapsed time. Please note that this is not a standard benchmark to calculate I/O throughput. However, it can give an approximate throughput value.
Usage Example - 1#
Assuming the HDF5 data set at /path/to/hdf/data exists, running the example with 1 rank:
$ LEGATE_IO_USE_VFD_GDS=1 legate \
--launcher mpirun \
--ranks-per-node 1 \
--gpus 1 \
--gpu-bind 0 \
--cpu-bind 48-63 \
--mem-bind 3 \
--sysmem 15000 \
--fbmem 80000 \
--zcmem 5000 \
share/legate/examples/io/hdf5/ex1.py /path/to/hdf/data --n_rank 1
IO MODE : GDS
Total Data Read: 17179869184
Total Turnaround Time (seconds): 31.506664514541626
Throughput (MB/sec): 520.0169631551479
Usage Example - 1#
Running the example with 4 ranks:
$ LEGATE_IO_USE_VFD_GDS=1 legate \
--launcher mpirun \
--ranks-per-node 4 \
--gpus 1 \
--gpu-bind 0/1/2/3 \
--cpu-bind 48-63/176-191/16-31/144-159 \
--mem-bind 3/3/1/1 \
--sysmem 15000 \
--fbmem 80000 \
--zcmem 5000 \
share/legate/examples/io/hdf5/ex1.py /path/to/hdf/data --n_rank 4
IO MODE : GDS
IO MODE : GDS
IO MODE : GDS
IO MODE : GDS
Total Data Read: 68719476736
Total Turnaround time (seconds): 72.61114501953125
Throughput (MB/sec): 902.5611699467328
Total Data Read: 68719476736
Total Turnaround time (seconds): 72.72363376617432
Throughput (MB/sec): 901.1650904397261
Total Data Read: 68719476736
Total Turnaround time (seconds): 72.35940861701965
Throughput (MB/sec): 905.7011555589922
Total Data Read: 68719476736
Total Turnaround time (seconds): 72.73599338531494
Throughput (MB/sec): 901.0119605135058
GDS Performance Tuning#
If the HDF5 datasets are larger, then the following tuning helps improve the performance.
Edit /etc/cufile.json
and add the following line under properties section (for 12.6 CUDA
release and above):
"properties": {
...
"per_buffer_cache_size_kb" : 16384,
...
}
For more GDS specific performance tuning, please refer to https://docs.nvidia.com/gpudirect-storage/troubleshooting-guide/index.html