Settings#

cuPyNumeric has a number of runtime settings that can be configured through environment variables.

`doctor`#

Type:: bool (“0” or “1”)
Env var:: CUPYNUMERIC_DOCTOR
Default:: False

Attempt to warn about certain usage patterns that are inefficient with cuPyNumeric.

`doctor_format`#

Type:: DoctorFormat (“plain”, “csv”, or “json”)
Env var:: CUPYNUMERIC_DOCTOR_FORMAT
Default:: ‘plain’

Format for cuPyNumeric output: plain, json, or csv.

`doctor_filename`#

Type:: str
Env var:: CUPYNUMERIC_DOCTOR_FILENAME
Default:: None

A filename for a file to dump cuPyNumeric output to, otherwise stdout.

`doctor_filename`#

Type:: bool (“0” or “1”)
Env var:: CUPYNUMERIC_DOCTOR_TRACEBACK
Default:: False

Whether cuPyNumeric Doctor output should include full tracebacks.

`preload_cudalibs`#

Type:: bool (“0” or “1”)
Env var:: CUPYNUMERIC_PRELOAD_CUDALIBS
Default:: False

Preload and initialize handles of all CUDA libraries (cuBLAS, cuSOLVER, etc.) used in cuPyNumeric.

`warn`#

Type:: bool (“0” or “1”)
Env var:: CUPYNUMERIC_WARN
Default:: False

Turn on warnings.

`numpy_compat`#

Type:: bool (“0” or “1”)
Env var:: CUPYNUMERIC_NUMPY_COMPATIBILITY
Default:: False

cuPyNumeric will issue additional tasks to match numpy’s results and behavior. This is currently used in the following APIs: nanmin, nanmax, nanargmin, nanargmax

`fallback_stacktrace`#

Type:: bool (“0” or “1”)
Env var:: CUPYNUMERIC_FALLBACK_STACKTRACE
Default:: False

Whether to dump a full stack trace whenever cuPyNumeric emits a warning about falling back to Numpy routines.

This is a read-only environment variable setting used by the runtime.

`fast_math`#

Type:: bool (“0” or “1”)
Env var:: CUPYNUMERIC_FAST_MATH
Default:: False

Enable certain optimized execution modes for floating-point math operations, that may violate strict IEEE specifications. Currently this flag enables the acceleration of single-precision cuBLAS routines using TF32 tensor cores.

This is a read-only environment variable setting used by the runtime.

`ufunc_native`#

Type:: bool (“0” or “1”)
Env var:: CUPYNUMERIC_UFUNC_NATIVE
Default:: False

Enable the experimental native C++ ufunc dispatch path.

This is a read-only environment variable setting used by the runtime.

`min_gpu_chunk`#

Type:: int
Env var:: CUPYNUMERIC_MIN_GPU_CHUNK
Default:: 65536 (test-mode default: 2)

Minimum chunk size for GPU operations.

This is a read-only environment variable setting used by the runtime.

`min_cpu_chunk`#

Type:: int
Env var:: CUPYNUMERIC_MIN_CPU_CHUNK
Default:: 1024 (test-mode default: 2)

Minimum chunk size for CPU operations.

This is a read-only environment variable setting used by the runtime.

`min_omp_chunk`#

Type:: int
Env var:: CUPYNUMERIC_MIN_OMP_CHUNK
Default:: 8192 (test-mode default: 2)

Minimum chunk size for OpenMP operations.

This is a read-only environment variable setting used by the runtime.

`matmul_cache_size`#

Type:: int
Env var:: CUPYNUMERIC_MATMUL_CACHE_SIZE
Default:: 134217728 (test-mode default: 4096)

Force cuPyNumeric to keep temporary task slices during matmul computations smaller than this threshold. Whenever the temporary space needed during computation would exceed this value the task will be batched over ‘k’ to fulfill the requirement.

This is a read-only environment variable setting used by the runtime.

`test`#

Type:: bool (“0” or “1”)
Env var:: LEGATE_TEST
Default:: False

Enable test mode. This sets alternative defaults for various other settings.

This is a read-only environment variable setting used by the runtime.

`take_default`#

Type:: str
Env var:: CUPYNUMERIC_TAKE_DEFAULT
Default:: ‘auto’

Default algorithm for deferred array.take():

‘auto’: let cuPyNumeric decide which algorithm to use
‘index’: use advanced indexing
‘task’: use a task that broadcasts the indices

`disable_bounds_checking`#

Type:: DisableBoundsChecking (“none”, “all”, or comma-separated selectors: indexing, take, take_along_axis, put)
Env var:: CUPYNUMERIC_DISABLE_BOUNDS_CHECKING
Default:: ‘none’

Disables explicit bounds checking for advanced-indexing-related operations.

‘none’: disable no targeted explicit bounds checks

‘all’: disable all targeted explicit bounds checks

comma-separated selectors such as:
‘indexing,take,put’

to disable checks only for the named operations

`use_nccl_gather`#

Type:: bool (“0” or “1”)
Env var:: CUPYNUMERIC_USE_NCCL_GATHER
Default:: True

Enable distributed gather via the NCCL all-to-all implementation when multiple GPUs are available. Set to 0 to fall back to the Legion-based gather path.

`use_nccl_scatter`#

Type:: bool (“0” or “1”)
Env var:: CUPYNUMERIC_USE_NCCL_SCATTER
Default:: True

Enable distributed scatter via the NCCL all-to-all implementation when multiple GPUs are available. Set to 0 to fall back to the Legion-based scatter path.

`all2all_staging_factor`#

Type:: float
Env var:: CUPYNUMERIC_ALL2ALL_STAGING_FACTOR
Default:: 1.1

Per-buffer staging budget for the NCCL all-to-all gather/scatter tasks, expressed as a multiple of the average per-rank request count: byte budget = factor * (global_index_volume / num_ranks) * elem_size. The exchange runs in roughly ceil(num_ranks / factor) NCCL group rounds. Lower values bound FB memory at the cost of more rounds; higher values favor throughput. Recommended: 0.5-1.0 if FB is tight, >= num_ranks for a single-round shuffle.