Shortcuts

torch.cuda

This package adds support for CUDA tensor types, that implement the same function as CPU tensors, but they utilize GPUs for computation.

It is lazily initialized, so you can always import it, and use is_available() to determine if your system supports CUDA.

CUDA semantics has more details about working with CUDA.

StreamContext

Context-manager that selects a given stream.

can_device_access_peer

Checks if peer access between two devices is possible.

current_blas_handle

Returns cublasHandle_t pointer to current cuBLAS handle

current_device

Returns the index of a currently selected device.

current_stream

Returns the currently selected Stream for a given device.

default_stream

Returns the default Stream for a given device.

device

Context-manager that changes the selected device.

device_count

Returns the number of GPUs available.

device_of

Context-manager that changes the current device to that of given object.

get_arch_list

Returns list CUDA architectures this library was compiled for.

get_device_capability

Gets the cuda capability of a device.

get_device_name

Gets the name of a device.

get_device_properties

Gets the properties of a device.

get_gencode_flags

Returns NVCC gencode flags this library was compiled with.

get_sync_debug_mode

Returns current value of debug mode for cuda synchronizing operations.

init

Initialize PyTorch's CUDA state.

ipc_collect

Force collects GPU memory after it has been released by CUDA IPC.

is_available

Returns a bool indicating if CUDA is currently available.

is_initialized

Returns whether PyTorch's CUDA state has been initialized.

memory_usage

Returns the percent of time over the past sample period during which global (device) memory was being read or written.

set_device

Sets the current device.

set_stream

Sets the current stream.This is a wrapper API to set the stream.

set_sync_debug_mode

Sets the debug mode for cuda synchronizing operations.

stream

Wrapper around the Context-manager StreamContext that selects a given stream.

synchronize

Waits for all kernels in all streams on a CUDA device to complete.

utilization

Returns the percent of time over the past sample period during which one or more kernels was executing on the GPU as given by nvidia-smi.

temperature

Returns the average temperature of the GPU sensor in Degrees C (Centigrades)

power_draw

Returns the average power draw of the GPU sensor in mW (MilliWatts)

clock_rate

Returns the clock speed of the GPU SM in Hz Hertz over the past sample period as given by nvidia-smi.

OutOfMemoryError

Exception raised when CUDA is out of memory

Random Number Generator

get_rng_state

Returns the random number generator state of the specified GPU as a ByteTensor.

get_rng_state_all

Returns a list of ByteTensor representing the random number states of all devices.

set_rng_state

Sets the random number generator state of the specified GPU.

set_rng_state_all

Sets the random number generator state of all devices.

manual_seed

Sets the seed for generating random numbers for the current GPU.

manual_seed_all

Sets the seed for generating random numbers on all GPUs.

seed

Sets the seed for generating random numbers to a random number for the current GPU.

seed_all

Sets the seed for generating random numbers to a random number on all GPUs.

initial_seed

Returns the current random seed of the current GPU.

Communication collectives

comm.broadcast

Broadcasts a tensor to specified GPU devices.

comm.broadcast_coalesced

Broadcast a sequence of tensors to the specified GPUs.

comm.reduce_add

Sum tensors from multiple GPUs.

comm.scatter

Scatters tensor across multiple GPUs.

comm.gather

Gathers tensors from multiple GPU devices.

Streams and events

Stream

Wrapper around a CUDA stream.

ExternalStream

Wrapper around an externally allocated CUDA stream.

Event

Wrapper around a CUDA event.

Graphs (beta)

is_current_stream_capturing

Returns True if CUDA graph capture is underway on the current CUDA stream, False otherwise.

graph_pool_handle

Returns an opaque token representing the id of a graph memory pool.

CUDAGraph

Wrapper around a CUDA graph.

graph

Context-manager that captures CUDA work into a torch.cuda.CUDAGraph object for later replay.

make_graphed_callables

Accepts callables (functions or nn.Modules) and returns graphed versions.

Memory management

empty_cache

Release all unoccupied cached memory currently held by the caching allocator so that those can be used in other GPU application and visible in nvidia-smi.

list_gpu_processes

Return a human-readable printout of the running processes and their GPU memory use for a given device.

mem_get_info

Return the global free and total GPU memory for a given device using cudaMemGetInfo.

memory_stats

Return a dictionary of CUDA memory allocator statistics for a given device.

memory_summary

Return a human-readable printout of the current memory allocator statistics for a given device.

memory_snapshot

Return a snapshot of the CUDA memory allocator state across all devices.

memory_allocated

Return the current GPU memory occupied by tensors in bytes for a given device.

max_memory_allocated

Return the maximum GPU memory occupied by tensors in bytes for a given device.

reset_max_memory_allocated

Reset the starting point in tracking maximum GPU memory occupied by tensors for a given device.

memory_reserved

Return the current GPU memory managed by the caching allocator in bytes for a given device.

max_memory_reserved

Return the maximum GPU memory managed by the caching allocator in bytes for a given device.

set_per_process_memory_fraction

Set memory fraction for a process.

memory_cached

Deprecated; see memory_reserved().

max_memory_cached

Deprecated; see max_memory_reserved().

reset_max_memory_cached

Reset the starting point in tracking maximum GPU memory managed by the caching allocator for a given device.

reset_peak_memory_stats

Reset the "peak" stats tracked by the CUDA memory allocator.

caching_allocator_alloc

Perform a memory allocation using the CUDA memory allocator.

caching_allocator_delete

Delete memory allocated using the CUDA memory allocator.

get_allocator_backend

Return a string describing the active allocator backend as set by PYTORCH_CUDA_ALLOC_CONF.

CUDAPluggableAllocator

CUDA memory allocator loaded from a so file.

change_current_allocator

Change the currently used memory allocator to be the one provided.

NVIDIA Tools Extension (NVTX)

nvtx.mark

Describe an instantaneous event that occurred at some point.

nvtx.range_push

Push a range onto a stack of nested range span.

nvtx.range_pop

Pop a range off of a stack of nested range spans.

Jiterator (beta)

jiterator._create_jit_fn

Create a jiterator-generated cuda kernel for an elementwise op.

jiterator._create_multi_output_jit_fn

Create a jiterator-generated cuda kernel for an elementwise op that supports returning one or more outputs.

Stream Sanitizer (prototype)

CUDA Sanitizer is a prototype tool for detecting synchronization errors between streams in PyTorch. See the documentation for information on how to use it.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources