NOTE: For the latest stable README.md ensure you are on the main branch.
- RMM Reference Documentation: Python and C++ API references, tutorials, and topic guides.
- RAPIDS Installation Guide: Instructions for installing RMM.
- GitHub Repository: Download the RMM source code.
- Issue Tracker: Report issues or request features.
- RAPIDS Community: Get help, contribute, and collaborate.
Achieving optimal performance in GPU-centric workflows frequently requires customizing how host and device memory are allocated. For example, using "pinned" host memory for asynchronous host <-> device memory transfers, or using a device memory pool sub-allocator to reduce the cost of dynamic device memory allocation.
The goal of the RAPIDS Memory Manager (RMM) is to provide:
- A common interface that allows customizing memory allocation on device and host
- A collection of implementations of the interface
- A collection of data structures that use the interface for memory allocation
For information on the interface RMM provides and how to use RMM in your C++ code, see below.
For a walkthrough about the design of the RAPIDS Memory Manager, read Fast, Flexible Allocation for NVIDIA CUDA with RAPIDS Memory Manager on the NVIDIA Developer Blog.
RMM can be installed with conda. You can get a minimal conda installation with miniforge.
Install RMM with:
conda install -c rapidsai -c conda-forge rmm cuda-version=13.3We also provide nightly conda packages built from the HEAD of our latest development branch.
Note: The RMM package from conda requires building with GCC 13.3 or later. Otherwise, your application may fail to build.
See the RAPIDS Installation Guide for system requirements.
Compiler requirements:
gccversion 13.3+nvccversion 12.9+cmakeversion 4.0+
CUDA/GPU requirements:
- CUDA 12.2+. You can obtain CUDA from https://developer.nvidia.com/cuda-downloads
GPU Support:
- RMM is tested and supported only on Volta architecture and newer (Compute Capability 7.0+).
Python requirements:
rapids-build-backend(available from PyPI or therapidsaiconda channel)scikit-build-corecuda-pythoncython
For more details, see pyproject.toml
To install RMM from source, ensure the dependencies are met and follow the steps below:
- Clone the repository:
$ git clone https://github.com/rapidsai/rmm.git
$ cd rmm- Create the conda development environment
rmm_dev:
# create the conda environment (assuming in base `rmm` directory)
$ conda env create --name rmm_dev --file conda/environments/all_cuda-133_arch-$(uname -m).yaml
# activate the environment
$ conda activate rmm_dev- Build and install
librmmusing cmake & make. CMake depends on thenvccexecutable being on your path or defined inCUDACXXenvironment variable.
$ mkdir build # make a build directory
$ cd build # enter the build directory
$ cmake .. -DCMAKE_INSTALL_PREFIX=/install/path # configure cmake ... use $CONDA_PREFIX if you're using a conda environment
$ make -j # compile the library librmm.so ... '-j' will start a parallel job using the number of physical cores available on your system
$ make install # install the library librmm.so to '/install/path'- Building and installing
librmmandrmmusingbuild.sh.build.shcreates a build directory at the root of the git repository.build.shdepends on thenvccexecutable being on your path or defined in theCUDACXXenvironment variable.
$ ./build.sh -h # Display help and exit
$ ./build.sh -n librmm # Build librmm without installing
$ ./build.sh -n rmm # Build rmm without installing
$ ./build.sh -n librmm rmm # Build librmm and rmm without installing
$ ./build.sh librmm rmm # Build and install librmm and rmm- To run tests (Optional):
$ cd build (if you are not already in the build directory)
$ make test- Build, install, and test the
rmmpython package, in thepythonfolder:
# In the root rmm directory
$ python -m pip wheel ./python/librmm
$ python -m pip install --find-links=. -e ./python/rmm
$ pytest -vDone! You are ready to develop for the RMM project.
RMM uses CPM.cmake to handle third-party dependencies like
CCCL, GoogleTest, GoogleBenchmark. In general you won't have to worry about it. If CMake finds an
appropriate version on your system, it uses it (you can help it along by setting CMAKE_PREFIX_PATH
to point to the installed location). Otherwise those dependencies will be downloaded as part of the
build.
If you frequently start new builds from scratch, consider setting the environment variable
CPM_SOURCE_CACHE to an external download directory to avoid repeated downloads of the third-party
dependencies.
The installed RMM library provides a set of config files that makes it easy to
integrate RMM into your own CMake project. Add the following to CMakeLists.txt:
find_package(rmm [VERSION])
# ...
target_link_libraries(<your-target> (PRIVATE|PUBLIC|INTERFACE) rmm::rmm)This links librmm, makes RMM headers available, and pulls in transitive dependencies.
If RMM is not installed in a default location, use
CMAKE_PREFIX_PATH or rmm_ROOT to point to its location.
RMM uses CPM.cmake to manage its dependencies, including CCCL, and you can use CPM for your project's dependency on RMM.
There is an issue with using CPM's single-argument compact syntax for
RMM/CCCL as it transitively marks targets as SYSTEM dependencies.
This causes the CCCL headers pulled in through CPM to be of lower priority
to the preprocessor than the (potentially outdated) CCCL headers provided
by the CUDA SDK. To avoid this issue, use CPM's multi-argument syntax
instead:
CPMAddPackage(NAME rmm [VERSION]
GITHUB_REPOSITORY rapidsai/rmm
SYSTEM OFF
SOURCE_SUBDIR cpp)
# ...
target_link_libraries(<your-target> (PRIVATE|PUBLIC|INTERFACE) rmm::rmm)The first goal of RMM is to provide a common interface for device memory allocation. This allows both users and implementers of custom allocation logic to program to a single interface.
RMM's memory resources use CCCL's memory resource
concepts.
Resource APIs accept either concrete resource objects, non-owning resource refs such as
rmm::device_async_resource_ref, or owning type-erased resources such as
cuda::mr::any_resource<cuda::mr::device_accessible>.
A device memory resource satisfies the CCCL resource concept and provides stream-ordered allocation and deallocation:
void* allocate(cuda::stream_ref stream,
std::size_t bytes,
std::size_t alignment = rmm::CUDA_ALLOCATION_ALIGNMENT);
void deallocate(cuda::stream_ref stream,
void* ptr,
std::size_t bytes,
std::size_t alignment = rmm::CUDA_ALLOCATION_ALIGNMENT) noexcept;RMM also uses rmm::device_async_resource_ref, an alias for
cuda::mr::resource_ref<cuda::mr::device_accessible>, as a lightweight non-owning reference to a
device resource. RMM uses cuda::mr::any_resource<cuda::mr::device_accessible> as an owning
type-erased resource. RMM resources with non-trivial state are value types with shared ownership of
their internal state, so copying a resource object is inexpensive and keeps the underlying state
alive.
RMM memory resources provide stream-ordered memory allocation. This allows optimizations such as re-using memory deallocated on the same stream without the overhead of synchronization.
A call to resource.allocate(stream_a, bytes) returns a pointer that is valid to use on stream_a.
Using the memory on a different stream (say stream_b) is Undefined Behavior unless the two streams
are first synchronized, for example by using cudaStreamSynchronize(stream_a) or by recording a CUDA
event on stream_a and then calling cudaStreamWaitEvent(stream_b, event).
The stream specified to deallocate should be a stream on which it is valid to use the deallocated
memory immediately for another allocation. Typically this is the stream on which the allocation was
last used before the call to deallocate. The passed stream may be used internally by a memory
resource for managing available memory with minimal synchronization, and it may also be synchronized
at a later time, for example using a call to cudaStreamSynchronize().
For this reason, it is Undefined Behavior to destroy a CUDA stream that is passed to
deallocate. If the stream on which the allocation was last used has been destroyed before calling
deallocate or it is known that it will be destroyed, it is likely better to synchronize the stream
(before destroying it) and then pass a different stream to deallocate (e.g. the default stream).
Note that device memory data structures such as rmm::device_buffer and rmm::device_uvector
follow these stream-ordered memory allocation semantics and rules.
For further information about stream-ordered memory allocation semantics, read Using the NVIDIA CUDA Stream-Ordered Memory Allocator on the NVIDIA Developer Blog.
RMM provides several device memory resources to satisfy various user requirements. For more detailed information about these resources, see their respective documentation.
Allocates and frees device memory using cudaMalloc and cudaFree.
Allocates and frees device memory using cudaMallocManaged and cudaFree.
Note that NVIDIA Virtual GPU Software (vGPU, for use with virtual machines or hypervisors) does not
support managed_memory_resource by default. To support this, Unified Memory must be enabled for
vGPU.
A coalescing, best-fit pool sub-allocator.
A memory resource that can only allocate a single fixed size. Average allocation and deallocation cost is constant.
Configurable to use multiple upstream memory resources for allocations that fall within different
bin sizes. Often configured with multiple bins backed by fixed_size_memory_resources and a single
pool_memory_resource for allocations larger than the largest bin size.
RMM users commonly need to configure a resource object to use for all allocations where another
resource has not explicitly been provided. A common example is configuring a pool_memory_resource
to use for all allocations to get fast dynamic allocation.
To enable this use case, RMM provides the concept of a resource for the currently active CUDA device. This resource is used when another is not explicitly provided.
Accessing and modifying this resource is done through two functions:
-
device_async_resource_ref get_current_device_resource_ref()- Returns a non-owning reference to the resource for the active CUDA device.
- The initial resource is an instance of
cuda_memory_resource. - This function is thread safe with respect to concurrent calls to it and
set_current_device_resource(). - For more explicit control, you can use
get_per_device_resource_ref(), which takes a device ID.
-
cuda::mr::any_resource<cuda::mr::device_accessible> set_current_device_resource(cuda::mr::any_resource<cuda::mr::device_accessible> new_mr)- Updates the resource for the active CUDA device to
new_mr - Returns the previous resource as an owning type-erased resource
- This function is thread safe with respect to concurrent calls to it and
get_current_device_resource_ref() - For more explicit control, you can use
set_per_device_resource(), which takes a device ID. - To reset to the initial resource, call
reset_current_device_resource().
- Updates the resource for the active CUDA device to
rmm::mr::cuda_memory_resource cuda_mr;
// Construct a resource that uses a coalescing best-fit pool allocator
// With the pool initially half of available device memory
auto initial_size = rmm::percent_of_free_device_memory(50);
rmm::mr::pool_memory_resource pool_mr{cuda_mr, initial_size};
auto previous = rmm::mr::set_current_device_resource(pool_mr);
auto mr = rmm::mr::get_current_device_resource_ref();A memory resource should only be used when the active CUDA device is the same device that was active when the resource was created. Otherwise behavior is undefined.
If a memory resource is used with a stream associated with a different CUDA device than the device for which the memory resource was created, behavior is undefined.
Creating a memory resource for each device requires care to set the current device before creating
each resource. Here is an example loop that creates pool_memory_resource objects for each device
and sets them as the per-device resource for that device.
for (int i = 0; i < N; ++i) {
cudaSetDevice(i); // set device i before creating MR
auto initial_size = rmm::percent_of_free_device_memory(50);
rmm::mr::pool_memory_resource pool{rmm::mr::cuda_memory_resource{}, initial_size};
// Set the per-device resource for device i
set_per_device_resource(cuda_device_id{i}, pool);
}Note that the CUDA device that is current when creating a memory resource must also be current any
time that resource is used to deallocate memory, including in a destructor. The RAII class
rmm::device_buffer and classes that use it as a backing store (rmm::device_scalar and
rmm::device_uvector) handle this by storing the active device when the constructor is called, and
then ensuring that the stored device is active whenever an allocation or deallocation is performed
(including in the destructor). The user must therefore only ensure that the device active during
creation of an rmm::device_buffer matches the active device of the memory resource being used.
Here is an incorrect example that creates a memory resource on device 0 and then uses it to
allocate a device_buffer on device 1:
{
RMM_CUDA_TRY(cudaSetDevice(0));
auto mr = rmm::mr::cuda_memory_resource{};
{
RMM_CUDA_TRY(cudaSetDevice(1));
// Invalid, current device is 1, but MR is only valid for device 0
rmm::device_buffer buf(16, rmm::cuda_stream_default, mr);
}
}A correct example creates the device buffer with device 0 active. After that it is safe to switch
devices and let the buffer go out of scope and destruct with a different device active. For example,
this code is correct:
{
RMM_CUDA_TRY(cudaSetDevice(0));
auto mr = rmm::mr::cuda_memory_resource{};
rmm::device_buffer buf(16, rmm::cuda_stream_default, mr);
RMM_CUDA_TRY(cudaSetDevice(1));
...
// No need to switch back to device 0 before ~buf runs
}rmm::device_vector uses an rmm::mr::thrust_allocator to enable thrust::device_vector to
allocate and deallocate memory using RMM. As such, the usual rules for usage of the backing memory
resource apply: the active device must match the active device at resource construction time. To
facilitate use in an RAII setting, rmm::mr::thrust_allocator records the active device at
construction time and ensures that device is active whenever it allocates or deallocates memory.
Usage of rmm::device_vector with multiple devices is therefore the same as rmm::device_buffer.
One must create device_vectors with the correct device active, but it is safe to destroy them
with a different active device.
For example, recapitulating the previous example using rmm::device_vector:
{
RMM_CUDA_TRY(cudaSetDevice(0));
auto mr = rmm::mr::cuda_memory_resource{};
rmm::device_vector<int> vec(16, rmm::mr::thrust_allocator<int>(rmm::cuda_stream_default, mr));
RMM_CUDA_TRY(cudaSetDevice(1));
...
// No need to switch back to device 0 before ~vec runs
}Note
Although allocation and deallocation in the thrust_allocator run with the correct active device,
modification of rmm::device_vector might necessitate a kernel launch, and this must run with the
correct device active. For example, .resize() might both allocate and launch a kernel to
initialize new elements: the user must arrange for this kernel launch to occur with the correct
device for the memory resource active.
rmm::cuda_stream_view is a simple non-owning wrapper around a CUDA cudaStream_t. New code should
prefer cuda::stream_ref for compatibility with CCCL. rmm::cuda_stream_view can be converted
to/from cuda::stream_ref.
rmm::cuda_stream is a simple owning wrapper around a CUDA cudaStream_t. This class provides
RAII semantics (constructor creates the CUDA stream, destructor destroys it). An rmm::cuda_stream
can never represent the CUDA default stream or per-thread default stream; it only ever represents
a single non-default stream. rmm::cuda_stream cannot be copied, but can be moved.
rmm::cuda_stream_pool provides fast access to a pool of CUDA streams. This class can be used to
create a set of cuda_stream objects whose lifetime is equal to the cuda_stream_pool. Using the
stream pool can be faster than creating the streams on the fly. The size of the pool is configurable.
Depending on this size, multiple calls to cuda_stream_pool::get_stream() may return instances of
rmm::cuda_stream_view that represent identical CUDA streams.
All current device memory resources are thread safe unless documented otherwise. More specifically,
calls to memory resource allocate() and deallocate() methods are safe with respect to calls to
either of these functions from other threads. They are not thread safe with respect to
construction and destruction of the memory resource object.
Note that a class thread_safe_resource_adapter is provided which can be used to adapt a memory
resource that is not thread safe to be thread safe (as described above). This adapter is not needed
with any current RMM device memory resources.
C++ interfaces commonly allow customizable memory allocation through an Allocator object.
RMM provides several Allocator and Allocator-like classes.
A stream-ordered allocator similar to std::pmr::polymorphic_allocator.
Unlike the standard C++ Allocator interface, the allocate and deallocate functions take a cuda_stream_view indicating the stream on which the (de)allocation occurs.
stream_allocator_adaptor can be used to adapt a stream-ordered allocator to present a standard Allocator interface to consumers that may not be designed to work with a stream-ordered interface.
Example:
rmm::cuda_stream stream;
rmm::mr::polymorphic_allocator<int> stream_alloc;
// Constructs an adaptor that forwards all (de)allocations to `stream_alloc` on `stream`.
auto adapted = rmm::mr::stream_allocator_adaptor(stream_alloc, stream);
// Allocates storage for 100 ints using `stream_alloc` on `stream`
auto p = adapted.allocate(100);
...
// Deallocates using `stream_alloc` on `stream`
adapted.deallocate(p, 100);thrust_allocator is a device memory allocator that uses the strongly typed thrust::device_ptr, making it usable with containers like thrust::device_vector.
See below for more information on using RMM with Thrust.
An untyped, uninitialized RAII class for stream ordered device memory allocation.
cuda_stream_view s{...};
// Allocates at least 100 bytes on stream `s` using the *default* resource
rmm::device_buffer b{100, s};
void* p = b.data(); // Raw, untyped pointer to underlying device memory
kernel<<<..., s.value()>>>(b.data()); // `b` is only safe to use on `s`
rmm::mr::cuda_memory_resource mr;
// Allocates at least 100 bytes on stream `s` using the resource `mr`
rmm::device_buffer b2{100, s, mr};A typed, uninitialized RAII class for allocation of a contiguous set of elements in device memory.
Similar to a thrust::device_vector, but as an optimization, does not default initialize the
contained elements. This optimization restricts the types T to trivially copyable types.
cuda_stream_view s{...};
// Allocates uninitialized storage for 100 `int32_t` elements on stream `s` using the
// default resource
rmm::device_uvector<int32_t> v(100, s);
// Initializes the elements to 0
thrust::uninitialized_fill(thrust::cuda::par.on(s.value()), v.begin(), v.end(), int32_t{0});
rmm::mr::cuda_memory_resource mr;
// Allocates uninitialized storage for 100 `int32_t` elements on stream `s` using the resource `mr`
rmm::device_uvector<int32_t> v2{100, s, mr};A typed, RAII class for allocation of a single element in device memory.
This is similar to a device_uvector with a single element, but provides convenience functions like
modifying the value in device memory from the host, or retrieving the value from device to host.
cuda_stream_view s{...};
// Allocates uninitialized storage for a single `int32_t` in device memory
rmm::device_scalar<int32_t> a{s};
a.set_value(42, s); // Updates the value in device memory to `42` on stream `s`
kernel<<<..., s.value()>>>(a.data()); // Pass raw pointer to underlying element in device memory
int32_t v = a.value(s); // Retrieves the value from device to host on stream `s`RAPIDS and other CUDA libraries make heavy use of Thrust. Thrust uses CUDA device memory in two situations:
- As the backing store for
thrust::device_vector, and - As temporary storage inside some algorithms, such as
thrust::sort.
RMM provides rmm::mr::thrust_allocator as a conforming Thrust allocator that uses
RMM memory resources.
To instruct a Thrust algorithm to use rmm::mr::thrust_allocator to allocate temporary storage, you
can use the custom Thrust CUDA device execution policy: rmm::exec_policy(stream).
thrust::sort(rmm::exec_policy(stream, ...);The first stream argument is the stream to use for rmm::mr::thrust_allocator.
The second stream argument is what should be used to execute the Thrust algorithm.
These two arguments must be identical.
RMM includes two forms of logging. Memory event logging and debug logging.
Memory event logging writes details of every allocation or deallocation to a CSV (comma-separated
value) file. In C++, Memory Event Logging is enabled by using the logging_resource_adaptor as a
wrapper around any other device memory resource.
Each row in the log represents either an allocation or a deallocation. The columns of the file are "Thread, Time, Action, Pointer, Size, Stream".
The CSV output files of the logging_resource_adaptor can be used as input to REPLAY_BENCHMARK,
which is available when building RMM from source, in the gbenchmarks folder in the build directory.
This log replayer can be useful for profiling and debugging allocator issues.
The following C++ example creates a logging version of a cuda_memory_resource that outputs the log
to the file "logs/test1.csv".
std::string filename{"logs/test1.csv"};
rmm::mr::cuda_memory_resource upstream;
rmm::mr::logging_resource_adaptor log_mr{upstream, filename};If a file name is not specified, the environment variable RMM_LOG_FILE is queried for the file
name. If RMM_LOG_FILE is not set, then an exception is thrown by the logging_resource_adaptor
constructor.
In Python, memory event logging is enabled when the logging parameter of rmm.reinitialize() is
set to True. The log file name can be set using the log_file_name parameter. See
help(rmm.reinitialize) for full details.
RMM leverages rapids-logger to log trace and debug
information to a file. This information can show when errors occur, when additional memory is
allocated from upstream resources, etc. By default output is logged to stderr, but the environment
variable RMM_DEBUG_LOG_FILE can be set to specify a path and file name to dump the logs to
instead.
There is a CMake configuration variable RMM_LOGGING_LEVEL, which can be set to enable compilation
of more detailed logging. The default is INFO. Available levels are TRACE, DEBUG, INFO,
WARN, ERROR, CRITICAL and OFF.
Note that to see logging below the INFO level, the application must also set the logging level at
run time. C++ applications must call rmm::default_logger().set_level(), for example to enable all
levels of logging down to TRACE, call rmm::default_logger().set_level(rapids_logger::level_enum::trace) (and compile
librmm with -DRMM_LOGGING_LEVEL=TRACE). Python applications must call rmm.set_logging_level(),
for example to enable all levels of logging down to TRACE, call rmm.set_logging_level("trace")
(and compile the RMM Python module with -DRMM_LOGGING_LEVEL=TRACE).
Note that debug logging is different from the CSV memory allocation logging provided by
rmm::mr::logging_resource_adapter. The latter is for logging a history of allocation /
deallocation actions which can be useful for replay with RMM's replay benchmark.
Memory allocations taken from a memory resource that allocates a pool of memory (such as
pool_memory_resource and arena_memory_resource) are part of the same low-level CUDA memory
allocation. Therefore, out-of-bounds or misaligned accesses to these allocations are not likely to
be detected by CUDA tools such as
CUDA Compute Sanitizer memcheck.
Exceptions to this are cuda_memory_resource, which wraps cudaMalloc, and
cuda_async_memory_resource, which uses cudaMallocAsync with CUDA's built-in memory pool
functionality (introduced in CUDA 11.2). Illegal memory accesses to memory allocated by these
resources are detectable with Compute Sanitizer Memcheck.
It may be possible in the future to add support for memory bounds checking with other memory resources using NVTX APIs.
There are two ways to use RMM in Python code:
- Using the
rmm.DeviceBufferAPI to explicitly create and manage device memory allocations - Transparently via external libraries such as CuPy and Numba-CUDA
RMM provides memory resource objects to control how device memory is allocated in both the above uses.
A DeviceBuffer represents an untyped, uninitialized device memory
allocation. DeviceBuffer objects can be created by providing the
size of the allocation in bytes:
>>> import rmm
>>> buf = rmm.DeviceBuffer(size=100)The size of the allocation and the memory address associated with it
can be accessed via the .size and .ptr attributes respectively:
>>> buf.size
100
>>> buf.ptr
140202544726016A DeviceBuffer can also be created by copying data from host memory:
>>> import rmm
>>> import numpy as np
>>> a = np.array([1, 2, 3], dtype='float64')
>>> buf = rmm.DeviceBuffer.to_device(a.tobytes())
>>> buf.size
24Conversely, the data in a DeviceBuffer can be copied to the
host:
>>> np.frombuffer(buf.tobytes())
array([1., 2., 3.])Memory resource objects are used to configure how device memory allocations are made by RMM.
By default if a memory resource is not set explicitly, RMM uses the CudaMemoryResource, which
uses cudaMalloc for allocating device memory.
rmm.reinitialize() provides an easy way to initialize RMM with specific memory resource options
across multiple devices. See help(rmm.reinitialize) for full details.
For lower-level control, the rmm.mr.set_current_device_resource() function can be
used to set a different resource for the current CUDA device. For
example, enabling the ManagedMemoryResource tells RMM to use
cudaMallocManaged instead of cudaMalloc for allocating memory:
>>> import rmm
>>> rmm.mr.set_current_device_resource(rmm.mr.ManagedMemoryResource())
⚠️ The default resource must be set for any device before allocating any device memory on that device. Setting or changing the resource after device allocations have been made can lead to unexpected behavior or crashes. See Multiple Devices
As another example, PoolMemoryResource allows you to allocate a
large "pool" of device memory up-front. Subsequent allocations will
draw from this pool of already allocated memory. The example
below shows how to construct a PoolMemoryResource with an initial size
of 1 GiB and a maximum size of 4 GiB. The pool uses
CudaMemoryResource as its underlying ("upstream") memory resource:
>>> import rmm
>>> pool = rmm.mr.PoolMemoryResource(
... rmm.mr.CudaMemoryResource(),
... initial_pool_size="1GiB", # equivalent to initial_pool_size=2**30
... maximum_pool_size="4GiB"
... )
>>> rmm.mr.set_current_device_resource(pool)Other memory resources include:
FixedSizeMemoryResourcefor allocating fixed blocks of memoryBinningMemoryResourcefor allocating blocks within specified "bin" sizes from different memory resources
Memory resources are highly configurable and can be composed together in different ways.
See help(rmm.mr) for more information.
You can configure CuPy to use RMM for memory
allocations by setting the CuPy CUDA allocator to
rmm_cupy_allocator:
>>> from rmm.allocators.cupy import rmm_cupy_allocator
>>> import cupy
>>> cupy.cuda.set_allocator(rmm_cupy_allocator)Note: This only configures CuPy to use the current RMM resource for allocations. It does not initialize nor change the current resource, e.g., enabling a memory pool. See here for more information on changing the current memory resource.
You can configure Numba-CUDA to use RMM for memory allocations using the Numba-CUDA EMM Plugin.
This can be done in two ways:
- Setting the environment variable
NUMBA_CUDA_MEMORY_MANAGER:
$ NUMBA_CUDA_MEMORY_MANAGER=rmm.allocators.numba python (args)- Using the
set_memory_manager()function provided by Numba-CUDA:
>>> from numba import cuda
>>> from rmm.allocators.numba import RMMNumbaManager
>>> cuda.set_memory_manager(RMMNumbaManager)Note: This only configures Numba-CUDA to use the current RMM resource for allocations. It does not initialize nor change the current resource, e.g., enabling a memory pool. See here for more information on changing the current memory resource.
PyTorch can use RMM for memory allocation. For example, to configure PyTorch to use an RMM-managed pool:
import rmm
from rmm.allocators.torch import rmm_torch_allocator
import torch
rmm.reinitialize(pool_allocator=True)
torch.cuda.memory.change_current_allocator(rmm_torch_allocator)PyTorch and RMM will now share the same memory pool.
You can, of course, use a custom memory resource with PyTorch as well:
import rmm
from rmm.allocators.torch import rmm_torch_allocator
import torch
# Note that you can configure PyTorch to use RMM either before or
# after changing RMM's memory resource. PyTorch will use whatever
# memory resource is configured to be the "current" memory resource at
# the time of allocation.
torch.cuda.change_current_allocator(rmm_torch_allocator)
# Configure RMM to use a managed memory resource, wrapped with a
# statistics resource adaptor that can report information about the
# amount of memory allocated:
mr = rmm.mr.StatisticsResourceAdaptor(rmm.mr.ManagedMemoryResource())
rmm.mr.set_current_device_resource(mr)
x = torch.tensor([1, 2]).cuda()
# The memory resource reports information about PyTorch allocations:
print(mr.allocation_counts)
# {'current_bytes': 16,
# 'current_count': 1,
# 'peak_bytes': 16,
# 'peak_count': 1,
# 'total_bytes': 16,
# 'total_count': 1}When interacting with a C++ library that uses RMM from Python, prefer C++ APIs that accept an explicit memory resource argument for allocations returned to Python. This lets Python callers pass the resource they want to use instead of relying on process-global current-resource state.
For example, a C++ function returning an rmm::device_buffer can accept an owning type-erased
resource:
std::unique_ptr<rmm::device_buffer> allocate(
std::size_t size,
cuda::mr::any_resource<cuda::mr::device_accessible> mr =
rmm::mr::get_current_device_resource_ref())
{
return std::make_unique<rmm::device_buffer>(size, rmm::cuda_stream_default, std::move(mr));
}The Python DeviceBuffer class has a convenience Cython function, c_from_unique_ptr, to construct
a DeviceBuffer from a unique_ptr<rmm::device_buffer>, taking ownership of it. Pass the same
Python memory resource object when constructing the Python wrapper if the wrapper should expose that
resource association.
