14,733 questions
2
votes
1
answer
67
views
Executing a CUDA Graph from a CUDA kernel
I’m trying to launch a captured CUDA Graph from inside a regular CUDA kernel (i.e., device-side graph launch).
From the NVIDIA blog on device graph launch, it seems this should be supported on newer ...
1
vote
0
answers
49
views
How can a removal of a boundary check introduce a BSYNC instruction in a following memory action?
I have some CUDA kernel code doing the following:
half * restrict output;
half v;
// ... etc ...
int i = whatever();
#ifdef CHECK_X
if (i >= 0 && i <= SOME_CONSTANT)
#endif
{
output[...
0
votes
1
answer
29
views
Automatic cmake parameter /Zc:__cplusplus interpreted as a file name by nvcc
I am working on C++ project on Windows, using CUDA 12.0, cmake 3.31.6, vcpkg (updated to recent commit a62ce77).
During configuration CMake tries to launch nvcc with some small test program to get ...
-3
votes
0
answers
54
views
How to update nvcc to match with my CUDA version?
The cuda version.json file (/usr/local/cuda/version.json) gives the CUDA version
"cuda" : {
"name" : "CUDA SDK",
"version" : "12.3.1"
}
...
0
votes
0
answers
39
views
TensorRT: enqueueV3 fails when using dynamic shapes and Green Contexts
I am trying to benchmark TensorRT inference using CUDA Green Contexts and splitting SMs. My code runs fine when I generate the .engine with fixed input shapes, but it fails when I build the engine ...
0
votes
1
answer
70
views
CPU-GPU producer-consumer pattern using unified memory but GPU is in spin loop
I am trying to implement producer consumer problem in GPU-CPU. Required for some other project. GPU requests some data via Unified memory to CPU. CPU copies that data to a specific location in global ...
0
votes
1
answer
37
views
Dask-CUDA LocalCUDACluster on WSL2: NVML errors despite enable_nvml=False
I’m trying to set up a LocalCUDACluster on WSL2 (Ubuntu 22.04) from Windows 11 for GPU computations. The cluster starts and runs, but performance is ~10× slower than running directly on the GPU, and ...
-3
votes
1
answer
51
views
How to debug cuda kernels in python, using vscode (linux)
I use cupy to call cuda kernels, but I don't know how to debug cuda code, here is my wrapper file:
wrapper.py
import math
from pathlib import Path
import cupy as cp
import numpy as np
with open(Path(...
-7
votes
1
answer
142
views
Is it possible to use clangd with cuda 13 and c++? [closed]
I have some c++ code that uses cuda. My editor (which uses clangd) is reporting a lot of spurious errors. For example the code
#include <string>
void main() {
std::string x = "";...
7
votes
1
answer
237
views
How do I get the GPU clock rate in CUDA 13?
I updated CUDA to version 13.
But it seems that cudaGetDeviceProperties has changed.
Instead of returning the cudaDeviceProp struct with clockRate, it returns a mutilated version thereof with ...
0
votes
1
answer
93
views
How to correctly monitor a program’s GPU memory bandwidth utilization and SM utilization? (DCGM DRAM_ACTIVE vs in-program bandwidth differs a lot)
I want to quantitatively measure the memory bandwidth utilization and SM utilization of a CUDA program for performance analysis and regression testing.
My approach so far:
Compute the theoretical ...
1
vote
1
answer
104
views
ILGPU kernel silently not compiling
I am trying to debug a kernel written for ILGPU which does not compile.
My aplication has 2 big kernels.
The first (that loads and does the right thing):
/// <summary>
/// Unified GPU kernel ...
1
vote
1
answer
143
views
std::complex in cuda kernels
CUDA allows to run constexpr member functions when compiling with --expt-relaxed-constexpr. This allows to use std::complex<double> in cuda kernels. However, while doing this, I get incorrect ...
3
votes
0
answers
79
views
CUDA: Load misaligned float4 vector
I want to load 4 floats per thread. I know they are not 16 byte aligned. What is the best way to do this?
Specific conditions:
I cannot align the array without replicating the data because other ...
0
votes
0
answers
41
views
When using NVML, do we need to call nvmlShutdown before exiting the process?
The NVML library has the API calls:
nvmlResult_t nvmlInit();
nvmlReturn_t nvmlShutdown();
We needs to call the first of these before performing any NVML operations on devices and such. And - after we'...