Skip to main content
2 votes
1 answer
67 views

Executing a CUDA Graph from a CUDA kernel

I’m trying to launch a captured CUDA Graph from inside a regular CUDA kernel (i.e., device-side graph launch). From the NVIDIA blog on device graph launch, it seems this should be supported on newer ...
Mohammad Siavashi's user avatar
1 vote
0 answers
49 views

How can a removal of a boundary check introduce a BSYNC instruction in a following memory action?

I have some CUDA kernel code doing the following: half * restrict output; half v; // ... etc ... int i = whatever(); #ifdef CHECK_X if (i >= 0 && i <= SOME_CONSTANT) #endif { output[...
einpoklum's user avatar
  • 137k
0 votes
1 answer
29 views

Automatic cmake parameter /Zc:__cplusplus interpreted as a file name by nvcc

I am working on C++ project on Windows, using CUDA 12.0, cmake 3.31.6, vcpkg (updated to recent commit a62ce77). During configuration CMake tries to launch nvcc with some small test program to get ...
CygnusX1's user avatar
  • 22.1k
-3 votes
0 answers
54 views

How to update nvcc to match with my CUDA version?

The cuda version.json file (/usr/local/cuda/version.json) gives the CUDA version "cuda" : { "name" : "CUDA SDK", "version" : "12.3.1" } ...
Uwe.Schneider's user avatar
0 votes
0 answers
39 views

TensorRT: enqueueV3 fails when using dynamic shapes and Green Contexts

I am trying to benchmark TensorRT inference using CUDA Green Contexts and splitting SMs. My code runs fine when I generate the .engine with fixed input shapes, but it fails when I build the engine ...
Gota_12's user avatar
  • 23
0 votes
1 answer
70 views

CPU-GPU producer-consumer pattern using unified memory but GPU is in spin loop

I am trying to implement producer consumer problem in GPU-CPU. Required for some other project. GPU requests some data via Unified memory to CPU. CPU copies that data to a specific location in global ...
Chinmaya Bhat K K's user avatar
0 votes
1 answer
37 views

Dask-CUDA LocalCUDACluster on WSL2: NVML errors despite enable_nvml=False

I’m trying to set up a LocalCUDACluster on WSL2 (Ubuntu 22.04) from Windows 11 for GPU computations. The cluster starts and runs, but performance is ~10× slower than running directly on the GPU, and ...
Marek Majoch's user avatar
-3 votes
1 answer
51 views

How to debug cuda kernels in python, using vscode (linux)

I use cupy to call cuda kernels, but I don't know how to debug cuda code, here is my wrapper file: wrapper.py import math from pathlib import Path import cupy as cp import numpy as np with open(Path(...
S200331082's user avatar
-7 votes
1 answer
142 views

Is it possible to use clangd with cuda 13 and c++? [closed]

I have some c++ code that uses cuda. My editor (which uses clangd) is reporting a lot of spurious errors. For example the code #include <string> void main() { std::string x = "";...
drg's user avatar
  • 9
7 votes
1 answer
237 views

How do I get the GPU clock rate in CUDA 13?

I updated CUDA to version 13. But it seems that cudaGetDeviceProperties has changed. Instead of returning the cudaDeviceProp struct with clockRate, it returns a mutilated version thereof with ...
Johan's user avatar
  • 77.3k
0 votes
1 answer
93 views

How to correctly monitor a program’s GPU memory bandwidth utilization and SM utilization? (DCGM DRAM_ACTIVE vs in-program bandwidth differs a lot)

I want to quantitatively measure the memory bandwidth utilization and SM utilization of a CUDA program for performance analysis and regression testing. My approach so far: Compute the theoretical ...
plznobug's user avatar
  • 133
1 vote
1 answer
104 views

ILGPU kernel silently not compiling

I am trying to debug a kernel written for ILGPU which does not compile. My aplication has 2 big kernels. The first (that loads and does the right thing): /// <summary> /// Unified GPU kernel ...
AlessandroParma's user avatar
1 vote
1 answer
143 views

std::complex in cuda kernels

CUDA allows to run constexpr member functions when compiling with --expt-relaxed-constexpr. This allows to use std::complex<double> in cuda kernels. However, while doing this, I get incorrect ...
thetwom's user avatar
  • 57
3 votes
0 answers
79 views

CUDA: Load misaligned float4 vector

I want to load 4 floats per thread. I know they are not 16 byte aligned. What is the best way to do this? Specific conditions: I cannot align the array without replicating the data because other ...
Homer512's user avatar
  • 15k
0 votes
0 answers
41 views

When using NVML, do we need to call nvmlShutdown before exiting the process?

The NVML library has the API calls: nvmlResult_t nvmlInit(); nvmlReturn_t nvmlShutdown(); We needs to call the first of these before performing any NVML operations on devices and such. And - after we'...
einpoklum's user avatar
  • 137k

15 30 50 per page
1
2 3 4 5
983