Newest 'cuda' Questions - Stack Overflow

2 votes

1 answer

67 views

Executing a CUDA Graph from a CUDA kernel

I’m trying to launch a captured CUDA Graph from inside a regular CUDA kernel (i.e., device-side graph launch). From the NVIDIA blog on device graph launch, it seems this should be supported on newer ...

Mohammad Siavashi

1,292

asked 16 hours ago

1 vote

0 answers

49 views

How can a removal of a boundary check introduce a BSYNC instruction in a following memory action?

I have some CUDA kernel code doing the following: half * restrict output; half v; // ... etc ... int i = whatever(); #ifdef CHECK_X if (i >= 0 && i <= SOME_CONSTANT) #endif { output[...

einpoklum

137k

asked Oct 8 at 14:27

0 votes

1 answer

29 views

Automatic cmake parameter /Zc:__cplusplus interpreted as a file name by nvcc

I am working on C++ project on Windows, using CUDA 12.0, cmake 3.31.6, vcpkg (updated to recent commit a62ce77). During configuration CMake tries to launch nvcc with some small test program to get ...

CygnusX1

22.1k

asked Oct 3 at 17:07

-3 votes

0 answers

54 views

How to update nvcc to match with my CUDA version?

The cuda version.json file (/usr/local/cuda/version.json) gives the CUDA version "cuda" : { "name" : "CUDA SDK", "version" : "12.3.1" } ...

Uwe.Schneider

1,467

asked Oct 2 at 17:18

0 votes

0 answers

39 views

TensorRT: enqueueV3 fails when using dynamic shapes and Green Contexts

I am trying to benchmark TensorRT inference using CUDA Green Contexts and splitting SMs. My code runs fine when I generate the .engine with fixed input shapes, but it fails when I build the engine ...

Gota_12

23

asked Oct 2 at 14:14

0 votes

1 answer

70 views

CPU-GPU producer-consumer pattern using unified memory but GPU is in spin loop

I am trying to implement producer consumer problem in GPU-CPU. Required for some other project. GPU requests some data via Unified memory to CPU. CPU copies that data to a specific location in global ...

Chinmaya Bhat K K

1

asked Sep 30 at 18:38

0 votes

1 answer

37 views

Dask-CUDA LocalCUDACluster on WSL2: NVML errors despite enable_nvml=False

I’m trying to set up a LocalCUDACluster on WSL2 (Ubuntu 22.04) from Windows 11 for GPU computations. The cluster starts and runs, but performance is ~10× slower than running directly on the GPU, and ...

Marek Majoch

1

asked Sep 29 at 9:37

-3 votes

1 answer

51 views

How to debug cuda kernels in python, using vscode (linux)

I use cupy to call cuda kernels, but I don't know how to debug cuda code, here is my wrapper file: wrapper.py import math from pathlib import Path import cupy as cp import numpy as np with open(Path(...

S200331082

1

asked Sep 25 at 13:08

-7 votes

1 answer

142 views

Is it possible to use clangd with cuda 13 and c++? [closed]

I have some c++ code that uses cuda. My editor (which uses clangd) is reporting a lot of spurious errors. For example the code #include <string> void main() { std::string x = "";...

drg

9

asked Sep 17 at 17:57

7 votes

1 answer

237 views

How do I get the GPU clock rate in CUDA 13?

I updated CUDA to version 13. But it seems that cudaGetDeviceProperties has changed. Instead of returning the cudaDeviceProp struct with clockRate, it returns a mutilated version thereof with ...

Johan

77.3k

asked Sep 12 at 11:45

0 votes

1 answer

93 views

How to correctly monitor a program’s GPU memory bandwidth utilization and SM utilization? (DCGM DRAM_ACTIVE vs in-program bandwidth differs a lot)

I want to quantitatively measure the memory bandwidth utilization and SM utilization of a CUDA program for performance analysis and regression testing. My approach so far: Compute the theoretical ...

plznobug

133

asked Sep 5 at 10:48

1 vote

1 answer

104 views

ILGPU kernel silently not compiling

I am trying to debug a kernel written for ILGPU which does not compile. My aplication has 2 big kernels. The first (that loads and does the right thing): /// <summary> /// Unified GPU kernel ...

AlessandroParma

151

asked Aug 29 at 12:23

1 vote

1 answer

143 views

std::complex in cuda kernels

CUDA allows to run constexpr member functions when compiling with --expt-relaxed-constexpr. This allows to use std::complex<double> in cuda kernels. However, while doing this, I get incorrect ...

thetwom

57

asked Aug 27 at 15:00

3 votes

0 answers

79 views

CUDA: Load misaligned float4 vector

I want to load 4 floats per thread. I know they are not 16 byte aligned. What is the best way to do this? Specific conditions: I cannot align the array without replicating the data because other ...

Homer512

15k

asked Aug 22 at 12:53

0 votes

0 answers

41 views

When using NVML, do we need to call nvmlShutdown before exiting the process?

The NVML library has the API calls: nvmlResult_t nvmlInit(); nvmlReturn_t nvmlShutdown(); We needs to call the first of these before performing any NVML operations on devices and such. And - after we'...

einpoklum

137k

asked Aug 21 at 10:49

Collectives™ on Stack Overflow

Executing a CUDA Graph from a CUDA kernel

How can a removal of a boundary check introduce a BSYNC instruction in a following memory action?

Automatic cmake parameter /Zc:__cplusplus interpreted as a file name by nvcc

How to update nvcc to match with my CUDA version?

TensorRT: enqueueV3 fails when using dynamic shapes and Green Contexts

CPU-GPU producer-consumer pattern using unified memory but GPU is in spin loop

Dask-CUDA LocalCUDACluster on WSL2: NVML errors despite enable_nvml=False

How to debug cuda kernels in python, using vscode (linux)

Is it possible to use clangd with cuda 13 and c++? [closed]

How do I get the GPU clock rate in CUDA 13?

How to correctly monitor a program’s GPU memory bandwidth utilization and SM utilization? (DCGM DRAM_ACTIVE vs in-program bandwidth differs a lot)

ILGPU kernel silently not compiling

std::complex in cuda kernels

CUDA: Load misaligned float4 vector

When using NVML, do we need to call nvmlShutdown before exiting the process?

Hot Network Questions