The ROCdebug-agent is a library that can be loaded by ROCm Platform Runtime to provide the following functionality:
- Print the state of wavefronts that report memory violation or upon executing a "s_trap 2" instruction.
- Allows SIGINT (
ctrl c) or SIGTERM (kill -15) to print wavefront state of aborted GPU dispatches. - It is enabled on Vega10 (since ROCm1.9), Vega20 (since ROCm2.0) GPUs.
To use the ROCdebug-agent set the following environment variable:
export HSA_TOOLS_LIB=librocm-debug-agent.soThis will use ROCdebug-agent library installed at /opt/rocm/lib/librocm-debug-agent.so by default since the ROCm installation adds /opt/rocm/lib to the system library path. To use a different version set the LD_LIBRARY_PATH, for example:
export LD_LIBRARY_PATH=/path_to_directory_containing_librocm-debug-agent.soTo display the machine code instructions of wavefronts, together with the source text location, the ROCdebug-agent using the llvm-objdump tool. (1) Compile the code object with "-g" for OpenCL, or "-gline-tables-only" for HIP/hcc. (2) Ensure a llvm-objdump that supports AMD GCN GPUs is on your '$PATH`. For example, for the ROCm1.9:
export PATH=/opt/rocm/opencl/bin/x86_64/:$PATHExecute your application.
If the application encounters a GPU error, it will display the wavefront state of the GPU to stdout. Possible error states include:
- The GPU executes a memory instruction that causes a memory violation. This is reported as an XNACK error state.
- Queue error.
- The GPU executes an
S_TRAPinstruction. The__builtin_trap()language builtin can be used to generate aS_TRAP. - A SIGINT (
ctrl c) or SIGTERM (kill -15) signal is sent to the application while executing GPU code. Enabled by theROCM_DEBUG_ENABLE_LINUX_SIGNALSenvironment variable.
For example, a sample print out for GPU memory fault is:
Memory access fault by GPU agent: AMD gfx900
Node: 1
Address: 0x18DB4xxx (page not present;write access to a read-only page;)
64 wavefront(s) found in XNACK error state @PC: 0x0000001100E01310
printing the first one:
EXEC: 0xFFFFFFFFFFFFFFFF
STATUS: 0x00412460
TRAPSTS: 0x30000000
M0: 0x00001010
s0: 0x00C00000 s1: 0x80000010 s2: 0x10000000 s3: 0x00EA4FAC
s4: 0x17D78400 s5: 0x00000000 s6: 0x01039000 s7: 0x00000000
s8: 0x00000000 s9: 0x00000000 s10: 0x17D78400 s11: 0x04000000
s12: 0x00000000 s13: 0x00000000 s14: 0x00000000 s15: 0x00000000
s16: 0x0103C000 s17: 0x00000000 s18: 0x00000000 s19: 0x00000000
s20: 0x01037060 s21: 0x00000000 s22: 0x00000000 s23: 0x00000011
s24: 0x00004000 s25: 0x00010000 s26: 0x04C00000 s27: 0x00000010
s28: 0xFFFFFFFF s29: 0xFFFFFFFF s30: 0x00000000 s31: 0x00000000
Lane 0x0
v0: 0x00000003 v1: 0x18DB4400 v2: 0x18DB4400 v3: 0x00000000
v4: 0x00000000 v5: 0x00000000 v6: 0x00700000 v7: 0x00800000
Lane 0x1
v0: 0x00000004 v1: 0x18DB4400 v2: 0x18DB4400 v3: 0x00000000
v4: 0x00000000 v5: 0x00000000 v6: 0x00700000 v7: 0x00800000
Lane 0x2
v0: 0x00000005 v1: 0x18DB4400 v2: 0x18DB4400 v3: 0x00000000
v4: 0x00000000 v5: 0x00000000 v6: 0x00700000 v7: 0x00800000
Lane 0x3
v0: 0x00000006 v1: 0x18DB4400 v2: 0x18DB4400 v3: 0x00000000
v4: 0x00000000 v5: 0x00000000 v6: 0x00700000 v7: 0x00800000
.
.
.
Lane 0x3C
v0: 0x0000001F v1: 0x18DB4400 v2: 0x18DB4400 v3: 0x00000000
v4: 0x00000000 v5: 0x00000000 v6: 0x00700000 v7: 0x00800000
Lane 0x3D
v0: 0x00000020 v1: 0x18DB4400 v2: 0x18DB4400 v3: 0x00000000
v4: 0x00000000 v5: 0x00000000 v6: 0x00700000 v7: 0x00800000
Lane 0x3E
v0: 0x00000021 v1: 0x18DB4400 v2: 0x18DB4400 v3: 0x00000000
v4: 0x00000000 v5: 0x00000000 v6: 0x00700000 v7: 0x00800000
Lane 0x3F
v0: 0x00000022 v1: 0x18DB4400 v2: 0x18DB4400 v3: 0x00000000
v4: 0x00000000 v5: 0x00000000 v6: 0x00700000 v7: 0x00800000
Faulty Code Object:
/tmp/ROCm_Tmp_PID_5764/ROCm_Code_Object_0: file format ELF64-amdgpu-hsacobj
Disassembly of section .text:
the_kernel:
; /home/qingchuan/tests/faulty_test/vector_add_kernel.cl:12
; d[100000000] = ga[gid & 31];
v_mov_b32_e32 v1, v2 // 0000000012F0: 7E020302
v_mov_b32_e32 v4, v3 // 0000000012F4: 7E080303
v_add_i32_e32 v1, vcc, s10, v1 // 0000000012F8: 3202020A
v_mov_b32_e32 v5, s22 // 0000000012FC: 7E0A0216
v_addc_u32_e32 v4, vcc, v4, v5, vcc // 000000001300: 38080B04
v_mov_b32_e32 v2, v1 // 000000001304: 7E040301
v_mov_b32_e32 v3, v4 // 000000001308: 7E060304
s_waitcnt lgkmcnt(0) // 00000000130C: BF8CC07F
flat_store_dword v[2:3], v0 // 000000001310: DC700000 00000002
; /home/qingchuan/tests/faulty_test/vector_add_kernel.cl:13
; }
s_endpgm // 000000001318: BF810000
Faulty PC offset: 1310
Aborted (core dumped)
By default the wavefront dump is sent to stdout.
To save to a file use:
export ROCM_DEBUG_WAVE_STATE_DUMP=fileThis will create a file called ROCm_Wave_State_Dump in code object directory (see below).
To return to the default stdout use either of the following:
export ROCM_DEBUG_WAVE_STATE_DUMP=stdout
unset ROCM_DEBUG_WAVE_STATE_DUMPThe following environment variable can be used to enable dumping wavefront states when SIGINT (ctrl c) or SIGTERM (kill -15) is sent to the application:
export ROCM_DEBUG_ENABLE_LINUX_SIGNALS=1Either of the following will disable this behavior:
export ROCM_DEBUG_ENABLE_LINUX_SIGNALS=0
unset ROCM_DEBUG_ENABLE_LINUX_SIGNALSWhen the ROCdebug-agent is enabled, each GPU code object loaded by the ROCm Platform Runtime will be saved in a file in the code object directory. By default the code object directory is /tmp/ROCm_Tmp_PID_XXXX/ where XXXX is the application process ID. The code object directory can be specified using the following environent variable:
export ROCM_DEBUG_SAVE_CODE_OBJECT=code_object_directoryThis will use the path /code_object_directory.
Loaded code objects will be saved in files named ROCm_Code_Object_N where N is a unique integer starting at 0 of the order in which the code object was loaded.
If the default code object directory is used, then the saved code object file will be deleted when it is unloaded with the ROCm Platform Runtime, and the complete code object directory will be deleted when the application exits normally. If a code object directory path is specified then neither the saved code objects, nor the code object directory will be deleted.
To return to using the default code object directory use:
unset ROCM_DEBUG_SAVE_CODE_OBJECTBy default ROCdebug-agent logging is disabled. It can be enabled to display to stdout using:
export ROCM_DEBUG_ENABLE_AGENTLOG=stdoutOr to a file using:
export ROCM_DEBUG_ENABLE_AGENTLOG=<filename>Which will write to the file <filename>_AgentLog_PID_XXXX.log.
To disable logging use:
unset ROCM_DEBUG_ENABLE_AGENTLOGsrc- Contains the sources for building the ROCdebug-agent. See the
README.mdfor directions.
- Contains the sources for building the ROCdebug-agent. See the
test- Contains the tests for the ROCdebug-agent. See the
README.mdfor directions.
- Contains the tests for the ROCdebug-agent. See the