调试 Triton¶

本教程提供了调试 Triton 程序的指导。它主要为 Triton 用户编写。对探索 Triton 后端（包括 MLIR 代码转换和 LLVM 代码生成）感兴趣的开发者可以参考此章节，以探索调试选项。

使用 Triton 的调试操作¶

Triton 包含四个调试操作符，允许用户检查张量值

static_print 和 static_assert 用于编译时调试。
device_print 和 device_assert 用于运行时调试。

device_assert 仅在 TRITON_DEBUG 设置为 1 时执行。其他调试操作符无论 TRITON_DEBUG 的值如何都会执行。

使用解释器¶

解释器是一个简单且有用的调试 Triton 程序的工具。它允许 Triton 用户在 CPU 上运行 Triton 程序并检查每个操作的中间结果。要启用解释器模式，请将环境变量 TRITON_INTERPRET 设置为 1。此设置会导致所有 Triton 内核绕过编译，并通过解释器使用 Triton 操作的 numpy 等效项进行模拟。解释器顺序处理每个 Triton 程序实例，一次执行一个操作。

使用解释器主要有三种方式

使用 Python print 函数打印每个操作的中间结果。要检查整个张量，请使用 print(tensor)。要检查 idx 处的单个张量值，请使用 print(tensor.handle.data[idx])。

附加 pdb 以对 Triton 程序进行逐步调试

TRITON_INTERPRET=1 pdb main.py
b main.py:<line number>
r

导入 pdb 包并在 Triton 程序中设置断点

import triton
import triton.language as tl
import pdb

@triton.jit
def kernel(x_ptr, y_ptr, BLOCK_SIZE: tl.constexpr):
  pdb.set_trace()
  offs = tl.arange(0, BLOCK_SIZE)
  x = tl.load(x_ptr + offs)
  tl.store(y_ptr + offs, x)

局限性¶

解释器有几个已知的局限性

它不支持 bfloat16 数字类型的操作。要对 bfloat16 张量执行操作，请使用 tl.cast(tensor) 将张量转换为 float32。
它不支持间接内存访问模式，例如
```
ptr = tl.load(ptr)
x = tl.load(ptr)
```

使用第三方工具¶

对于 NVIDIA GPU 上的调试，compute-sanitizer 是一个检查数据竞争和内存访问问题的有效工具。要使用它，请在运行 Triton 程序的命令前加上 compute-sanitizer。

对于 AMD GPU 上的调试，您可能想尝试 ROCm 的 LLVM AddressSanitizer。

要详细可视化 Triton 程序中的内存访问，请考虑使用 triton-viz 工具，它与底层 GPU 无关。