注意
前往末尾 下载完整的示例代码。
Libdevice (tl.extra.libdevice) 函数¶
Triton 可以调用外部库中的自定义函数。在此示例中,我们将使用 libdevice 库对张量应用 asin 函数。
请参考 CUDA libdevice-users-guide 和/或 HIP device-lib 源代码,了解所有可用 libdevice 函数的语义。
在 libdevice.py 中,我们尝试将计算相同但数据类型不同的函数聚合在一起。例如,__nv_asin 和 __nv_asinf 都计算输入的反正弦主值,但 __nv_asin 对 double 类型操作,而 __nv_asinf 对 float 类型操作。Triton 会根据输入和输出类型自动选择要调用的正确底层设备函数。
asin 核函数¶
import torch
import triton
import triton.language as tl
import inspect
import os
from triton.language.extra import libdevice
from pathlib import Path
DEVICE = triton.runtime.driver.active.get_active_torch_device()
@triton.jit
def asin_kernel(
x_ptr,
y_ptr,
n_elements,
BLOCK_SIZE: tl.constexpr,
):
pid = tl.program_id(axis=0)
block_start = pid * BLOCK_SIZE
offsets = block_start + tl.arange(0, BLOCK_SIZE)
mask = offsets < n_elements
x = tl.load(x_ptr + offsets, mask=mask)
x = libdevice.asin(x)
tl.store(y_ptr + offsets, x, mask=mask)
使用默认的 libdevice 库路径¶
我们可以使用编码在 triton/language/math.py 中的默认 libdevice 库路径
torch.manual_seed(0)
size = 98432
x = torch.rand(size, device=DEVICE)
output_triton = torch.zeros(size, device=DEVICE)
output_torch = torch.asin(x)
assert x.is_cuda and output_triton.is_cuda
n_elements = output_torch.numel()
grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']), )
asin_kernel[grid](x, output_triton, n_elements, BLOCK_SIZE=1024)
print(output_torch)
print(output_triton)
print(f'The maximum difference between torch and triton is '
f'{torch.max(torch.abs(output_torch - output_triton))}')
tensor([0.4105, 0.5430, 0.0249, ..., 0.0424, 0.5351, 0.8149], device='cuda:0')
tensor([0.4105, 0.5430, 0.0249, ..., 0.0424, 0.5351, 0.8149], device='cuda:0')
The maximum difference between torch and triton is 2.384185791015625e-07
自定义 libdevice 库路径¶
我们还可以通过将 libdevice 库的路径传递给 asin 核函数来自定义 libdevice 库路径。
def is_cuda():
return triton.runtime.driver.active.get_current_target().backend == "cuda"
def is_hip():
return triton.runtime.driver.active.get_current_target().backend == "hip"
current_file = inspect.getfile(inspect.currentframe())
current_dir = Path(os.path.dirname(os.path.abspath(current_file)))
if is_cuda():
libdir = current_dir.parent.parent / 'third_party/nvidia/backend/lib'
extern_libs = {'libdevice': str(libdir / 'libdevice.10.bc')}
elif is_hip():
libdir = current_dir.parent.parent / 'third_party/amd/backend/lib'
extern_libs = {}
libs = ["ocml", "ockl"]
for lib in libs:
extern_libs[lib] = str(libdir / f'{lib}.bc')
else:
raise RuntimeError('unknown backend')
output_triton = torch.empty_like(x)
asin_kernel[grid](x, output_triton, n_elements, BLOCK_SIZE=1024, extern_libs=extern_libs)
print(output_torch)
print(output_triton)
print(f'The maximum difference between torch and triton is '
f'{torch.max(torch.abs(output_torch - output_triton))}')
tensor([0.4105, 0.5430, 0.0249, ..., 0.0424, 0.5351, 0.8149], device='cuda:0')
tensor([0.4105, 0.5430, 0.0249, ..., 0.0424, 0.5351, 0.8149], device='cuda:0')
The maximum difference between torch and triton is 2.384185791015625e-07
脚本总运行时间: (0 分 0.248 秒)