vps2arch NVIDIA vGPU

文章目录

vps2arch

没啥好说的,上不了网注意改 systemd-networkd 的配置,提前 pacman -S vi vim base-devel

NVIDIA

nvidianvidia-lts 都是最新版 nvidia 驱动,一般内核新不是问题,往往是驱动太新,执行 nvidia-smi 后提示无法与 driv,lsmod | grep nvidia 没有结果,/dev 下也没有 nvidiadmesg 才发现提示不支持。

在官网查看对应型号显卡的最新驱动,记住版本号,比如 Tesla T4 是 470.82.01,若该型号官网驱动版本低于 nvidia,从 AUR 安装 nvidia-470xx-dkmsnvidia-390xx-dkms(其实 AUR 不止这些但以上两者是 Wiki 推荐)

观察以上两个包的 PKGBUILD,发现都是从 https://download.nvidia.com/XFree86/Linux-x86_64/ 下载对应版本的 .run 文件,但直接执行 .run 文件不是 The Arch Way (容易滚挂?咱也没试过),最好还是将 NVIDIA driver 纳入包管理器的控制,可以修改 PKGBUILD 中的 pkgver自行打包 以安装任意版本的驱动,即 pacman -S devtools 后,执行 extra-x86_64-build 根据 PKGBUILD 创建干净的环境打包,再 pacman -U *.pkg.tar.zst 安装。若需要自行创建测试环境,可用 systemd-nspawn

为 NVIDIA 驱动打包,可参看 Listing of Installed Components 了解各文件的作用,.run 文件解压后也有 .manifest 简单列出路径和权限。另外 AUR 可参考的版本较少,可去 Manjaro GitLab 偷包,另外 diff -qr dir1/ dir2/ 可以比较不同驱动解压后目录中的文件异同,方便改包。

vGPU

从 470xx 到 390xx,dmesg 日志都还是报错不支持,突然意识到机器是 vGPU 而非直通显卡,需要装 grid 驱动。可能是 license 的缘故,AUR 没有基于 grid 驱动的现成包,nvidia-merged 似乎是支持 vGPU 但安装提示本机并不是跑在 KVM 上的 vGPU,所以只能手打包。NVIDIA 官网没有提供 grid 驱动的公开直链,还好 Google Cloud 可以直接下载 NVIDIA-Linux-x86_64-${pkgver}-grid.run

基于 470xx 的 PKGBUILD 删减一通后居然打出了 470xx grid 的包,还真能装上,module 和 dev 都有了,nvidia-smi 不会立刻报错,而是等待许久后来一句 No devices were founddmesg 中没有原来的显眼报错,而是 NVRM: RmInitAdapter failed!,肯定还是有问题了,nvidia-persistence 也无法启动的。

Downgrade Kernel

查阅内网文档说是显卡驱动版本受限于母机,只支持到 450.102.04,那再手打 450xx 的包,结果发现安装 dkms 时总是编译报错,看 make.log 应该是内核源码中某些定义有变动,有类似的 patch https://bbs.archlinux.org/viewtopic.php?id=268421,但改了一个还没完,后面继续出现更多报错,短时间内估计搞不定,不如退而求其次,降 kernel 版本。

根据 cuda-toolkit-release-notes 的 Table 3,450.102.04 对应 CUDA 11.0.3 Update 1,查看 cuda-installation-guide-linux v11.0.3,从表 Table 1. Native Linux Distribution Support in CUDA 11.0 推测官方最高支持到 Kernel 5.4.0,故降级到 linux-lts54,并 yay -S linux-lts54-headers

安装 kernel 后重启前一定记得 grub-mkconfig,然后删除 /usr/lib/modules/ 下之前版本的残留文件夹,否则 dkms 仍会尝试编译该版本于是报错,未找到模块的错误 PKGBUILD 中再看是否可删除多余的命令,最后终于装成功,重启后 nvidia-smi 成功出现了梦寐以求的界面!

python-pytorch-cuda 直接装,居然也 available 而不用装老版本,因为 cuda-toolkit-release-notes 的 Table 2 表明直到 CUDA 11.5 的 Minimum Required Driver Version 还是 >=450.80.02

gridd

然而事情并没有那么简单,这样装上驱动后炼丹似乎完全没效果,这才想起来 vGPU 是需要 license 的,可装上后完全没有体现,因为我打包时压根没把 nvidia-gridd 放进去,于是打进包里,然后在 /etc/nvidia/gridd.conf 填入 license server address,启用服务后报错 Error requesting D-Bus name (Connection ":1.14" is not allowed to own the service "nvidia.grid.server" due to security policies in the configuration file)

成功就在眼前,这个报错虽然非常小众,但问题依然能定位到 dbus配置,在 /usr/share/dbus-1/system.d 下创建 nvidia.grid.server.conf,写入如下配置:

<!DOCTYPE busconfig PUBLIC "-//freedesktop//DTD D-Bus Bus Configuration 1.0//EN"
 "http://www.freedesktop.org/standards/dbus/1.0/busconfig.dtd">
<busconfig>
  <policy context="default">
    <allow own="nvidia.grid.server"/>
  </policy>
</busconfig>

果然不再报错,重启服务后显示成功获取,nvidia-smi -q | grep -i license 也证实了已变成 Licensed 状态。

Summary

没有学会炼丹,却再次锻炼了折腾能力,最开始只会盲从 Arch Wiki,转折点是看到 /data 目录下的残留驱动而意识到应使用 vGPU 特供 grid 驱动,从而被迫学习改包打包,虽然成功打包并装上,但报错并发现到机器最高只支持 450xx,于是打旧包,这次是安装 dkms 总出错,从而对驱动和内核版本之间的关系有了更深理解,事后看 NVIDIA 官网的文档和表格大致明白了 Kernel, Driver, CUDA 这三者版本的关联。

最后附上自制 nvidia-450xx-utils 的 PKGBUILD

# Maintainer:  Jonathon Fernyhough <jonathon+m2x+dev>
# Contributor: Sven-Hendrik Haase <svenstaro@gmail.com>
# Contributor: Thomas Baechler <thomas@archlinux.org>
# Contributor: James Rayner <iphitus@gmail.com>

pkgbase=nvidia-450xx-utils
pkgname=('nvidia-450xx-utils' 'opencl-nvidia-450xx' 'nvidia-450xx-dkms')
pkgver=450.102.04
pkgrel=2
arch=('x86_64')
url="http://www.nvidia.com/"
license=('custom')
options=('!strip')
_pkg="NVIDIA-Linux-x86_64-${pkgver}-grid"
source=('nvidia-drm-outputclass.conf'
        'nvidia-450xx-utils.sysusers'
        'nvidia-450xx.rules'
	"https://storage.googleapis.com/nvidia-drivers-us-public/GRID/GRID11.3/${_pkg}.run")
sha512sums=('de7116c09f282a27920a1382df84aa86f559e537664bb30689605177ce37dc5067748acf9afd66a3269a6e323461356592fdfc624c86523bf105ff8fe47d3770'
            '4b3ad73f5076ba90fe0b3a2e712ac9cde76f469cd8070280f960c3ce7dc502d1927f525ae18d008075c8f08ea432f7be0a6c3a7a6b49c361126dcf42f97ec499'
            'a0ceb0a6c240cf97b21a2e46c5c212250d3ee24fecef16aca3dffb04b8350c445b9f4398274abccdb745dd0ba5132a17942c9508ce165d4f97f41ece02b0b989'
            '523070e9e458f2da50df0f6dd35445ed824cf3b4ce2c3e191d58718a4ed638cfc644852b8330fb3da0444811431da7bf88f195e9aed1fa8615f92b8d1e941892')


create_links() {
    # create soname links
    find "$pkgdir" -type f -name '*.so*' ! -path '*xorg/*' -print0 | while read -d $'\0' _lib; do
        _soname=$(dirname "${_lib}")/$(readelf -d "${_lib}" | grep -Po 'SONAME.*: \[\K[^]]*' || true)
        _base=$(echo ${_soname} | sed -r 's/(.*)\.so.*/\1.so/')
        [[ -e "${_soname}" ]] || ln -s $(basename "${_lib}") "${_soname}"
        [[ -e "${_base}" ]] || ln -s $(basename "${_soname}") "${_base}"
    done
}

prepare() {
    sh "${_pkg}.run" --extract-only
    cd "${_pkg}"
    bsdtar -xf nvidia-persistenced-init.tar.bz2

    cd kernel
    sed -i "s/__VERSION_STRING/${pkgver}/" dkms.conf
    sed -i 's/__JOBS/`nproc`/' dkms.conf
    sed -i 's/__DKMS_MODULES//' dkms.conf
    sed -i '$iBUILT_MODULE_NAME[0]="nvidia"\
DEST_MODULE_LOCATION[0]="/kernel/drivers/video"\
BUILT_MODULE_NAME[1]="nvidia-uvm"\
DEST_MODULE_LOCATION[1]="/kernel/drivers/video"\
BUILT_MODULE_NAME[2]="nvidia-modeset"\
DEST_MODULE_LOCATION[2]="/kernel/drivers/video"\
BUILT_MODULE_NAME[3]="nvidia-drm"\
DEST_MODULE_LOCATION[3]="/kernel/drivers/video"' dkms.conf

    # Gift for linux-rt guys
    sed -i 's/NV_EXCLUDE_BUILD_MODULES/IGNORE_PREEMPT_RT_PRESENCE=1 NV_EXCLUDE_BUILD_MODULES/' dkms.conf
}

package_opencl-nvidia-450xx() {
    pkgdesc="OpenCL implemention for NVIDIA"
    depends=('zlib')
    optdepends=('opencl-headers: headers necessary for OpenCL development')
    provides=('opencl-driver' 'opencl-nvidia')
    conflicts=('opencl-nvidia')
    cd "${_pkg}"

    # OpenCL
    install -Dm644 nvidia.icd "${pkgdir}/etc/OpenCL/vendors/nvidia.icd"
    install -D "libnvidia-compiler.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-compiler.so.${pkgver}"
    install -D "libnvidia-opencl.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-opencl.so.${pkgver}"

    create_links

    mkdir -p "${pkgdir}/usr/share/licenses"
    ln -s nvidia-utils "${pkgdir}/usr/share/licenses/opencl-nvidia"
}

package_nvidia-450xx-dkms() {
    pkgdesc="NVIDIA drivers - module sources"
    depends=('dkms' "nvidia-450xx-utils=$pkgver" 'libglvnd')
    provides=('NVIDIA-MODULE')

    cd ${_pkg}

    install -dm 755 "${pkgdir}"/usr/src
    cp -dr --no-preserve='ownership' kernel "${pkgdir}/usr/src/nvidia-${pkgver}"

    install -Dt "${pkgdir}/usr/share/licenses/${pkgname}" -m644 "${srcdir}/${_pkg}/LICENSE"
}

package_nvidia-450xx-utils() {
    pkgdesc="NVIDIA drivers utilities"
    depends=('xorg-server')
    optdepends=('xorg-server-devel: nvidia-xconfig'
                'opencl-nvidia-450xx: OpenCL support')
    conflicts=('nvidia-libgl' 'nvidia-utils')
    provides=('vulkan-driver' 'opengl-driver' 'nvidia-libgl' 'nvidia-utils')
    install="${pkgname}.install"

    cd "${_pkg}"

    # Check http://us.download.nvidia.com/XFree86/Linux-x86_64/${pkgver}/README/installedcomponents.html
    # for hints on what needs to be installed where.

    # X driver
    install -D nvidia_drv.so "${pkgdir}/usr/lib/xorg/modules/drivers/nvidia_drv.so"

    # GLX extension module for X
    install -D "libglxserver_nvidia.so.${pkgver}" "${pkgdir}/usr/lib/nvidia/xorg/libglxserver_nvidia.so.${pkgver}"
    # Ensure that X finds glx
    ln -s "libglxserver_nvidia.so.${pkgver}" "${pkgdir}/usr/lib/nvidia/xorg/libglxserver_nvidia.so.1"
    ln -s "libglxserver_nvidia.so.${pkgver}" "${pkgdir}/usr/lib/nvidia/xorg/libglxserver_nvidia.so"

    install -D "libGLX_nvidia.so.${pkgver}" "${pkgdir}/usr/lib/libGLX_nvidia.so.${pkgver}"

    # OpenGL libraries
    install -D     "libEGL_nvidia.so.${pkgver}" "${pkgdir}/usr/lib/libEGL_nvidia.so.${pkgver}"
    install -D     "libGLESv1_CM_nvidia.so.${pkgver}" "${pkgdir}/usr/lib/libGLESv1_CM_nvidia.so.${pkgver}"
    install -D     "libGLESv2_nvidia.so.${pkgver}" "${pkgdir}/usr/lib/libGLESv2_nvidia.so.${pkgver}"
    install -Dm644 "10_nvidia.json" "${pkgdir}/usr/share/glvnd/egl_vendor.d/10_nvidia.json"

    # OpenGL core library
    install -D "libnvidia-glcore.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-glcore.so.${pkgver}"
    install -D "libnvidia-eglcore.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-eglcore.so.${pkgver}"
    install -D "libnvidia-glsi.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-glsi.so.${pkgver}"

    # misc
    install -D "libnvidia-ifr.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-ifr.so.${pkgver}"
    install -D "libnvidia-fbc.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-fbc.so.${pkgver}"
    install -D "libnvidia-encode.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-encode.so.${pkgver}"
    install -D "libnvidia-cfg.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-cfg.so.${pkgver}"
    install -D "libnvidia-ml.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-ml.so.${pkgver}"
    install -D "libnvidia-glvkspirv.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-glvkspirv.so.${pkgver}"
    install -D "libnvidia-allocator.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-allocator.so.${pkgver}"

    # Vulkan ICD
    install -Dm644 "nvidia_icd.json" "${pkgdir}/usr/share/vulkan/icd.d/nvidia_icd.json"
    install -Dm644 "nvidia_layers.json" "${pkgdir}/usr/share/vulkan/implicit_layer.d/nvidia_layers.json"

    # VDPAU
    install -D "libvdpau_nvidia.so.${pkgver}" "${pkgdir}/usr/lib/vdpau/libvdpau_nvidia.so.${pkgver}"

    # nvidia-tls library
    install -D "libnvidia-tls.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-tls.so.${pkgver}"

    # CUDA
    install -D "libcuda.so.${pkgver}" "${pkgdir}/usr/lib/libcuda.so.${pkgver}"
    install -D "libnvcuvid.so.${pkgver}" "${pkgdir}/usr/lib/libnvcuvid.so.${pkgver}"

    # PTX JIT Compiler (Parallel Thread Execution (PTX) is a pseudo-assembly language for CUDA)
    install -D "libnvidia-ptxjitcompiler.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-ptxjitcompiler.so.${pkgver}"

    # raytracing
    install -D "libnvoptix.so.${pkgver}" "${pkgdir}/usr/lib/libnvoptix.so.${pkgver}"
    install -D "libnvidia-rtcore.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-rtcore.so.${pkgver}"
    install -D "libnvidia-cbl.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-cbl.so.${pkgver}"

    # NGX
    install -D "libnvidia-ngx.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-ngx.so.${pkgver}"

    # Optical flow
    install -D "libnvidia-opticalflow.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-opticalflow.so.${pkgver}"

    # Only for GRID, maybe useless
    install -D "libFlxCore64.so.2018.02" "${pkgdir}/usr/lib/libFlxCore64.so.2018.02"
    install -D "libFlxComm64.so.2018.02" "${pkgdir}/usr/lib/libFlxComm64.so.2018.02"

    # DEBUG
    install -D nvidia-debugdump "${pkgdir}/usr/bin/nvidia-debugdump"

    # nvidia-xconfig
    install -D     nvidia-xconfig "${pkgdir}/usr/bin/nvidia-xconfig"
    install -Dm644 nvidia-xconfig.1.gz "${pkgdir}/usr/share/man/man1/nvidia-xconfig.1.gz"

    # nvidia-settings
    install -D -m755 nvidia-settings "${pkgdir}/usr/bin/nvidia-settings"
    install -D -m644 nvidia-settings.1.gz "${pkgdir}/usr/share/man/man1/nvidia-settings.1.gz"
    install -D -m644 nvidia-settings.desktop "${pkgdir}/usr/share/applications/nvidia-settings.desktop"
    install -D -m644 nvidia-settings.png "${pkgdir}/usr/share/pixmaps/nvidia-settings.png"
    install -D -m755 "libnvidia-gtk2.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-gtk2.so.${pkgver}"
    install -D -m755 "libnvidia-gtk3.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-gtk3.so.${pkgver}"
    sed -e 's:__UTILS_PATH__:/usr/bin:' -e 's:__PIXMAP_PATH__:/usr/share/pixmaps:' -i "${pkgdir}/usr/share/applications/nvidia-settings.desktop"

    # nvidia-bug-report
    install -D nvidia-bug-report.sh "${pkgdir}/usr/bin/nvidia-bug-report.sh"

    # nvidia-smi
    install -D     nvidia-smi "${pkgdir}/usr/bin/nvidia-smi"
    install -Dm644 nvidia-smi.1.gz "${pkgdir}/usr/share/man/man1/nvidia-smi.1.gz"

    # nvidia-cuda-mps
    install -D     nvidia-cuda-mps-server "${pkgdir}/usr/bin/nvidia-cuda-mps-server"
    install -D     nvidia-cuda-mps-control "${pkgdir}/usr/bin/nvidia-cuda-mps-control"
    install -Dm644 nvidia-cuda-mps-control.1.gz "${pkgdir}/usr/share/man/man1/nvidia-cuda-mps-control.1.gz"

    # nvidia-modprobe
    # This should be removed if nvidia fixed their uvm module!
    install -Dm4755 nvidia-modprobe "${pkgdir}/usr/bin/nvidia-modprobe"
    install -Dm644  nvidia-modprobe.1.gz "${pkgdir}/usr/share/man/man1/nvidia-modprobe.1.gz"

    # nvidia-persistenced
    install -D     nvidia-persistenced "${pkgdir}/usr/bin/nvidia-persistenced"
    install -Dm644 nvidia-persistenced.1.gz "${pkgdir}/usr/share/man/man1/nvidia-persistenced.1.gz"
    install -Dm644 nvidia-persistenced-init/systemd/nvidia-persistenced.service.template "${pkgdir}/usr/lib/systemd/system/nvidia-persistenced.service"
    sed -i 's/__USER__/nvidia-persistenced/' "${pkgdir}/usr/lib/systemd/system/nvidia-persistenced.service"

    # nvidia-gridd
    install -Dm4755 nvidia-gridd "${pkgdir}/usr/bin/nvidia-gridd"
    install -Dm644  nvidia-gridd.1.gz "${pkgdir}/usr/share/man/man1/nvidia-gridd.1.gz"
    install -Dm644 gridd.conf.template "${pkgdir}/etc/nvidia/gridd.conf.template"
    install -Dm644 init-scripts/systemd/nvidia-gridd.service "${pkgdir}/usr/lib/systemd/system/nvidia-gridd.service"

    # application profiles
    install -Dm644 nvidia-application-profiles-${pkgver}-rc "${pkgdir}/usr/share/nvidia/nvidia-application-profiles-${pkgver}-rc"
    install -Dm644 nvidia-application-profiles-${pkgver}-key-documentation "${pkgdir}/usr/share/nvidia/nvidia-application-profiles-${pkgver}-key-documentation"

    install -Dm644 LICENSE "${pkgdir}/usr/share/licenses/nvidia-utils/LICENSE"
    install -Dm644 README.txt "${pkgdir}/usr/share/doc/nvidia/README"
    install -Dm644 NVIDIA_Changelog "${pkgdir}/usr/share/doc/nvidia/NVIDIA_Changelog"
    cp -r html "${pkgdir}/usr/share/doc/nvidia/"
    ln -s nvidia "${pkgdir}/usr/share/doc/nvidia-utils"

    install -Dm644 "${srcdir}/nvidia-450xx-utils.sysusers" "${pkgdir}/usr/lib/sysusers.d/$pkgname.conf"

    install -Dm644 "${srcdir}/nvidia-450xx.rules" "$pkgdir"/usr/lib/udev/rules.d/60-nvidia-450xx.rules

    # distro specific files must be installed in /usr/share/X11/xorg.conf.d
    install -m755 -d "$pkgdir/usr/share/X11/xorg.conf.d"
    install -Dm644 "${srcdir}/nvidia-drm-outputclass.conf" "${pkgdir}/usr/share/X11/xorg.conf.d/10-nvidia-drm-outputclass.conf"

    echo "blacklist nouveau" | install -Dm644 /dev/stdin "${pkgdir}/usr/lib/modprobe.d/${pkgname}.conf"
    echo "nvidia-uvm" | install -Dm644 /dev/stdin "${pkgdir}/usr/lib/modules-load.d/${pkgname}.conf"

    create_links
}

评论正在加载中...如果评论较长时间无法加载,你可以 搜索对应的 issue 或者 新建一个 issue