vps2arch NVIDIA vGPU
文章目录
vps2arch
没啥好说的,上不了网注意改 systemd-networkd
的配置,提前 pacman -S vi vim base-devel
NVIDIA
nvidia
和 nvidia-lts
都是最新版 nvidia 驱动,一般内核新不是问题,往往是驱动太新,执行 nvidia-smi
后提示无法与 driv,lsmod | grep nvidia
没有结果,/dev
下也没有 nvidia
,dmesg
才发现提示不支持。
在官网查看对应型号显卡的最新驱动,记住版本号,比如 Tesla T4 是 470.82.01
,若该型号官网驱动版本低于 nvidia
,从 AUR 安装 nvidia-470xx-dkms 或 nvidia-390xx-dkms(其实 AUR 不止这些但以上两者是 Wiki 推荐)
观察以上两个包的 PKGBUILD,发现都是从 https://download.nvidia.com/XFree86/Linux-x86_64/ 下载对应版本的 .run
文件,但直接执行 .run
文件不是 The Arch Way (容易滚挂?咱也没试过),最好还是将 NVIDIA driver 纳入包管理器的控制,可以修改 PKGBUILD 中的 pkgver
,自行打包 以安装任意版本的驱动,即 pacman -S devtools
后,执行 extra-x86_64-build
根据 PKGBUILD
创建干净的环境打包,再 pacman -U *.pkg.tar.zst
安装。若需要自行创建测试环境,可用 systemd-nspawn。
为 NVIDIA 驱动打包,可参看 Listing of Installed Components 了解各文件的作用,.run
文件解压后也有 .manifest
简单列出路径和权限。另外 AUR 可参考的版本较少,可去 Manjaro GitLab 偷包,另外 diff -qr dir1/ dir2/
可以比较不同驱动解压后目录中的文件异同,方便改包。
vGPU
从 470xx 到 390xx,dmesg 日志都还是报错不支持,突然意识到机器是 vGPU 而非直通显卡,需要装 grid 驱动。可能是 license 的缘故,AUR 没有基于 grid 驱动的现成包,nvidia-merged 似乎是支持 vGPU 但安装提示本机并不是跑在 KVM 上的 vGPU,所以只能手打包。NVIDIA 官网没有提供 grid 驱动的公开直链,还好 Google Cloud 可以直接下载 NVIDIA-Linux-x86_64-${pkgver}-grid.run
。
基于 470xx 的 PKGBUILD 删减一通后居然打出了 470xx grid 的包,还真能装上,module 和 dev 都有了,nvidia-smi
不会立刻报错,而是等待许久后来一句 No devices were found
,dmesg
中没有原来的显眼报错,而是 NVRM: RmInitAdapter failed!
,肯定还是有问题了,nvidia-persistence
也无法启动的。
Downgrade Kernel
查阅内网文档说是显卡驱动版本受限于母机,只支持到 450.102.04
,那再手打 450xx 的包,结果发现安装 dkms 时总是编译报错,看 make.log 应该是内核源码中某些定义有变动,有类似的 patch https://bbs.archlinux.org/viewtopic.php?id=268421,但改了一个还没完,后面继续出现更多报错,短时间内估计搞不定,不如退而求其次,降 kernel 版本。
根据 cuda-toolkit-release-notes 的 Table 3,450.102.04
对应 CUDA 11.0.3 Update 1,查看 cuda-installation-guide-linux v11.0.3,从表 Table 1. Native Linux Distribution Support in CUDA 11.0
推测官方最高支持到 Kernel 5.4.0,故降级到 linux-lts54
,并 yay -S linux-lts54-headers
安装 kernel 后重启前一定记得 grub-mkconfig
,然后删除 /usr/lib/modules/
下之前版本的残留文件夹,否则 dkms 仍会尝试编译该版本于是报错,未找到模块的错误 PKGBUILD 中再看是否可删除多余的命令,最后终于装成功,重启后 nvidia-smi
成功出现了梦寐以求的界面!
python-pytorch-cuda
直接装,居然也 available 而不用装老版本,因为 cuda-toolkit-release-notes 的 Table 2 表明直到 CUDA 11.5 的 Minimum Required Driver Version 还是 >=450.80.02
gridd
然而事情并没有那么简单,这样装上驱动后炼丹似乎完全没效果,这才想起来 vGPU 是需要 license 的,可装上后完全没有体现,因为我打包时压根没把 nvidia-gridd
放进去,于是打进包里,然后在 /etc/nvidia/gridd.conf
填入 license server address,启用服务后报错 Error requesting D-Bus name (Connection ":1.14" is not allowed to own the service "nvidia.grid.server" due to security policies in the configuration file)
成功就在眼前,这个报错虽然非常小众,但问题依然能定位到 dbus配置,在 /usr/share/dbus-1/system.d
下创建 nvidia.grid.server.conf
,写入如下配置:
<!DOCTYPE busconfig PUBLIC "-//freedesktop//DTD D-Bus Bus Configuration 1.0//EN"
"http://www.freedesktop.org/standards/dbus/1.0/busconfig.dtd">
<busconfig>
<policy context="default">
<allow own="nvidia.grid.server"/>
</policy>
</busconfig>
果然不再报错,重启服务后显示成功获取,nvidia-smi -q | grep -i license
也证实了已变成 Licensed 状态。
Summary
没有学会炼丹,却再次锻炼了折腾能力,最开始只会盲从 Arch Wiki,转折点是看到 /data
目录下的残留驱动而意识到应使用 vGPU 特供 grid 驱动,从而被迫学习改包打包,虽然成功打包并装上,但报错并发现到机器最高只支持 450xx,于是打旧包,这次是安装 dkms 总出错,从而对驱动和内核版本之间的关系有了更深理解,事后看 NVIDIA 官网的文档和表格大致明白了 Kernel, Driver, CUDA 这三者版本的关联。
最后附上自制 nvidia-450xx-utils
的 PKGBUILD
# Maintainer: Jonathon Fernyhough <jonathon+m2x+dev>
# Contributor: Sven-Hendrik Haase <svenstaro@gmail.com>
# Contributor: Thomas Baechler <thomas@archlinux.org>
# Contributor: James Rayner <iphitus@gmail.com>
pkgbase=nvidia-450xx-utils
pkgname=('nvidia-450xx-utils' 'opencl-nvidia-450xx' 'nvidia-450xx-dkms')
pkgver=450.102.04
pkgrel=2
arch=('x86_64')
url="http://www.nvidia.com/"
license=('custom')
options=('!strip')
_pkg="NVIDIA-Linux-x86_64-${pkgver}-grid"
source=('nvidia-drm-outputclass.conf'
'nvidia-450xx-utils.sysusers'
'nvidia-450xx.rules'
"https://storage.googleapis.com/nvidia-drivers-us-public/GRID/GRID11.3/${_pkg}.run")
sha512sums=('de7116c09f282a27920a1382df84aa86f559e537664bb30689605177ce37dc5067748acf9afd66a3269a6e323461356592fdfc624c86523bf105ff8fe47d3770'
'4b3ad73f5076ba90fe0b3a2e712ac9cde76f469cd8070280f960c3ce7dc502d1927f525ae18d008075c8f08ea432f7be0a6c3a7a6b49c361126dcf42f97ec499'
'a0ceb0a6c240cf97b21a2e46c5c212250d3ee24fecef16aca3dffb04b8350c445b9f4398274abccdb745dd0ba5132a17942c9508ce165d4f97f41ece02b0b989'
'523070e9e458f2da50df0f6dd35445ed824cf3b4ce2c3e191d58718a4ed638cfc644852b8330fb3da0444811431da7bf88f195e9aed1fa8615f92b8d1e941892')
create_links() {
# create soname links
find "$pkgdir" -type f -name '*.so*' ! -path '*xorg/*' -print0 | while read -d $'\0' _lib; do
_soname=$(dirname "${_lib}")/$(readelf -d "${_lib}" | grep -Po 'SONAME.*: \[\K[^]]*' || true)
_base=$(echo ${_soname} | sed -r 's/(.*)\.so.*/\1.so/')
[[ -e "${_soname}" ]] || ln -s $(basename "${_lib}") "${_soname}"
[[ -e "${_base}" ]] || ln -s $(basename "${_soname}") "${_base}"
done
}
prepare() {
sh "${_pkg}.run" --extract-only
cd "${_pkg}"
bsdtar -xf nvidia-persistenced-init.tar.bz2
cd kernel
sed -i "s/__VERSION_STRING/${pkgver}/" dkms.conf
sed -i 's/__JOBS/`nproc`/' dkms.conf
sed -i 's/__DKMS_MODULES//' dkms.conf
sed -i '$iBUILT_MODULE_NAME[0]="nvidia"\
DEST_MODULE_LOCATION[0]="/kernel/drivers/video"\
BUILT_MODULE_NAME[1]="nvidia-uvm"\
DEST_MODULE_LOCATION[1]="/kernel/drivers/video"\
BUILT_MODULE_NAME[2]="nvidia-modeset"\
DEST_MODULE_LOCATION[2]="/kernel/drivers/video"\
BUILT_MODULE_NAME[3]="nvidia-drm"\
DEST_MODULE_LOCATION[3]="/kernel/drivers/video"' dkms.conf
# Gift for linux-rt guys
sed -i 's/NV_EXCLUDE_BUILD_MODULES/IGNORE_PREEMPT_RT_PRESENCE=1 NV_EXCLUDE_BUILD_MODULES/' dkms.conf
}
package_opencl-nvidia-450xx() {
pkgdesc="OpenCL implemention for NVIDIA"
depends=('zlib')
optdepends=('opencl-headers: headers necessary for OpenCL development')
provides=('opencl-driver' 'opencl-nvidia')
conflicts=('opencl-nvidia')
cd "${_pkg}"
# OpenCL
install -Dm644 nvidia.icd "${pkgdir}/etc/OpenCL/vendors/nvidia.icd"
install -D "libnvidia-compiler.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-compiler.so.${pkgver}"
install -D "libnvidia-opencl.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-opencl.so.${pkgver}"
create_links
mkdir -p "${pkgdir}/usr/share/licenses"
ln -s nvidia-utils "${pkgdir}/usr/share/licenses/opencl-nvidia"
}
package_nvidia-450xx-dkms() {
pkgdesc="NVIDIA drivers - module sources"
depends=('dkms' "nvidia-450xx-utils=$pkgver" 'libglvnd')
provides=('NVIDIA-MODULE')
cd ${_pkg}
install -dm 755 "${pkgdir}"/usr/src
cp -dr --no-preserve='ownership' kernel "${pkgdir}/usr/src/nvidia-${pkgver}"
install -Dt "${pkgdir}/usr/share/licenses/${pkgname}" -m644 "${srcdir}/${_pkg}/LICENSE"
}
package_nvidia-450xx-utils() {
pkgdesc="NVIDIA drivers utilities"
depends=('xorg-server')
optdepends=('xorg-server-devel: nvidia-xconfig'
'opencl-nvidia-450xx: OpenCL support')
conflicts=('nvidia-libgl' 'nvidia-utils')
provides=('vulkan-driver' 'opengl-driver' 'nvidia-libgl' 'nvidia-utils')
install="${pkgname}.install"
cd "${_pkg}"
# Check http://us.download.nvidia.com/XFree86/Linux-x86_64/${pkgver}/README/installedcomponents.html
# for hints on what needs to be installed where.
# X driver
install -D nvidia_drv.so "${pkgdir}/usr/lib/xorg/modules/drivers/nvidia_drv.so"
# GLX extension module for X
install -D "libglxserver_nvidia.so.${pkgver}" "${pkgdir}/usr/lib/nvidia/xorg/libglxserver_nvidia.so.${pkgver}"
# Ensure that X finds glx
ln -s "libglxserver_nvidia.so.${pkgver}" "${pkgdir}/usr/lib/nvidia/xorg/libglxserver_nvidia.so.1"
ln -s "libglxserver_nvidia.so.${pkgver}" "${pkgdir}/usr/lib/nvidia/xorg/libglxserver_nvidia.so"
install -D "libGLX_nvidia.so.${pkgver}" "${pkgdir}/usr/lib/libGLX_nvidia.so.${pkgver}"
# OpenGL libraries
install -D "libEGL_nvidia.so.${pkgver}" "${pkgdir}/usr/lib/libEGL_nvidia.so.${pkgver}"
install -D "libGLESv1_CM_nvidia.so.${pkgver}" "${pkgdir}/usr/lib/libGLESv1_CM_nvidia.so.${pkgver}"
install -D "libGLESv2_nvidia.so.${pkgver}" "${pkgdir}/usr/lib/libGLESv2_nvidia.so.${pkgver}"
install -Dm644 "10_nvidia.json" "${pkgdir}/usr/share/glvnd/egl_vendor.d/10_nvidia.json"
# OpenGL core library
install -D "libnvidia-glcore.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-glcore.so.${pkgver}"
install -D "libnvidia-eglcore.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-eglcore.so.${pkgver}"
install -D "libnvidia-glsi.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-glsi.so.${pkgver}"
# misc
install -D "libnvidia-ifr.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-ifr.so.${pkgver}"
install -D "libnvidia-fbc.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-fbc.so.${pkgver}"
install -D "libnvidia-encode.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-encode.so.${pkgver}"
install -D "libnvidia-cfg.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-cfg.so.${pkgver}"
install -D "libnvidia-ml.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-ml.so.${pkgver}"
install -D "libnvidia-glvkspirv.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-glvkspirv.so.${pkgver}"
install -D "libnvidia-allocator.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-allocator.so.${pkgver}"
# Vulkan ICD
install -Dm644 "nvidia_icd.json" "${pkgdir}/usr/share/vulkan/icd.d/nvidia_icd.json"
install -Dm644 "nvidia_layers.json" "${pkgdir}/usr/share/vulkan/implicit_layer.d/nvidia_layers.json"
# VDPAU
install -D "libvdpau_nvidia.so.${pkgver}" "${pkgdir}/usr/lib/vdpau/libvdpau_nvidia.so.${pkgver}"
# nvidia-tls library
install -D "libnvidia-tls.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-tls.so.${pkgver}"
# CUDA
install -D "libcuda.so.${pkgver}" "${pkgdir}/usr/lib/libcuda.so.${pkgver}"
install -D "libnvcuvid.so.${pkgver}" "${pkgdir}/usr/lib/libnvcuvid.so.${pkgver}"
# PTX JIT Compiler (Parallel Thread Execution (PTX) is a pseudo-assembly language for CUDA)
install -D "libnvidia-ptxjitcompiler.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-ptxjitcompiler.so.${pkgver}"
# raytracing
install -D "libnvoptix.so.${pkgver}" "${pkgdir}/usr/lib/libnvoptix.so.${pkgver}"
install -D "libnvidia-rtcore.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-rtcore.so.${pkgver}"
install -D "libnvidia-cbl.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-cbl.so.${pkgver}"
# NGX
install -D "libnvidia-ngx.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-ngx.so.${pkgver}"
# Optical flow
install -D "libnvidia-opticalflow.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-opticalflow.so.${pkgver}"
# Only for GRID, maybe useless
install -D "libFlxCore64.so.2018.02" "${pkgdir}/usr/lib/libFlxCore64.so.2018.02"
install -D "libFlxComm64.so.2018.02" "${pkgdir}/usr/lib/libFlxComm64.so.2018.02"
# DEBUG
install -D nvidia-debugdump "${pkgdir}/usr/bin/nvidia-debugdump"
# nvidia-xconfig
install -D nvidia-xconfig "${pkgdir}/usr/bin/nvidia-xconfig"
install -Dm644 nvidia-xconfig.1.gz "${pkgdir}/usr/share/man/man1/nvidia-xconfig.1.gz"
# nvidia-settings
install -D -m755 nvidia-settings "${pkgdir}/usr/bin/nvidia-settings"
install -D -m644 nvidia-settings.1.gz "${pkgdir}/usr/share/man/man1/nvidia-settings.1.gz"
install -D -m644 nvidia-settings.desktop "${pkgdir}/usr/share/applications/nvidia-settings.desktop"
install -D -m644 nvidia-settings.png "${pkgdir}/usr/share/pixmaps/nvidia-settings.png"
install -D -m755 "libnvidia-gtk2.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-gtk2.so.${pkgver}"
install -D -m755 "libnvidia-gtk3.so.${pkgver}" "${pkgdir}/usr/lib/libnvidia-gtk3.so.${pkgver}"
sed -e 's:__UTILS_PATH__:/usr/bin:' -e 's:__PIXMAP_PATH__:/usr/share/pixmaps:' -i "${pkgdir}/usr/share/applications/nvidia-settings.desktop"
# nvidia-bug-report
install -D nvidia-bug-report.sh "${pkgdir}/usr/bin/nvidia-bug-report.sh"
# nvidia-smi
install -D nvidia-smi "${pkgdir}/usr/bin/nvidia-smi"
install -Dm644 nvidia-smi.1.gz "${pkgdir}/usr/share/man/man1/nvidia-smi.1.gz"
# nvidia-cuda-mps
install -D nvidia-cuda-mps-server "${pkgdir}/usr/bin/nvidia-cuda-mps-server"
install -D nvidia-cuda-mps-control "${pkgdir}/usr/bin/nvidia-cuda-mps-control"
install -Dm644 nvidia-cuda-mps-control.1.gz "${pkgdir}/usr/share/man/man1/nvidia-cuda-mps-control.1.gz"
# nvidia-modprobe
# This should be removed if nvidia fixed their uvm module!
install -Dm4755 nvidia-modprobe "${pkgdir}/usr/bin/nvidia-modprobe"
install -Dm644 nvidia-modprobe.1.gz "${pkgdir}/usr/share/man/man1/nvidia-modprobe.1.gz"
# nvidia-persistenced
install -D nvidia-persistenced "${pkgdir}/usr/bin/nvidia-persistenced"
install -Dm644 nvidia-persistenced.1.gz "${pkgdir}/usr/share/man/man1/nvidia-persistenced.1.gz"
install -Dm644 nvidia-persistenced-init/systemd/nvidia-persistenced.service.template "${pkgdir}/usr/lib/systemd/system/nvidia-persistenced.service"
sed -i 's/__USER__/nvidia-persistenced/' "${pkgdir}/usr/lib/systemd/system/nvidia-persistenced.service"
# nvidia-gridd
install -Dm4755 nvidia-gridd "${pkgdir}/usr/bin/nvidia-gridd"
install -Dm644 nvidia-gridd.1.gz "${pkgdir}/usr/share/man/man1/nvidia-gridd.1.gz"
install -Dm644 gridd.conf.template "${pkgdir}/etc/nvidia/gridd.conf.template"
install -Dm644 init-scripts/systemd/nvidia-gridd.service "${pkgdir}/usr/lib/systemd/system/nvidia-gridd.service"
# application profiles
install -Dm644 nvidia-application-profiles-${pkgver}-rc "${pkgdir}/usr/share/nvidia/nvidia-application-profiles-${pkgver}-rc"
install -Dm644 nvidia-application-profiles-${pkgver}-key-documentation "${pkgdir}/usr/share/nvidia/nvidia-application-profiles-${pkgver}-key-documentation"
install -Dm644 LICENSE "${pkgdir}/usr/share/licenses/nvidia-utils/LICENSE"
install -Dm644 README.txt "${pkgdir}/usr/share/doc/nvidia/README"
install -Dm644 NVIDIA_Changelog "${pkgdir}/usr/share/doc/nvidia/NVIDIA_Changelog"
cp -r html "${pkgdir}/usr/share/doc/nvidia/"
ln -s nvidia "${pkgdir}/usr/share/doc/nvidia-utils"
install -Dm644 "${srcdir}/nvidia-450xx-utils.sysusers" "${pkgdir}/usr/lib/sysusers.d/$pkgname.conf"
install -Dm644 "${srcdir}/nvidia-450xx.rules" "$pkgdir"/usr/lib/udev/rules.d/60-nvidia-450xx.rules
# distro specific files must be installed in /usr/share/X11/xorg.conf.d
install -m755 -d "$pkgdir/usr/share/X11/xorg.conf.d"
install -Dm644 "${srcdir}/nvidia-drm-outputclass.conf" "${pkgdir}/usr/share/X11/xorg.conf.d/10-nvidia-drm-outputclass.conf"
echo "blacklist nouveau" | install -Dm644 /dev/stdin "${pkgdir}/usr/lib/modprobe.d/${pkgname}.conf"
echo "nvidia-uvm" | install -Dm644 /dev/stdin "${pkgdir}/usr/lib/modules-load.d/${pkgname}.conf"
create_links
}
评论正在加载中...如果评论较长时间无法加载,你可以 搜索对应的 issue 或者 新建一个 issue 。