QEMU/KVM and GPU passthrough – VM reboot woes

We are currently implementing GPU passthrough support in Origo OS – another long overdue feature. The passthrough in itself is simple enough – it’s the scheduling, accounting, billing etc. that’s complicated. One thing has been a persistent problem though – properly releasing GPUs when rebooting or shutting down a VM, and we still do not have a real fix. We have however isolated the problem, and felt it appropriate to share, since others must be having the same problem.

Launching a Ubuntu VM with GPU passthrough enabled in Origo OS is simple. This of course requires that the hardware we run Origo OS on has at least one physical GPU. Installing Nvidia drivers + binaries (apt install nvidia-driver-580) is also simple. Running “nvidia-smi” verifies that the passed-through GPU is correctly identified.

The Problem

The problem rears its head when we reboot the VM. After a reboot, running “nvidia-smi” simply returns “No devices found”. Performing all the suggested incantations scattered around Reddit has done nothing to alleviate the problem. Shutting the VM down completely and starting it again – same problem – no devices found. The weird thing is that running “lspci” in the VM correctly lists the GPU. “virsh nodedev-detach pci_xxxx…” and “virsh nodedev-reset pci_xxxx_…” run without issues on the host, but do nothing. After booting a VM with GPU passthrough a second time, the GPU is not available. Removing and inserting kernel modules (“rmmod vfio_pci” etc.) in order to try and release the GPU also does not help. Nvidia or nouveau drivers are not loaded on the host. In short – really annoying. Having to reboot the host with potentially dozens of VMs running to regain access to the GPU is not really workable.

The “Fix”

After spending a couple of evenings trying to debug this, it finally dawned on me: The problem is probably not on the host side. Once the Nvidia drivers grab hold of the GPU, they apparently hold on, even after the VM is terminated. So I tried unloading the Nvidia drivers (“rmmod nvidia-drm”, etc.) before shutting down the VM. And behold, upon reboot, the GPU was accessible. So, at least a work-around: If I remember to unload the Nvidia-drivers before shutting down the VM, I do not have to reboot the entire host to regain access to the GPU.

One other thing I tried was adding “nvidia_drm.modeset=0” to “GRUB_CMDLINE_LINUX_DEFAULT” in “/etc/default/grub” in the VM and running “update-grub”. This also worked – after that the the GPU is also properly released upon VM shutdown or reboot. This would suggest that the root problem is with the “nvidia_drm” driver loaded by the VM. If anyone knows more about this issue, please chime in. It would be great to be able to properly release a GPU from the host side, no matter what shenanigans a VM is up to.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

QEMU/KVM and GPU passthrough – VM reboot woes

The Problem

The “Fix”

Leave a Comment

Leave a Comment