Azure - Kernel crashes when removing gpu from pci

Bug #2042568 reported by Ioanna Alifieraki
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-azure (Ubuntu)
New
Medium
Unassigned
Jammy
New
Medium
Ioanna Alifieraki
Lunar
Invalid
Medium
Ioanna Alifieraki

Bug Description

[Description]

On a VM on Azure with a Tesla gpu it was noticed that when removing the gpu from the pci the vm would crash. In case the nvidia drivers are loaded, the machine won't crash. Instead the removing process will hang and the machine will crash on reboot.

This is related to bug [1].
The bug reported in [1] regards another driver but the root cause is the same.
It is still investigated whether this is a bug in pci, or it is a bug of various drivers on how they use pci.

For this case we have identified that removing commit [2] prevents the kernel crashes.

Azure has requested to revert this commit, at least for the time being.
This commit is not in upstream, so it just need to be reverted from Ubuntu kernels.

[Test Case]

On an Azure vm with a gpu :

# echo '1' > /sys/bus/pci/devices/0001:00:00.0/remove

where '0001:00:00.0' the pci address of the gpu.
The vm will crash.

[Where things could go wrong]

The commit to be reverted was included in a patchset to address lp bugs https://bugs.launchpad.net/bugs/2023071 and https://bugs.launchpad.net/bugs/2023594

However this commit just reduces boot time and removing shall not introduce any regressions.
Side effects will be increase in the boot time.

[Other]

Only Ubuntu azure kernels are affected :

- Jammy 5.15

Focal is also affected since it's using 5.15 kernel.
This commit does not appear in Mantic 6.5 kernel.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=215515
[2] https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-azure/+git/jammy/commit/?h=Ubuntu-azure-5.15.0-1043.50&id=75af0c10b3703400890d314d1d91d25294234a81

Changed in linux (Ubuntu):
status: New → Confirmed
importance: Undecided → Medium
Changed in linux (Ubuntu Jammy):
assignee: nobody → Ioanna Alifieraki (joalif)
Changed in linux (Ubuntu Lunar):
assignee: nobody → Ioanna Alifieraki (joalif)
Changed in linux (Ubuntu Jammy):
importance: Undecided → Medium
Changed in linux (Ubuntu Lunar):
importance: Undecided → Medium
affects: linux (Ubuntu) → linux-azure (Ubuntu)
Changed in linux-azure (Ubuntu):
status: Confirmed → New
description: updated
description: updated
Revision history for this message
Ioanna Alifieraki (joalif) wrote :

Upon further testing Lunar kernel 6.2 seems not be affected. I'll investigate further to find out why.

Changed in linux-azure (Ubuntu Lunar):
status: New → Invalid
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.