VM boots slowly with large-BAR GPU Passthrough due to pci/probe.c redundancy

Bug #2097389 reported by Mitchell Augustin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Undecided
Unassigned
Noble
Fix Committed
Medium
Mitchell Augustin
Oracular
Fix Committed
Medium
Mitchell Augustin

Bug Description

SRU Justification:

[ Impact ]

VM guests that have large-BAR GPUs passed through to them will take 2x as long to initialize all device BARs without this patch

[ Test Plan ]

I verified that this patch applies cleanly to the Noble kernel
and resolves the bug on DGX H100 and DGX A100. I observed no regressions.
This can be verified on any machine with a sufficiently large BAR and the
capability to pass through to a VM using vfio.

To verify no regressions, I applied this patch to the guest kernel, then
rebooted and confirmed that:
1. The measured PCI initialization time on boot was ~50% of the unmodified kernel
2. Relevant parts of /proc/iomem mappings, the PCI init section of dmesg output, and lspci -vv output remained unchanged between the system with the unmodified kernel and with the patched kernel
3. The Nvidia driver still successfully loaded and was shown via nvidia-smi after the patch was applied

[ Fix ]

Roughly half of the time consuming device configuration options invoked during
the PCI probe function can be eliminated by rearranging the memory and I/O disable/enable
calls such that they only occur per-device rather than per-BAR. This is what the upstream
patch does, and it results in roughly half the excess initialization time being eliminated
reliably during VM boot.

[ Where problems could occur ]

I do not expect any regressions. The only callers of ABIs changed by this patch are also adjusted within this patch, and the functional change only removes entirely redundant calls to disable/enable PCI memory/IO.

[ Additional Context ]

Upstream patch: https://<email address hidden>/
Upstream bug report: https://lore.kernel<email address hidden>/

Changed in linux (Ubuntu):
assignee: nobody → Mitchell Augustin (mitchellaugustin)
status: New → In Progress
Changed in linux (Ubuntu Noble):
status: New → In Progress
Changed in linux (Ubuntu):
status: In Progress → Invalid
Changed in linux (Ubuntu Noble):
assignee: nobody → Mitchell Augustin (mitchellaugustin)
Changed in linux (Ubuntu):
assignee: Mitchell Augustin (mitchellaugustin) → nobody
Changed in linux (Ubuntu Oracular):
status: New → In Progress
assignee: nobody → Mitchell Augustin (mitchellaugustin)
Revision history for this message
Mitchell Augustin (mitchellaugustin) wrote :

Upstream patch submitted to kernel-team list with subject [SRU][N/O][PATCH 0/1] PCI: Batch BAR sizing operations

Changed in linux (Ubuntu Noble):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Oracular):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Noble):
importance: Undecided → Medium
Changed in linux (Ubuntu Oracular):
importance: Undecided → Medium
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux/6.8.0-56.58 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-noble-linux' to 'verification-done-noble-linux'. If the problem still exists, change the tag 'verification-needed-noble-linux' to 'verification-failed-noble-linux'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-noble-linux-v2 verification-needed-noble-linux
Revision history for this message
Mitchell Augustin (mitchellaugustin) wrote :

-proposed kernel verified to fix bug on DGX H100

tags: added: verification-done-noble-linux
removed: verification-needed-noble-linux
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux/6.11.0-21.21 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-oracular-linux' to 'verification-done-oracular-linux'. If the problem still exists, change the tag 'verification-needed-oracular-linux' to 'verification-failed-oracular-linux'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-oracular-linux-v2 verification-needed-oracular-linux
Revision history for this message
Mitchell Augustin (mitchellaugustin) wrote :

oracular-proposed kernel verified to fix bug on DGX H100

tags: added: verification-done-oracular-linux
removed: verification-needed-oracular-linux
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.