thunderbolt combined with ethernet does not work

Bug #1878020 reported by gnomed
28
This bug affects 6 people
Affects Status Importance Assigned to Milestone
linux-signed-hwe (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Originally submitted to askubuntu.com with no responses (even after a bounty).

Original question here: https://askubuntu.com/questions/1235285/usb-ethernet-thunderbolt-errors

Original question text:

Currently having major issues with thunderbolt ethernet adapter.

The device is a "Razer Core X" thunderbolt eGPU enclosure which has a built in USB3 hub and gigabit ethernet port. Essentially a very powerful dock with its own GPU (utilizing the 4 lanes of PCIe on the thunderbolt port).

The same issue occurs if I use the built in ethernet port or if I plug in my own USB3 ethernet adaptor into the USB3 hub on the dock. The issue does not occur if I bypass the thunderbolt dock and plug an ethernet adaptor in directly (although this kind of defeats the purpose of the dock).

The same issue also occurs on different laptops that both use this dock. My work laptop is on 18.04 and my new personal laptop is on 20.04 and both versions of Ubuntu on both laptops have the exact same behavior.

When the ethernet fails it is no longer able to resolve DNS nor can I even ping anything on my local network, but the device still appears "connected" in the NetworkManager UI. When I toggle the connection off/on then it shows "Connecting" and stays stuck there permanently.

When I check dmesg/syslog I see the exact same bunch of errors every single time. Below is an example of the error. Here you can see in the logs I reload the driver via modprobe which causes connection to work again briefly. Then the exact same error and symptoms occur moments later:

[SAME ERROR REPEATED ~20 times]
[ 543.814402] xhci_hcd 0000:0c:00.0: Looking for event-dma 0000000425e6e3f0 trb-start 0000000425e6e2e0 trb-end 0000000425e6e2e0 seg-start 0000000425e6e000 seg-end 0000000425e6eff0
[ 543.815185] xhci_hcd 0000:0c:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 4 comp_code 13
[ 543.815186] xhci_hcd 0000:0c:00.0: Looking for event-dma 0000000425e6e400 trb-start 0000000425e6e2e0 trb-end 0000000425e6e2e0 seg-start 0000000425e6e000 seg-end 0000000425e6eff0
[ 543.815271] xhci_hcd 0000:0c:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 4 comp_code 13
[ 543.815272] xhci_hcd 0000:0c:00.0: Looking for event-dma 0000000425e6e410 trb-start 0000000425e6e2e0 trb-end 0000000425e6e2e0 seg-start 0000000425e6e000 seg-end 0000000425e6eff0
[ 543.815356] xhci_hcd 0000:0c:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 4 comp_code 13
[ 543.815357] xhci_hcd 0000:0c:00.0: Looking for event-dma 0000000425e6e420 trb-start 0000000425e6e2e0 trb-end 0000000425e6e2e0 seg-start 0000000425e6e000 seg-end 0000000425e6eff0
[ 543.815822] xhci_hcd 0000:0c:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 4 comp_code 13
[ 543.815823] xhci_hcd 0000:0c:00.0: Looking for event-dma 0000000425e6e430 trb-start 0000000425e6e2e0 trb-end 0000000425e6e2e0 seg-start 0000000425e6e000 seg-end 0000000425e6eff0
[ 609.127969] usbcore: deregistering interface driver ax88179_178a
[ 609.128443] ax88179_178a 10-1:1.0 enx90203a1c2b65: unregister 'ax88179_178a' usb-0000:0c:00.0-1, ASIX AX88179 USB 3.0 Gigabit Ethernet
[ 609.643604] ax88179_178a 10-1:1.0 eth0: register 'ax88179_178a' at usb-0000:0c:00.0-1, ASIX AX88179 USB 3.0 Gigabit Ethernet, 90:20:3a:1c:2b:65
[ 609.645089] usbcore: registered new interface driver ax88179_178a
[ 609.646672] ax88179_178a 10-1:1.0 enx90203a1c2b65: renamed from eth0
[ 612.705215] ax88179_178a 10-1:1.0 enx90203a1c2b65: ax88179 - Link status is: 1
[ 612.735735] IPv6: ADDRCONF(NETDEV_CHANGE): enx90203a1c2b65: link becomes ready
[ 652.991795] xhci_hcd 0000:0c:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 4 comp_code 13
[ 652.991799] xhci_hcd 0000:0c:00.0: Looking for event-dma 0000000384cb4280 trb-start 0000000384cb4260 trb-end 0000000384cb4260 seg-start 0000000384cb4000 seg-end 0000000384cb4ff0
[ 652.991859] xhci_hcd 0000:0c:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 4 comp_code 13
[ 652.991862] xhci_hcd 0000:0c:00.0: Looking for event-dma 0000000384cb4290 trb-start 0000000384cb4260 trb-end 0000000384cb4260 seg-start 0000000384cb4000 seg-end 0000000384cb4ff0
[ 652.991944] xhci_hcd 0000:0c:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 4 comp_code 13
[ 652.991945] xhci_hcd 0000:0c:00.0: Looking for event-dma 0000000384cb42a0 trb-start 0000000384cb4260 trb-end 0000000384cb4260 seg-start 0000000384cb4000 seg-end 0000000384cb4ff0
[ 652.992030] xhci_hcd 0000:0c:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 4 comp_code 13
[ 652.992032] xhci_hcd 0000:0c:00.0: Looking for event-dma 0000000384cb42b0 trb-start 0000000384cb4260 trb-end 0000000384cb4260 seg-start 0000000384cb4000 seg-end 0000000384cb4ff0
[ 652.992121] xhci_hcd 0000:0c:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 4 comp_code 13
[ 652.992122] xhci_hcd 0000:0c:00.0: Looking for event-dma 0000000384cb42c0 trb-start 0000000384cb4260 trb-end 0000000384cb4260 seg-start 0000000384cb4000 seg-end 0000000384cb4ff0
[SAME ERROR REPEATED ~20 times]

That "transfer event" error is in the logs every single time I check them after noticing the symptoms and the timestamp in syslog appears to correspond precisely to the moment the network stops working. As you can see from this run there is no other log output between loading the driver and experiencing the issue.

When I google that error message there are multiple articles talking about thunderbolt bandwidth management issues, but every post I can find is years old and just says it's fixed in the kernel now. But it doesn't seem to be fixed for me (5.3.0-51-generic). The error does occur much more quickly when I am doing something that requires more network bandwidth (like streaming a movie) so that does seem to point to thunderbolt bandwidth issues, but again I cannot find a fix/supposedly this was all fixed years ago.

Linked here is my output from the networking diagnostic script: https://paste.ubuntu.com/p/FmG8hGzkXV/

I've tried disabling autosuspend (using powertop) of both the ethernet device and the USB bus that the device is attached to, but it does not help. My laptops do not appear to have laptop-mode-tools nor tlp installed.

Can someone please help me figure out what is going on? I feel like a thunderbolt device like this should be able to work on Ubuntu. If anything I expected issues with the GPU part of the eGPU but that is working fine as I am currently typing on dual 4k monitors being driven by the dock, the only issue is with the ethernet (keyboard, mouse, usb sound card all work fine when plugged into dock).

ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: linux-image-5.3.0-51-generic 5.3.0-51.44~18.04.2
ProcVersionSignature: Ubuntu 5.3.0-51.44~18.04.2-generic 5.3.18
Uname: Linux 5.3.0-51-generic x86_64
NonfreeKernelModules: nvidia_modeset nvidia
ApportVersion: 2.20.9-0ubuntu7.14
Architecture: amd64
CurrentDesktop: ubuntu:GNOME
Date: Mon May 11 09:35:37 2020
InstallationDate: Installed on 2019-02-26 (439 days ago)
InstallationMedia: Ubuntu 18.04.2 LTS "Bionic Beaver" - Release amd64 (20190210)
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: linux-signed-hwe
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
gnomed (the-gnomed) wrote :
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-signed-hwe (Ubuntu):
status: New → Confirmed
Revision history for this message
Nick Klein (sledgebullet) wrote :
Download full text (5.4 KiB)

I am able to reproduce this on-demand on a fully-patched Ubuntu 20.04. Same enclosure - Razer CoreX Chroma. Both the built-in Asix AX88179-based gigE adapter and my Realtek RTL8153-based USB-to-ethernet adapter exhibit this same problem when connected through the CoreX's USB hub.

The same Realtek adapter plugged straight into my laptop has no problems after extended use. Likewise, other high-traffic USB devices (and PCIe) work fine through this enclosure. It is strictly an issue with USB-to-GigE from what I've seen. This appears to be the USB controller in question-

3a:00.0 USB controller: Intel Corporation JHL6240 Thunderbolt 3 USB 3.1 Controller (Low Power) [Alpine Ridge LP 2016] (rev 01) (prog-if 30 [XHCI])
        Subsystem: Lenovo JHL6240 Thunderbolt 3 USB 3.1 Controller (Low Power) [Alpine Ridge LP 2016]
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 128 bytes
        Interrupt: pin A routed to IRQ 159
        Region 0: Memory at c5f00000 (32-bit, non-prefetchable) [size=64K]
        Capabilities: [80] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [88] MSI: Enable+ Count=1/8 Maskable- 64bit+
                Address: 00000000fee00738 Data: 0000
        Capabilities: [c0] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <4us, L1 <8us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <2us, L1 <4us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s (ok), Width x4 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range B, TimeoutDis+, NROPrPrP-, LTR+
                         10BitTagComp-, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-, TPHComp-, ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
                         AtomicOpsCtl: ReqEn-
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Rang...

Read more...

Revision history for this message
Nick Klein (sledgebullet) wrote :

Confirmed still present with kernel 5.8rc4.

Revision history for this message
Nick Klein (sledgebullet) wrote :

See 1878020

Revision history for this message
Nick Klein (sledgebullet) wrote :

Razer tells me they are working to get the new firmware out to their customers.

Revision history for this message
Nick Klein (sledgebullet) wrote :

Razer now tells me they have yet to find a solution.

Revision history for this message
Viktor (lamalas) wrote :

This also happens with a Lenovo Thunderbolt Dock Gen 2.

Revision history for this message
Kevin Hester (kevinh) wrote :

I can confirm @lamalas "Lenovo Thunderbolt Dock Gen 2" shows these same problems with a high speed USB (Samsung SSD T7) disk. No need to involve the gige port to see the problem.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.