Transaction ID collisions cause slow DNS lookups in getaddrinfo

Bug #1961697 reported by KJ Tsanaktsidis
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
GLibC
Fix Released
Medium
glibc (Ubuntu)
Confirmed
Medium
Unassigned
Focal
New
Medium
Unassigned

Bug Description

[impact]
When resolving DNS names with getaddrinfo(), I have seen this hang for 5 seconds and then retry and succeed. The issue is that glibc will issue a both an A and AAAA query on the same socket, and in some circumstances they can be sent with the same DNS transaction ID as well.

[test case]
TBD

[regression potential]
TBD.

[original description]
I verified this with a packet capture; in the packet capture, I saw the A and AAAA queries for a name be made with the same DNS transaction ID, get responses, do nothing for five seconds, and then send the same DNS query again. On the glibc side, I confirmed that it's blocked waiting for the DNS response by interrupting it with gdb, even though the packet capture shows the response has well and truly arrived. I've attached a packet capture & a backtrace of the glibc hang.

I believe this is the same issue reported in these places:
    * In RHEL: https://bugzilla.redhat.com/show_bug.cgi?id=1904153
    * Also RHEL: https://bugzilla.redhat.com/show_bug.cgi?id=1903880
    * Upstream: https://sourceware.org/bugzilla/show_bug.cgi?id=26600

The environment I noticed this bug in was:
    * Docker for Mac on an arm64 m1 Macbook
    * Docker for Mac Linux kernel version is 5.10.76-linuxkit
    * Linux is also arm64, not emulated
    * Container with the buggy DNS environment is Ubuntu bionic (also arm64, not emulated)
    * Glibc 2.27-3ubuntu1.4

However one of the redhat reporters noticed this issue in m6 series EC2 instances in AWS.

A patch has been provided upstream for this issue: https://sourceware.org/pipermail/libc-alpha/2020-September/117547.html

I applied the upstream patch to glibc 2.27-3ubuntu1.4 and rebuilt the package, and the problem went away. I've attached the exact patch I applied, since I had to work through some conflicts.

So, I think that patch just needs to be backported to Bionic and (I think) Focal as well. Is that reasonable?

Thanks!

Tags: patch fr-2102
Revision history for this message
In , Florian Weimer (fweimer) wrote :

If the A and AAAA queries have equal transaction IDs, the initial AAAA response is not recognized as valid, resulting in timeouts and retransmits.

Revision history for this message
In , Florian Weimer (fweimer) wrote :

This bug is distinct from bug 19691 in the sense that it is possible to fix it without reworking the buffer management.

Revision history for this message
In , Florian Weimer (fweimer) wrote :
Revision history for this message
In , Cvs-commit (cvs-commit) wrote :

The master branch has been updated by Florian Weimer <email address hidden>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f1f00c072138af90ae6da180f260111f09afe7a3

commit f1f00c072138af90ae6da180f260111f09afe7a3
Author: Florian Weimer <email address hidden>
Date: Wed Oct 14 10:54:39 2020 +0200

    resolv: Handle transaction ID collisions in parallel queries (bug 26600)

    If the transaction IDs are equal, the old check attributed both
    responses to the first query, not recognizing the second response.
    This fixes bug 26600.

Revision history for this message
In , Florian Weimer (fweimer) wrote :

Fixed for glibc 2.33.

Revision history for this message
In , Cvs-commit (cvs-commit) wrote :

The release/2.32/master branch has been updated by Florian Weimer <email address hidden>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=2dfa659a66f20facc4082207884c20e986ddecee

commit 2dfa659a66f20facc4082207884c20e986ddecee
Author: Florian Weimer <email address hidden>
Date: Wed Oct 14 10:54:39 2020 +0200

    resolv: Handle transaction ID collisions in parallel queries (bug 26600)

    If the transaction IDs are equal, the old check attributed both
    responses to the first query, not recognizing the second response.
    This fixes bug 26600.

    (cherry picked from commit f1f00c072138af90ae6da180f260111f09afe7a3)

Revision history for this message
KJ Tsanaktsidis (kjtsanaktsidis) wrote :
Revision history for this message
KJ Tsanaktsidis (kjtsanaktsidis) wrote :
Revision history for this message
KJ Tsanaktsidis (kjtsanaktsidis) wrote :
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in glibc (Ubuntu):
status: New → Confirmed
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "upstream patch with conflicts resolved for 2.27" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]

tags: added: patch
tags: added: rls-ff-incoming
Changed in glibc (Ubuntu):
importance: Undecided → Medium
Changed in glibc:
importance: Unknown → Medium
status: Unknown → Fix Released
Revision history for this message
KJ Tsanaktsidis (kjtsanaktsidis) wrote :

Just wondering if there's a plan or desire to (correct me if I use the wrong terminology here..) SRU this into bionic/focal?

The criteria for SRU, from what I read, seem to be related to "critical bugs" (of which maybe this is not), "regressions" (which this isn't), and "hardware enablement" (which I think does apply here - this bug seems to be triggered much more often on new hardware like M1 macbooks and m6 EC2 instances)

Let me know if I can help with this at all by providing a merge proposal (although I don't think there's much more to it than applying the attached patch & writing up the changelog?)

tags: added: fr-2102
Changed in glibc (Ubuntu Focal):
importance: Undecided → Medium
tags: removed: rls-ff-incoming
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

So for SRU we ideally want a nice, self-contained, ubuntu-based test case. Is that possible here? It reads to me as if it's a bit non-deterministic, is that true?

@kjtsanaktsidis, do you think you can write up reproduction instructions? If not, would you be able to test the proposed glibc in your environment? We'll be patching focal first fwiw.

description: updated
Revision history for this message
KJ Tsanaktsidis (kjtsanaktsidis) wrote :

It's definitely non-deterministic, unfortunately. I do have a reliable reproduction for Bionic and Focal I can trigger on my laptop, but it's a huge pile of proprietary Ruby code that just happens to hit all the right timings on my machine. I can validate a -proposed package if you need though.

The reproduction instructions basically boil down to "Have IPv6, call getaddrinfo(), and if you're unlucky, it will take > 5 seconds and make 4 DNS queries instead of two".

There is also a test case provided in the upstream glibc patch that could also be applied.

https://sourceware.org/git/?p=glibc.git;a=blob;f=resolv/tst-resolv-txnid-collision.c;h=611d37362f3e5e89b92766f0790459340cc071b3;hb=2dfa659a66f20facc4082207884c20e986ddecee

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

I think given that verification is going to be slightly fiddly and the update is already badly overdue, I will not attempt to fix this bug in the next focal update. But also I do not want to leave it so long again until the next update.

Revision history for this message
KJ Tsanaktsidis (kjtsanaktsidis) wrote :

I ran into this issue in production today on Graviton 3 AWS instances (with Bionic) :(

Did this ever get backported to Focal, at least? It definitely doesn't seem to be currently fixed in Bionic still.

This program seems to semi-reliably trip the problem up, but again it's very non-deterministic. I don't know if this helps with a reproduction? https://gist.github.com/KJTsanaktsidis/668247fd898cec57f9d280b0222438ae

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.