HTMLParser handling of <![CDATA[...]]> changed w/ libxml2 2.9.11+
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
New
|
Undecided
|
Unassigned |
Bug Description
Python : sys.version_
lxml.etree : (4, 6, 3, 0)
libxml used : (2, 9, 12)
libxml compiled : (2, 9, 12)
libxslt used : (1, 1, 34)
libxslt compiled : (1, 1, 34)
It seems that the handling of <![CDATA[...]]> inside HTMLParser has changed when built against libxml2 2.9.11+. I'm currently trying to figure out whether it's a regression/behavior change in libxml2 itself or a bug in lxml, however I wasn't able to easily reproduce it using the C API and the Cython code in lxml is above my paygrade.
I'm attaching a trivial reproducer using lxml.etree.
b"<html>
With older libxml2, the result is:
start html
start body
end body
end html
(i.e. CDATA is ignored). With newer libxml2, the result is:
start html
start body
data <
data ![CDATA[test]]>
end body
end html
(i.e. CDATA is reported raw as data() method calls)
This breaks the assumptions made by beautifulsoup4 and soupsieve. I've reported the problem there previously to get some pointers:
https:/
https:/
I've also bisected libxml2 and found out that the following commit causes the behavior change:
commit 173a0830dcec769
Author: Nick Wellnhofer <email address hidden>
Date: 2020-07-22 23:15:35 +0200
Fix quadratic runtime when push parsing HTML start tags
Make sure that htmlParseStartTag doesn't terminate on characters for
which IS_CHAR_CH is false like control chars.
In htmlParseTryOrF
starts a valid name. Otherwise, htmlParseStartTag might return without
consuming all characters up to the final '>'.
Found by OSS-Fuzz.
I can also file a bug against libxml2 but I'm going to need help getting a trivial reproducer there. I've tried using htmlSAXParseDoc() but I can't reproduce the new behavior there (i.e. CDATA is just not reported at all, via cDataBlock or characters callback).
As the author of Beautiful Soup let me say that I would probably prefer the new behavior. I haven't been able to get CDATA sections from lxml the way I have been from html.parser and html5lib.
I've been using the strip_cdata=False argument mentioned here: /lxml.de/ api.html# cdata
https:/
But in the context in which I'm using it, it's never worked: /bugs.launchpad .net/beautifuls oup/+bug/ 1275085
https:/
I say I'd _probably_ prefer the new behavior because the way in which the CDATA section is being sent over -- as chunked data blocks -- means I don't think I can recognize it as CDATA and create a special CData object on my side. But I'd definitely rather have the data than not.