Misparsing of XML files with very long attributes
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
New
|
Undecided
|
Unassigned |
Bug Description
BeautifulSoup misparses XML files with multiple very long attributes, reproducer below. The requirements for this seem to be that the file has at least two elements that each have an attr that is almost 10 MB long. The symptom is that bs4 will, without raising any Exceptions or printing any warnings, silently swallow subsequent elements in the file, even in tree levels above the elems with the long attributes.
I ticked the "security vulnerability" box because this seems like it might be caused by some buffer overflowing somewhere. If this was due to an intentional length check, I'd expect it to happen (a) at a more "round" number of bytes and (b) with some visible error or warning.
Reproducer:
```
#!/usr/bin/env python
from xml.etree import ElementTree
from bs4 import BeautifulSoup
# The number below is the exact point where this issue shows up in my testing.
points = 'A'*9999825
input_svg = f'''<?xml version="1.0" encoding="utf-8"?>
<svg xmlns="http://
<g id="one"/>
<g id="two">
<polygon points="{points}"/>
</g>
<g id="three"/>
<g id="four">
<polygon points="{points}"/>
</g>
<g id="five"/>
</svg>
'''
print(f'Length of file is {len(input_svg)} bytes, length of each points attr is {len(points)} bytes')
soup = BeautifulSoup(
print('
root = ElementTree.
print('Python ElementTree:', [e.get('id') for e in root.iterfind(
```
Output on my machine w/ Python v3.12.4 and bs4 v4.12.3 from the arch repos:
```
Length of file is 19999904 bytes, length of each points attr is 9999839 bytes
Beautifulsoup: ['one', 'two', 'three', 'four']
Python ElementTree: ['one', 'two', 'three', 'four', 'five']
```
Notice that in the output, the entry "five" is missing in the output from bs4, but present in both the input XML and in the output from Python's etree.
After digging into this I believe the magic number is 10,000,000 bytes and it is a limitation of libxml2 (lxml depends on libxml2, and Beautiful Soup uses lxml to parse XML). Specifically it is the XML_MAX_TEXT_LENGTH constant defined here:
https:/ /gitlab. gnome.org/ GNOME/libxml2/ -/blob/ master/ include/ libxml/ parserInternals .h?ref_ type=heads# L37
You can bump this by a factor of ten by setting the XML_PARSE_HUGE option when compiling libxml2, but almost nobody compiles libxml2 themselves, and you can't get rid of the limitation altogether. Since this is (according to the libxml2 comments) a safety boundary issue, I'm removing the 'security vulnerability' flag, though I do appreciate your caution.
The attached script, based on your script, demonstrates the issue without using any Beautiful Soup code. It also showcases the drastic difference between an attribute value 10,000,000 bytes large and an attribute value one byte larger--this is how I was convinced XML_MAX_TEXT_LENGTH was at play here.
So, what to do?
First, this is an argument for someone completing the work described in bug #1651251. There's a market for a Beautiful Soup TreeBuilder which parses XML using one of Python's built-in XML parsing libraries, either xml.sax or xml.etree. Originally I thought this was purely a matter of reducing the number of dependencies, but it looks like xml.etree has capabilities that lxml doesn't have.
Second, I may file a bug against libxml2. For a text node larger than XML_MAX_LENGTH, I can understand the undefined behavior, but this is happening even when individual text nodes are smaller than XML_MAX_ TEXT_LENGTH, if the _total_ size is greater than XML_MAX_ TEXT_LENGTH. It looks like the buffer is filling up across multiple text nodes or something.
Output of my script:
Length of file is 19999926 bytes, length of each points attr is 9999824 bytes
Python ElementTree: ['one', 'two', 'three', 'four', 'five'] [('p1', 9999824), ('p2', 9999824), ('p2', 3)]
lxml ElementTree: ['one', 'two', 'three'] [('p1', 9999824)]
Length of file is 19999928 bytes, length of each points attr is 9999825 bytes
Python ElementTree: ['one', 'two', 'three', 'four', 'five'] [('p1', 9999825), ('p2', 9999825), ('p2', 3)]
lxml ElementTree: ['one', 'two', 'three'] [('p1', 9999825)]
Length of file is 20000278 bytes, length of each points attr is 10000000 bytes
Python ElementTree: ['one', 'two', 'three', 'four', 'five'] [('p1', 10000000), ('p2', 10000000), ('p2', 3)]
lxml ElementTree: ['one', 'two', 'three'] [('p1', 10000000)]
Length of file is 20000280 bytes, length of each points attr is 10000001 bytes
Python ElementTree: ['one', 'two', 'three', 'four', 'five'] [('p1', 10000001), ('p2', 10000001), ('p2', 3)]
lxml ElementTree: ['one', 'three'] [('p1', 0)]