Misparsing of XML files with very long attributes

Bug #2072424 reported by jaseg
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
New
Undecided
Unassigned

Bug Description

BeautifulSoup misparses XML files with multiple very long attributes, reproducer below. The requirements for this seem to be that the file has at least two elements that each have an attr that is almost 10 MB long. The symptom is that bs4 will, without raising any Exceptions or printing any warnings, silently swallow subsequent elements in the file, even in tree levels above the elems with the long attributes.

I ticked the "security vulnerability" box because this seems like it might be caused by some buffer overflowing somewhere. If this was due to an intentional length check, I'd expect it to happen (a) at a more "round" number of bytes and (b) with some visible error or warning.

Reproducer:
```
#!/usr/bin/env python

from xml.etree import ElementTree
from bs4 import BeautifulSoup

# The number below is the exact point where this issue shows up in my testing.
points = 'A'*9999825
input_svg = f'''<?xml version="1.0" encoding="utf-8"?>
<svg xmlns="http://www.w3.org/2000/svg">
 <g id="one"/>
 <g id="two">
    <polygon points="{points}"/>
 </g>
 <g id="three"/>
 <g id="four">
    <polygon points="{points}"/>
 </g>
 <g id="five"/>
</svg>
'''

print(f'Length of file is {len(input_svg)} bytes, length of each points attr is {len(points)} bytes')

soup = BeautifulSoup(input_svg, features='lxml-xml')
print('Beautifulsoup:', [e.get('id') for e in soup.find_all('g', recursive=True)])

root = ElementTree.fromstring(input_svg)
print('Python ElementTree:', [e.get('id') for e in root.iterfind('svg:g', {'svg': 'http://www.w3.org/2000/svg'})])
```

Output on my machine w/ Python v3.12.4 and bs4 v4.12.3 from the arch repos:
```
Length of file is 19999904 bytes, length of each points attr is 9999839 bytes
Beautifulsoup: ['one', 'two', 'three', 'four']
Python ElementTree: ['one', 'two', 'three', 'four', 'five']
```

Notice that in the output, the entry "five" is missing in the output from bs4, but present in both the input XML and in the output from Python's etree.

Revision history for this message
jaseg (jaseg) wrote :
Revision history for this message
Leonard Richardson (leonardr) wrote :

After digging into this I believe the magic number is 10,000,000 bytes and it is a limitation of libxml2 (lxml depends on libxml2, and Beautiful Soup uses lxml to parse XML). Specifically it is the XML_MAX_TEXT_LENGTH constant defined here:

https://gitlab.gnome.org/GNOME/libxml2/-/blob/master/include/libxml/parserInternals.h?ref_type=heads#L37

You can bump this by a factor of ten by setting the XML_PARSE_HUGE option when compiling libxml2, but almost nobody compiles libxml2 themselves, and you can't get rid of the limitation altogether. Since this is (according to the libxml2 comments) a safety boundary issue, I'm removing the 'security vulnerability' flag, though I do appreciate your caution.

The attached script, based on your script, demonstrates the issue without using any Beautiful Soup code. It also showcases the drastic difference between an attribute value 10,000,000 bytes large and an attribute value one byte larger--this is how I was convinced XML_MAX_TEXT_LENGTH was at play here.

So, what to do?

First, this is an argument for someone completing the work described in bug #1651251. There's a market for a Beautiful Soup TreeBuilder which parses XML using one of Python's built-in XML parsing libraries, either xml.sax or xml.etree. Originally I thought this was purely a matter of reducing the number of dependencies, but it looks like xml.etree has capabilities that lxml doesn't have.

Second, I may file a bug against libxml2. For a text node larger than XML_MAX_LENGTH, I can understand the undefined behavior, but this is happening even when individual text nodes are smaller than XML_MAX_TEXT_LENGTH, if the _total_ size is greater than XML_MAX_TEXT_LENGTH. It looks like the buffer is filling up across multiple text nodes or something.

Output of my script:
Length of file is 19999926 bytes, length of each points attr is 9999824 bytes
Python ElementTree: ['one', 'two', 'three', 'four', 'five'] [('p1', 9999824), ('p2', 9999824), ('p2', 3)]
lxml ElementTree: ['one', 'two', 'three'] [('p1', 9999824)]

Length of file is 19999928 bytes, length of each points attr is 9999825 bytes
Python ElementTree: ['one', 'two', 'three', 'four', 'five'] [('p1', 9999825), ('p2', 9999825), ('p2', 3)]
lxml ElementTree: ['one', 'two', 'three'] [('p1', 9999825)]

Length of file is 20000278 bytes, length of each points attr is 10000000 bytes
Python ElementTree: ['one', 'two', 'three', 'four', 'five'] [('p1', 10000000), ('p2', 10000000), ('p2', 3)]
lxml ElementTree: ['one', 'two', 'three'] [('p1', 10000000)]

Length of file is 20000280 bytes, length of each points attr is 10000001 bytes
Python ElementTree: ['one', 'two', 'three', 'four', 'five'] [('p1', 10000001), ('p2', 10000001), ('p2', 3)]
lxml ElementTree: ['one', 'three'] [('p1', 0)]

information type: Private Security → Public
Revision history for this message
jaseg (jaseg) wrote :

That limit sure looks like the culprit. The behavior that it only swallows later elements after two instances is really odd though. In a test file with only one large element, it will happily parse a 60MB attr correctly, but two 10MB attrs cause it to misparse.

Apart from your two suggestions, I think a good thing to do in the meantime would be for bs4(?) to bubble up the error from lxml/libxml2 to alert the user of that parser limitation.

Revision history for this message
Hostever (hostever) wrote :

Extended XML attributes may cause mispricing due to encoding issues. Ensure proper encoding and validate attribute size.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.