Parsing empty bytes as HTML returns "b''"

Bug #2110492 reported by Martin Burchell
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Committed
Undecided
Unassigned

Bug Description

Perhaps a bit of an edge case but if you try to parse b"" with the HTML parser, the returned text is "b''". With 4.12.3 the returned text is "".

#!/usr/bin/env python
import bs4
soup = bs4.BeautifulSoup(b"", "html.parser")
assert soup.get_text() == ""

Running this script with git-bisect:

5bf3787aa660ed87ecb40859639b3fa7e6497f5c is the first bad commit
commit 5bf3787aa660ed87ecb40859639b3fa7e6497f5c
Author: Leonard Richardson <email address hidden>
Date: Fri May 12 14:21:14 2023 -0400

    Went through dammit.py with mypy strict.

 bs4/dammit.py | 16 ++++++++--------
  1 file changed, 8 insertions(+), 8 deletions(-)

The line that broke this is in the __init__ of UnicodeDammit, line 800 at the time of writing on the master branch:

        # Short-circuit if the data is in Unicode to begin with.
        if isinstance(markup, str) or markup == b"":
            self.markup = markup
            self.unicode_markup = str(markup)
            self.original_encoding = None
            return

This used to read:

        if isinstance(markup, str) or markup == "":

Tested with Python 3.9 and 3.10
Neither html5lib nor lxml installed

Revision history for this message
Leonard Richardson (leonardr) wrote :

Thanks for taking the time to pinpoint the issue. I've committed a fix and test in revision 63d18c9.

Changed in beautifulsoup:
status: New → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.