Parsing empty bytes as HTML returns "b''"
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Fix Committed
|
Undecided
|
Unassigned |
Bug Description
Perhaps a bit of an edge case but if you try to parse b"" with the HTML parser, the returned text is "b''". With 4.12.3 the returned text is "".
#!/usr/bin/env python
import bs4
soup = bs4.BeautifulSo
assert soup.get_text() == ""
Running this script with git-bisect:
5bf3787aa660ed8
commit 5bf3787aa660ed8
Author: Leonard Richardson <email address hidden>
Date: Fri May 12 14:21:14 2023 -0400
Went through dammit.py with mypy strict.
bs4/dammit.py | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
The line that broke this is in the __init__ of UnicodeDammit, line 800 at the time of writing on the master branch:
# Short-circuit if the data is in Unicode to begin with.
if isinstance(markup, str) or markup == b"":
return
This used to read:
if isinstance(markup, str) or markup == "":
Tested with Python 3.9 and 3.10
Neither html5lib nor lxml installed
Thanks for taking the time to pinpoint the issue. I've committed a fix and test in revision 63d18c9.