SoupStrainer behaves differently as constructor argument versus find* method argument
Bug #2111651 reported by
Sergey Dudanov
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Triaged
|
Wishlist
|
Unassigned |
Bug Description
SoupStrainer not work correctly with multi-value attributes.
The following code reproduce this behaviour.
from bs4 import BeautifulSoup
from bs4.filter import SoupStrainer
only_test_classes = SoupStrainer(
html_doc = """
<p class="test">One class value</p>
<p class="test bug">Multi-value class</p>
"""
print(Beautiful
description: | updated |
summary: |
- bug in SoupStrainer with multi-valued attributes + SoupStrainer not work correctly with multi-value attributes |
To post a comment you must log in.
Thanks for taking the time to file this bug. Adding a line to your test code demonstrates the issue a bit more clearly. A SoupStrainer can behave differently as a constructor argument and an argument to find_all:
print(Beautiful Soup(html_ doc, "html.parser", parse_only= only_test_ classes) ) Soup(html_ doc, "html.parser" ).find_ all(only_ test_classes) )
print(Beautiful
The reason is that on the first line, the document hasn't been parsed yet. The Tag object is in charge of deciding how to process the input from the parser, such as changing certain attribute values from strings into lists. In the first line, the Tag object doesn't exist yet, and the SoupStrainer is being given the same values the Tag _would_ get if its constructor were to be called. In the second line, the Tag object already exists and it parsed the string into a list when its constructor was called.
Now that you know what's happening, you can process the attribute value yourself and get the behavior you want:
only_test_classes = SoupStrainer( class_= lambda x: "test" in x.split())
I've added an explanation of this to the "behavior of SoupStrainer" section of the documentation. I am going to leave this bug open for a while because I'm not happy with any difference in behavior of this sort. But right now I don't see a change to the code base that would be a better solution.
This is the most obvious example of the changes made by Tag to the raw incoming data, but it's not the only such change. And I don't want to run all of the preprocessing code before checking the parse_only SoupStrainer, because the point of parse_only is to avoid running code on tags that don't need it.