SoupStrainer behaves differently as constructor argument versus find* method argument

Bug #2111651 reported by Sergey Dudanov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Triaged
Wishlist
Unassigned

Bug Description

SoupStrainer not work correctly with multi-value attributes.
The following code reproduce this behaviour.

from bs4 import BeautifulSoup
from bs4.filter import SoupStrainer

only_test_classes = SoupStrainer(class_="test")

html_doc = """
<p class="test">One class value</p>
<p class="test bug">Multi-value class</p>
"""

print(BeautifulSoup(html_doc, "html.parser", parse_only=only_test_classes))

Sergey Dudanov (dudanov)
description: updated
Sergey Dudanov (dudanov)
summary: - bug in SoupStrainer with multi-valued attributes
+ SoupStrainer not work correctly with multi-value attributes
Revision history for this message
Leonard Richardson (leonardr) wrote : Re: SoupStrainer not work correctly with multi-value attributes

Thanks for taking the time to file this bug. Adding a line to your test code demonstrates the issue a bit more clearly. A SoupStrainer can behave differently as a constructor argument and an argument to find_all:

print(BeautifulSoup(html_doc, "html.parser", parse_only=only_test_classes))
print(BeautifulSoup(html_doc, "html.parser").find_all(only_test_classes))

The reason is that on the first line, the document hasn't been parsed yet. The Tag object is in charge of deciding how to process the input from the parser, such as changing certain attribute values from strings into lists. In the first line, the Tag object doesn't exist yet, and the SoupStrainer is being given the same values the Tag _would_ get if its constructor were to be called. In the second line, the Tag object already exists and it parsed the string into a list when its constructor was called.

Now that you know what's happening, you can process the attribute value yourself and get the behavior you want:

only_test_classes = SoupStrainer(class_=lambda x: "test" in x.split())

I've added an explanation of this to the "behavior of SoupStrainer" section of the documentation. I am going to leave this bug open for a while because I'm not happy with any difference in behavior of this sort. But right now I don't see a change to the code base that would be a better solution.

This is the most obvious example of the changes made by Tag to the raw incoming data, but it's not the only such change. And I don't want to run all of the preprocessing code before checking the parse_only SoupStrainer, because the point of parse_only is to avoid running code on tags that don't need it.

summary: - SoupStrainer not work correctly with multi-value attributes
+ SoupStrainer behaves differently as constructor argument versus find*
+ method argument
Changed in beautifulsoup:
status: New → Triaged
importance: Undecided → Wishlist
Revision history for this message
Sergey Dudanov (dudanov) wrote :

Got it! Thanks for explaining this behavior of SoupStrainer. The description in the documentation is exactly what is needed in this situation! Thanks again.

Revision history for this message
Sergey Dudanov (dudanov) wrote (last edit ):

Your suggestion does not work as expected:
only_test_classes = SoupStrainer(class_=lambda x: "test" in x.split())

If the tag does not have a class attribute, the lambda is called with the None argument. This is strange. Why call the attribute filter function when the attribute itself is missing...

For your suggestion to work, need to write code like this:
only_test_classes = SoupStrainer(class_=lambda x: x is not None and "test" in x.split())

Revision history for this message
Leonard Richardson (leonardr) wrote :

Thanks for the correction. The attribute filter function is always called because the function might have logic that depends on the absence of the attribute, similar to find(_class=False).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.