Does SoupStrainer have two arguments?

Question

Does SoupStrainer have two arguments?

Ok, this started out as a question, but halfway through I figured it out. I can't find such a question on stackoverflow or google, so I'll post it anyway to help anyone who stumbles upon it.

I wanted to use SoupStrainer from BeautifulSoup to parse two tags instead of one in an html document.

I knew I could do this:

soup = BeautifulSoup(content.text, 'lxml', parse_only=SoupStrainer('p'))

This will get the tags <p>

. I also wanted to get the tags <h3>

. So I tried this:

soup = BeautifulSoup(content.text, 'lxml', parse_only=SoupStrainer('h3', 'p'))

But that won't work because SoupStrainer only takes one argument.

The answer is below.

+3

python python-3.x beautifulsoup

GreenRaccoon23 Dec 30. 15 at 22:59

source to share

3 answers

GreenRaccoon23 · Answer 1 · 2014-12-31T17:11:37+0000

To get SoupStrainer to parse multiple tags, you need to put them in a list. Like this:

soup = BeautifulSoup(content.text, 'lxml', parse_only=SoupStrainer(['h3', 'p']))

This parses both tags <h3>

and <p>

in content.text

, even if they are siblings (i.e. one tag is not inside the other).

You can do this with more than two tags if you pass them as one list to SoupStrainer.

One tag:

soup = BeautifulSoup(content.text, 'lxml', parse_only=SoupStrainer('p'))

Several tags:

soup = BeautifulSoup(content.text, 'lxml', parse_only=SoupStrainer(['h1', 'h3', 'p', 'h4']))

John strood · Answer 2 · 2018-12-11T12:12:13+0000

There are regex

many more possibilities. Use a module re

.

This will get the tags <p>

. I also wanted to get the tags <h3>

.

soup = BeautifulSoup(content.text, 'lxml', parse_only=SoupStrainer(re.compile(r"p|h3")))

@Martijn For attributes you can use attrs

.

soup = BeautifulSoup(content.text, 'lxml', parse_only=SoupStrainer(re.compile(r"p|h3")), class_="foo")
soup = BeautifulSoup(content.text, 'lxml', parse_only=SoupStrainer(re.compile(r"p|h3")), attrs={"class": "foo"})

But you obviously can't apply class

for every tag

HTML. You can get around this with selectors css

.

soup = BeautifulSoup(content.text, 'lxml', parse_only=SoupStrainer(["h1", "h2", "h3", "p"])).select("h1.foo, h2, h3, p")

kollwitz · Answer 3 · 2019-01-12T18:22:18+0000

This is how the constructor is defined SoupStrainer

in bs4:

class SoupStrainer(object):
    """Encapsulates a number of ways of matching a markup element (tag or
    text)."""

    def __init__(self, name=None, attrs={}, text=None, **kwargs):

...

So, adding to @ JohnStrood's answer, you can use an argument attrs

(dictionary) to constrain matches to one or more attributes:

attrs_dict = { "class": "foo", "other_attr": ["value1", "value2"] }
strainer = SoupStrainer([h3, p], attrs=attrs_dict)

Does SoupStrainer have two arguments?

More articles: