Does SoupStrainer have two arguments?

Ok, this started out as a question, but halfway through I figured it out. I can't find such a question on stackoverflow or google, so I'll post it anyway to help anyone who stumbles upon it.

I wanted to use SoupStrainer from BeautifulSoup to parse two tags instead of one in an html document.

I knew I could do this:

soup = BeautifulSoup(content.text, 'lxml', parse_only=SoupStrainer('p'))  

      

This will get the tags <p>

. I also wanted to get the tags <h3>

. So I tried this:

soup = BeautifulSoup(content.text, 'lxml', parse_only=SoupStrainer('h3', 'p'))

      

But that won't work because SoupStrainer only takes one argument.

The answer is below.

+3


source to share


3 answers


To get SoupStrainer to parse multiple tags, you need to put them in a list. Like this:

soup = BeautifulSoup(content.text, 'lxml', parse_only=SoupStrainer(['h3', 'p']))

      

This parses both tags <h3>

and <p>

in content.text

, even if they are siblings (i.e. one tag is not inside the other).

You can do this with more than two tags if you pass them as one list to SoupStrainer.



One tag:

soup = BeautifulSoup(content.text, 'lxml', parse_only=SoupStrainer('p'))

      

Several tags:

soup = BeautifulSoup(content.text, 'lxml', parse_only=SoupStrainer(['h1', 'h3', 'p', 'h4']))

      

+3


source


There are regex

many more possibilities. Use a module re

.

This will get the tags <p>

. I also wanted to get the tags <h3>

.

soup = BeautifulSoup(content.text, 'lxml', parse_only=SoupStrainer(re.compile(r"p|h3")))

      

@Martijn For attributes you can use attrs

.



soup = BeautifulSoup(content.text, 'lxml', parse_only=SoupStrainer(re.compile(r"p|h3")), class_="foo")
soup = BeautifulSoup(content.text, 'lxml', parse_only=SoupStrainer(re.compile(r"p|h3")), attrs={"class": "foo"})

      

But you obviously can't apply class

for every tag

HTML. You can get around this with selectors css

.

soup = BeautifulSoup(content.text, 'lxml', parse_only=SoupStrainer(["h1", "h2", "h3", "p"])).select("h1.foo, h2, h3, p")

      

0


source


This is how the constructor is defined SoupStrainer

in bs4:

class SoupStrainer(object):
    """Encapsulates a number of ways of matching a markup element (tag or
    text)."""

    def __init__(self, name=None, attrs={}, text=None, **kwargs):

...

      

So, adding to @ JohnStrood's answer, you can use an argument attrs

(dictionary) to constrain matches to one or more attributes:

attrs_dict = { "class": "foo", "other_attr": ["value1", "value2"] }
strainer = SoupStrainer([h3, p], attrs=attrs_dict)

      

0


source







All Articles