Beautiful soup Unable to get_text after using extract ()

I am working on a web scrap and I only want text from any site, so I am using Beautiful Soup

. Initially I found that the method was get_text()

returning a code as well JavaScript

, so to avoid that I have to use the method extract()

, but now I have a strange problem that after extracting the tags script

and style

Beautiful Soup

does not recognize its body, even its present in the new `html.

Let me first understand that I was doing this

soup = BeautifulSoup(HTMLRawData, 'html.parser')
print(soup.body)

      

print

all html

data was printed here but when i do

soup = BeautifulSoup(rawData, 'html.parser')
    for script in soup(["script", "style"]):
        script.extract()    # rip it out
    print(soup.body)

      

Now it prints None

as the element is missing, but for debugging after that I did soup.prettify()

, then it prints whole tags html

including the tag body

as well as the tag script

and style

:( Now I am very confused why it is happening and if body

present than why its saying None

please , thanks

and I am using Python 3 and bs4 and rawData

is html extracted from website.

+3


source to share


3 answers


Problem: Using this html example:

<html>
<style>just style</style>
<span>Main text.</span>
</html>

      

After fetching the style tag and calling get_text (), it only returns the text that should have been removed. This is due to the double newline in the html after using extract (). Call soup.contents before and after .extract () and you will see this problem.

Before extract ():

[<html>\n<style>just style</style>\n<span>Main text.</span>\n</html>]

      

After extract ():

[<html>\n\n<span>Main text.</span>\n</html>]

      

You can see the double newline between the html and the span. This issue slows down get_text () for an unknown reason. To test this point, remove the lines in the example and it will work correctly.



Solutions:

1.- Parse the soup again after calling extract ().

BeautifulSoup(str(soup), 'html.parser')

      

2.- Use a different parser.

BeautifulSoup(raw, 'html5lib')

      

Note. Solution # 2 doesn't work if you retrieve two or more adjacent tags, because you get double newline again.

Note. You may need to install this parser. Just do:

pip install html5lib

      

+3


source


Can you include rawData content? If your rawData looks something like this:

<script>...</script>
<script>...</script>
<style>...</style>

      



It makes sense. X.extract()

will remove this tag from the DOM tree.

Without all the content and all the code, it will be difficult for you to help.

0


source


This seems to be a bug in the latest 4.4.0. I had an almost identical problem: after decomposing (or extracting) a tag: I couldn't access the next tag.

Miguel Sánchez's first answer, but very slow!

Going back to BeautifulSoup 4.3.2 I solved the problem for me.

pip uninstall beautifulsoup4
pip install -Iv http://www.crummy.com/software/BeautifulSoup/bs4/download/4.3/beautifulsoup4-4.3.2.tar.gz

      

0


source







All Articles