Beautiful soup Unable to get_text after using extract ()
I am working on a web scrap and I only want text from any site, so I am using Beautiful Soup
. Initially I found that the method was get_text()
returning a code as well JavaScript
, so to avoid that I have to use the method extract()
, but now I have a strange problem that after extracting the tags script
and style
Beautiful Soup
does not recognize its body, even its present in the new `html.
Let me first understand that I was doing this
soup = BeautifulSoup(HTMLRawData, 'html.parser')
print(soup.body)
print
all html
data was printed here but when i do
soup = BeautifulSoup(rawData, 'html.parser')
for script in soup(["script", "style"]):
script.extract() # rip it out
print(soup.body)
Now it prints None
as the element is missing, but for debugging after that I did soup.prettify()
, then it prints whole tags html
including the tag body
as well as the tag script
and style
:( Now I am very confused why it is happening and if body
present than why its saying None
please , thanks
and I am using Python 3 and bs4 and rawData
is html extracted from website.
source to share
Problem: Using this html example:
<html>
<style>just style</style>
<span>Main text.</span>
</html>
After fetching the style tag and calling get_text (), it only returns the text that should have been removed. This is due to the double newline in the html after using extract (). Call soup.contents before and after .extract () and you will see this problem.
Before extract ():
[<html>\n<style>just style</style>\n<span>Main text.</span>\n</html>]
After extract ():
[<html>\n\n<span>Main text.</span>\n</html>]
You can see the double newline between the html and the span. This issue slows down get_text () for an unknown reason. To test this point, remove the lines in the example and it will work correctly.
Solutions:
1.- Parse the soup again after calling extract ().
BeautifulSoup(str(soup), 'html.parser')
2.- Use a different parser.
BeautifulSoup(raw, 'html5lib')
Note. Solution # 2 doesn't work if you retrieve two or more adjacent tags, because you get double newline again.
Note. You may need to install this parser. Just do:
pip install html5lib
source to share
Can you include rawData content? If your rawData looks something like this:
<script>...</script>
<script>...</script>
<style>...</style>
It makes sense. X.extract()
will remove this tag from the DOM tree.
Without all the content and all the code, it will be difficult for you to help.
source to share
This seems to be a bug in the latest 4.4.0. I had an almost identical problem: after decomposing (or extracting) a tag: I couldn't access the next tag.
Miguel Sánchez's first answer, but very slow!
Going back to BeautifulSoup 4.3.2 I solved the problem for me.
pip uninstall beautifulsoup4
pip install -Iv http://www.crummy.com/software/BeautifulSoup/bs4/download/4.3/beautifulsoup4-4.3.2.tar.gz
source to share