Insert html string into BeautifulSoup object

I am trying to insert a html string into a BeautifulSoup object. If I paste it directly bs4 sanitizes the html. If I take a html string and create a soup out of it and insert that I have problems using the function find

. This post-thread on SO suggests that inserting BeautifulSoup objects can cause problems. I use the solution from this post and recreate the soup every time I paste.

But of course the best way is to insert html string into soup.

EDIT: I'll add some code as an example of what the problem is

from bs4 import BeautifulSoup

mainSoup = BeautifulSoup("""
<html>
    <div class='first'></div>
    <div class='second'></div>
</html>
""")

extraSoup = BeautifulSoup('<span class="first-content"></span>')

tag = mainSoup.find(class_='first')
tag.insert(1, extraSoup)

print mainSoup.find(class_='second')
# prints None

      

+3


source to share


2 answers


The simplest way, if you already have a html string, is to insert another BeautifulSoup object.

from bs4 import BeautifulSoup

doc = '''
<div>
 test1
</div>
'''

soup = BeautifulSoup(doc, 'html.parser')

soup.div.append(BeautifulSoup('<div>insert1</div>', 'html.parser'))

print soup.prettify()

      

Output:

<div>
 test1
<div>
 insert1
</div>
</div>

      

Update 1



How about this? The idea is to use BeautifulSoup to generate the correct AST node (span tag). This seems to fix the No problem.

import bs4
from bs4 import BeautifulSoup

mainSoup = BeautifulSoup("""
<html>
    <div class='first'></div>
    <div class='second'></div>
</html>
""", 'html.parser')

extraSoup = BeautifulSoup('<span class="first-content"></span>', 'html.parser')
tag = mainSoup.find(class_='first')
tag.insert(1, extraSoup.span)

print mainSoup.find(class_='second')

      

Output:

<div class="second"></div>

      

+4


source


The best way to do this is to create a new tag span

and insert it into yours mainSoup

. This is the method for. .new_tag



In [34]: from bs4 import BeautifulSoup

In [35]: mainSoup = BeautifulSoup("""
   ....: <html>
   ....:     <div class='first'></div>
   ....:     <div class='second'></div>
   ....: </html>
   ....: """)

In [36]: tag = mainSoup.new_tag('span')

In [37]: tag.attrs['class'] = 'first-content'

In [38]: mainSoup.insert(1, tag)

In [39]: print(mainSoup.find(class_='second'))
<div class="second"></div>

      

+3


source







All Articles