Extracting text outside of <div> tag BeautifulSoup

Question

Extracting text outside of <div> tag BeautifulSoup

So, I was practicing my scraping and I came across something like this:

<div class="profileDetail">
    <div class="profileLabel">Mobile : </div>
     021 427 399 
</div>

and I need a number outside the tag <div>

:

My code:

num = soup.find("div",{"class":"profileLabel"}).text

but the output of this Mobile :

only contains the text inside the tag <div>

, not the text outside of it.

so how do we extract text outside of the tag <div>

?

+3

python html html-parsing beautifulsoup

Zion 30 jul. 15 at 18:18

source to share

2 answers

try using soup.find("div",{"class":"profileLabel"}).next_sibling

, this will grab the next element, which can be bs4.Tag

either bs4.NavigableString

.

bs4.NavigableString

is what you are trying to get in this case.

elem = soup.find("div",{"class":"profileLabel"}).next_sibling
print type(elem)

# Should return
bs4.element.NavigableString

Example:

In [4]: s = bs4.BeautifulSoup('<div> Hello </div>HiThere<p>next_items</p>', 'html5lib')

In [5]: s
Out[5]: <html><head></head><body><div> Hello </div>HiThere<p>next_items</p></body></html>

In [6]: s.div
Out[6]: <div> Hello </div>

In [7]: s.div.next_sibling
Out[7]: u'HiThere'

In [8]: type(s.div.next_sibling)
Out[8]: bs4.element.NavigableString

+1

Crispy 30 jul. 15 at 18:20

source to share

alecxe · Accepted Answer · 2015-07-30T18:25:08+0000

I would make a reusable function to get the value by label by finding the label text

and getting the following sibling :

import re

def find_by_label(soup, label):
    return soup.find("div", text=re.compile(label)).next_sibling

Using:

find_by_label(soup, "Mobile").strip()  # prints "021 427 399"

Extracting text outside of <div> tag BeautifulSoup

Example:

More articles: