How to pull <b> text </b> under the href tag using BeautifulSoup?

Question

How to pull <b> text </b> under the href tag using BeautifulSoup?

I am trying to find a way to pull both some links and their associated text with a nice soup. The HTML looks like this:

<tr>
    <td align="left" bgcolor="#ffff99">
        <font size="2">
            <a href="link/I/Want.htm">
                <b>Text I Want</b>
            </a>
        </font>
     </td>

<tr>
    <td align="left" bgcolor="#ffff99">
        <font size="2">
            <a href="link/I/Want.htm2">
                <b>Text I Want2</b>
            </a>
        </font>
     </td>

I can pull the link without issue:

soup.find_all('a', href=re.compile('link/I/Want'))

However, I would also like to get the text and link it to a link. Either going back to the list, or adding them to separate lists in the same order so that I can use the zip () function.

+3

python python-3.x web-scraping beautifulsoup

Chace mcguyer March 25 17 at 22:59

source to share

3 answers

Zroq · Answer 1 · 2017-03-25T23:07:10+0000

You can try this:

links = []
for link in soup.find_all('a', href=re.compile('link/I/Want')):
    links.append({"link" : link["href"],  "text": link.find_all("b")[-1].get_text(strip=True)})
print (links)

Outputs:

[{'link': 'link / I / Want.htm', 'text': 'Text I want'}, {'link': 'link / I / Want2.htm', 'text': 'Text I want2 '}]

宏杰李 · Answer 2 · 2017-03-27T06:14:45+0000

use dict definition to get data from soup object.

get_text()

will concatenate all flattened text.

links = soup.find_all('a', href=re.compile('link/I/Want'))
data = {link.get_text(strip=True): link['href'] for link in links}

of

{'Text I Want': 'link/I/Want.htm', 'Text I Want2': 'link/I/Want.htm2'}

ayush mathur · Answer 3 · 2017-03-25T23:12:13+0000

s.html is html file

We can pull out the text content in all tags like this

from BeautifulSoup import *

fh = open('s.html')
html = fh.read()

soup = BeautifulSoup(html)
tags = soup('a')

for tag in tags:
    print tag.get('href',None),soup.find('a').findNext('b').contents[0]

How to pull <b> text </b> under the href tag using BeautifulSoup?

More articles: