How to pull <b> text </b> under the href tag using BeautifulSoup?

I am trying to find a way to pull both some links and their associated text with a nice soup. The HTML looks like this:

<tr>
    <td align="left" bgcolor="#ffff99">
        <font size="2">
            <a href="link/I/Want.htm">
                <b>Text I Want</b>
            </a>
        </font>
     </td>

<tr>
    <td align="left" bgcolor="#ffff99">
        <font size="2">
            <a href="link/I/Want.htm2">
                <b>Text I Want2</b>
            </a>
        </font>
     </td>

      

I can pull the link without issue:

soup.find_all('a', href=re.compile('link/I/Want'))

      

However, I would also like to get the text and link it to a link. Either going back to the list, or adding them to separate lists in the same order so that I can use the zip () function.

+3


source to share


3 answers


You can try this:

links = []
for link in soup.find_all('a', href=re.compile('link/I/Want')):
    links.append({"link" : link["href"],  "text": link.find_all("b")[-1].get_text(strip=True)})
print (links)

      



Outputs:

[{'link': 'link / I / Want.htm', 'text': 'Text I want'}, {'link': 'link / I / Want2.htm', 'text': 'Text I want2 '}]

+5


source


use dict definition to get data from soup object.

get_text()

will concatenate all flattened text.

links = soup.find_all('a', href=re.compile('link/I/Want'))
data = {link.get_text(strip=True): link['href'] for link in links}

      



of

{'Text I Want': 'link/I/Want.htm', 'Text I Want2': 'link/I/Want.htm2'}

      

0


source


s.html is html file

We can pull out the text content in all tags like this

from BeautifulSoup import *

fh = open('s.html')
html = fh.read()

soup = BeautifulSoup(html)
tags = soup('a')

for tag in tags:
    print tag.get('href',None),soup.find('a').findNext('b').contents[0]

      

-2


source







All Articles