How to pull <b> text </b> under the href tag using BeautifulSoup?
I am trying to find a way to pull both some links and their associated text with a nice soup. The HTML looks like this:
<tr>
<td align="left" bgcolor="#ffff99">
<font size="2">
<a href="link/I/Want.htm">
<b>Text I Want</b>
</a>
</font>
</td>
<tr>
<td align="left" bgcolor="#ffff99">
<font size="2">
<a href="link/I/Want.htm2">
<b>Text I Want2</b>
</a>
</font>
</td>
I can pull the link without issue:
soup.find_all('a', href=re.compile('link/I/Want'))
However, I would also like to get the text and link it to a link. Either going back to the list, or adding them to separate lists in the same order so that I can use the zip () function.
+3
source to share
3 answers
You can try this:
links = []
for link in soup.find_all('a', href=re.compile('link/I/Want')):
links.append({"link" : link["href"], "text": link.find_all("b")[-1].get_text(strip=True)})
print (links)
Outputs:
[{'link': 'link / I / Want.htm', 'text': 'Text I want'}, {'link': 'link / I / Want2.htm', 'text': 'Text I want2 '}]
+5
source to share
use dict definition to get data from soup object.
get_text()
will concatenate all flattened text.
links = soup.find_all('a', href=re.compile('link/I/Want'))
data = {link.get_text(strip=True): link['href'] for link in links}
of
{'Text I Want': 'link/I/Want.htm', 'Text I Want2': 'link/I/Want.htm2'}
0
source to share