Extracting text inside HTML paragraph with BeautifulSoup in Python

Question

Extracting text inside HTML paragraph with BeautifulSoup in Python

<p>
    <a name="533660373"></a>
    <strong>Title: Point of Sale Threats Proliferate</strong><br />
    <strong>Severity: Normal Severity</strong><br />
    <strong>Published: Thursday, December 04, 2014 20:27</strong><br />
    Several new Point of Sale malware families have emerged recently, to include LusyPOS,..<br />
    <em>Analysis: Emboldened by past success and media attention, threat actors  ..</em>
    <br />
</p>

This is the paragraph that I want to lay out in an HTML page using BeautifulSoup in Python. I can get values inside tags using .children and .string methods. But I cannot get the text "Several new malware Point of Sale ..." which is inside a paragraph without any tag. I've tried using soup.p.text, .get_text (), etc, but didn't use.

+3

python html web-scraping beautifulsoup

remis haroon Dec 24. 14 at 5:28 am

source to share

1 answer

alecxe · Accepted Answer · 2014-12-24T05:38:42+0000

Use find_all()

c text=True

to find all text nodes and recursive=False

to search only among the direct children of the parent tag p

:

from bs4 import BeautifulSoup

data = """
<p>
    <a name="533660373"></a>
    <strong>Title: Point of Sale Threats Proliferate</strong><br />
    <strong>Severity: Normal Severity</strong><br />
    <strong>Published: Thursday, December 04, 2014 20:27</strong><br />
    Several new Point of Sale malware families have emerged recently, to include LusyPOS,..<br />
    <em>Analysis: Emboldened by past success and media attention, threat actors  ..</em>
    <br />
</p>
"""

soup = BeautifulSoup(data)
print ''.join(text.strip() for text in soup.p.find_all(text=True, recursive=False))

Printing

Several new Point of Sale malware families have emerged recently, to include LusyPOS,..

Extracting text inside HTML paragraph with BeautifulSoup in Python

More articles: