Make BeautifulSoup handles line breaks as the browser

I am using BeautifulSoup (version "4.3.2" with Python 3.4) to convert html documents to text. The problem I'm running into is that sometimes web pages have "\ n" newlines that won't actually show up as a newline in the browser, but when BeautifulSoup converts them to text, it goes to " \ n ".

Example:

Your browser is probably displaying the following on one line (even if it has a newline in the middle):

This is the point.

And your browser probably does the following on multiple lines, even if I enter it without newlines:

This is a paragraph.

This is another paragraph.

But when BeautifulSoup converts the same strings to text, the only string it uses is newline literals - and it always uses them:

from bs4 import BeautifulSoup

doc = "<p>This is a\nparagraph.</p>"
soup = BeautifulSoup(doc)

soup.text
Out[181]: 'This is a \n paragraph.'

doc = "<p>This is a paragraph.</p><p>This is another paragraph.</p>"
soup = BeautifulSoup(doc)

soup.text
Out[187]: 'This is a paragraph.This is another paragraph.'

      

Does anyone know how to make BeautifulSoup extract text in a prettier way (or really just get all newlines right)? Are there any other easy ways to solve the problem?

+3


source to share


2 answers


get_text

might be helpful here:



>>> from bs4 import BeautifulSoup
>>> doc = "<p>This is a paragraph.</p><p>This is another paragraph.</p>"
>>> soup = BeautifulSoup(doc)
>>> soup.get_text(separator="\n")
u'This is a paragraph.\nThis is another paragraph.'

      

+2


source


I would take a look at python-markdownify . It turns html into pretty readable markdown formatted text.

It's available in pypi: https://pypi.python.org/pypi/markdownify/0.4.0



and github: https://github.com/matthewwithanm/python-markdownify

0


source







All Articles