Parsing HTML to Formatted Plaintext using jsoup

Question

Parsing HTML to Formatted Plaintext using jsoup

I was working on a maven project that allows me to parse html data from a website. I was able to parse it using the following code:

public void parseData(){
        String url = "http://stackoverflow.com/help/on-topic";
        try {
            Document doc = Jsoup.connect(url).get();
            Element essay = doc.select("div.col-section").first();
            String essayText = essay.text();
            jTextAreaAdem.setText(essayText);


        } catch (IOException ex) {
            Logger.getLogger(formAdem.class.getName()).log(Level.SEVERE, null, ex);
        }
    }

So far I have no problems. I can parse html data. I used the select method from jsoup and I was fetching the data using "div.col-section", which means I am looking for a div element with class col-section. I wanted to print the data in a textbox. The result I have is a huge paragraph, although the actual data on the website is more than one paragraph. So, how to parse the data just like on a website?

+3

java maven jsoup

GoGo 13 oct. 14 at 18:38

source to share

1 answer

Jonathan Hedley · Accepted Answer · 2014-10-13T22:52:33+0000

The reason that it is not formatted, is that formatting is in HTML - tagged <p>

and <ol>

etc. A call .text()

to a block element loses this formatting.

Jsoup has a sample HTML for a simple text converter that you can tailor to your needs - by providing a div element as focus.

Alternatively, you can simply select "div.col-section > *"

and iterate over each element and print that text with a new line.

Parsing HTML to Formatted Plaintext using jsoup

More articles: