Parsing HTML to Formatted Plaintext using jsoup
I was working on a maven project that allows me to parse html data from a website. I was able to parse it using the following code:
public void parseData(){
String url = "http://stackoverflow.com/help/on-topic";
try {
Document doc = Jsoup.connect(url).get();
Element essay = doc.select("div.col-section").first();
String essayText = essay.text();
jTextAreaAdem.setText(essayText);
} catch (IOException ex) {
Logger.getLogger(formAdem.class.getName()).log(Level.SEVERE, null, ex);
}
}
So far I have no problems. I can parse html data. I used the select method from jsoup and I was fetching the data using "div.col-section", which means I am looking for a div element with class col-section. I wanted to print the data in a textbox. The result I have is a huge paragraph, although the actual data on the website is more than one paragraph. So, how to parse the data just like on a website?
The reason that it is not formatted, is that formatting is in HTML - tagged <p>
and <ol>
etc. A call .text()
to a block element loses this formatting.
Jsoup has a sample HTML for a simple text converter that you can tailor to your needs - by providing a div element as focus.
Alternatively, you can simply select "div.col-section > *"
and iterate over each element and print that text with a new line.