Removing text nodes and checking for alternative text nodes in html: Jsoup
I am trying to parse a html string using jsoup:
<div class="test">
<br>From: <b class="sendername">Divya</b>
<span dir="ltr"><<a href="mailto:divya@abc.net" target="_blank">divya@abc.net</a>></span>
<br>Date: Wed, May 27, 2015 at 11:10 AM
<br>Subject: Plan for the day 27/05/2015
<br>To: Abhishek<<a href="mailto:abhishek.sharma@abc.com" target="_blank">abhishek.sharma@abc.<wbr>com</a>>,
<a href="mailto:xyz@abc.com" target="_blank">xyz@abc.com</a>>
<br>Cc: Ram <<a href="mailto:Ram@abc.net" target="_blank">Ram@abc.net</a>>
<br>
<br>
<br>
<div dir="ltr">Hi,</div>
</div>
Document doc = Jsoup.parse( mailBody.getBodyHtml().get( 0 ) );
Elements elem = doc.getElementsByClass( "test" );
int totalElements = 0;
Elements childElements = elem.get( 0 ).;
int brCount = 0;
for( Element childElement : childElements )
{
totalElements++;
if( childElement.tagName().equalsIgnoreCase( "br" ) )
{
brCount++;
if( brCount == 3 )
break;
}
else
brCount = 0;
}
for( int i = 1; i <= totalElements; i++ )
{
childElements.get( i ).remove();
}
I want to get rid of all content up to three consecutive br tags and there should be no node text in between.
that is, in the above case, it will remove all tags (html and text tags) and the output will be as follows:
- How can I check if there is a text node between two br tags?
- The above code just removes the html tags, but the text nodes are not removed. How can I remove this?
+3
source to share
1 answer
The html structure seems to be constant. Therefore, you can try the following CSS selector:
div.test br + br + br + div
DEMO
http://try.jsoup.org/~DiBi9Q_Ye88gi6Hq29Z44ar6xus
SAMPLE CODE
String html = "<div class=\"test\">\n <br>From: <b class=\"sendername\">Divya</b> \n <span dir=\"ltr\"><<a href=\"mailto:divya@abc.net\" target=\"_blank\">divya@abc.net</a>></span>\n <br>Date: Wed, May 27, 2015 at 11:10 AM\n <br>Subject: Plan for the day 27/05/2015\n <br>To: Abhishek<<a href=\"mailto:abhishek.sharma@abc.com\" target=\"_blank\">abhishek.sharma@abc.<wbr>com</a>>, \n <a href=\"mailto:xyz@abc.com\" target=\"_blank\">xyz@abc.com</a>>\n <br>Cc: Ram <<a href=\"mailto:Ram@abc.net\" target=\"_blank\">Ram@abc.net</a>>\n <br>\n <br>\n <br>\n <div dir=\"ltr\">Hi,</div>\n </div>";
Document doc = Jsoup.parse(html);
Element mailBody = doc.select("div.test br + br + br + div").first();
if (mailBody == null) {
throw new RuntimeException("Unable to locate mail body.");
}
System.out.println("** BEFORE:\n" + doc);
Document tmp = Jsoup.parseBodyFragment("<div class='test'>" + mailBody.outerHtml() + "</div>");
mailBody.parent().replaceWith(tmp.select("div.test").first());
System.out.println("\n** AFTER:\n" + doc);
OUTPUT
** BEFORE:
<html>
<head></head>
<body>
<div class="test">
<br>From:
<b class="sendername">Divya</b>
<span dir="ltr"><<a href="mailto:divya@abc.net" target="_blank">divya@abc.net</a>></span>
<br>Date: Wed, May 27, 2015 at 11:10 AM
<br>Subject: Plan for the day 27/05/2015
<br>To: Abhishek<
<a href="mailto:abhishek.sharma@abc.com" target="_blank">abhishek.sharma@abc.<wbr>com</a>>,
<a href="mailto:xyz@abc.com" target="_blank">xyz@abc.com</a>>
<br>Cc: Ram <
<a href="mailto:Ram@abc.net" target="_blank">Ram@abc.net</a>>
<br>
<br>
<br>
<div dir="ltr">
Hi,
</div>
</div>
</body>
</html>
** AFTER:
<html>
<head></head>
<body>
<div class="test">
<div dir="ltr">
Hi,
</div>
</div>
</body>
</html>
0
source to share