Removing text nodes and checking for alternative text nodes in html: Jsoup

I am trying to parse a html string using jsoup:

<div class="test">
  <br>From: <b class="sendername">Divya</b> 
  <span dir="ltr">&lt;<a href="mailto:divya@abc.net" target="_blank">divya@abc.net</a>&gt;</span>
  <br>Date: Wed, May 27, 2015 at 11:10 AM
  <br>Subject: Plan for the day 27/05/2015
  <br>To: Abhishek&lt;<a href="mailto:abhishek.sharma@abc.com" target="_blank">abhishek.sharma@abc.<wbr>com</a>&gt;, 
    <a href="mailto:xyz@abc.com" target="_blank">xyz@abc.com</a>&gt;
  <br>Cc: Ram &lt;<a href="mailto:Ram@abc.net" target="_blank">Ram@abc.net</a>&gt;
  <br>
  <br>
  <br>
  <div dir="ltr">Hi,</div>
 </div>
      

Run codeHide result


Document doc = Jsoup.parse( mailBody.getBodyHtml().get( 0 ) );
Elements elem = doc.getElementsByClass( "test" );
int totalElements = 0;
Elements childElements = elem.get( 0 ).;
int brCount = 0;
for( Element childElement : childElements )
{
    totalElements++;
    if( childElement.tagName().equalsIgnoreCase( "br" ) )
    {
        brCount++;
        if( brCount == 3 )
            break;
    }
    else
    brCount = 0;
}
for( int i = 1; i <= totalElements; i++ )
{
    childElements.get( i ).remove();
}

      

I want to get rid of all content up to three consecutive br tags and there should be no node text in between.
that is, in the above case, it will remove all tags (html and text tags) and the output will be as follows:

<div class="test">
  <div dir="ltr">Hi,</div>
 </div>
      

Run codeHide result


  • How can I check if there is a text node between two br tags?
  • The above code just removes the html tags, but the text nodes are not removed. How can I remove this?
+3


source to share


1 answer


The html structure seems to be constant. Therefore, you can try the following CSS selector:

div.test br + br + br + div

      

DEMO

http://try.jsoup.org/~DiBi9Q_Ye88gi6Hq29Z44ar6xus



SAMPLE CODE

String html = "<div class=\"test\">\n  <br>From: <b class=\"sendername\">Divya</b> \n  <span dir=\"ltr\">&lt;<a href=\"mailto:divya@abc.net\" target=\"_blank\">divya@abc.net</a>&gt;</span>\n  <br>Date: Wed, May 27, 2015 at 11:10 AM\n  <br>Subject: Plan for the day 27/05/2015\n  <br>To: Abhishek&lt;<a href=\"mailto:abhishek.sharma@abc.com\" target=\"_blank\">abhishek.sharma@abc.<wbr>com</a>&gt;, \n    <a href=\"mailto:xyz@abc.com\" target=\"_blank\">xyz@abc.com</a>&gt;\n  <br>Cc: Ram &lt;<a href=\"mailto:Ram@abc.net\" target=\"_blank\">Ram@abc.net</a>&gt;\n  <br>\n  <br>\n  <br>\n  <div dir=\"ltr\">Hi,</div>\n </div>";

Document doc = Jsoup.parse(html);

Element mailBody = doc.select("div.test br + br + br + div").first();
if (mailBody == null) {
    throw new RuntimeException("Unable to locate mail body.");
}
System.out.println("** BEFORE:\n" + doc);

Document tmp = Jsoup.parseBodyFragment("<div class='test'>" + mailBody.outerHtml() + "</div>");
mailBody.parent().replaceWith(tmp.select("div.test").first());
System.out.println("\n** AFTER:\n" + doc);

      

OUTPUT

** BEFORE:
<html>
 <head></head>
 <body>
  <div class="test"> 
   <br>From: 
   <b class="sendername">Divya</b> 
   <span dir="ltr">&lt;<a href="mailto:divya@abc.net" target="_blank">divya@abc.net</a>&gt;</span> 
   <br>Date: Wed, May 27, 2015 at 11:10 AM 
   <br>Subject: Plan for the day 27/05/2015 
   <br>To: Abhishek&lt;
   <a href="mailto:abhishek.sharma@abc.com" target="_blank">abhishek.sharma@abc.<wbr>com</a>&gt;, 
   <a href="mailto:xyz@abc.com" target="_blank">xyz@abc.com</a>&gt; 
   <br>Cc: Ram &lt;
   <a href="mailto:Ram@abc.net" target="_blank">Ram@abc.net</a>&gt; 
   <br> 
   <br> 
   <br> 
   <div dir="ltr">
    Hi,
   </div> 
  </div>
 </body>
</html>

** AFTER:
<html>
 <head></head>
 <body>
  <div class="test">
   <div dir="ltr">
     Hi, 
   </div>
  </div>
 </body>
</html>

      

0


source







All Articles