How to select leaf labels of html document using jsoup

I am using jsoup to parse an html document. I need to extract all child div elements. These are basically div tags without nested div tags. I have used the following in java to extract div tags,

Elements bodyTag = document.select("div:not(div>div)"); 

      

Here's an example:

<div id="header">
     <div class="container">
         <div id="header-logo"> 
         <a href="/" title="mekay.com">
             <div id="logo">
             </div> </a>
        </div>
        <div id="header-banner">
            <div data-type="ad" data-publisher="lqm.j2ee.site" data-zone="ron">
            </div>
        </div>
     </div>
</div>

      

I only need to extract the following:

 <div id="logo">
 </div>
 <div data-type="ad" data-publisher="lqm.j2ee.site" data-zone="ron">
 </div>

      

Instead, the above code snippet returns all div tags. So, could you please help me figure out what is wrong with this selector.

+3


source to share


2 answers


This one works fine

Elements innerMostDivs = doc.select("div:not(:has(div))");

      



Try online

  • add your html file
  • add css request like div:not(:has(div))

  • check the listed items
+1


source


If you only want tags div

that don't have any children then use this

Elements emptyDivs = document.select("div:empty");

      



The selector you are using now means fetch me all the divs that are not direct children of another div

. It's okay that it brings the very first parent div because it is div id="header"

not a direct child div

. Most likely its parent body

.

0


source







All Articles