How to select leaf labels of html document using jsoup

Question

How to select leaf labels of html document using jsoup

I am using jsoup to parse an html document. I need to extract all child div elements. These are basically div tags without nested div tags. I have used the following in java to extract div tags,

Elements bodyTag = document.select("div:not(div>div)");

Here's an example:

<div id="header">
     <div class="container">
         <div id="header-logo"> 
         <a href="/" title="mekay.com">
             <div id="logo">
             </div> </a>
        </div>
        <div id="header-banner">
            <div data-type="ad" data-publisher="lqm.j2ee.site" data-zone="ron">
            </div>
        </div>
     </div>
</div>

I only need to extract the following:

 <div id="logo">
 </div>
 <div data-type="ad" data-publisher="lqm.j2ee.site" data-zone="ron">
 </div>

Instead, the above code snippet returns all div tags. So, could you please help me figure out what is wrong with this selector.

+3

javascript html jsoup

mintra Dec 16 14 at 4:25 am

source to share

2 answers

If you only want tags div

that don't have any children then use this

Elements emptyDivs = document.select("div:empty");

The selector you are using now means fetch me all the divs that are not direct children of another div

. It's okay that it brings the very first parent div because it is div id="header"

not a direct child div

. Most likely its parent body

.

0

alkis Dec 16 14 at 4:40 am

source to share

Burusothman · Accepted Answer · 2014-12-16T04:35:52+0000

This one works fine

Elements innerMostDivs = doc.select("div:not(:has(div))");

Try online

add your html file
add css request like div:not(:has(div))
check the listed items

How to select leaf labels of html document using jsoup

More articles: