How to select items with spaces only with Jsoup?

I have a problem with selecting items with only spaces.

Given the html: <html><body><p> </p></body></html>

Usage: empty does not select p which I am assuming because there is a text node in it

However, :matchesOwn(^\\s+$)

won't select it either because JSoup seems to render the text trim()

in the text before testing it against a regex pattern.

:matchesOwn(^$)

will select it, but will also select items with no text nodes that are not empty

Maybe I'm missing something?

:matchesOwn

shouldn't be truncated at all as it uses regex and all text should be evaluated

+3


source to share


1 answer


CSS selectors can only match a specific node: element type . Selectors cannot find comments or text nodes. We must rely on the Jsoup API to find items with only spaces.

We will be looking for nodes with a single unique text node only child. This unique text node must match the following regex ^\s+$

. To get the (uncropped) text, we'll call the method TextNode#getWholeText

.

Here's how to do it:



String html = "<html><body><div><p> </p><p> </p><span>\n\t\n   </span></div><span></span></body></html>";

Document doc = Jsoup.parse(html);

final Matcher onlyWhitespaceMatcher = Pattern.compile("^\\s+$").matcher("");
new NodeTraversor(new NodeVisitor() {

    @Override
    public void head(Node node, int depth) {
        List<Node> childNodes = node.childNodes();
        // * We're looking for nodes with one child only otherwise we move on
        if (childNodes.size() != 1) {
            return;
        }

        // * This unique child node must be a TextNode
        Node uniqueChildNode = childNodes.get(0);
        if (uniqueChildNode instanceof TextNode == false) {
            return;
        }

        // * This unique TextNode must be whitespace only
        if (onlyWhitespaceMatcher.reset(((TextNode) uniqueChildNode).getWholeText()).matches()) {
            System.out.println(node.nodeName());
        }
    }

    @Override
    public void tail(Node node, int depth) {
        // void
    }
}).traverse(doc);
// Instead of traversing the whole document,
// we could narrow down the search to its body only with doc.body()

      

OUTPUT

p
p
span

      

0


source







All Articles