How to select items with spaces only with Jsoup?
I have a problem with selecting items with only spaces.
Given the html: <html><body><p> </p></body></html>
Usage: empty does not select p which I am assuming because there is a text node in it
However, :matchesOwn(^\\s+$)
won't select it either because JSoup seems to render the text trim()
in the text before testing it against a regex pattern.
:matchesOwn(^$)
will select it, but will also select items with no text nodes that are not empty
Maybe I'm missing something?
:matchesOwn
shouldn't be truncated at all as it uses regex and all text should be evaluated
source to share
CSS selectors can only match a specific node: element type . Selectors cannot find comments or text nodes. We must rely on the Jsoup API to find items with only spaces.
We will be looking for nodes with a single unique text node only child. This unique text node must match the following regex ^\s+$
. To get the (uncropped) text, we'll call the method TextNode#getWholeText
.
Here's how to do it:
String html = "<html><body><div><p> </p><p> </p><span>\n\t\n </span></div><span></span></body></html>";
Document doc = Jsoup.parse(html);
final Matcher onlyWhitespaceMatcher = Pattern.compile("^\\s+$").matcher("");
new NodeTraversor(new NodeVisitor() {
@Override
public void head(Node node, int depth) {
List<Node> childNodes = node.childNodes();
// * We're looking for nodes with one child only otherwise we move on
if (childNodes.size() != 1) {
return;
}
// * This unique child node must be a TextNode
Node uniqueChildNode = childNodes.get(0);
if (uniqueChildNode instanceof TextNode == false) {
return;
}
// * This unique TextNode must be whitespace only
if (onlyWhitespaceMatcher.reset(((TextNode) uniqueChildNode).getWholeText()).matches()) {
System.out.println(node.nodeName());
}
}
@Override
public void tail(Node node, int depth) {
// void
}
}).traverse(doc);
// Instead of traversing the whole document,
// we could narrow down the search to its body only with doc.body()
OUTPUT
p
p
span
source to share