How to use a pattern to get a value between two known strings

Let me first tell you where I am from. I have a string that is html code from a website, I got this using JSOUP. Anyway, so the html is all on line and I can print it in a text file. So I am trying to get songs from inside this code and every song has the same "tags"

this is a line from a text file I typed on

          <div class="title" itemprop="name">
           Wrath
          </div> </td> 

      

It looks like a string in notepad, but when you copy and paste it, it looks like this. So what I want is anger in the middle, so I tried to create a template to find it using help from this other post on the stack: Java regex to extract text between tags

This is the part of my code related to this

Pattern p = Pattern.compile( "<div class=\"title\" itemprop=\"name\">(.+?)</div> </td>");
    Matcher m = p.matcher( html );
    while( m.find()) {
       quote.add( m.group( 1 ));
    }

      

When it runs, it shows that there is nothing in the ArrayList quote. It may not be working because it counts the gap between them. Any ideas?

+3


source to share


2 answers


You can use jsoup

to parse as well as load an HTML document:

String site = "http://example.com/";
Document doc = Jsoup.connect(site).get();
String text doc.select("div.title").first().text();

      


Or just use XPath if that doesn't work. Regular expressions are great for collecting data from unstructured text. However, if you have a structured document such as HTML, you can leave all the heavy lifting to a specially designed parser. Java comes with a javax.xml.xpath

library
with which you can search the node tree of your document.



Let's say your document looks like this:

<html>
  <body>
    <div class="title">Wrath</div>
  </body>
</html>

      

You can do this to find the text in that div:

XPath xpath = XPathFactory.newInstance().newXPath();
String expression = "/html/body/div[@class='title']/text()";
InputSource inputSource = new InputSource("myDocument.html");
NodeList nodes = (NodeList) xpath.evaluate(expression, inputSource, XPathConstants.NODESET);

      

+4


source


If it parses like Perl, you might need to double the value by \

Pattern p = Pattern.compile("<div class=\"title\" itemprop=\"name\">(.*?)<\\/div>");

      

Should be



Pattern p = Pattern.compile("<div class=\"title\" itemprop=\"name\">(.*?)<\\\\/div>");

      

But for this type of Regex is the wrong tool

0


source







All Articles