Retrieving information from websites

Not every site provides its data well, with XML feeds, APIs, etc.

How can I extract information from the site? For example:

...
<div>
  <div>
    <span id="important-data">information here</span>
  </div>
</div>
...

      

I come from a background of Java programming and coding with Apache XMLBeans. Is there something similar to HTML parsing where I know the structure and data is between a known tag?

thank

0


source to share


3 answers


There are several open source HTML parsers for Java.



I've used JTidy in the past and they 're in luck. It will provide you with the DOM of the html page and you can grab the tags you need.

+3


source


Here's an article that has several screen cleaning tools written in java.

In general, it sounds like you want to take a look at regular expressions that match the pattern you are looking for.



Hope it helps!

+4


source


Java seems like a pretty tricky constraint for such a task. Is this a tough requirement? Scripting languages ​​are ideal for creating what is really a lot of last mile code.

If you're open to it, ruby ​​+ hpricot makes it completely trivial. You can use css or xpath selectors (or both) to find (and manipulate) content in HTML. Grabbing a document, parsing it, and extracting the text in your example is literally one line of code.

0


source