Retrieving information from websites

Question

Retrieving information from websites

Not every site provides its data well, with XML feeds, APIs, etc.

How can I extract information from the site? For example:

...
<div>
  <div>
    <span id="important-data">information here</span>
  </div>
</div>
...

I come from a background of Java programming and coding with Apache XMLBeans. Is there something similar to HTML parsing where I know the structure and data is between a known tag?

thank

0

java html html-content-extraction

Mark sailes 25 nov. '08 at 19:23

source to share

3 answers

Here's an article that has several screen cleaning tools written in java.

In general, it sounds like you want to take a look at regular expressions that match the pattern you are looking for.

Hope it helps!

+4

Zachary yates 25 nov. '08 at 19:26

source to share

Java seems like a pretty tricky constraint for such a task. Is this a tough requirement? Scripting languages are ideal for creating what is really a lot of last mile code.

If you're open to it, ruby + hpricot makes it completely trivial. You can use css or xpath selectors (or both) to find (and manipulate) content in HTML. Grabbing a document, parsing it, and extracting the text in your example is literally one line of code.

0

Dustin 25 nov. '08 at 19:45

source to share

James van huis · Accepted Answer · 2008-11-25T19:26:49+0000

There are several open source HTML parsers for Java.

I've used JTidy in the past and they 're in luck. It will provide you with the DOM of the html page and you can grab the tags you need.

Retrieving information from websites

More articles: