Retrieving information from websites
Not every site provides its data well, with XML feeds, APIs, etc.
How can I extract information from the site? For example:
...
<div>
<div>
<span id="important-data">information here</span>
</div>
</div>
...
I come from a background of Java programming and coding with Apache XMLBeans. Is there something similar to HTML parsing where I know the structure and data is between a known tag?
thank
source to share
Here's an article that has several screen cleaning tools written in java.
In general, it sounds like you want to take a look at regular expressions that match the pattern you are looking for.
Hope it helps!
source to share
Java seems like a pretty tricky constraint for such a task. Is this a tough requirement? Scripting languages ββare ideal for creating what is really a lot of last mile code.
If you're open to it, ruby ββ+ hpricot makes it completely trivial. You can use css or xpath selectors (or both) to find (and manipulate) content in HTML. Grabbing a document, parsing it, and extracting the text in your example is literally one line of code.
source to share