How can I extract addresses and phone number from HTML?

Is there a library that specializes in analyzing such data?

+2


source to share


3 answers


You can use something like Google Maps. Geocode the address and, if successful, the Google API will return an XML representation of the address with all the highlighted elements (and fixed or completed).

EDIT:

I voted and don't know why. Parsing addresses can be a little tricky. Here's an example using Google to do this:



http://blog.nerdburn.com/entries/code/how-to-parse-google-maps-returned-address-data-a-simple-jquery-plugin

I am not saying that this is the only way, or necessarily the best way. Just a way to parse addresses on a website.

+6


source


There are two parts to this: extract the full address from the page, and parse that address into something you can use (for example, store various parts in a DB).

For the first part, you will need a heuristic, the most likely country-specific: for US addresses, [A-Z][A-Z],?\s*\d\d\d\d\d

should give you the end of the address if 2 letters are in the state. Finding the beginning of a string remains an exercise.



The second part can be done either by calling Google Maps or, as usual, in Perl using the CPAN: Lingua :: EN :: AddressParse module (check it against your data to make sure it works well enough for you).

This is a tricky task anyway, and you will most likely never get 100% right, so plan to manually check addresses before using them.

+2


source


You don't need regular expressions (yet) or a generic parser like pyparsing (in general). Look at something like Beautiful Soup, which will parse even bad HTML into something like a tag tree. From there, you can look at the source of the page and see which tags will expand to get to the data. Then, from the Beautiful Soup tree, you can search for those nodes with XPath (in the latest versions) and directly iterate over the tags you are interested in, it is easy to get to the real data. From there, you can parse the data with a quick regex or whatever. This will be more flexible and more reliable in the future, and also arguably less bloated than just trying to do it in pure regexes.

0


source







All Articles