Using natural language processing to extract the address from a tweet

Question

Using natural language processing to extract the address from a tweet

I am creating a tweeter bot that will listen to tweets like the following:

Hey @twitterbot, I'm looking for restaurants around 123 Main Street, New York

or, another example:

@twitterbot, what near Yonge & Dundas, Toronto? I'm hungry!

It will then respond with the data you expect these questions to return. I've got most of the problem solved, but I'm stuck on something that shouldn't be that hard; extracting an address from a tweet.

I will be redirecting the address to the geocoding service to get lat / lng, so I don't need to format or prepare the address in any way; I just need to isolate it from unrelated text like "I'm looking for restaurants around" or "I'm hungry!"

Are there any NLP tools that will perform this address identification in a block of text? Any suggestions for a different path? Since Google's geocoder handles such a wide array of address formats (even a point of interest such as "Survey Center, Toronto" counts as an address), I can't use a regular expression to pluck out the address.

On the other hand, I just want to remove any text that is not part of the address.

The addresses I'm looking for must work in the US / Canada.

There are a couple of similar questions on StackOverflow, but none are tackling this exact problem I could find. Since Google's geocoder is so forgiving, the solution doesn't have to be perfect, it just needs to get rid of enough fluff for Google to know what I'm trying to say.

I am very new to NLP, so I would appreciate any guidance on this.

+3

google-maps machine-learning nlp street-address

Joshua comeau 11 jul. 15 at 17:52

source to share

2 answers

Here you go: http://geocoder.ca/?locate=Hey+%40twitterbot%2C+I%27m+looking+for+restaurants+around+123+Main+Street%2C+New+York&geoit=xml&parse=1

<geodata>
<latt>40.5119365</latt>
<longt>-74.2493562</longt>
<AreaCode>347,718</AreaCode>
<TimeZone>America/New_York</TimeZone>
<standard>
     <stnumber>123</stnumber>
     <staddress>Main ST</staddress>
     <city>STATEN ISLAND</city>
     <prov>NY</prov>
     <postal>11385</postal>
     <confidence>0.9</confidence>
  </standard>
</geodata>

or http://geocoder.ca/?locate=Hey+%40twitterbot%2C+I%27m+looking+for+restaurants+around+123+Main+Street% 2C + New York +

+1

Ervin ruci Dec 12. 15 at 17:55

source to share

Gabriel · Accepted Answer · 2015-07-12T13:23:07+0000

How to parse a freeform street / postal address from text and into components answers the question “Is there a way to isolate the address from the text around it and Break it into chunks?” Is essentially the same question as yours (except that you don't care about breaking it apart - just isolating it from the rest of the text).

SmartyStreets also has a nice demo at https://smartystreets.com/demo?mode=extract , but not a free solution unfortunately.

One more consideration. Since tweets are limited to 140 characters and tend to contain multiple words (your two examples are 9 and 12 words respectively), you could just just rudely force it. For example, to find a location in "@twitterbot, what's next to Yonge and Dundas, Toronto? I'm hungry!", You can submit all of the following data to the google geocoder -

what's near Yonge and Dundas, Toronto? I'm hungry!

what's near Yonge and Dundas, Toronto? I AM

what's near Yonge and Dundas, Toronto?

which is close to Yong and Dundas,

and so on .. for all possible full word substrings.

Using natural language processing to extract the address from a tweet

More articles: