How to determine if part of the text mentions a product

I'm new to the natural language process, so I apologize if my question is unclear. I've read a book or two on the subject and have done general research on various libraries to figure out how I should do this, but I'm not sure what to do yet.

I'm playing around with an idea for an app and part of it is trying to find product mentions in unstructured text (like tweets, facebook posts, emails, websites, etc.) in real time. I won't go into what they are, but it can be assumed that they are known (stored in a file or database). Some examples:

  • "starting tomorrow, we have 5 boxes of @hersheys drifters available at $ 5 each - 1 pp cap" (snickers is a product from hershey [referred to as "@hersheys"])
  • "Big news: 12 ounce bottles of coke and pepsi on sale from Fri. (coca-cola is a [aliased as" coke "] product from coca-cola, and Pepsi is a PepsiCo product).
  • "# OMG, I just bought my dream car. Mustang !!!!" (Mustang is a Ford product)

So, basically, given a snippet of text, ask for text to see if it mentions a product and gets any indication (boolean or confidence) that mentions a product.

I have some problems:

  • Missing products due to spelling errors. I thought maybe I can use string similarity check to catch them.
  • Product names that are also English words or things will be caught. Like a mustang horse vs mustang car
  • A list of alternate names for the products should be kept (eg "coke" for "coco-cola", etc.).

I don't know where to start, but any help would be appreciated. I've looked at NLTK and SciKit already and haven't really seen how to do this. If you know examples or docs that explain this, the links would be helpful. At the moment I am not peculiar to any language. Java is preferred, but Python and Scala are acceptable.

+3


source to share


2 answers


It sounds like your goal is to classify the linguistic forms in a given text as references to semantic entities (which many different linguistic forms can refer to). You describe a number of subtasks that must be completed to get good results, but they are still tasks in their own right.

spelling

To deal with potential word errors, you need to associate these possible spelling errors with their canonical (i.e. correct) form.

  • Phonetic Similarity : Many reasons for "spelling mistakes" are the opacity in the relationship between a word of a phonetic form (that is, how it sounds) and its spelling (that is, how it is spelled). Thus, a good way to solve this problem is to use index terms phonetically so that, for example, innovashun is associated with innovation.
  • Shape similarity : Alternatively, you can perform string similarity checks , but you can introduce a lot of noise into your results that you have to decide because many different words are actually very similar (like chic versus chicken). You can make it a little smarter by first morphologically parsing that word and then using the kernel tree .
  • Manual mappings: You can also just list the generic misspelling → canonical_form

    mappings. This will work well for "exceptions" that are not handled by the above methods.

Meaning of meaning in words



A Mustang car and a Mustang horse are the same shape but belong to completely different entities (or rather entity classes if you want to be pedantic). In fact, we ourselves, as humans, cannot say what is meant unless we also know the word context . One of the widely used methods of modeling this context is the lexical semantics of the distribution : defining the semantic similarity of a word with another as the similarity of their lexical contexts, i.e. words preceding and following them in the text.

Linguistic aliases (synonyms)

As stated above, any given semantic entity can be related in different ways: bathroom, toilet, toilet, toilet, toilet, toilet, toilet, boys / girls room, throne room, etc. For simple values ​​of these general types, they can often be viewed as spellings in the same way as "common spelling mistakes" and can be matched against the "canonical" combo form. For ambiguous references like the throne room, you can also include other metrics (like lexical-distributional methods) to eliminate meaning so you are not relevant, for example. I'm in the throne room now! The throne room of Buckingham Palace is beautiful.

Conclusion

You have a lot of work to do to get where you want to go, but it's all fun, and there are already good libraries out there for most of these tasks.

+1


source


The answer you chose doesn't really answer your question.

The best approach you can take is to use a Named Entity Recognizer (NER) and a POS tagger (grab NNP / NNPS; Own names). The database may be missing some new brands like Lyft (Uber's rival), but without developing its own prop database, Stanford tagger will solve half of your immediate needs.



If you have the time, I would build a dictionary that has the name of each brand and just extract it from the tweets. http://www.namedevelopment.com/brand-names.html If you know how to scan, it's not hard to decide.

+3


source







All Articles