Python NLTK named object recognition depends on (upper) case of first letter?
I am planning to use Python NLTK for academic research. Specifically, I need a way to validate Twitter users and tease those who don't seem to be using their "real name" on their profile.
I am considering using the default NLTK name recognition to separate Twitter users who seem to be using their real name from those who don't. Do you think it's worth a try? Or should I train the classifier myself?
import nltk import re import time ##contentArray0 =['Health Alerts', "Kenna Hill"] contentArray =['ICU nurse toronto'] ##let the fun begin!## def processLanguage(): try: for item in contentArray: tokenized = nltk.word_tokenize(item) tagged = nltk.pos_tag(tokenized) print tagged namedEnt = nltk.ne_chunk(tagged) ##namedEnt.draw() time.sleep(1) except Exception, e: print str(e) processLanguage()
Edit: I did a little work. It seems nltk is recognizing the name object first of all, is the first letter of the word capital? For example, "ICU Nurse Toronto" will be recognized with NNP, but "ICU nurse toronto" will not. This seems overly simplistic and not very useful for my purpose (twitter) as many Twitter users using real name may use lowercase letters while some commercial organization would use the first letter of capital.
source to share
Definitely train yourself on your own. The NLTK NE recognizer is trained to recognize named objects embedded in complete sentences. But don't just retrain the nltk NE recognizer for new data; it is a "sequential classifier", that is, it takes into account the surrounding words and POS tags and the classification of the named objects of the previous words. Since you already have usernames, they won't be useful or relevant to your purposes.
I suggest you train a regular classifier (like Naive Bayes), feed it with any custom features that you think might be relevant, and ask it to make the "this is real name" decision. To practice, you must have a training corpus that contains examples of names and examples of non-names. Ideally, a corpus should consist of what you are trying to classify: Twitter.
Repeat the question in your comment, don't use whole words as functions: your classifier can only reason with functions that it knows about, so census names cannot help you with new names unless your functions are part of the name.Functions are usually an ending (last letter, final bigram, final trigram) but you can try other things like length and of course capitalization as well. The NLTK chapter discusses the problem of gender recognition of names and gives many examples of suffixes.
The trick in your case is that you have multiple words. So your classifier needs to be told somehow if some words are recognized as names and some are not. Somehow you have to define your functions in such a way as to preserve this information. For example, you can set the Known Names feature to None, One, Multiple, All. (Note that the NLTK implementation treats function values as "categories": they are just different values. You can use 3 and 4 as function values, but as far as the classifier is concerned, you could also use "green" and "lift" ".)
And don't forget to add a constant value "offset" function (see the NLTK chapter).
source to share
You will definitely have to train the classifier yourself. For example, since you are working on naming, you might want to take a look at this NLTK chapter. The simple Naive Bayes classifier described in the chapter to test if a name is "masculine" or "feminine" gives a good idea of the types of functions. Also your question about what functions is more of a problem and subject matter. Besides the common features that all Information Extraction researchers use, there may be other features. But again, this depends solely on your data. Go through this chapter, it gives you all the basic tools to create your own classifier.
As an aside, since you mentioned Twitter usernames, I would also suggest using a normalizer since most names can only contain letters. For example, instead of "Tom", the username could also be "T0m". Perhaps you are already doing this, that in your absence, I am sorry to have done it again.
source to share