Extract more similar words from a word list

So, I have a list of words that describe a specific group. For example, one group is based on pets.

The words for example group pets are as follows:

[pets, pet, kitten, cat, cats, kitten, puppies, puppy, dog, dogs, dog walking, begging, catnip, lol, catshit, thug life, poop, lead, leads, bones, garden, mouse, bird, hamster, hamsters, rabbits, rabbit, german shepherd, moggie, mongrel, tomcat, lolcatz, bitch, icanhazcheeseburger, bichon frise, toy dog, poodle, terrier, russell, collie, lab, labrador, persian, siamese, rescue, Celia Hammond, RSPCA, battersea dogs home, rescue home, battersea cats home, animal rescue, vets, vet, supervet, Steve Irwin, pugs, collar, worming, fleas, ginger, maine coon, smelly cat, cat people, dog person, Calvin and Hobbes, Calvin & Hobbes, cat litter, catflap, cat flap, scratching post, chew toy, squeaky toy, pets at home, cruft's, crufts, corgi, best in show, animals, Manchester dogs' home, manchester dogs home, cocker spaniel, labradoodle, spaniel, sheepdog, Himalayan, chinchilla, tabby, bobcat, ragdoll, short hair, long hair, tabby cat, calico, tabbies, looking for a good home, neutring, missing, spayed, neutered, declawing, deworming, declawed, pet insurance, pet plan, guinea pig, guinea pigs, ferret, hedgehogs, minipigs, mastiff, leonburger, great dane, four-legged friend, walkies, goldfish, terrapin, whiskas, mr dog, sheba, iams]

Now I am planning to enrich this list using NLTK.

This way I can start syncing every word. If we take cats

, as an example, we get:

Synset('cat.n.01')
Synset('guy.n.01')
Synset('cat.n.03')
Synset('kat.n.01')
Synset('cat-o'-nine-tails.n.01')
Synset('caterpillar.n.02')
Synset('big_cat.n.01')
Synset('computerized_tomography.n.01')
Synset('cat.v.01')
Synset('vomit.v.01')

      

For this user nltk wordnet

, from nltk.corpus import wordnet as wn

.

Then we can get lemmas for each sinh. By simply adding this lemma, I am adding quite a bit of noise, no matter how I add interesting words.

But what I would like to see is noise reduction and I would appreciate any suggestions or alternative methods above.

One such idea that I'm trying to test is that the word "cats" appears in the name or definition of synset to include or exclude these lemmas.

+3


source to share


1 answer


I would suggest using semantic similarity here with the kNN variant: for each candidate word, compute pairwise semantic similarity with all standards on gold, and then store only k (try different k from 5 to 100) standard words, compute the mean (or sum) of similarities with those k words, and then use that value to discard the noise candidates - by sorting and keeping only the best n, or by clipping using an experimentally determined threshold.

Semantic similarity can be calculated based on WordNet, see related question, or based on vector models learned by word2vec or similar techniques, see related question .



Actually, you can try to use this technique with all words as candidates or all / some words found in domain-specific texts - in the latter case the task is called automatic term recognition and the methods can be used for your problem directly or as a source candidates; search for them from a Google scientist; for an example with a short description of existing approaches and links to reviews, see this document :

Fedorenko D., Astrakhantsev N., Turdakov D. (2013). automatic recognition of domain-specific terms: an experimental evaluation. In SYRCoDIS (pp. 15-23). In the meantime, there is no need to know about it. ”

+2


source







All Articles