Search database of entity names (colleges, cities, individuals, countries ...)

For an enterprise app research project that another person and I are working on, we want to remove certain content from the page in order to keep the posted messages generic (which means non-offensive and inherently anonymous). Right now, we want to accept the message a user posted to the message board and remove any name, college or institution name, and profanity (and if possible later, we would like to remove company names).

Is there some kind of database that we can connect to so that we can trigger the cleanup of our messages by checking the values ​​in the database to recognize them?

+2


source to share


1 answer


The question seems to imply an online database that will be queried when processing messages. Operational problems (reliability of such services, lag in response time, etc.), as well as the problem of completeness (multiple databases need to be requested, since none of them will cover 100% of the lexical needs of the project) make this online approach in real time impractical. However, there are many databases available for download , allowing you to create your own local hot word database.

WordNet might be a good place to start , you would probably use all "instance" words as words that should normally be removed from messages as you anonymize / clear them. (You might also want to store the words "no instance" in a separate table / wordlist "more likely to be fine"). Perhaps this list will most likely support "version 0.9" of your application.

Ultimately you will want to expand this lexical database of "bad words", for example, to include all university acronyms (CMU, UCSD, DU, MIT, UNC, etc.), sports team names (Celtics, Bruins, Bruins, Red Sox ...) and depending on the area of ​​your posts, additional names of public figures (Wordnet has a few, like George W. Bush or Robert De Niro, but it lacks lesser-known people or people who came to fame more recently: e.g. Barack Obama)

To complement Wordnet, two different types of sources come to mind:

  • traditional online databases
  • ontologies and folk sciences

Examples of the former say "Cities / States by ZIP Code" in the USPS. Examples of the latter are various "lists" drawn up by scientists, organizations or various individuals. It is not possible to provide an exhaustive list of any of these types of sources, but the following should help:



  • DAML.ORG Ontology Catalog
  • Regions and states of the USA an example of the DAML ontology format
  • Open Directory Project Open Source Directory (attention, quickly gets messy)
  • SourceWatch.org example "list of lists: people in journalism / politics"
  • Seach Engine Keywords: " List of Lists ", and use the three or four words that you expect to find in the list you are looking for.

In simpler cases, you can simply load lists, etc., and also "cut and paste". Ontologies will be "burdened" with additional attributes that you need to parse (in the future you may really want these attributes and use ontologies in a more traditional way, for now, capturing lexical objects is all that is needed).

This task of compiling a lexical database can seem daunting. But the 80-20 rule says that 20% of hot words will be 80% of quotes in posts, and so with relatively little effort, you should be able to produce a system that covers 90% + of your use cases.

Looking Ahead: Beyond the Hot Word Database
There are many ways to approach this task using various techniques and concepts from Natural Language Processing (NLP). As your project develops in sophistication, you may learn about some of these concepts and possibly implement them. For example, a simple POS tagger comes to mind, as it can help [partially] distinguish between different uses of the "SCREW" token, since your application is discarding offensive words. ("The board of directors wants to screw the pupils" against "The board must be fixed with a minimum of 4 screws per yard."

Before you need these formal NLP techniques, you can use several template-based rules to handle common domain (s) related cases regarding the message type defined for a project. For example, you might consider the following:
- (word) State University
- Senator (Word_Starting_with_Capital letter)
- Words that mix letters and numbers (they are often used to denote errors and bypass the types of filters your projects want to implement)

Another tool that can be useful, particularly at the beginning, would be a system that collects statistical information about the message corpus: word frequency, most common words, most common bigrams (two consecutive words), etc.

+7


source







All Articles