Current state of anti-spam protection

What is the current state of anti-spam technology?

I've already read Paul Graham's articles on Bayesian filtering. ( Plan for Spam and Best Bayesian Filtering )

and wanted to know if there are even more relevant articles? (preferably AI related)

+3


source to share


4 answers


In case you are trying to prevent spam words, sentences such as "fasdhusdhfi" and not something else, you can always have a database of words and their synonyms. Then you can check if the input has less than 50% of known words in the database, you can raise the flag. You can make a standalone database, which I would not recommend, or you could use some online databases. For a list of words, I would suggest

http://thesaurus.com/

For a list of synonyms for these words, I would suggest

http://www.synonyms.net/

I think these two would probably be the best for the stated purpose, as they both have an API (for .net synonyms on this page ) that you can use, so it doesn't require parsing the returned pages for words.

In turn, you could combine this with other techniques as stated earlier, such as Bayesian filtering.

While not appropriate for your AI needs, it prevents a number of messages.

You can probably adapt ALICE Spam.aiml to match your "AI" query . It is in AIML format , but contains many permutations from 4 character spam. The problem is that it is slow.

A possible alternative to Spam.aiml would be to use English language rules for spam detection and filtering. The following rules can be used:

Each word must have at least one vowel. For this, the letter "Y" is considered a vowel.

Not a word of no more than three consonants in a row. For this, "TH is counted as one letter (so as not to spoil words such as" streNGTH ").

The word is longer than 34 letters. Exceptions to this may be the words listed here .

Some letter combinations cannot occur. An example of this would be that the letters "R" and "C" never appear directly next to each other in normal, non-slang conversation.

You may have a database of impossible combinations. I did a little by doing every 2-letter permutation in a 6578 word database and came up with the following results:

df bf kf gf jk kj sj fj gj hj lj sl

      

These are all impossible combinations. Of course, combinations like 'zz' are omitted. It:

aa bb cc dd ee ff gg hh ii jj kk ll mm nn pp qq rr ss tt uu vv ww xx yy zz

      



'oo' is omitted as it appears in many words, such as "look".

Line segments longer than 2 characters and repeated in succession will be marked as spam. The line "lololololol" has a duplicate "lo" segment and is marked as spam.

More than 3 identical vowels in one word will be marked as spam. For example: "oooouuuu" will be flagged as spam since "o" and "u" are vowels that have been repeated for more than 3 times.

There are no words exceeding 1 character, it can consist of just vowels. In this case, "Y" will not be considered a vowel to prevent false positives on "you".

Any input that does not meet these rules by 15% or more (margin for spelling errors) will be redirected to spam.

If you decide to change the ALICE files, you can get the ones from here . A newer version can be found on the ALICE Google Code page .

You can also use the spell checker to help detect spam. You can run input against spell checker like PyEnchant (for Python) and read the sentences. If there are no suggestions in the input, then in most cases it can be assumed that this is spam.

It's not ideal, but it should be to a limited extent. I made a small program to demonstrate what spam filtering is and how it will lead. This is the result:

>>> fdsahjfsd
'fdsahjfsd' is spam since more than 3 consonants appear in a row
>>> fhsdjhfksd
'fhsdjhfksd' is spam since it has no vowel
>>> jfsdkjl
'jfsdkjl' is spam since it has no vowel
>>> dk
'dk' is spam since it has no vowel
>>> ddds
'ddds' is spam since it has no vowel
>>> uxxs
'uxxs' is not spam
>>> kd
'kd' is spam since it has no vowel
>>> ukd
'ukd' is not spam
>>> asdjaskljlaskjldkasjkljdklas
'asdjaskljlaskjldkasjkljdklas' is spam since it is too long
>>> hdjaskj
'hdjaskj' is spam since invalid sequences detected

      

As I said, it is not perfect as it returns false positives (eg "uxxs"), but this can be fixed with a spell check.

Backdraw with spell check implementation will be that your spam detection will be based on the word count that the dictionary has. Most spelling checkers only have the first 10,000 words, so some unusual words might be blocked as spam. However, checking if more than 15% of the input is invalid can fix the problem.

If you think this might help you, you can get a little program I made from here . This is written in Python.

Also, as other answers have said, the "state of the art" spam filter requires a combination of techniques.

You can use SpamAssasin , PyZor , Reverend, and Orange , but your best bet would probably be to try and combine everything together.

If you want to use Lisp for this, a good article on Bayesian filtering in Lisp is here .

If you want to do it through a neural network then this Codeproject article might be helpful. It uses a simple and easy-to-use dll, and the sample code can be used almost directly for a spam filtering task.

Hope it helped!

+4


source


The state of the art is not so much the patrician algorithm as the quality and quantity of the input data. To reach the state of the art, you need hundreds of thousands of active users, millions of messages per day. In other words, whether it's Gmail, Yahoo or Hotmail, or you have the means to get equally large amounts of data in real time.

Save your verdict until the last moment; be prepared to pull the message out of the user's inbox before requesting the list of messages. Find out which users trust and apply your sentences to all other users' posts. Collect as many external inputs as possible (custom sentences, sender reputation, url parsing what you have) and submit them to your machine learning.

Attempting to filter spam based on message content is a losing game in itself; spammers know how to mutate their messages to the point where a Bayesian classifier can barely see anything but noise. But you can use it against them. SpamAssassin has a lot of evidence for this, but then again, you need real-time dynamic data analysis to really pull it off. I even argue that once you have enough relevant material, the exact method you use to formulate your verdict is of secondary importance.

+3


source


I was (out of sheer laziness) rolling SpamAssassin delaying implementation for a while and it performed pretty poorly.

A few months ago I added Vipul Razor and Pyzor collaborative filtering systems to my arsenal, with SpamAssassin in control by increasing the amount of spam. I receive spam on both systems on a regular basis. It's still not perfect, but now my phone goes away much less often.

It seems that "state of the art" is a combination of effective techniques.

+1


source







All Articles