Best choice of language for spam detection service

I have about 20 or so active blogs that receive quite a lot of spam. Since I hate CAPCHAs, the alternative is very clever spam filtering. I want to create a simple REST api like a spam checker service that I will use in all of my blogs. This way I can consolidate IP packet blocks and third party spam spam detection like Akisment , Mollom , Defensio , and sometime in the future write my own spam detection to really get my head into some very interesting spam detection algorithms.

My language of choice is PHP, I consider myself quite experienced and I can really go deeply into and go out with a solution. This project, I believe, can be used as a good exercise for learning another language. The big 2 that come to mind are Python and Ruby on Rails as everyone talks about them as the next coming of our savior. Since it is mostly just an API and has no admin or public bumping into anything, it looks like basic Python running on a plain http server seems to be what it wants. Did I miss something? What would you, a great community, recommend? I would love to hear your recommendations for language, book and guidelines.

It needs to scale, and I want to write this with that in mind. I can probably use third-party free plans now, but soon enough I'll have to deploy everything to actually think for myself. For now, I think I'll just keep everything in a MySQL database until I can do some real analysis. Thank!

+1


source to share


4 answers


Python has some advantages.



  • There are several HTTP server frameworks in Python. Check out the WSGI reference implementation and learn how to use the WSGI standard to handle web requests. It is very clean and extensible. It takes a bit of research to see that WSGI is all there is to add to the request, until you get to the processing stage when it's time to formulate the answer.

  • Parsing MIME email is pretty simple.

  • You will then use site blacklisting and content filtering to detect spam.

    • Blacklisting sites can be a large, fancy RDBMS. Or it could be a simple Python pickle set of domain names and IP addresses. I recommend a simple pickle set that lives in memory. It's fast. You can force the RESTful service to reload this set from the source file when it receives some GET request that forces an update.

    • Text filtering is just tricky. I would start with SpamBayes .

+2


source


My first question is - Why don't you just use one of the three services you listed? They seem to do exactly what you want. Sorry for being cynical, but I doubt you, working alone, could beat the software engineers developing the algorithms used on these websites in a reasonable amount of time, especially given that their source of income depends on how much well they do it.

And again you can be smarter than they = P. I don't judge. In any case, I would recommend python for the reasons you stated - you don't need a fancy public interface, so Python's lack in this area doesn't matter. Python is also good for text processing, and it has great built-in bindings for using databases (e.g. sqlite, you can of course install MySQL if you deem necessary).



Disadvantages: It can get a little slow depending on how complex your algorithms are.

+9


source


I humbly recommend Lua not only because it is a great, fast language already integrated with web servers, but also because you can then use OSBF-Lua , an existing spam filter that has won the spam competition for several years in a row. Fidelis Assis and I have worked hard to generalize the model outside of email, and we'd love to work with you to integrate it with your application, which is what Lua was designed for.

In terms of scaling, in learning mode we are processing hundreds of emails per second on a 2006 machine, so this should work well even for a busy website.

We need to work with you to classify material without post headers, but I've already made some headway in that direction. For more information email nr@cs.tufts.edu. (Yes, I want people to send me spam. This is for research!)

+1


source


I would recommend Akismet for ease of use and high accuracy. With only a WordPress.com API key and an API call, you can determine if a given piece of text is spam from a user. I am using the Akismet WordPress plugin which uses the same API and have had stellar results with it for the past year or so.

Zend Framework has a great Akismet PHP class that you can use independently of the rest of the framework, which should make integration pretty easy. The documentation is also fairly complete.

+1


source







All Articles