SpamAssassin REGEX to catch long url

Question

SpamAssassin REGEX to catch long url

I am stripping my SpamAssassin filter on CentOS. After logging out of the .link and .eu domains, I like to mark very long domain strings with more than 100 characters.

Terms:

Start with http or https
May or may not contain www
Ends with EOL, Line Break, Space, ", ', <

I came up with this one:

body     LONG_URL    (https?:\/\/)[^,;\"\'<\s$]{100,}
describe LONG_URL    URL with over 100 characters
score    LONG_URL    0.5

It works in REGEX test, but doesn't work in SpamAssassin

+3

regex spamassassin

yello 21 oct. At 4:19 am

source to share

2 answers

200_success · Answer 1 · 2014-10-21T05:56:50+0000

You want to write uri

test , not test body

.

Adam katz · Answer 2 · 2014-11-13T22:05:04+0000

To get around the new TLD problem, you really need a body rule. As you wrote above, there are some syntax problems and some unnecessary computational overhead. Try this instead:

body     YELLO_LONG_BODY_URL  m@\bhttps?://[^\"\'<\s$]{100}@i
describe YELLO_LONG_BODY_URL  100+ char URL, https://stackoverflow.com/a/26919318
score    YELLO_LONG_BODY_URL  0.1

This will technically work, although I'm sure you will find that it shoots a LOT of non-spam, marketing mail in particular, especially if you keep that 100 character binding (that's not enough!). I took out the comma and half-length as they can be part of the urls and a legitimate post will only have long urls as one character is too long, so you're probably fine with justm@\bhttps?://\S{100}@i

Warning: I fight spam for my life and have a lot of data at my fingertips. You will hit no more spam ("ham") than spam under 128 characters. No size range will make an awfully good spam: ratio ham; S / O , aka precision , from 0.900 is probably acceptable, but you really want to be closer to 1.000. In my tests, the best range is 192-256 characters, but too weak (S / O = 0.862) to be terribly useful. Spam using links with more than 1024 characters (S / O = 0.057) is almost non-existent.

I have changed this rule name. It's good to take credits for your rules so that they can be easily identified as yours (not upstream SpamAssassin) when something went wrong and "credit" becomes "guilty" 😉 ... I even linked this answer to description of the rule so that your users can learn more.

SpamAssassin REGEX to catch long url

More articles: