I've got a huge dataset of spam from the past few years of running my site. Unfortunately the Movable Type interface I've got available at the moment won't let me upload the whole Excel spreadsheet, but I'll be happy to email it to anyone who wants to see. Here are some highlights.
Top ten filter strings that have blocked the most spams:
51175 - <h1>32347 - texas-hold-em
20657 - texas-holdem
17727 - qualitypornlinks4u.info
14717 - free-online-poker
13310 - payday-loan
11606 - hey.com
9204 - free--online--poker
6601 - 00120.com
6304 - pornlink4u.info
Five filter strings that were created on 1/1/2005 and caught a spam today, 8/24/2006 (a useful span of 600 days):
2944 - viagra2216 - (diet|penis)[\w\-_.]*(pills|enlargement)[\w\-_....
700 - hydrocodone
382 - freewebs.com
175 - xenical
Total filter strings: 8144
Total spams caught: 417,385
Filter strings that never caught a spam: 63%
The top 14 filters caught 50% of the spams -- that is, the top 0.17% of filters caught half the spam.
Median number of spams caught by a filter: 11
Mean number of spams caught by a filter: 131
Standard deviation of number of spams caught by a filter: 1307
Defining rate as spams caught divided by useful span, the filter with the highest catch rate is "free--online--poker" with 9204 hits in 4 days, for 2301 spams per day. The filter with the highest rate with a useful span of over 50 days is "qualitypornlinks4u.info" with 17727 hits in 62 days, for a rate of 281 spams per day.






