I've got a huge dataset of spam from the past few years of running my site. Unfortunately the Movable Type interface I've got available at the moment won't let me upload the whole Excel spreadsheet, but I'll be happy to email it to anyone who wants to see. Here are some highlights.

Top ten filter strings that have blocked the most spams:

51175 - <h1>

32347 - texas-hold-em

20657 - texas-holdem

17727 - qualitypornlinks4u.info

14717 - free-online-poker

13310 - payday-loan

11606 - hey.com

9204 - free--online--poker

6601 - 00120.com

6304 - pornlink4u.info

Five filter strings that were created on 1/1/2005 and caught a spam today, 8/24/2006 (a useful span of 600 days):

2944 - viagra

2216 - (diet|penis)[\w\-_.]*(pills|enlargement)[\w\-_....

700 - hydrocodone

382 - freewebs.com

175 - xenical

Total filter strings: 8144

Total spams caught: 417,385

Filter strings that never caught a spam: 63%

The top 14 filters caught 50% of the spams -- that is, the top 0.17% of filters caught half the spam.

Median number of spams caught by a filter: 11

Mean number of spams caught by a filter: 131

Standard deviation of number of spams caught by a filter: 1307

Defining rate as spams caught divided by useful span, the filter with the highest catch rate is "free--online--poker" with 9204 hits in 4 days, for 2301 spams per day. The filter with the highest rate with a useful span of over 50 days is "qualitypornlinks4u.info" with 17727 hits in 62 days, for a rate of 281 spams per day.



