filter.pl
a simple context based spam filter
by Stunt Pope <markjr@shmOOze.net>
copyright 1997 Mark Jeftovic/Private World Communications

Current version at: http://AntiSpam.shmOOze.net/filter/
This version: 1.02d

In this file:
	1.0 Disclaimer
		1.1 Copyright
		1.2 Contributors
	2.0 Preamble
	3.0 Requirements
	4.0 Installation
		4.1 dbm database setup
		4.2 expire_spam_db.pl
	5.0 Usage
		5.1 Usenet Harvesters and $POST_ADDR
		5.2 POP Client support
		5.3 Tripwyres
		5.4 Additional mail patterns
	6.0 Todo
	7.0 Contact
		7.1 Developers' mailing list
	
1.0 Disclaimer

	The usual, this provided "as is", no promises, no guarantees.
	I have used this program for upwards of 6 months, and in that
	time it has never corrupted my mbox or (to my knowledge) misplaced
	a message in limbo. Myself, Private World Communications anyone
	contributing to this project are not responsible for anything
	good or bad that results in your use of this software.

	The status of filter.pl is: EXPERIMENTAL

1.1 Copyright

	This was coded by Mark Jeftovic (a.k.a Stunt Pope), whose entire
	intellectual portfolio, including but not limited to: ideas, 
	thoughts, and brainwaves; is owned completely and in it's entirety
	by Private World Communications. You cannot bundle this software
	into a commercial package, or resell it, or in any way derive
	income from it without obtaining a commercial license from PWC.
	You are free to modify it, give it away for free, or (in the
	immortal words of Dr. Bubonic) "print off hard copy and hand them 
	out to the homeless". Please keep the distribution intact if you do 
	this, and please send us your improvements and additions if you 
	feel energetic enough to code any.

1.2 Contributors	

	Aside from the original coder the development of this software 
	has been aided by:

		-Michael Mattes <mattes@angel.com>
		-Justin Mason 	<jmason@iona.com>
		-Karl Anderson	<kra@pobox.com>

2.0 Preamble

	This is my first run at context based spam filter, coded in perl. 
	Many of the other spam filters out there screen incoming email by 
	address masks, etc. What I am attempting here is to make a filter 
	that screens mail based on it's content.

	Surprisingly, it's not very difficult. Between this program, and 
	blocking spam domains with sendmail, I'm finding upwards of 90% of 
	the spam that does get thru to my account, flagged and filed by 
	filter.pl. Of the legit messages flagged as spam, I admit it occurs
	but 90% of those are posts to an antispam list I moderate, and the 
	spam quoted in the posts trips the filter (I'm also finding the number
	of false postives dropping from the original version).

	The status of this is PURELY EXPERIMENTAL. It works fine for me, 
	I think the antispam community is better served by releasing it 
	now so others who may feel so inclined can help out on this if 
	interested.

	See contact and todo sections below.

3.0 Requirements

	perl 5 (because it uses symbolic refernces in function calls), and
	a .forward file. 

	This has only been tested on hosts running sendmail, 8.7.x and 8.8.x.
	It uses flock() file locking. If you run it with a different mail
	server and it works, please let me know.

4.0 Installation

	* Edit the line near the top of filter.pl that reads:

	push(@INC, "/path/to/your/filter");

	to match where on the system you have installed filter.pl

	* Edit config.h to suit your system. Most if not all of the 
	variables should be self evident. Having done that, my advice 
	is to run a perl -c on config.h just to double check you haven't 
	made any typos. 

	* Once it looks ok, perhaps run it from the command line using 
	TESTSPAM.txt as it's input:

	markjr@bofh~> ./filter.pl < TESTSPAM.txt

	The contents of TESTSPAM.txt should be in whatever file you designated
	as the $BIT_BUCKET (NOTE: I would *NOT* set this to /dev/null), OR
	if you've set $JUST_TAG the subject line will be rewritten 
	appropriately OR if your $MAIL_READER is set to $POPPER then the
	X-Spam-Filter header line will be added.

	If all this works ok, you can then put the following in a .forward
	file in your homedir:

	"| /path/to/your/home_dir/filter/filter.pl -u yourusername"

	(NOTE: i've seen different syntaxes, some o/s'es apparently don't 
	require the double quotes, others do, as above works on slackware 
	2.0.27, see Jason's note below regarding the -u switch")

	[jmason note -- Mark, you may want to change that to:
	"| /path/to/your/home_dir/filter/filter.pl -user yourusername"
	as older versions of sendmail at least will strip out multiple
	invocations of a .forward script for the same message, even if
	they're being run by different users! My patch will ensure that
	filter.pl will ignore the -user arguments.]

	[markjr note -- agreed, although I changed it from -user to -u
	on the chance that this eventually uses command line switches
	we can then process them with getopt() or getopts().]

4.1 dbm database setup

	1.02 onwards supports dbm database logging and lookups. If an
	inbound email gets flagged as spam, it's subject, message-id
	from line, are added to a dbm file. If it's already in the
	database, it's count gets incremented and last modified date
	updated. The first rule to fire now is to check this database
	(if $DBASE_TYPE is set).

	The database is in dbm or db format (for now), see config.h for usage.
	The format is:

	subject = message-id from-line count created modified

	subject, message-id and from-line are the header lines with
	all non-alphanumeric characters and whitespaces taken out,
	and wrapped to lower case ("MAKE $$ FAST" becomes "makefast"),
	null terminated.

	count is how many times that record has been flagged as spam.

	created & modified are in seconds as returned by the time() call
	in perl.

	If you opt to use this, the database should go somewhere on the
	system common for everyone who will use it (i.e. mine is
	/usr/local/etc/filter.db) and be writable by whoever your local
	delivery agent runs as. Mode 0665, uid mail, gid mail works on 
	my linux box just fine.

4.2 expire_spam_db.pl

	To guard against winding up with a huge spam hit database, 1.02c
	and up ships with a script called expire_spam_db.pl, which can
	run out of a crontab to expire old items out of the database.

	example: expire_spam_db.pl -m -d5

	This would expire any items that have not been modified in 5 days.
	(meaning: no email with that subject came in in the last 5 days).

	Run without args for further usage.

5.0 Usage

	You should be able to figure out how to tweak it's performance by
	reading the config file. If too much legit mail is getting flagged,
	for example, set $STRIKE_OUT higher, or if a particular rule doesn't
	suit you, edit rules.hdr.pl (header rules) or rules.bod.pl (body
	rules) and set the variable $active for the rule to 0.

5.1 Usenet Harvesters and $POST_ADDR	

	$POST_ADDR is an optional var in config.h. It works like this:
	I created an additional a record for my mail server, 
	mail.privateworld.com is the same box as shmOOze.net. In all of 
	my news readers I set my email address to markjr@mail.privateworld.com, 
	I set $POST_ADDR to this as well. If mail comes in for that address, 
	and the subject line does not begin with "re:" (case insensitive), 
	then it's a strike.

5.2 POP Client support:

	pop clients are supported as of 1.01c. You still need to know what the
	path to perl is on your pop host, and be able to at least get in there 
	to install the package and .forward file. (this is all do-able via ftp
	if you don't have shell access on the pop server).

	In config.h set $MAIL_READER to $POPPER. Have a look at $X_HDR_FLAG and
	$POP_FLAG, unless you change these values, you will get an extra
	X-Header in the emails flagged as spam a la:

	X-Spam-Filter: filter.pl-1.01c [SMELLS LIKE SPAM]

	depending on your popper you could either filter for the specific 
	header, or the string in it. 

	Alternatively, as of 1.01e you can also set $JUST_TAG and your
	subject line will be rewritten (see config.h). 

	[Note: there was a reported bug with 1.02a in which some pop
	users found they weren't getting their spool files with
	blank lines between the emails. 1.02b was to fix this, but
	if you encounter probs, please let me know]

5.3 Tripwyres:

	The format has changed as of 1.01d, now in the form:
	[number]:[flags]:regex

	-number is minimum number of matches makes a strike.
	-flags are optional pattern matching flags, the only one supported
	thus far is 'i' for case insensitivity
	-regex is the actual regular expression
	(there is a note about these changes in the Todo section)

5.4 Additional mail patterns

	If you have additional aliases or other addresses that resolve
	to your address, you can add them one per line (or a regex)
	to the "mail_patterns" file. 

	You should also be able to enter any listserv addresses you
	find getting false positives. (i.e. Computer Underground Digest
	always gets flagged for me). This should at least stop them
	from firing hdr_1. I'm still testing the accuracy here.

6.0 Todo:

	* make an installer script (i.e ./config;make test;make install)

	* better docs

	* better pop support?

	* Add a rule to check for embedded html, again, trivial.

	* a certain perl guru (I won't name names but his IRC nick is merlyn) 
	thinks the new method employed by tripwyres is deeply flawed. If
	anyone can think of a better way to do it, by all means clue me in.

	* binary attachments support. Basically, make it not scan any binary 
	attachments, as they often create spurious strikes.

	* some way to figure out whether nor to use AnyDBM_File, or
	DB_File, or GDBM_File or whatever (if the dbase is used).

7.0 Contact:

	My email is markjr@shmOOze.net, the most recent distribution of this
	software will reside at http://antispam.shmOOze.net/filter. If you
	need my mailing address for some reason my office address can 
	be obtained using my NIC handle: MJ177

7.1 Developers' mailing list:

	In order that those interested in furthering the development of
	filter.pl have a place to trade ideas and make sure we aren't
	re-inventing somebody else's wheel, the filter mailing list
	has been set up.

	To subscribe send a message with "subscribe" in the message body
	to: <filter-request@shmOOze.net>

	To unsubscribe send a message with "unsubscribe" in the message body
	to: <filter-request@shmOOze.net>

	Send a message to the list via <filter@shmOOze.net>

----
Mark Jeftovic           aka: mark jeff or vic, stunt pope.
markjr@shmOOze.net      http://www.shmOOze.net/~markjr
PWC's BOFH              http://www.PrivateWorld.com
irc: L-bOMb             Keep `em Guessing

