Spamassassin Settings

Spam Filtering Overview

The SpamAssassin program. SpamAssassin is a powerful spam filter that was installed on the MUSes network in September 2002. At the time spam was beginning to emerge as more than a minor nuisance, and I had been looking into ways to effectively deal with spam on our local system. Of all the alternatives, SpamAssassin - a free program in the best of the Unix/Linux traditions - was almost universally regarded as the best option, at least for Unix/Linux platforms. My own testing, using a private copy of SpamAssassin that I had downloaded from the web and installed under my home directory, confirmed this: after a short period of tweaking with settings, it proved to be nearly 100 % accurate in correctly identifying spam, while not blocking legitimate email.

Approaches to spam filtering. Spam filtering can be done in many ways. For example, the filtering can be performed systemwide, at the entry level to a network, or on a per-user basis, and messages identified as spam can be dealt with in various ways, including automatic deletion. The two guiding principles behind the spam filtering set-up on our systems are the following:

  • Spam filtering must be done on an individual basis and activated by each user. There is no systemwide spam filtering. In other words, all email sent to an account at math.uiuc.edu reaches that account unfiltered; it is up to the user to set up filtering of incoming email, and to decide to what extent such filtering should be performed. This is an important principle for several reasons:
    • It is consistent with the "opt-in" philosophy. Just as many people object to being put on mailing lists without their explicit approval ("opt in"), people should not have their email filtered without their explicit consent.
    • There is no universal agreement on what constitutes spam, and there are many grey areas, where unsolicited email (e.g., a book announcement by a publisher) might be considered spam by some, but welcome and legitimate email by others.
    • While highly accurate, SpamAssassin is occasionally prone to "false positives", and by completely blocking such email at the entry level to math.uiuc.edu, some legitimate messages would likely get blocked.
    • By performing the filtering at the user level, users can customize the filter to their own purposes, for example, by adding addresses to a "whitelist".

    What SpamAssassin does is make it easy for the average user to set up their account for spam filtering. SpamAssassin is highly customizable, and the customization can be done at the user level. Other approaches, such as creating individual procmail recipes based on spam received, require a high level of expertise which few users possess, and are more time consuming and less effective than SpamAssassin.

     

  • No email is deleted without explicit user request. SpamAssassin itself does not delete email; all it does is perform certain tests on email messages that are passed through the program, and based on the outcome of such tests, assign a numerical score ("hits") to the message. If the number of "hits" exceeds a predetermined threshold, SpamAssassin adds a spam tag to the subject line of the message (plus details about the tests that caused the "hits"), and then passes the message on for further processing. The user can then, via a ".procmailrc" file as described below, process such email further. The two sample procmail files shown below cause any spam-tagged email to be moved a special "spam folder" rather than being deleted. This is is an important safeguard against "false positives", as any message tagged as spam can be retrieved from the spam folder until the user explicitly deletes the message from the spam folder, or deletes the spam folder.

SpamAssassin most effectively, you may want to customize your set-up. This can be done at any time, and you may want to use SpamAssassin as is, in its default configuration, for a few days, before trying any customization.

Customization

Customizations are done through a file called "user_prefs" located in a subdirectory ".spamassassin" (note the period at the beginning of the file name) under your home directory. The SpamAssassin program should have created the subdirectory and a default version of the user_prefs file automatically. (If not, create the directory and an empty file named user_prefs inside this directory.)

The user_prefs file is pretty much self-explanatory. Note that all lines beginning with a hash mark (#) are interpreted as comment lines; the default version of user_prefs has all lines commented out in this manner, so in effect it is equivalent to an empty file. To make an instruction active, "uncomment" the line by removing the hashmark at the beginning.

The user_prefs file can be used to set or adjust dozens of variables. However, for most users only a few variables are worth adjusting, with "whitelist_from" (which specifies a "whitelist" of addresses) being by far the most useful one (and for many the only variable that should be explicitly set). Here is how to set or change these variables.

  • The spam threshold. This is specified by a line of the form
    required_hits 5

    The "hits" represent a numerical score that needs to be reached in order for a message to be marked as spam. The higher the value, the less sensitive the filter is. The default value is 5, which in my experience works pretty well. With this value, in conjunction with the use of a "whitelist" (see below), false positives are extremely rare, so there is no reason to raise the spam threshold above 5. If you are using the "enhanced" version of the procmail sample file, which sends (copies of) messages with scores between 3 and 5 to a "probablespam" folder, you might want to take a look at this folder after a period of time (say, a few weeks), to check how many legitimate email messages fall into that range. If all or almost all of the "probablespam" messages are indeed spam, then lowering the spam threshold to 3 might make sense.

     

  • The whitelist. The single most effective tool in preventing false positives is the whitelist, set via the "whitelist_from" command. Here you can put addresses that you do not want to be filtered, as in the following example:

    whitelist_from  *.ac.uk *.org  *amazon.com johndoe@aol.com 

    The syntax is simple: just separate the addresses by spaces. Addresses can either be specific email addresses, such as johndoe@aol.com, or address patterns formed with wildcards, such as *amazon.com [Note that the procmail sample files provided above already whitelist *.edu addresses, so including the pattern *.edu in the whitelist_from command is not necessary (though there is no harm in having it there).]

    You can have multiple whitelist_from commands in your user_prefs file. Thus, if you want to whitelist additional addresses, you can do so by simply appending an appropriate "whitelist_from" line at the end of your user_prefs file, without having to worry about accidentally messing up existing whitelist addresses, or running into problems associated with overlong lines.

    Addresses you might want to whitelist include the following:

    • Academic addresses. It is extremely rare for spam to originate from an academic address, so whitelisting such addresses is unlikely to have a noticeable impact on the amount of spam you receive. As mentioned, email from *.edu addresses of colleges and universities in the U.S. is already whitelisted via the procmail file (assuming you are using the sample files), so there is no need to whitelist *.edu via a whitelist_from command. On the other hand, if you frequently correspond with people at academic institutions in other countries, you might want to add the corresponding patterns to the whitelist, for example, *.ac.uk for universities in the U.K., *.edu.cn for universities in China, *.uni*.de for German universities, etc.
    • Commercial addresses of companies you have transactions with. If you use your account for commercial transactions such as an order or a travel reservation, it is a good idea to add appropriate patterns to the whitelist, e.g.:

      whitelist_from *amazon.com *orbitz.com *expedia.com

      This is because email from commercial outfits is more likely to display the characteristics of a spam message, even if it is a legitimate piece of email. For example, confirmations of orders or reservations are usually sent as non-personal autoreplies, which may cause the messages to be flagged as spam.

    • Mailing lists. Another possible source of false positives are messages sent to large mailing lists (e.g., from a school, or a church). A long list of recipients in the "To:" field may, in the presence of other factors, cause SpamAssassin to flag the email as spam. Assuming you have whitelisted *.edu addresses, this is only a problem for email originating from a non-edu account. In that case, simply add the address from which the mailing list gets sent to the whitelist.
    • Specific addresses of regular correspondents. Ordinary, non-automated, correspondence from individuals should normally score well below the spam threshold, so there is no need to whitelist such addresses. Exceptions are correspondents with addresses at hosts often used by spammers, such as yahoo.com or hotmail.com, or correspondents from countries like Korea, China, or Japanthat use non-Western character sets. The use of non-Western character sets, especially in the presence of other "spam triggers", may be enough to cause the message to be marked as spam. (See the ok_locales setting below for an alternate way to deal with this problem.)

       

     

  • The blacklist. SpamAssassin also has a blacklist feature that allows one to list email addresses to be "blacklisted" so that all email from these addresses is treated as spam and redirected to the spam folder. The blacklist is entered via the "blacklist_from" command, which has the same syntax as the "whitelist_from command. For example:

    blacklist_from johndoe@aol.com *@hotmail.com

    However, while whitelists are very effective in preventing false positives, blacklists are of little use in preventing false negatives, and there is normally no need to use blacklists. For one, SpamAssassin is so uncanningly accurate in properly identifying spam that the number of spam messages that are missed by SpamAssassin is very small.

    More importantly, however, as a spam prevention measure, blacklisting specific addresses is largely ineffective, since spammers change their Internet providers and email addresses frequently, and return email addresses are often forged.

    Blacklisting entire domains of ISP's providing free email accounts, such as hotmail.com or yahoo.com, would be somewhat more effective as such free accounts are often used by spammers as throw-away accounts. However, such a strategy would also block out legitimate email from these providers. I do get a fair amount of professional email from such accounts, and I would not want that to be shut off. Some students use accounts at yahoo.com and similar services rather than their university account, as do people at countries or universities with inadequate internet connections.

    While blacklists are of little value as a generic spam fighting tool, there are special situations where using a blacklist is appropriate and effective. An obvious example would be a case of continued targeted email harassment by a particular party. A blacklist can also be useful in connection with mailing lists. If the volume of mail coming in through the list becomes overwhelming, an easy way to put a temporary stop to such mail is by adding the originating address for the list to the blacklist. This causes all list mail to be redirected to the spam folder where it can be examined at leisure, or simply deleted. Also, getting off mailing lists is not always easy, and following removal instructions doesn't always work. In those cases, blacklisting the originating address is an effective way to eliminate undesired email.

     

  • [Added 1/20/04] The "ok_locales" and "ok_languages" setting. Email messages from countries such as Korea that use non-Western character sets are far more likely to be spam than messages using the usual Western character set, and SpamAssassin accordingly assigns higher hit scores to such messages. If you regularly correspond with people from such countries, you might want to explicitly "okay" email from these countries via the "ok_locales" and "ok_languages" variables. The default settings for both variables are "en" (for English/Western character sets and languages). By adding appropriate codes, you can "okay" additional character sets and/or languages. For example, to "okay" mail using Korean character sets and/or in Korean language (code "ko"), add the following:

    ok_locales en ko 
    ok_languages en ko

    The codes for Chinese and Japanese languages/character sets are "zh" and "ja", respectively. You can use "all" to "okay" all languages and character sets:

    ok_locales all 
    ok_languages all

    With this setting, a foreign language, or a foreign character set will have no effect on the spam score. (Of course, all other spam tests are still performed, and the message can still end up being flagged as spam, if its total hit score exceeds the spam threshold.) The trade-off for allowing foreign languages or character sets is likely a slight increase in the spam you receive.

     

  • Further information. Complete documentation of customizations you can make in your the user_prefs file can be found in the manual pages for the SpamAssassin configuration, which can be accessed with the command "man Mail::SpamAssassin::Conf". (Note the two double colons.)

Checking The Spam Folder

Once you have set up your account as shown above, all email identified as spam will be rerouted to a spam folder. If you have been receiving a lot of spam, you will notice a substantial reduction in your email volume as all spam-tagged email gets rerouted to that folder.

You should get into the habit of periodically checking that folder, to see if any legitimate email has made it into the spam folder and, if necessary, update your whitelist as shown above. Most common email programs (including pine, elm, mail, mailx, mutt, dtmail) allow you to specify a folder rather than the default folder ("inbox") with the "-f" option, e.g., "mail -f spam" or "mail -f $HOME/mail/spam". The former command requires you to be in the directory in which the "spam" folder is located; the latter command works from any directory and assumes the spam folder is in the directory "mail" off your home directory. (This is the setting in the procmail sample files above; if you have changed that setting, you will need to make a corresponding change in the "-f" specification.)

Quick access to the spam folder. A convenient way to access your spam folder is via an alias (shortcut) "checkspam" that does the equivalent of the "mail -f" command above. To set up such an alias, add the following line at the end of the .cshrc file in your home directory:

alias checkspam 'pine  -f $HOME/mail/spam'

(Note the two forward ticks enclosing the argument of "checkspam".) This assumes that your spam folder is in the default location provided by the sample files above, and that you use pine as your mail reader. If you use another mail reader, replace "pine" by the name of the program you normally use to access your mail. The same syntax works for elm, mail, mailx, mutt, and dtmail. (Dtmail is the the default mail reader in the CDE environment - it's the one you get if you click on the envelope icon at the bottom of the screen.) (Other mail readers, such as Netscape mail, probably have similar functionality, but you may have to do things slightly differently.)

Since the ".cshrc" file is only being read at the beginning of a login session, you need to log out and log back in to make the alias active. (Alternatively, you can explicitly read in the file using the command "source $HOME/.cshrc".) Once you have done that, the command "checkspam" will display an index of all messages accumulated in the spam folder, and you can deal with them in whatever way you want.

I check my spam folder in this manner every couple of days. The process is very quick and takes only minutes. Since these messages are separate from legitimate email messages, I don't even bother to delete messages - I just let the folder accumulate and periodically skim newly arrived messages to check for any false positives.

If you want to get rid of the accumulated spam messages (after having dealt with any "false positives" in the spam folder), simply delete the entire spam folder, using the Unix "rm" command. The next time a spam message comes in, a new folder will be created.

Spam Statistics

Here are some statistics and conclusions from an analysis of the spam I received during the period 9/30/2002 - 12/9/2002.

 

  • Spam volume. The 200 messages came in over a 10 week period, which works out to about 20 messages per week. This is probably in the middle of the range; some may receive a lot more, while others receive less spam. I have been careful in guarding my email address (for example, by not posting on usenet or chat rooms, the two most common sources of email addresses for spammers), but my address is listed on a number of webpages, some of which I maintain myself, while others are out of my control. (To foil harvesting of my email address from web pages, I never use a "mailto" link in conjunction with my email address. I found this to be an effective way to reduce the chance of my email address being harvested, as the software used for that typically looks only for addresses given in mailto links and will not catch addresses given in plain text.)

     

  • Hit scores. My spam threshold during the period covered was set at 5. The average hit score was 19.2, the maximal hit score was 40.4, and the lowest score was 5.0 (coinciding with my spam threshold). 80 percent of the messages had a hit score exceeding 10, 90 percent had a hit score exceeding 7.5. Thus, if you raise your spam threshold by a few points from the default level of 5, the overwhelming majority of the spam will still be caught by SpamAssassin. More importantly, a glance at the subject lines shows that the most egregious and offensive spam tends to have the highest hit scores. In fact, none of the spam with sexual content (as far as can be inferred from the subject line) had a hit score lower than 10.

     

  • Unique email addresses and subject lines. As mentioned above, trying to prevent or reduce spam by filtering specific email addresses or subject lines is largely a futile effort. The data clearly confirms this. Among the 200 spam emails, only a handful of email addresses occurred more than once, and the greatest multiplicity among those was 4. Most of those addresses are forged, so filtering out these addresses would not do any good. Subject-based filtering would be slightly more effective, but still eliminate only a small proportion of the spam received. About half a dozen subject lines occurred more than 3 times, with 10 as the largest multiplicity. Filtering out those emails would eliminate only about 25 percent of the spam messages, and this proportion will likely decrease in the long run, as spammers often change subject lines to foil subject-based filtering.

     

  • False positives. Out of 200 messages received during this 10 week period, five were false positives. Three of those were sent from Korean academic addresses and triggered SpamAssassin because of a foreign character set used. (I have since eliminated this problem by adding the pattern *.ac.kr to my whitelist.) One was a confirmation of a reservation from a commercial site; since I did expect that email and kept an eye on the spam folder, this was not a problem for me. All false positives had hit scores of less than 10, so increasing the threshold setting to 10 would have eliminated those false positives. However, a threshold of 5 in conjunction with a whitelist works well enough for me, so I plan to leave it for now.

     

  • Borderline spam. A handful of messages tagged as spam by SpamAssassin were of the borderline variety: messages that are unsolicited and mass-mailed to a (usually) limited number of addresses, but with legitimate content that may be of interest to a substantial number of addressees. About half a dozen of the 200 messages were in this category. They included things like book buy-back offers, information from textbook publishers, and other sales information. All of these messages had hit scores below 10, so raising the spam threshold to 10 would have prevented these messages from being marked as spam. Whether you want to do this a matter of personal preference. Personally, I am quite happy with my current setting of 5. I check my spam folder every few days, and I'll catch anything of interest that way.

     

  • False negatives. The data covers only messages from my spam folder, i.e., those messages that received a "hit score" of 5 or higher by SpamAssassin and which were consequently tagged as spam and redirected to the spam folder. I have received some spam messages that SpamAssassin missed, but those were few and far between. I have not kept accurate count of those misses, but there were probably less than a dozen during the 10 week spam covered by this analysis. The reasons why these messages were missed vary. In a few cases, it was because the "From " line was forged to appear as if the email had been sent from a local (math.uiuc.edu) machine, and since I had all edu addresses in my whitelist, it did not get tagged as spam. In other cases, SpamAssassin did not flag the message as its entire content was a single sentence referencing a webpage. During my first few weeks of using SpamAssassin, not a single spam message passed the SpamAssassin filter. The frequency of false negatives has since been slowly increasing, but it is still nowhere near the point where it becomes a serious problem. Also, SpamAssassin is being constantly improved, and it will eventually catch up with any new tricks spammers concoct.