NAME

Mail::SpamAssassin::Plugin::TxRep - Normalize scores with sender reputation records


SYNOPSIS

The TxRep (Reputation) plugin is designed as an improved replacement of the AWL (Auto-Whitelist) plugin. It adjusts the final message spam score by looking up and taking in consideration the reputation of the sender.

To try TxRep out, you have to disable the AWL plugin (if present), back up its database and add a line loading this module in init.pre (AWL may be enabled in v310.pre):

 # loadplugin   Mail::SpamAssassin::Plugin::AWL
   loadplugin   Mail::SpamAssassin::Plugin::TxRep

When AWL is not disabled, TxRep will refuse to run.

Use the supplied 60_txreputation.cf file or add these lines to a .cf file:

 header         TXREP   eval:check_senders_reputation()
 describe       TXREP   Score normalizing based on sender's reputation
 tflags         TXREP   userconf noautolearn
 priority       TXREP   1000


DESCRIPTION

This plugin is intended to replace the former AWL - AutoWhiteList. Although the concept and the scope differ, the purpose remains the same - the normalizing of spam score results based on previous sender's history. The name was intentionally changed from ``whitelist'' to ``reputation'' to avoid any confusion, since the result score can be adjusted in both directions.

The TxRep plugin keeps track of the average SpamAssassin score for senders. Senders are tracked using multiple identificators, or their combinations: the From: email address, the originating IP and/or an originating block of IPs, sender's domain name, the DKIM signature, and the HELO name. TxRep then uses the average score to reduce the variability in scoring from message to message, and modifies the final score by pushing the result towards the historical average. This improves the accuracy of filtering for most email.

In comparison with the original AWL plugin, several conceptual changes were implemented in TxRep:

1. Scoring - at AWL, although it tracks the number of messages received from each respective sender, when calculating the corrective score at a new message, it does not take it in count in any way. So for example a sender who previously sent a single ham message with the score of -5, and then sends a second one with the score of +10, AWL will issue a corrective score bringing the score towards the -5. With the default auto_whitelist_factor of 0.5, the resulting score would be only 2.5. And it would be exactly the same even if the sender previously sent 1,000 messages with the average of -5. TxRep tries to take the maximal advantage of the collected data, and adjusts the final score not only with the mean reputation score stored in the database, but also respecting the number of messages already seen from the sender. You can see the exact formula in the section txrep_factor.

2. Learning - AWL ignores any spam/ham learning. In fact it acts against it, which often leads to a frustrating situation, where a user repeatedly tags all messages of a given sender as spam (resp. ham), but at any new message from the sender, AWL will adjust the score of the message back to the historical average which does not include the learned scores. This is now changed at TxRep, and every spam/ham learning will be recorded in the reputation database, and hence taken in consideration at future email from the respective sender. See the section LEARNING SPAM / HAM for more details.

3. Auto-Learning - in certain situations SpamAssassin may declare a message an obvious spam resp. ham, and launch the auto-learning process, so that the message can be re-evaluated. AWL, by design, did not perform any auto-learning adjustments. This plugin will readjust the stored reputation by the value defined by txrep_learn_penalty resp. txrep_learn_bonus. Auto-learning score thresholds may be tuned, or the auto-learning completely disabled, through the setting txrep_autolearn.

4. Relearning - messages that were wrongly learned or auto-learned, can be relearned. Old reputations are removed from the database, and new ones added instead of them. The relearning works better when message tracking is enabled through the txrep_track_messages option. Without it, the relearned score is simply added to the reputation, without removing the old ones.

5. Aging - with AWL, any historical record of given sender has the same weight. It means that changes in senders behavior, or modified SA rules may take long time, or be virtually negated by the AWL normalization, especially at senders with high count of past messages, and low recent frequency. It also turns to be particularly counterproductive when the administrator detects new patterns in certain messages, and applies new rules to better tag such messages as spam or ham. AWL will practically eliminate the effect of the new rules, by adjusting the score back towards the (wrong) historical average. Only setting the auto_whitelist_factor lower would help, but in the same time it would also reduce the overall impact of AWL, and put doubts on its purpose. TxRep, besides the txrep_factor (replacement of the auto_whitelist_factor), introduces also the txrep_dilution_factor to help coping with this issue by progressively reducing the impact of past records. More details can be found in the description of the factor below.

6. Blacklisting and Whitelisting - when a whitelisting or blacklisting was requested through SpamAssassin's API, AWL adjusts the historical total score by a fixed value, regardless of the number of messages recorded at given sender. It results in practical impossibility of blacklisting or whitelisting any sender with higher number of recorded scores. Even at senders with few messages, the impact of the whitelisting or blacklisting is minimal, and new messages can be still tagged incorrectly. TxRep handles black/whitelisting differently, so that it has the desired effect. It is explained in details in the section BLACKLISTING / WHITELISTING.

7. Sender Identification - AWL identifies a sender on the basis of the email address used, and the originating IP address (better told its part defined by the mask setting). The main purpose of this measure is to avoid assigning false good scores to spammers who spoof known email addresses. The disadvantage appears at senders who send from frequently changing locations or even when connecting through dynamical IP addresses that are not within the block defined by the mask setting. Their score is difficult or sometimes impossible to track. Another disadvantage is, for example, at a spammer persistently sending spam from the same IP address, just under different email addresses. AWL will not find his previous scores, unless he reuses the same email address again. TxRep uses several identificators, and creates separate database entries for each of them. It tracks not only the email/IP address combination like AWL, but also the standalone email address (regardless of the originating IP), the standalone IP (regardless of email address used), the domain name of the email address, the DKIM signature, and the HELO name of the connecting PC. The influence of each individual identificator may be tuned up with the help of weight factors described in the section REPUTATION WEIGHTS.

8. Message Tracking - TxRep (optionally) keeps track of already scanned and/or learned message ID's. This is useful for avoiding to strengthen the reputation score by simply rescanning or relearning the same message multiple times. In the same time it also allows the proper relearning of once wrongly learned messages, or relearning them after the learn penalty or bonus were changed. See the option txrep_track_messages.

9. User and Global Storages - usually it is recommended to use the per-user setup of SpamAssassin, because each user may have quite different requirements, and may receive quite different sort of email. Especially when using the Bayesian and AWL plugins, the efficiency is much better when SpamAssassin is learned spam and ham separately for each user. However, the disadvantage is that senders and emails already learned many times by different users, will need to be relearned without any recognized history, anytime they arrive to another user. TxRep uses the advantages of both systems. It can use dual storages: the global common storage, where all email processed by SpamAssassin is recorded, and a local storage separate for each user, with reputation data from his email only. See more details at the setting txrep_user2global_ratio.

10. Outbound Whitelisting - when a local user sends messages to an email address, we assume that he needs to see the eventual answer too, hence the recipient's address should be whitelisted. When SpamAssassin is used for scanning outgoing email too, when local users use the SMTP server where SA is installed, for sending email, and when internal networks are defined, TxREP will improve the reputation of all 'To:' and 'CC' addresses from messages originating in the internal networks. Details can be found at the setting txrep_whitelist_out.

Both plugins (AWL and TxREP) cannot coexist. It is necessary to disable the AWL to allow TxRep running. TxRep reuses the database handling of the original AWL module, and some its parameters bound to the database handler modules. By default, TxRep creates its own database, but the original auto-whitelist can be reused as a starting point. The AWL database can be renamed to the name defined in TxRep settings, and TxRep will start using it. The original auto-whitelist database has to be backed up, to allow switching back to the original state.

The spamassassin/Plugin/TxRep.pm file replaces both spamassassin/Plugin/AWL.pm and spamassassin/AutoWhitelist.pm. Another two AWL files, spamassassin/DBBasedAddrList.pm and spamassassin/SQLBasedAddrList.pm are still needed.


TEMPLATE TAGS

This plugin module adds the following tags that can be used as placeholders in certain options. See the Mail::SpamAssassin::Conf manpage for more information on TEMPLATE TAGS.

 _TXREP_XXX_Y_          TXREP modifier
 _TXREP_XXX_Y_MEAN_     Mean score on which TXREP modification is based
 _TXREP_XXX_Y_COUNT_    Number of messages on which TXREP modification is based
 _TXREP_XXX_Y_PRESCORE_ Score before TXREP
 _TXREP_XXX_Y_UNKNOW_   New sender (not found in the TXREP list)

The XXX part of the tag takes the form of one of the following IDs, depending on the reputation checked: EMAIL, EMAIL_IP, IP, DOMAIN, or HELO. The _Y appendix ID is used only in the case of dual storage, and takes the form of either _U (for user storage reputations), or _G (for global storage reputations).

use_txrep
  0 | 1                 (default: 0)

Whether to use TxRep reputation system. TxRep tracks the long-term average score for each sender and then shifts the score of new messages toward that long-term average. This can increase or decrease the score for messages, depending on the long-term behavior of the particular correspondent.

Note that certain tests are ignored when determining the final message score:

 - rules with tflags set to 'noautolearn'
txrep_spf
  0 | 1                 (default: 1)

When enabled, TxRep will treat any IP address using a given email address as the same authorized identity, and will not associate any IP address with it. (The same happens with valid DKIM signatures. No option available for DKIM).

Note: at domains that define the useless SPF +all (pass all), no IP would be ever associated with the email address, and all addresses (incl. the froged ones) would be treated as coming from the authorized source. However, such domains are hopefuly rare, and ask for this kind of treatment anyway.

  1. ) The reputation of the 'From' email address bound to the originating IP address fraction (see the mask parameters for details)
  2. ) The reputation of the 'From' email address alone (regardless the IP address being currently used)
  3. ) The reputation of the domain name of the 'From' email address
  4. ) The reputation of the originating IP address, regardless of sender's email address
  5. ) The reputation of the HELO name of the originating computer (if available)

Each of these partial reputations is weighted with the help of these parameters, and the overall reputation is calculation as the sum of the individual reputations divided by the sum of all their weights:

 sender_reputation = weight_email    * rep_email    +
                     weight_email_ip * rep_email_ip +
                     weight_domain   * rep_domain   +
                     weight_ip       * rep_ip       +
                     weight_helo     * rep_helo

You can disable the individual partial reputations by setting their respective weight to zero. This will also reduce the size of the database, since each partial reputation requires a separate entry in the database table. Disabling some of the partial reputations in this way may also help with the performance on busy servers, because the respective database lookups and processing will be skipped too.

txrep_weight_email
 range [0..10]          (default: 3)

This weight factor controls the influence of the reputation of the standalone email address, regardless of the originating IP address. When adjusting the weight, you need to keep on mind that an email address can be easily spoofed, and hence spammers can use 'from' email addresses belonging to senders with good reputation. From this point of view, the email address bound to the originating IP address is a more reliable indicator for the overall reputation.

On the other hand, some reputable senders may be sending from a bigger number of IP addresses, so looking for the reputation of the standalone email address without regarding the originating IP has some sense too.

We recommend using a relatively low value for this partial reputation.


ADMINISTRATOR SETTINGS

These settings differ from the ones above, in that they are considered 'more privileged' -- even more than the ones in the PRIVILEGED SETTINGS section. No matter what allow_user_rules is set to, these can never be set from a user's user_prefs file.

txrep_factory module
 (default: Mail::SpamAssassin::DBBasedAddrList)

Select alternative database factory module for the TxRep database.