Mail::SpamAssassin::Plugin::TxRep - Normalize scores with sender reputation records
The TxRep (Reputation) plugin is designed as an improved replacement of the AWL (Auto-Whitelist) plugin. It adjusts the final message spam score by looking up and taking in consideration the reputation of the sender.
To try TxRep out, you have to disable the AWL plugin (if present), back up its database and add a line loading this module in init.pre (AWL may be enabled in v310.pre):
# loadplugin Mail::SpamAssassin::Plugin::AWL loadplugin Mail::SpamAssassin::Plugin::TxRep
When AWL is not disabled, TxRep will refuse to run.
Use the supplied 60_txreputation.cf file or add these lines to a .cf file:
header TXREP eval:check_senders_reputation() describe TXREP Score normalizing based on sender's reputation tflags TXREP userconf noautolearn priority TXREP 1000
This plugin is intended to replace the former AWL - AutoWhiteList. Although the concept and the scope differ, the purpose remains the same - the normalizing of spam score results based on previous sender's history. The name was intentionally changed from ``whitelist'' to ``reputation'' to avoid any confusion, since the result score can be adjusted in both directions.
The TxRep plugin keeps track of the average SpamAssassin score for senders. Senders are tracked using multiple identificators, or their combinations: the From: email address, the originating IP and/or an originating block of IPs, sender's domain name, the DKIM signature, and the HELO name. TxRep then uses the average score to reduce the variability in scoring from message to message, and modifies the final score by pushing the result towards the historical average. This improves the accuracy of filtering for most email.
In comparison with the original AWL plugin, several conceptual changes were implemented in TxRep:
1. Scoring - at AWL, although it tracks the number of messages received from each
respective sender, when calculating the corrective score at a new message, it does
not take it in count in any way. So for example a sender who previously sent a single
ham message with the score of -5, and then sends a second one with the score of +10,
AWL will issue a corrective score bringing the score towards the -5. With the default
auto_whitelist_factor of 0.5, the resulting score would be only 2.5. And it would be
exactly the same even if the sender previously sent 1,000 messages with the average of
-5. TxRep tries to take the maximal advantage of the collected data, and adjusts the
final score not only with the mean reputation score stored in the database, but also
respecting the number of messages already seen from the sender. You can see the exact
formula in the section
2. Learning - AWL ignores any spam/ham learning. In fact it acts against it, which often leads to a frustrating situation, where a user repeatedly tags all messages of a given sender as spam (resp. ham), but at any new message from the sender, AWL will adjust the score of the message back to the historical average which does not include the learned scores. This is now changed at TxRep, and every spam/ham learning will be recorded in the reputation database, and hence taken in consideration at future email from the respective sender. See the section LEARNING SPAM / HAM for more details.
3. Auto-Learning - in certain situations SpamAssassin may declare a message an
obvious spam resp. ham, and launch the auto-learning process, so that the message can be
re-evaluated. AWL, by design, did not perform any auto-learning adjustments. This plugin
will readjust the stored reputation by the value defined by
txrep_learn_bonus. Auto-learning score thresholds may be tuned, or the
auto-learning completely disabled, through the setting
4. Relearning - messages that were wrongly learned or auto-learned, can be relearned.
Old reputations are removed from the database, and new ones added instead of them. The
relearning works better when message tracking is enabled through the
txrep_track_messages option. Without it, the relearned score is simply added to
the reputation, without removing the old ones.
5. Aging - with AWL, any historical record of given sender has the same weight. It
means that changes in senders behavior, or modified SA rules may take long time, or
be virtually negated by the AWL normalization, especially at senders with high count
of past messages, and low recent frequency. It also turns to be particularly
counterproductive when the administrator detects new patterns in certain messages, and
applies new rules to better tag such messages as spam or ham. AWL will practically
eliminate the effect of the new rules, by adjusting the score back towards the (wrong)
historical average. Only setting the
auto_whitelist_factor lower would help, but in
the same time it would also reduce the overall impact of AWL, and put doubts on its
purpose. TxRep, besides the
txrep_factor (replacement of the
introduces also the
txrep_dilution_factor to help coping with this issue by
progressively reducing the impact of past records. More details can be found in the
description of the factor below.
6. Blacklisting and Whitelisting - when a whitelisting or blacklisting was requested through SpamAssassin's API, AWL adjusts the historical total score by a fixed value, regardless of the number of messages recorded at given sender. It results in practical impossibility of blacklisting or whitelisting any sender with higher number of recorded scores. Even at senders with few messages, the impact of the whitelisting or blacklisting is minimal, and new messages can be still tagged incorrectly. TxRep handles black/whitelisting differently, so that it has the desired effect. It is explained in details in the section BLACKLISTING / WHITELISTING.
7. Sender Identification - AWL identifies a sender on the basis of the email address used, and the originating IP address (better told its part defined by the mask setting). The main purpose of this measure is to avoid assigning false good scores to spammers who spoof known email addresses. The disadvantage appears at senders who send from frequently changing locations or even when connecting through dynamical IP addresses that are not within the block defined by the mask setting. Their score is difficult or sometimes impossible to track. Another disadvantage is, for example, at a spammer persistently sending spam from the same IP address, just under different email addresses. AWL will not find his previous scores, unless he reuses the same email address again. TxRep uses several identificators, and creates separate database entries for each of them. It tracks not only the email/IP address combination like AWL, but also the standalone email address (regardless of the originating IP), the standalone IP (regardless of email address used), the domain name of the email address, the DKIM signature, and the HELO name of the connecting PC. The influence of each individual identificator may be tuned up with the help of weight factors described in the section REPUTATION WEIGHTS.
8. Message Tracking - TxRep (optionally) keeps track of already scanned and/or learned
message ID's. This is useful for avoiding to strengthen the reputation score by simply
rescanning or relearning the same message multiple times. In the same time it also allows
the proper relearning of once wrongly learned messages, or relearning them after the
learn penalty or bonus were changed. See the option
9. User and Global Storages - usually it is recommended to use the per-user setup
of SpamAssassin, because each user may have quite different requirements, and may receive
quite different sort of email. Especially when using the Bayesian and AWL plugins,
the efficiency is much better when SpamAssassin is learned spam and ham separately
for each user. However, the disadvantage is that senders and emails already learned
many times by different users, will need to be relearned without any recognized history,
anytime they arrive to another user. TxRep uses the advantages of both systems. It can
use dual storages: the global common storage, where all email processed by SpamAssassin
is recorded, and a local storage separate for each user, with reputation data from his
email only. See more details at the setting
10. Outbound Whitelisting - when a local user sends messages to an email address, we
assume that he needs to see the eventual answer too, hence the recipient's address should
be whitelisted. When SpamAssassin is used for scanning outgoing email too, when local
users use the SMTP server where SA is installed, for sending email, and when internal
networks are defined, TxREP will improve the reputation of all 'To:' and 'CC' addresses
from messages originating in the internal networks. Details can be found at the setting
Both plugins (AWL and TxREP) cannot coexist. It is necessary to disable the AWL to allow TxRep running. TxRep reuses the database handling of the original AWL module, and some its parameters bound to the database handler modules. By default, TxRep creates its own database, but the original auto-whitelist can be reused as a starting point. The AWL database can be renamed to the name defined in TxRep settings, and TxRep will start using it. The original auto-whitelist database has to be backed up, to allow switching back to the original state.
The spamassassin/Plugin/TxRep.pm file replaces both spamassassin/Plugin/AWL.pm and spamassassin/AutoWhitelist.pm. Another two AWL files, spamassassin/DBBasedAddrList.pm and spamassassin/SQLBasedAddrList.pm are still needed.
This plugin module adds the following
tags that can be used as
placeholders in certain options. See the Mail::SpamAssassin::Conf manpage
for more information on TEMPLATE TAGS.
_TXREP_XXX_Y_ TXREP modifier _TXREP_XXX_Y_MEAN_ Mean score on which TXREP modification is based _TXREP_XXX_Y_COUNT_ Number of messages on which TXREP modification is based _TXREP_XXX_Y_PRESCORE_ Score before TXREP _TXREP_XXX_Y_UNKNOW_ New sender (not found in the TXREP list)
The XXX part of the tag takes the form of one of the following IDs, depending on the reputation checked: EMAIL, EMAIL_IP, IP, DOMAIN, or HELO. The _Y appendix ID is used only in the case of dual storage, and takes the form of either _U (for user storage reputations), or _G (for global storage reputations).
0 | 1 (default: 0)
Whether to use TxRep reputation system. TxRep tracks the long-term average score for each sender and then shifts the score of new messages toward that long-term average. This can increase or decrease the score for messages, depending on the long-term behavior of the particular correspondent.
Note that certain tests are ignored when determining the final message score:
- rules with tflags set to 'noautolearn'
0 | 1 (default: 1)
When enabled, TxRep will treat any IP address using a given email address as the same authorized identity, and will not associate any IP address with it. (The same happens with valid DKIM signatures. No option available for DKIM).
Note: at domains that define the useless SPF +all (pass all), no IP would be ever associated with the email address, and all addresses (incl. the froged ones) would be treated as coming from the authorized source. However, such domains are hopefuly rare, and ask for this kind of treatment anyway.
Each of these partial reputations is weighted with the help of these parameters, and the overall reputation is calculation as the sum of the individual reputations divided by the sum of all their weights:
sender_reputation = weight_email * rep_email + weight_email_ip * rep_email_ip + weight_domain * rep_domain + weight_ip * rep_ip + weight_helo * rep_helo
You can disable the individual partial reputations by setting their respective weight to zero. This will also reduce the size of the database, since each partial reputation requires a separate entry in the database table. Disabling some of the partial reputations in this way may also help with the performance on busy servers, because the respective database lookups and processing will be skipped too.
range [0..10] (default: 3)
This weight factor controls the influence of the reputation of the standalone email address, regardless of the originating IP address. When adjusting the weight, you need to keep on mind that an email address can be easily spoofed, and hence spammers can use 'from' email addresses belonging to senders with good reputation. From this point of view, the email address bound to the originating IP address is a more reliable indicator for the overall reputation.
On the other hand, some reputable senders may be sending from a bigger number of IP addresses, so looking for the reputation of the standalone email address without regarding the originating IP has some sense too.
We recommend using a relatively low value for this partial reputation.
These settings differ from the ones above, in that they are considered 'more
privileged' -- even more than the ones in the PRIVILEGED SETTINGS section.
No matter what
allow_user_rules is set to, these can never be set from a
Select alternative database factory module for the TxRep database.