Bayesian spam filtering with Exchange Server 2007Written by Paul Cunningham on January 1, 2009
Bayesian spam filtering is a technique used to classify email as spam based on the contents of the email message. This is similar to other forms of Exchange content filtering with one important distinction – standard content filtering uses a database of spam “signatures”, whereas Bayesian spam filtering uses a mathematical probability calculation that is based on what the filter has learned about an organisations email.
Why isn’t signature-based content filtering enough?
As a real world example I once deployed an anti-spam solution to a client in the tourism industry. The chosen product was of good quality and performed well at other client installations, but encountered many problems with the tourism client.
The biggest problem was that the database of spam signatures was treating email with certain characteristics as spam despite these emails being quite legitimate. The type of characteristics were things like:
- Email sources being in an Asian or European country
- Emails containing “offers” and “deals” with heavy marketing language
- Emails regarding hotels and travel insurance
For many organisations outside of the tourism industry these types of emails would very likely be spam, however for this client these were legitimate business emails getting blocked as spam! In order to let these emails pass through the spam filter an extensive whitelist of keywords and sender addresses was built, as well as lowering the overall sensitivity of the spam filter.
The end result was a massive administrative overhead in developing and maintaining the whitelist, investigating rejected emails, and releasing quarantined items. In addition to these costs the end users developed a perception that the email system was unreliable, and also complained loudly about the amount of real spam that was slipping through the less sensitive spam filter.
How does Bayesian spam filtering solve this problem?
For a Bayesian filter to be effective it must first learn about your organisation’s email content. This achieved by “training” the Bayesian filter with a sample of your regular business emails (usually those sent by the organisation).
The Bayesian filter uses this training process to learn about words, phrases, or names that indicate that a message is less likely to be spam.
As an example, many signature-based spam filters will treat words such as “Viagra” or “Rolex” as indicating a high probability that the email is spam. But if the words appear in an email message alongside other words, phrases or names that the Bayesian filter has learned are legitimate then it will consider the email to have a lower probability of being spam. So while “Viagra” email might be spam, it probably isn’t spam if your company manufactures or distributes the product legitimately.
In other words, Bayesian filtering solves the problem of the “one size fits all” approach of signature-based content filtering.
How can Bayesian spam filtering protect Exchange Server 2007?
Exchange Server 2007 ships with anti-spam features included in the product. Among these is a Content Filter agent that filters spam based on email contents. The Content Filter agent uses signature-based spam filtering, which is based on a database of spam submissions from Microsoft customers and partners.
Although the Content Filter agent can be effective it often requires constant attention and fine tuning, and has no ability to learn the characteristics of your organisation’s typical email content unliked Bayesian filtering.
Deploying a Bayesian filter in an Exchange Server 2007 environment can be done in a few different ways:
Client based solution
By installing a client-based Bayesian filter solution on each end user computer Bayesian filtering can be utilised. This approach carries several disadvantages:
- Large administrative effort deploying the client software to all computers
- End user education required on how to “train” the Bayesian filter, as well as the productivity lost in performing the training
- Spam emails are delivered to the end user mailbox before the filtering is applied, wasting server and bandwidth resources
Server based solution
By installing a dedicated server-based Bayesian filter solution in front of the Exchange servers the Bayesian filtering can be performed on email messages before they arrive on the Exchange servers. However despite that advantage over client-based solutions there are still several advantages:
- Spam emails are fully downloaded to the filtering server before they can be checked for spam content, wasting bandwidth resources.
- Earlier spam checks such as Connection Filtering which can block likely spam based on the sending IP address are not applied first as they should be
- Many of the dedicated Bayesian filter solutions have no features such as reporting or end user quarantine management
Approaching Bayesian filtering for Exchange Server environments
Many organisations that attempt to solve their spam problems with a built in Exchange 2007 spam filter will be dissatisfied with the performance and look for more effective solutions such as Bayesian filtering.
When considering a server-based Bayesian filtering solution the disadvantages listed above should be taken into account. To get the best improvement over the Exchange Server 2007 anti-spam features organisations should look for a dedicated email security solution that includes a range of protective measures (including Bayesian filtering), as well as advanced features such as end user self service for quarantined items and advanced reporting.