Bayesian Spam Filtering

Written by Giselle Borg Olivier on September 11, 2008

Bayesian filtering is one of the most effective and intelligent solutions to combat spam email nowadays. Spam is a problem faced by all email users and it shows no sign of slowing down anytime soon; in fact, the number of spam emails is increasing daily. Added to this, spammers are becoming more sophisticated and are constantly managing to outsmart ‘static’ methods of fighting spam.

The techniques currently used by most anti-spam software are static, meaning that spammers simply examine the latest anti-spam filtering techniques and find ways how to dodge them, usually done by simply tweaking the message a little.

This gave anti spam developers a new challenge – come up with a new anti spam technique; one that was familiar with spammers’ tactics as they change over time, and one that is able to adapt to the particular organization that it is protecting from spam. The answer lay in Bayesian mathematics leading to a technique known as Bayesian filtering.

How does Bayesian filtering work?

Bayesian spam filtering is the process of using a naive Bayes classifier to identify spam e-mail. It is based on the principle that most events are dependent and that the probability of an event occurring in the future can be inferred from the previous occurrences of that event. This same technique can be used to classify spam. If some piece of text occurs often in spam but not in legitimate mail, then it would be reasonable to assume that this email is probably spam.

Bayesian spam filtering has become a popular mechanism to distinguish illegitimate spam email from legitimate email. Nowadays many mail clients implement Bayesian spam filtering.

Bayesian filters must be ‘trained’ to work effectively. Particular words have certain probabilities (also known as likelihood functions) of occurring in spam email but not in legitimate email. For instance, most email users will frequently encounter the word Viagra in spam email, but will seldom see it in other email. Before mail can be filtered using this method, the user needs to generate a database with words and tokens (such as the $ sign, IP addresses and domains, and so on), collected from a sample of spam mail and valid mail (referred to as ‘ham’). For all words in each training email, the filter will adjust the probabilities that each word will appear in spam or legitimate email in its database.

After training, the word probabilities are used to compute the probability that an email with a particular set of words in it belongs to either category. If the total of word probabilities exceeds a certain threshold, the filter will mark the email as spam. Users can then decide whether to move email marked as spam to their spam folder or whether to just delete them.

How Bayesian filtering adapts itself to your industry

It is important to note that the analysis of legitimate mail is performed on the organization’s mail by the Bayesian filter, and is therefore tailored to that particular organization. For example, a financial institution might use the word ‘mortgage’ many times over and would get a lot of false positives if using a general anti-spam rule set. On the other hand, the Bayesian filter, if tailored to your company through an initial training period, takes note of the company’s valid outbound mail (and recognizes “mortgage” as being frequently used in legitimate messages), and therefore has a much better spam detection rate and a far lower false positive rate (where legitimate email is incorrectly classified as spam).

Bayesian filtering is an intelligent approach to classifying mail because it examines all aspects of a message, as opposed to keyword checking that classifies a mail as spam on the basis of a single word. For example: not every email that contains the word ‘free’ and ‘cash’ is spam. The Bayesian method would find the words ‘cash’ and ‘free’ interesting but it would also recognize the name of the business contact who sent the message and thus may classify the message as legitimate; it allows words to ‘balance’ each other out.

Spam vs. Ham Data Files

Note that some anti-spam software with very basic Bayesian capabilities, such as the Outlook spam filter or the Internet Message Filter in Exchange Server, does not create a tailored ham data file for your company, but ships a standard ham data file with the installation. Although this method does not require an initial learning period, it has 2 major flaws:

  1. The ham data file is publicly available and can thus be hacked by professional spammers and therefore bypassed. If the ham data file is unique to your company, then hacking the ham data file is useless. For example, there are hacks available to bypass the Microsoft Outlook 2003 or Exchange Server spam filter
  2. Such a ham data file is a general one, and thus not tailored to your company, it cannot be as effective and you will suffer from noticeably higher false positives.

For Bayesian filtering to work effectively it must also be kept updated with the latest spam techniques, by creating a spam data file that is kept updated by the anti-spam software. This will ensure that the Bayesian filter is aware of the latest spam tricks, resulting in a high spam detection rate (note: this is achieved once the required initial two-week learning period is over). For example, when spammers started using “f-r-e-e” instead of “free” they succeeded in evading keyword checking until “f-r-e-e” was also included in the keyword database. On the other hand, the Bayesian filter automatically notices such tactics and classifies these mails as spam.

Clever in any language

The Bayesian method is multi-lingual and international unlike most keyword lists which are only available in English and therefore quite useless in non-English speaking countries. The Bayesian filter also takes into account certain languages deviations or the diverse usage of certain words in different areas, even if the same language is spoken. This intelligence enables such a filter to catch more spam.

A Bayesian filter is difficult to fool, as opposed to a keyword filter. An advanced spammer who wants to trick a Bayesian filter can either use fewer words that usually indicate spam (such as free, Viagra, etc), or more words that generally indicate valid mail (such as a valid contact name, etc). However the latter method is quite an impossible feat if one wants to target a large group of people as collating personal data for each person would prove to be an extremely time-consuming and tiresome task. The other method would probably see spammers using tricks such as writing the word ‘mortgage’ as ‘m-o-r-t-g-a-g-e’ which is something that the Bayesian filter would still pick up.

Whilst some types of anti-spam software regularly download new keyword files to update their anti spam systems, the method is still flawed compared to a Bayesian filter.

Why Bayesian Filtering is the answer to your spam problems

Bayesian filtering, if implemented the right way and tailored to your company is by far the most effective technology to combat spam. The only catch is that you will have to wait for two weeks upon installation for the software to learn about your company’s email habits (or alternatively train it yourself which can be quite time-consuming). Once the two-week period is up, the filter will be able to distinguish between the ham and the spam and classify accordingly; meanwhile it will constantly keep itself updated with any new spam techniques or email habits that your company introduces.

Therefore when evaluating and comparing anti-spam software one must keep the long-term effects in mind, because whilst an anti spam software that is based on keyword listings is likely to perform better in the first month, Bayesian filtering will soon catch up and supersede conventional anti-spam filters once and for all.