What are Bayesian filters anyway?Written by Ed Fisher on February 22, 2011
We mention them all the time. We look for them as a feature in our anti-spam products. But do we know what they are, or are they just another black box in our infrastructure? For many an experienced admin, Bayesian filters may be old hat, but for others, it is a term easily used but not fully understood. This article will crack open the box for those who are curious about just what the heck a Bayesian filter actually is, what it does, and how it works.
Let’s start with a little vocabulary that is used when we discuss Bayesian filters and spam in general.
Unsolicited Commercial Email, or messages that were neither requested nor welcomed, and generally are an attempt to sell you something.
Email that the intended recipient would like to receive, but that was identified as spam. See false positive.
Legitimate email identified as spam, sometimes called ham.
Spam that is classified as legitimate email and passed to the user’s inbox.
Rev. Thomas Bayes
Bayesian filters’ namesake, Reverend Thomas Bayes was born over two hundred years before the technology that uses his theorem was created. He was a Presbyterian minister and mathematician who lived in England in the 1700s, studied mathematics and theology at the University of Edinburgh, and became a Presbyterian minister. He wrote a mathematical treatise, published posthumously, that defended Sir Isaac Newton’s calculus, as well as a respected theological text.
However Bayes is best known for his theorem on probability. Bayes’ theorem is also called the theorem of probability of causes.
In short, it states that if you consider an event where A1, A2 … AN are all mutually exclusive events which could have caused B, then the sample space S = Unk=1, i.e., one of these events has to occur. Bayes Rule gives us the probability of event B, and is expressed as:
The probability of event A given event B (e.g. the probability that an email is spam because it contains one or more keywords associated with spam) depends not only on the relationship between events A and B but also on the marginal probability of the occurrence of each event.
Bayes’ theorem is used by Bayesian filters to calculate the probability that an email is spam based on the likelihood that any individual email is spam, the likelihood of the presence of certain word in spam, the likelihood of the presence of that same word in ham, and other traits such as links to sites from other domains or known spam domains, etc. If that makes your head spin (and it does mine) then let’s simplify this with a practical example.
This example is just using round numbers to illustrate the point…the percentages are arbitrary. Consider an email that contains the phrase ‘bank account.’ If we take all emails collectively and say that 80% of them are spam and 20% are legitimate, and we say that the phrase ‘bank account’ appears in 20% of spam messages and 10% of legitimate messages, then the likelihood that an email containing the phrase ‘bank account’ is spam is eight times higher than that it is legitimate (16% versus 2%.) This will be factored in with the probabilistic analysis of other phrases in the email, any links, the source domain, or other attributes to come up with a total probability that a specific email is spam. If the probability exceeds the threshold, it is filtered. If it is below the threshold, it is passed on.
Bayesian filters need to be ‘trained’ as the attributes that can identify spam are not consistent across all organisations. You can imagine the percentage of emails sent to a bank that would include the phrase ‘bank account’ would be much higher than to another company.
Spammers try to fool Bayesian filters using several techniques. You have probably seen paragraphs of seemingly random text at the end of a spam message, or words that are broken up with nonsense characters or soft-hyphens. These are ways to game the system by either escaping detection, or throwing the total calculation off by placing words or phrases that are more likely to be found in legitimate mail than in spam.
While Bayesian filtering is an important part of most anti-spam systems, it is only one part and should be used in combination with other methods like whitelists, blacklists, and other filtering technologies. Fighting spam, just like any other security initiative, should take a layered approach, often called defense in depth.