|Anti-spam technique: Bayesian filters|
|Date of first use:||2002|
|Difficulty of implementation:||Medium|
Bayesian filters were first proposed in Paul Graham's a plan for spam in 2002. Starting with two corpora of messages, one of spam, and one of legitimate mail (ham), software extracts words or other features from the messages and determines how common each feature is in ham and spam. When subsequent messages arrive, a message with many features common to spam is probably spam, and a message with many features common to legitimate messages is probably legitimate.
When the corpora accurately reflect a user's message stream, Bayesian filters can be very accurate. As more messages arrive, the filters can adapt to the changing characteristics of messages. (See Sources of Corpora, below.)
It can be difficult to collect enough messages, particularly legitimate messages, to train a Bayesian filter accurately. Both filter training and filtering are slow, since they require significant processing of messages to extract features, to calculate probabilities, and to match message features against the filters.
When a filter is self-training, that is, it adds messages that it has classified to its corpora, errors tend to cascade.
Sources of Corpora
In many environments, it is difficult to find and classify enough messages to create an adequate corpus to train filters. While there are many available archives of spam, there are few archives of legitimate mail, for privacy reasons. Some filters are initially trained from a static corpus, then updated as messages are delivered, automatically as messages are classified by the filter, or as mail is manually classified by users by explicit tagging or implicitly as the move messages between MUA inbox and spam folders. Some systems use a shared filter for groups of users, while others keep separate statistics for groups of users.
A significant source of corpora is Collaborative Filtering, q.v.