INTRODUCTION:
SpamBlocker is an email filter which attempts to identify Spam mails
using text analysis and maintaining blacklists and White lists. SpamBlocker works on Naive Bayes Alogorithm. Naive Bayes classifiers
are among the most successful known algorithms for learning to classify text documents.
Along with Naive Bayes , I've also used Gzip which uses Lempel-ziv coding compression
algorithm to identify whether the email is a spam or ham
Using its rule base, the Spam Blocker uses a heuristic tests on mail
headers and body text to identify "Spam"
SpamBlocker typically differentiates successfully between Spam and
non-Spam (ham) in between 95% and above cases, depending on what kind
of mail you get. Note :
This code and data are only supported under the Unix and Linux operating systems.
To reconstruct the original files from a downloaded files such as xxx.tar.gz,
type the following two commands to Linux:
gunzip xxx.tar.gz
tar -xf xxx.tar
Description Of the Bayesian Algorithm for Identifying Spam
P(C|W) * P(W) = P(W|C) * P(C)
Thus Probability of a particular class (here by class I mean Spam /Ham)
P(C) = P(C|W1) * P(C|W2) *P(C|W3)*....P(C|Wi)
Probability of the class for a particular document/email.
The Class here identifies the category the document/email belongs to either Spam or Ham
Abbreviation used
C : Class of document (Here there will be 2 classes Spam and Ham)
W : words
P : Probability
Down Load SpamBlocker |