/ ”Spam
and the ongoing battle for the Inbox”
Definitions:
Spam - unsolicited email, often advertising a product or service. Spam can occasional “flood” an individual or ISP to the point that it significantly slows down the data flow.
Phishing – In
a computing context, Phishing is an impersonation of
a corporation or other trusted institution. The goal of the impersonation is to
extract passwords or other sensitive information from the victim. It is a form
of criminal activity that utilizes social engineering techniques. Phishing is typically done using e-mail or an instant
messaging program. The attempt of the message is to appear to be from an
authentic source so that victim will either directly respond, or will open a
URL link to a fake web site run by the criminals.
Spam Filter - Software that uses various techniques to redirect unwanted email away from a user’s inbox. These filters can be based on a variety of criteria, including sender's email address; specific words in the subject or message body, and can be implemented by end-users as well as ISPs. Unique Clicks: The number of different individuals who click on an ad link within a specific period of time.
Human interaction proof (CAPTCHA) - HIPs (also known as "completely automated public Turing Tests to tell computers and humans apart," or CAPTCHAs, or just plain Turing Tests) are a key component in preventing abuse. The most common type of HIP is an image of a sequence of letters and digits that has been automatically distorted. One of the many ways they are used is before signing up for most free email accounts, users are required to solve one—correctly entering the sequence of letters and numbers in the image. Without HIPs, spammers would use these services to produce a torrent of spam. They are also used to prevent automated password attacks. Several products (such as MailBlocks and Matador) have used HIP challenges for suspected spam as a kind of economic approach. HIPs also prevent, for instance, the automated harvesting of Web site data and automated attempts to steal passwords.
What percentage overall mail volume is spam?
“Spam has increased from approximately 10% of overall mail volume in 1998, constituting an annoyance, to as much as 80% today”
Give two examples of the escalation of
technology taking place between spammers and spam filters
Training Machine Method
The use of a training machine to filter
spam has been widely accepted. The Naïve Bayes method
amongst other methods of spam filtering use a learning algorithm. “The Naïve Bayes method is used to find the characteristics of the
spam mail versus those of the good mail. Future messages can be automatically
categorized as highly likely to be spam, highly likely to be good, or somewhere
in between. The earliest learning approaches were fairly simple, using the
Naive Bayes algorithm to count how often each word or
other feature occurs in spam messages and in good messages. To be effective
these methods need training data “known spam and known good mail” to train the
system.”
Other
Training Algorithm Methods
Much more sophisticated algorithms have
come out which put “weights” on words. Each word has a specific weight and if
the email message weights totals more than a specified spam message weight than
it is flagged as spam. These algorithms "learn" a weight for each
word in a message. The weights are carefully adjusted so results derived from
the training examples of both spam and good email are
as accurate as possible. The learning process may require repeatedly adjusting
tens of thousands or even hundreds of thousands of weights, a potentially
time-consuming process. Fortunately, progress in machine learning over the past
few years has made such computation possible.
Compression
Technique
“Compression-based technique is more
effective for spam filtering than traditional machine learning systems.
Compression-based systems build a model of spam and a model of good email. A
new message is compressed using both the spam model and the good-email model.
If the message compresses better with the spam model, the message is likely
spam; if it compresses better with the good-email model, the message is more
likely legitimate.”
IP
Address Filtering Method
IP Address Filtering has been used to
block spam but some methods spammers use to circumvent this is to obtain IP
addresses of other machines and create a horde of zombies or a “bot-net” and use those machines as drones to do the dirty
work of emailing their spam messages with their unblocked IP Addresses.
Certificates
& Security ID Method 
Certificates or Security ID methods have
been robust to most attacks but too difficult to deploy for practical reasons.
They typically focus on the identity of a person rather than the identity of an
email address, thus requiring a certifying agency of some sort. Some proposals
would require all Internet users to go to their local Post Office and pay a fee
to get a certificate. In addition, these proposals usually require some form of
attachment or inclusion in the email message itself, confusing some users.
Similarity
Matching Method
One of the most widely deployed spam
filtering techniques is similarity-matching solutions. They attempt to find
examples of known spam; for example, email that has gone to a special trap
account that should receive no legitimate email that users have complained
about. They then try to match new examples to this known spam. Spammers
actively randomize their email in an attempt to defeat these matching systems.
In some cases (such as spam where the primary content is an image meant to defeat
both matching-based and machine-learning-based text-oriented filters), spammers
even randomize the image to defeat image-matching technologies.
What Spammers have done to circumvent anti-spam methods?
Spammers have not sat idle while algorithms have been ramped up. “Traditional machine learning for spam filtering has many weaknesses. Initially, spammers sought to overcome these filters by making sure that words with large (spammy) weights, like "free," did not appear verbatim in their messages. For instance, they might break the word into multiple pieces using an HTML comment (fr<!--><-->ee) or encode it with HTML ASCII codes (frexe). When displayed to a user, both these examples look like "free," but for spam-filtering software, especially on servers, any sort of complex HTML processing is too computationally expensive, so the systems do not detect the word "free."”
Anti-Spam Links & Tips for you!
FTC Spam website - http://www.ftc.gov/spam/
Sign
up your email address to not receive junk email here.
Make A Difference! - http://spam.abuse.net/bits/makeadifference.shtml
How
you can help out in the fight against spammers.
How to Prevent Spam - http://www.spamlaws.com/prevent-spam.html
List
of guidelines to help you remove and decrease the amount of spam you receive