|Intro|  |News|  |Threats|  |Alerts|  |Papers|  |Events|  |Reading|  |Links|  |About Me|  |Powered-by...|

Obviously, fun should catapult ions
[Back to Main]

No doubt by now you've encountered the fun spammer trick of obfuscation.  Spammers go to greath lengths to hide their content from filters (while leaving it at least barely understandable to humans), with the most obvious being misspellings and splitting words appart with whitespace or punctionations, but it gets much, much more bizarre than that.  Some of the tactics used by spammers are downright surrealistic.  In the sections below I'll attempt to outline several of the more popular techniques, and possible ways to combat them.  This is not an exhaustive list by any stretch, and new methods of obfuscation are being designed every day.

Misspellings and non-alpha characters
This is the oldest and most common form of obfuscation, invented to get around one of the oldest and most unreliable anti-spam mechanisms:  Keyword matching.  As keyword and phrase matching technology advanced, able to search for substring matches and do other types of wildcard matching, spammers got more clever in their word representations.  It doesn't take a clever person to realize that there are ways to creatively spell a word than there is reasonable time to anticipate all such words and block them.  Worse, some of the words that most commonly appear in spam also appear in "wanted" mail.  Consider "free", "mortgage", "investment", "sex" (think gender), "breast" (think health care), etc...  What can you possibly do to fight such wily spam?

The first thing to realize is that pure dictionary-based solutions are hopelessly obsolete and by their very nature, reactive.  Don't got sold on a "lexical" solution as a primary way to block spam.  A step in the right direction is wild card matching (* for 0 or more characters, ? for a single character, etc) and better, regular expression filtering (usually called REGEX).  For the PERL mongers and old-school sed and awk users this will be natural, but for many novices REGEX is difficult to grasp.  For Microsoft-centric shops you'll need to rely on rules created either by a product vendor, some type of professional services consultent, or by the Open Source community.  REGEX is extremely powerful, but that can easily be a destructive power in the hands of someone who doesn't fully understand what they're doing.

Some newer techniques exist for discovering obfuscated content, such as content analysis (tries to determine if a high percentage of the message is non-alpha characters), Bayesian filtering (which relies on being able to recognize content the more times it has seen it, and take direction from a "trainer"), and even language analysis.  Bayesian is very interesting, but also reactive, and language analysis is a relatively new and immature method (although improving).  This area will continue to be an arms race for the foreseeable future, but the spammers are likely to always have the upper hand here, since most of the above methods rely on being able to see new unwanted content several times before being able to conclusively block it (and at significant risk of blocking "good" messages, in many cases).

"Write a novel" Bayesian avoidance
Many people are now familiar with Bayesian analysis as a method of fighting spam.  It should come as no surprise that spammers are also well aware of it.  Many attempts have been made to circumvent Bayesian, with varying degrees of sucess.  One rather interesting approach has been to flood a message with an overwhelming amount of "good" text in an attempt to drown out the "bad" tokens and skew the propbability towards "ham". 

So as not to raise the reader's suspicion, this text is often hidden.  One of the most common methods is "invisible ink", which involves setting both the font color and background color to match (in HTML e-mail).  Another popular trick is to hide the "good text" (which is often a news article, or a few pages from a popular book) in the text/plain section of a multipart/alternative message.  Since most e-mail clients will display text/html if it's available, the spammer can place their solicitation in the text/html part and what ever goes in the text/plain section will be safely ignored by the vast majority of their audience.

Several obvious methods are effective against this sort of obfuscation.  First, "invisible ink" is relatively easy to detect, simply check for font and background colors being identical.  Of course, spammers caught onto that fact and started making the font color vary ever so slightly from the background, just enough that the human eye still can't determine the difference.  In one of the most novel applications ever seen, some very clever people realize that you can use Pythagoras theorm to determine the "closeness" of two colors and thus still detect the "invisible ink".  As for text that hides in the text/plain section of a multipart/alternative message, detecting substantial differences between the text rendered in HTML and the text when displayed plain is easy.  Since some text/plain messages only contain an advisory that the message should be read with HTML enabled, this check should look to see if the text/plain section also exceeds a certain number of bytes.

Alternative Character sets, Encoding, and Escape characters
Another popular method of evading content filtering is text encoding.  Since content filters generally look for ASCII text, the spammer simply writes their message in a different character set--one that contains the characters necessary to write English words, but also unlikely to be recognized by the filter.  While these methods can be effective, a counter-measure is to assign a higher spam probability to messages sent in unusual character sets.

A slightly more interesting approach is to actually encode the text before placing it in the message.  UUEncode and especially Base64 encoding are used for this.  They're two methods of encoding data from any format into ASCII so that it can be handled by e-mail MTAs.  Usually these methods are used to attach binary files to messages, but there's nothing to stop them being used on normal text as well.  Once encoded, text is rendered unreadable and thus will not match keyword filters.  For example, the phrase:
Refinance your debt now!
will become:
UmVmaW5hbmNlIHlvdXIgZGVidCBub3chCg==
in base64 encoding.  Obviously, this presents a problem, but fortunately it's not insurmountable.  Simply decoding text type MIME parts that have been encoded will allow them to be scanned as normal.  Additionally, you can feel confident in assigning a higher spam probability to a message that has encoded text parts, since encoding adds a significant amount of overhead and is generally not done without a good reason.  Simply put, there's no reason to encode plain text unless you're trying to hide it.

The last trick in this section is the use of escape characters.  Good old HTML once again comes and bites us (notice how many spam tricks are allowed to happen because of HTML?).  The Hyper Text Markup Language has the handy inclusion of escape sequences to generate plain ASCII text.  I'm not sure why you would normally want to use 6 characters in code just to display one on the page, but there must have been some reason.  In any case, this makes another great way to obfuscate spam.  Consider the following text:
Another HTML trick!
which translates as:

Another HTML trick!

Don't believe me?  Just view the source from this page.  Of course, this trick is also easy to detect, and in fact makes some filtering much easier.  Normal e-mail clients do not generate messages comprised entirely of escape characters, so a simple comparison of the ratio of HTML escapes to normal characters is a good start.  For content filtering, you can safely add such words as "free" in their appropriate HTML escape sequence (free) even though you couldn't normally filter on a word that could appear in so many ambiguous contexts.

Encryption
All of the above methods of obfuscation have some type of counter-measure, but what if the content being sent is encrypted?  A few years ago a possibility like this would be laughed off as impractical, since encrypting a message would require some type of pre-arrangement by both parties in order for the recipient to then decrypt it.  That fundamental fact hasn't changed, but what has changed is the adoption of integrated encryption plug-ins for e-mail clients as they're now being rolled out en mass for corporations concerned about the confidentiality of their messages.  This means that, not only is there a fast growing install base of clients with encryption capability, but also many unsavvy users now have encryption facilities in their client, where previously it was only the most technically advanced users who installed such software.

So just where am I going with all this?  Simple.  Who is responsible for causing mass-worms to propagate?  Unsavvy users.  What is the first thing that mass e-mail worms do when they infect a client?  They scan the address book and e-mail a copy of themselves to all the entries.  Now combine that with the fact that many desktop users now have public keys of many other users in order to send them encrypted messages, and you have a disaster waiting to happen.  If a mass-worm could scan a key ring for public keys, then encrypt itself and send to all those e-mail users, the worm itself would be encrypted payload and it would pass right through all know virus and content scanners.

The only current way to prevent such an attack would be to block all encrypted messages, certainly not a cheerful thought.  What could be done is have a mechanism where the gateway SMTP server at each organization has the ability to decrypt every message sent by a user in that organization, or any message sent to a user in that organization.  This would require massive duplication of all the keys to the SMTP gateway, or some PKI that made those keys available to the gateway as well as desktop clients.  Unless the scanning servers are able to decrypt messages sent to and from their users, encrypted worms and viruses are just a ticking timebomb waiting to go off.




This site © copyright 2003-2011 Brian Keefer.  Unauthorized republication is forbidden.