Ending Spam, by Jonathan Zdziarski, ISBN 1-59327-052-6
by: Howard Carson, July 2005
by: No Starch Press, distributed by O'Reilly Media
MSRP: US$39.95, CDN$53.95
If I never see another spam e-mail, it won't be too soon. Unfortunately, I most definitely will see more spam e-mail—lots more. The ongoing quest to rid our Inboxes of the spam scourge is fraught with difficulties and riddled with choices based on technology which is unfamiliar to most SOHO and small business people. Worst of all, the choices we make are often insufficient for our needs (the spam keeps coming). As well, whatever spam filtering software we choose is bound to require physical intervention on our part in order to continuously fine tune, update, upgrade and on occasion actually replace the software with another product. Ending Spam author Jonathan Zdziarski (last name pronounced: zid-jarski), longtime spam fighter, DSPAM evangelist, antispam lecturer and programmer has written a book which attempts to explain the various methods and mathematical approaches which underpin current spam filtering technologies.
Zdziarski is a huge fan of Bayesian Content Filtering and Statistical Language Classification—also referred to as statistical, adaptive spam filtering. Basically, the methodologies are components of Intelligent Machine Learning which is essentially defined as programming capable of dynamically adapting to a particular user's behavior and figuring out what that user believes is and isn't spam. In the simplest terms, statistical, adaptive spam filtering appears to be the most effective method currently available to stop most spam in its tracks—that is, to prevent the spam from reaching your Inbox. Before we go any farther, let's address the foundational impetus behind spam and why it's so universally reviled at the same time as it's increasing in both volume and intensity.
First and foremost, spam is inveigling and in many cases almost irresistibly attractive to about 5% of the general population in the industrialized world. The reasons are reasonably straightforward if you're a student of psychology. If you're just plain folks, the explanation is a bit more complex but still clearly understandable.
Topping the list of reasons is greed in all its nasty glory. The prospect of a great deal (Microsoft Office for $39.95 anyone?) presented in the context of a typical HTML e-mail complete with professional graphics, links to click and so on, is deliberately designed to dull our normally suspicious reaction to the sale of a product for one fifteenth of its regular price. We want to believe that we encountered a fabulous bargain. We want to believe that the disclaimers in the e-mail entreaty which state that "This is a SPECIAL promotion and is NOT supported by Microsoft" somehow lend honest credence to what is patently an impossible deal. We know that Microsoft is not going out of business and that legitimate retailers do not have to sell its products at fire sale prices. I won't bore you with endless stories about the number of people who have handed credit card numbers to these crooks and received nothing—NOTHING—in return. I talked to a victim of this sort of con last month (June 2005). When I asked him point-blank why he fell for such an obvious rip-off he replied "I know it's a longshot, but you never know. One of these deals could be legit. Besides, it was only forty bucks." I'll leave you to your own conclusions.
I got the same reaction last year (during research interviews in March and November 2004) from people who had been ripped off to the tune of about $9,000.00 each by a couple of the Nigerian (Ivory Coast, South Africa, Zimbabwe, Uganda, etc., etc.) money laundering scams (you know the one: "I've got twenty million stolen by my late Uncle that needs to be cleared and I'll pay you a huge percentage if you'll take care of the transactions for me.") Of course the scam always ends up costing you lots of money for purported bribes, transportation clearances, fees and sundry other nonsense. The Nigerians must think we're a nation of absolute morons. Where greed is concerned, they're not completely wrong—at least for the aforementioned 5% of the population.
Pornography spam is another matter altogether. Like it or not, there is a demonstrable prurience factor which eludes the rational control of a shockingly large percentage of the male population. Susceptibility estimates range as high as 15% of all males online over the age of 14. That's a lot of potential business to tap. For the sex photo and video purveyors it represents a market worth billions. All they have to do in order to tap into the market is to find a way to get those males to click the links. Methods abound.
The cost of all this in terms of monetary transactions is staggering. The numbers range into the billions of dollars ever year, worldwide, with year over year growth.
The second level of spam is related to a simpler and legitimate exchange of money for clicks. There are thousands of anonymous web sites which claim to be the best sources for one thing and another. When you receive an e-mail enticing you to click a link that takes you to one of these web sites, you'll be greeted with a browser window full of product and sales links, pop-ups and scripts (bits of programming code which silently install tracking software and other programs on your computer). The spammers are taking advantage of all the pay-per-click and commission-based Internet advertising out there by aggregating large selections of ads on a single web site. Ads are sometimes included without the permission of the product manufacturer or retailer. The sites also act as vehicles for so-called AdWare and SpyWare which manages to install itself on your computer, adding unsolicited links to your Bookmarks or Favorites list, redirecting your home page, tracking everything you do in some cases and otherwise performing more nefarious exploits. The whole point is to get you to look at things you might not otherwise have time for, entice you into clicking links, and without your direct permission control some portion of your online attention. It's creepy.
The third level of problems resulting from e-mail spam are more insidious and physically destructive. Virus writers of all descriptions are just as aware of our susceptibility as the spammers. As a consequence, virus writers regularly and anonymously broadcast e-mail containing virus attachments disguised as documents, images and even attached e-mails. They include cleverly written subject lines using the language of social engineering (based on old, simple and commonly understood principles of psychology) to attract your attention and distract you just long enough to get you to click a link in the message body or open the file attachment. Do it and you're done. If your antivirus software is malfunctioning or not up to date, the virus can do irreversible damage to your computer.
In a nutshell, the fundaments of all of that are what Ending Spam is designed to address and defeat. The future is not bleak, according to Zdziarski, and the technology to aggressively blockade spam in all its forms exists today.
The book breaks the analysis of spam into bite-size chunks, which are easy to digest for both programmers and non-programmers. It's the "non-programmer" bit that should be of interest to SOHO, small business and home computer users. For normal people (non-programmers), Zdziarski clearly and accurately explains Tokenization, one of the most important processes uses to identify spam. At the same time he presents the technological and mathematical theories, devices and practical applications currently being employed against spam.
The book also clearly explains, and repeats, the differences between older, rule-based spam filtering and the much more effective content based filtering. Rule-based filtering usually makes reference to a preset database of objectionable keywords. This kind of filtering is somewhat dumb in that it can't learn any new rules unless users intervene and establish them. Rule-based filtering can also benefit from regular updates to its database, the unfortunate side effect being that not everyone has the same tolerance (or intolerance) for certain kinds of e-mail.
Content based filtering, on the other hand, begins with a tokenization process—literally, the reduction of an e-mail to its component parts. The more effective and accurate the tokenization, the more accurate and effective the spam filtering. Since the fundamental goal of tokenization is to separate and identify specific features of text, anything which arrives in some sort of encoded format (Base64, ZIP, etc.) has to be decoded, at least in part, in order for the tokenization process to be truly effective and produce something other than gibberish to hand off to the analysis engine.
The point of the comparative differences in spam filtering techniques is that any discussion of them by someone as obviously knowledgeable as Jonathan Zdziarski is bound to be interesting. For programmers and IS/IT managers battling spam on a daily basis, Ending Spam should serve as the clearest possible reference. If you're working on spam filtering code, or implementing various different filtering methods at home, in the office or on a network, the detailed analyses and concise theory on analytical weighting (among a hundred other things) will serve as an excellent guideline. The author covers fundamentals of statistical filtering including decoding, tokenization, data storage, scaling and of course the technical tricks used by spammers. Advanced concepts of statistical filtering are also covered beginning with testing theory, concept identification, discrimination, feature set reduction, collaborative algorithms and a range of examples of effective filtering.
Cons: There are none. However, the book is aimed at two distinctly different groups: programmers and non-programmers. As a result, non-programmers will get the most out of the first half of the book, plus a score of pages in later chapters.
Pros: The clearest and most erudite treatise on spam filtering to date. If you're a home, SOHO or small business computer user dealing with mountains of e-mail spam, you owe yourself some education. Ending Spam is just the ticket, with readable definitions, and explanations of spam techniques that are a revelation for their clarity and accuracy. For programmers and IS/IT managers, Ending Spam should become one of your regular reference texts. Well done. Highly recommended.
Feedback? Letters to the Editor? Send them here!