INTRODUCTION AND DEFINITIONS
Before we begin to discuss methods of operation, classifications and technical workings of spambots, let us first define what a spambot actually is. Wikipedia defines a spambot as such: “A spambot is an automated computer program designed to assist in the sending of spam/” Spambots usually create fake accounts and send spam using them.”
This is obviously a very elementary and surface definition of a spambot, but it is essentially true. In more detail, a spambot is an automated script, usually a subclass of the common web crawler/scraper that is used to facilitate the sending of spam through a plethora of means, which vary depending on the type of applications it specializes in probing, harvesting information from and sending spam to. Due to their nature, they ignore web meta conventions such as robots.txt and no-follows.
Spambots can range from basic crawler scripts specialized for navigating through and posting spam to a particular website, to vast and highly advanced black hat search engine optimization suites employed by amateur and professional spammers and spamdexers alike for sending of spam messages on a massive scale across a multitude of platforms and protocols. An example of the latter is the popular proprietary program known as Xrumer, which has been in continuous development since its inception in 2006 and is a standard favorite among spammers to this day.
The reasons for writing and using spambots differ among people, but the practice is firmly associated with black hat search engine optimization (SEO), also known as spamdexing, which is the use of immoral, obtrusive and sometimes illegal techniques to generate ad revenue and higher relevance on search engines for personal benefit of the spammer. These techniques include link farming, duplicate scraper sites, meta tag stuffing, hidden links, spam blogs, excessive link building and mass spam messaging. Many of these techniques can be automated with advanced spamware, but of particular note and prevalence are the latter two. Indeed, “link building” is often erroneously used by spamdexers as a sly euphemism for their true actions, even though the actual practice is legitimate. Spambots are designed to automatically send out large quantities of spam, but can also be programmed to place links in strategic places (forum signatures, blog comments, etc.), which is a form of black hat link building.
Spambots are not limited to email and web applications. They are also encountered in online games, where they are used for griefing or to put strain on a server as part of a denial-of-service attack, or in IRC networks to crapflood channels, although the latter two are tightly related to the more specific and openly malicious programs that are denial-of-service (DoS) bots.
Spambots should be differentiated from zombie computers hooked to a botnet, since spambots are stand-alone programs, plus malicious botnets of this type also put high emphasis on distributed denial-of-service (DDoS) attacks, malware and recruiting more zombies, which is distinct from the predominantly black hat SEO edge of spambots. Nonetheless, zombies are frequently used to send email spam in particular.
Depending on the complexity of the spambot, the method to generate messages may simply involve hardcoded text, cycling through several values of hardcoded text (from an array, some other data container or a text file), or in more advanced cases, through the use of Markov chains. Some spambots may even scrape text from web pages for later use.
The creation of spam and anti-spam techniques has proven to be a perpetual arms race. New anti-spam solutions keep appearing and older ones are in the constant process of refining themselves, whereas on the other side experienced and dedicated spammers are always analyzing the inner workings and mechanics of these techniques, and implementing methods to recognize and bypass them. Therefore, this paper is by no means a complete overview, but it still serves to fill a void in research of spambots and provide useful information to an interested reader.
The code examples given here will be in Python 2.7, using the standard library and the third-party Splinter web browser abstraction layer. Not all examples are necessarily suited as stand-alone programs, and may only be demonstrative of the underlined concepts.
The most basic and prevalent class of spambots. They typically work by scraping websites for email addresses (through regular expression or whatever other means), compiling these addresses into a list and then sending hardcoded or user input-specified spam messages to the list of recipient addresses via built-in SMTP client session functionality, which is usually included in the standard library of most high-level programming languages. They may be rudimentary scripts that use a hardcoded email address specified by the spammer, or contain more advanced features such as creating addresses on the fly, spoofing email and MIME headers, sending attachments and even trying to crack email accounts through and hijack them for usage in spamming activity, or retrieve already cracked accounts from websites utilizing advanced search operator strings, commonly known as dorks.
By far the most sophisticated spambot class, and also the most popular as of recent. These spambots have evolved quite significantly and quickly from their early incarnations. What started off as simple tools to interact with newsgroups or basic message boards which had no effective anti-spam measures have evolved into powerful software that can navigate across and register to many types of forum software, make use of private messaging and user control panel settings, recognize and get past most forms of anti-spam, such as hidden trap forms through HTML/CSS parsing, CAPTCHA – through optical character recognition (OCR) for images and pattern analysis or parsing of natural language (basic computational linguistics) for text.
Advanced forum spambots may also include utilities such as search engine parsers to retrieve large amounts of links by user-specified keywords or instructions, compiling them into lists or databases and then proceeding to try and attack them all in a multi-threaded fashion. An example of this is Hrefer, a complementary tool to Xrumer.
These spambots either use hardcoded emails in their primitive forms, or they can be coded to open email addresses on the fly and navigate throughout webmail interfaces to validate forum registrations. They can also be trained to post in a fashion that evades flood control and more complex ones have predefined behaviors for different forum software, which can either be set manually or programmed so that the bot can forensically examine the software type (e.g. looking for the word “vBulletin” in the index page, which is likely to be in the footer) and switch to subroutines optimized for that.
Forum spambots are divided into two subclasses: playback bots are those that send strings of replayed POST parameters to the form submission URL and form-filling bots are those that actually read, recognize and submit filled out HTML forms.
Blog and guestbook spambots
Spambots that target blogs and guestbooks tend to be hybrids of the first two classes of spambots, and are often form-filling. Functionality to target blogs and guestbooks is bundled in addition to message boards for more advanced forum spambots, or bots can be specifically written to target these platforms.
Blog and guestbook spambots can have an amalgamation of features, but in general they are much simpler to write than forum spambots due to how commenting systems for such sites are implemented in most applications: plain forms with no anti-spam methods, most often just a screen name, email address, optional website and the comment itself.
Social network spambots
These bots are specifically coded to interface within the boundaries of social networking websites. The most common example is that of Twitterbots, which as the name implies, operate within the Twitter microblogging platform.
Generally most large social networks provide APIs or have third-party APIs written for interfacing with the application, making these bots relatively trivial to write. Twitterbots, for instance, are often written as a recreational activity and not for deliberate spamming. They can arbitrarily follow users, reply to tweets containing certain keywords and post tweets of their own.
Social network spambots are perhaps the least threatening class of spambots, although they’re still a nuisance and can potentially be used to generate large interest in the spammer’s links if implemented well enough. Depending on the moderation effort of the given social network, they can be removed fairly quickly or remain for prolonged periods of time in the user base, churning out messages.
IRC spambots/automated flooders
These are essentially regular IRC bots, but regimented for the sole and minimalistic purpose of flooding channels with spam, usually regular crapflooding (repeatedly sending large and vapid messages).
Most IRC flooders aren’t typically designed to aid in black hat SEO, but rather just to launch denial-of-service and bandwidth exhaustion attacks for amusement or out of spite. The common methods of CTCP PING, DCC request, nick and connect floods have no real use for black hat SEO and spamdexing.
Nevertheless, IRC bots are used to send advertising spam as well. In addition, they are trivial to write in any language that supports Berkeley sockets out of the box, plus there is the added advantage of easily being able to connect using multiple program-controlled clients (clones) to carry out the spam, as well as tunneling through open proxies (SOCKS, HTTP, etc.) for extra effectiveness.
IRC spambots and flooders are still perhaps the easiest class of spambots to mitigate and control.
These bots are most often trivial applications that are used to send spam via the instant/team messaging features of cooperative and/or multiplayer online games. Such bots are less frequently used for spamdexing and advertising, but mostly for malicious purposes to put strain on the server or for personal amusement to grief. Therefore, they tend to overlap with the more pronounced category of DoS bots.
Once again, they are easy to write in any language that has support for Berkeley sockets, but they may also make use of operating system and hooking APIs to be able to trigger events such as keystrokes and mouse clicks or ones related to the game itself for native interaction with its features.
Note that this paper will primarily focus on the first two classes of spambots: email and forum bots.
Now that we have cleared the introduction and basic taxonomy, we shall examine the technical workings of two rudimentary spambots: an email harvester/mass mailer and a form-filling forum spambot.
1). Email spambot overview
Below is the source code of a basic proof-of-concept email spambot, which is made using only the Python standard library and harvests email addresses into an array by looking for matches from a regular expression in the page source (page specified via user argument), archives these addresses into a text file and then connects to the Gmail SMTP server to send a message, relying on hardcoded login credentials and user-specified subject and message as arguments.
The design was partially influenced by the spambot described in this academic resource.
Note that regular expressions need not be able to unearth every RFC-compliant email address. Not only would it be unnecessary, since the vast majority of users stick to reasonable boundaries when naming the local part of their address, but matching every single compliant result would require a regex of this magnitude. The regex engine will likely go insane trying to parse that (although it seems Perl handles it well, which is no surprise there), resulting in a phenomenon known as catastrophic backtracing. The IETF are laughing their asses off as we speak.
2). Forum spambot overview
Below is the source code of a basic proof-of-concept form-filling forum spambot using a browser automation and web testing framework called Splinter. Splinter is an abstraction layer over the well-known Selenium, with the aim of making a crystal-clear, simple and powerful API for developers to conduct web tests in virtually no time. In fact, it is so abstract it could almost be considered pseudocode.
Yet for our purposes it is sufficient, as we want to demonstrate how a common spambot could operate. This particular script is specifically optimized for a vanilla PunBB installation, assuming no additional defense features.
It makes use of an external configuration file that maps HTML input form names so that they can be easily reused. The configuration file is merely an associative array (dictionary) on the basis of key-value pairs, without using special features such as Python’s native ConfigParser.
Splinter makes use of web drivers. I have used the Firefox driver, so every time the script is run, a new instance of Firefox will be opened and the written instructions will play out. While this is impractical for a real-life spambot, I have chosen this approach so as to map the bot’s actions in a visual real-time manner. In addition, most bots probably won’t follow such elaborate browsing patterns, but the benefit of having ones like that is to emulate human user behaviors, something smart bots will do.
It should be noted that Splinter and other testing frameworks allow for usage of headless drivers as well, which means activity can go on without opening an explicit browser instance. This goes to show how benign technology like this meant to easen the lives of people conducting software tests could also make spammers’ jobs much easier as well.
The bot itself works by having some hardcoded global variables (username, password, email, topic subject) and takes the message as an argument. It then navigates through the PunBB forum, clicks on the registration link, fills out the forms with help from the key-value pairs mapping the input names so as to recognize them, waits 60 seconds to imitate a human user taking their time, submits and then waits for the spammer to validate the account (or simply continue) by pausing the process with a raw_input() function. It then logs in with its credentials and finally directly visits the topic submission page for the forum with an ID of 1, posts the user-specified message that was a taken as an argument, submits, destroys all cookies and terminates the process.
Without further ado, the bot:
The configuration file:
(AND THEIR OWN RESPECTIVE COUNTERMEASURES)
1. Address munging and URL obfuscation
An often-employed countermeasure is one that has went on to be known as “address munging”. This involves obfuscating an email address in such a way that a user will clearly interpret how it is actually meant to be spelled out, but will drive out bots whose harvesting engines and rulesets will be unable to parse it and/or will avoid it completely.
There are countless ways to munge an email address. Consequently, sophisticated bots need large sets of coded patterns to be able to detect most.
Here is the hypothetical address “firstname.lastname@example.org” munged using several common techniques:
sendmespam (at) example.com
sendmespam (at) example (dot) com
s e n d m e s p a m @ e x a m p l e . c o m
email@example.com (replace ‘z’ with ‘m’)
firstname.lastname@example.org (HTML character entity references for ‘@’ and ‘.’)
The main disadvantage of address munging lies in UX (user experience), namely those who use screen readers and text-based web browsers. Some of the more “creative” munging that starts to cross into the realm of cryptography may even ultimately dissuade regular users from sending an email to the given address.
URL obfuscation is the even more paranoid measure of obscuring regular links so as to avoid them being followed or harvested by crawlers. Generally, this also tends to lure away legitimate agents, such as search engine crawlers.
The most common method is the iconic URI scheme hxxp:// with its variations _ttp:// and h**p://. Others include writing the domain in octal, hexadecimal or abusing HTTP basic authentication.
For example, the URL http://email@example.com will lead you to Evilzone. We can obscure any part of the URL with the aforementioned techniques to make it virtually unreadable to the average user. This was commonly used by phishers back in the day, but most modern browsers now detect it and give out a warning.
In the end, these two techniques are not considered to be of real use in fighting spam, as advanced spambot solutions have been trained to spot most common patterns (with the possible exception of client-side scripting) and fix them through text processing procedures, whereas end users will likely be annoyed and abandon their intentions.
Here are a few code snippets that reverse some basic URL obfuscation and address munging techniques, the general logic of which could be employed by spambots: