Malware can be camouflaged in plain English

Written by John P Mello Jr on December 3, 2009
Anatomy of automatically generated English encoding.

Anatomy of automatically generated English encoding.

The fractured English in spam messages can be amusing but in the future, it could have a malicious subtext.

That’s what a quartet of researchers demonstrated recently at the 16th ACM Conference on Computer and Communications Security held in Chicago.

The foursome–Joshua Mason and Sam Small from John Hopkins University, Fabian Monrose from the University of North Carolina and Greg MacManus from iSight Partners in Washington, D.C.–in a paper presented at the conference outlined how they created an engine to produce malware based on plain English text.

The researchers were able to transform arbitrary shellcode into a representation that is superficially similar to English prose.

        “The shellcode is completely self-contained i.e., it does not require an external loader and executes as valid IA32 code-and can typically be generated in under an hour on commodity hardware,” they wrote.

Shellcode is a code injection technique used by crackers to compromise computers. The code is used to create a buffer overflow in a program.

Buffer overflows result when invalid input is given to a program making it behave in a way that’s unintended by its writers. For example, an application may ask for a password that’s limited to 10 characters. Giving it 20 characters might cause a buffer overflow.

Using the overflow as a door into the program, the intruder often tries to gain control of the application’s program counter–the code that tells the software what to do next. Instead of doing what it’s supposed to do, the cracker redirects the program to execute nefarious code that’s been planted on the system. In many cases, that code creates a command shell used by the miscreant to control the computer. Hence the term “shellcode.”

When fighting shellcode attacks, White Hats use the tried and true technique of divide and conquer. They identify essential and inalienable components and then develop detection and prevention techniques to target one or more of those components.

Up to now, security experts considered it impossible  to hide the components of malware using polymorphic shellcode. A key component to any shellcode scheme is the decoder. Since that component has a signature that distinguished it from other kinds of benign traffic on a network, the thinking was that no matter what crackers did to disguise their shellcode, the decoder would always stick out like a black thumb.

What the four researchers proved, however, was that they could create shellcode, including the decoder, that appears to be English text. That will allow it to blend in with the benign traffic on a network and make it very difficult, if not impossible, to detect.

         “Shellcode, like other compiled code, is simply an ordered list of machine instructions. At the lowest level of representation, each instruction is stored as a series of bytes signifying a pattern of signals that instruct the CPU to manipulate data as desired,” the researchers explained. “Like machine instructions, non-executable data is represented in byte form.”

          “Coincidentally,” they continued, “some character
strings from the ASCII character and native machine
instructions have identical byte representations. Moreover, it is even possible to find examples of this phenomenon that parse as grammatically correct English sentences.”

To create their project, the researchers began by making a list of instructions that could be rendered in English. Then they fashioned a decoder capable of encoding generic payloads using those instructions. Finally, using text from articles from Wikipedia and books from Project Gutenberg, they discovered text strings that mimicked the executable behavior of the decoder.

Once the experimenters had their decoder, they were confronted with two problems. First, because the decoder must reside in memory as executable code, it is vulnerable to detection by malware fighting software. Second, their English instruction set was limited. To solve both problems, they turned to a technique called self-modification.

          “Self-modification is often used to address this problem whereby permissible code modifies portions of the payload such that non-compliant instructions are patched in at runtime, thereby passing any input filters,” they explained. “These additional instructions provide an attacker with more versatility and may make an otherwise impotent attack quite powerful.”

The self-modifying decoder written by the team took the form: initialization, decoder, encoded payload.

          “Intuitively,” they explained, “the first component builds an initial decoder in memory (through self-modification) which when executed, expands the working instruction set, providing the decoder with IA32 operations beyond those provided by English prose. The decoder then decodes the next segment (the encoded payload), again via self-modification.”

Although the researchers proved that English shellcode was possible, one of them doubted that it would ever be seen in the wild.

          “I’d be astounded if anyone is using this method maliciously in the real world, due to the amount of engineering it took to pull off,” John Hopkins’ Mason told the New Scientist.

  • (required)
  • (required)