================================================================================
----------------------{ Architecture Spanning Shellcode }-----------------------
================================================================================
prole && harm                                                         06.17.2000
                          prole: prole@redgeek.net
                          harm:  harmful@xxx.xxxx

====-{ Introduction (prole's perspective)

So there I was, mux'ing between a talk session with bind and the whatever the 
hip channels were at the time, when bind said something that prompted me to say,
"yeah, that reminds me of one time when I got into a discussion about 
architecture-independent shellcode."   Of course, that was totally the wrong 
word for it - shellcode is bytecode, and bytecode is by it's very nature is
architecture dependent (traditionally speaking, ignore Crusoe, Java, etc., for 
this paper - wait for the future article on the impact meta-ISAs have on 
shellcode exploits.)

Anyway, bind asked if I was serious, and I recalled that the conversation was
along the theoretical (drunken) lines, but none of the participants including
me really attempted to follow up on it.  This time, however, we started 
bandying about ideas and general usefulness.  The first order of business was to
define exactly what we wanted: architecture-independent code seemed to be
completely out of reach in the software only realm with traditional chips, like
x86, SPARC, MIPS, PPC, or Alpha.  What we really wanted to do was find string
of bits that would run on more than one chip.  This would more accurately be
called "architecture-spanning" shellcode (hereafter called ASSC for brevity).
(Note: ASSC(2) implies shellcode that spans two architectures, ASSC(3) spans
three, etc.)

====-{ Uses

Why would you want ASSC?  After all, couldn't you just use siphon, nmap, or
queso to determine what OS and/or architecture you target machine is running,
and cut-n-paste you favorite shellcode into your exploit-to-go proggie?

Yeah, you could.  But what if you don't want all the lights and sirens to go
off on the other end with an active OS fingerprinter?  Or what if the target
machine is behind a tight stateful firewall that's redirecting only your
desired port to the target, and dropping all your funky fingerprinting packets?
Or, maybe you're not initiating the attack manually - maybe the attack is self 
propagating?  

Of course, this is all under the conventional idea that the target is a single 
machine - ASSC would be very useful against distributed targets, especially when
a successful 'compromise' entails elevating privileges on all of the target's 
machines within a certain time frame.  For example, some clustered systems today
will do content-management between heterogenous architectures, automatically 
[re]storing to/from a backup periodically.  It may be that a compromise of one 
or a few of the clusters would allow you to add/change/delete the content of 
some of those servers, but that change may be flushed after the next data-
consistency check.  Immediately obvious targets that fit this profile include 
distributed databases, webservers, and LDAP servers.   In this case, it would be
desirable to be able to compromise all machines relatively quickly, and be able 
to control the data-consistency mechanisms on all of the servers.  The ability 
to do this may not only increase your chances of achieving your goal, but 
would also decrease your chances of exposing your presence, as no alarms would 
fire saying "database servers #3 and #4 has a different grant tables than all 
the rest!"

====-{ Basic Concepts - ASSC(2)

Since it seemed highly unlikely we'd be lucky enough to find a string of bits
that translated into valid instructions that had similar logical results on two 
completely different chips, we decided the best thing to do was split the
shellcode into two portions: architecture-common, and architecture-specific.
All we had to do was to create a bitstring that comprised the architecture-
common code that would jump us to the correct architecture-specific code (hybrid
pseudo-code follows):

   <0xFFFFFbdcb>: if( arch == SPARC )
   <0xFFFFFbdc8>: jmp 0xEF
   <0xFFFFFbdc4>: else
   <0xFFFFFbdc0>: jmp 0x60
    ...<0x60 later>...
   <0xFFFFFbe20>: <x86 shellcode>
    ...<0x8b later>...
   <0xFFFFFbeb7>: <SPARC shellcode>

Those versed in asm will immediate notice we don't need the second jump, as
we will just fall through, and therefore only require to find a jmp in one
architecture instead of two.  The second thing to notice here is that there 
will be some SPARC instructions that are invalid x86 instructions, and vice 
versa, but that SIGILL (or equivalent) will only be generated at runtime, so as 
long as our architecture-specific sections are clean and we exit/return before 
we hit the fall-through block, we're safe.  We do need to take into account any 
branch delay slots, however:

   <0xFFFFFbdcb>: if( arch == SPARC )
   <0xFFFFFbdc8>: jmp 0xEF
   <0xFFFFFbdc4>: <SPARC and x86 NOP>
   <0xFFFFFbdc0>: <x86 shellcode>
    ...<0xEF later>...
   <0xFFFFFbeb7>: <SPARC shellcode>

Of course, we're still faced with the challenge of finding out both what 
architecture we're on.  This is when harm reminded me I'm an complete idiot.
We *don't need to know* what architecture we're running on, as we have one
string of bits that will be interpreted in *one* way by the chip, so code flow
will take use exactly where we want.  The challenge is really changed to this:
find a string of bits for the architecture-common code that will be interpreted
as NOPs or NOP-equivalents in one language, and as a jump or branch in the 
other.  For example:

                                      x86 interpretation   SPARC interpretation
   <0xFFFFFbdcb>: 0x1234FDCB          jmp  0xEF            NOP
   <0xFFFFFbdc4>: <SPARC shellcode>   <not seen>           SPARC exec("/bin/sh")
    ...<0xEF later>...                <not seen>           SPARC exit()/return
   <0xFFFFFbeb7>: <x86 shellcode>     x86 exec("/bin/sh")  <not seen>

Things began to seem possible, but we still needed to generate these strings,
and exhaustive search was out of the question - it would require too much
space and time with today 32- and 64- bit architectures, in addition to 
complications we discuss next.  Furthermore, an exhaustive search wouldn't 
provide a general enough framework, since there is a variable amount of space 
available for overflows depending on exploit, an just flipping bits affects the
interpretation on all architectures involved.

====-{ 6th Grade

Remember all those "compare and contrast" papers your sixth grade teacher made
you write about random articles and books?  That's exactly what we need to do
here.  In order to efficiently generate the desired strings, we have to know
exactly what similarities and differences exists between the architectures
we're trying to exploit.  The three primary differences differences we utilized
were the following:

   1) Endian-ness
   2) Alignment
   2) Instruction Length

The difficult part in all of the following sections is the interaction between
architectures - in one, you may need to flip a certain bit in one architecture,
but that invalidates an instruction in the other, or changes the meaning, or
jump landing ... etc.

We'll using x86 and SPARC throughout, as they illustrate these differences 
explicitly.  And the more variable the languages, the more likely you'll get a 
successful ASSC sample.  For example, both MIPS and SPARC have 32-bit 
instructions aligned on 32-bit boundaries, and are big-endian. Hence, we have
to rely on chance that they were influenced from a similar ancestor and have 
the right bits in the right places.  (Not as fun - it's more fun to take into 
account all the differences.)

-----{ Endian-ness

Since we've been using x86 and SPARC as examples so far, I'll continue to use
them.  Remember, while the bitstring
   0x12345678
may correspond to an valid instruction in SPARC, it may not in x86.  However,
x86 will interpret the bitstring as
   0x78563412
which may be a valid instruction.  The most useful aspect of characteristic is 
that most ISAs have their opcodes in the most significant bits of the string,
and user-definable data near the end.  This is to speed instruction decoding
within the chip itself.  So, while a SPARC addcc immediate format is
   10ddddd010000rrrrr1iiiiiiiiiiiii
      key:  d = destination register
            r = source register
            i = sign-extended immediate value
a valid x86 instruction to match this would actually match the string:
   iiiiiiiirr1iiiii10000rrr10ddddd0

Since the string of "i"s is an immediate value, we can assign it whatever value
we want, which is also the first byte of an x86 instruction - giving us a lot
of flexibility as to the instruction we wish to execute in x86 land.  If we can
jump to the x86-specific shellcode, we're almost done.  There are two caveats, 
however:
   i)  We now have a good portion of non user-definable data in the x86
       instruction at the end, where x86 will define things such as the width
       of the instruction, addressing mode, and register addresses.
   ii) With this particular format, we are storing data to a SPARC register,
       (ddddd) which may have side effects that are undesirable.  The addcc
       will set any of SPARC's condition code register (N,V,C,Z), which will
       affect SPARC branching.  Furthermore, we may be destroying data in
       the destination register that we previously put there and don't want
       munged.  If the SPARC interpretation is supposed to be a NOP-equivalent,
       it shouldn't have such side effects.

We  have to work around (i) if we want to use SPARC addcc, but (ii) is easily
fixed.  Use add immediate instead of addcc immediate, with the destination
register set to %g0, the SPARC sink register (although this aggravates (i)):
   SPARC add:    1000000000000rrrrr1iiiiiiiiiiiii
   byte-swapped: iiiiiiiirr1iiiii00000rrr10000000

Note:  hereafter, we will be calling NOPs and NOP-equivalents (any instruction
that doesn't affect future instructions or code flow, such as arithmetic to
a sink register) NULL instructions.

-----{ Alignment

Alignment and Instruction Length are very closely related.  If the two 
architectures have different alignment constraints, it gives you more 
flexibility in designing your bitstring.  For example, if arch #1 (A1) requires
16-bit alignment, and arch #2 (A2) requires 32-bit alignment, you can overlap
your NULL placements (remember that the last part of NULL-ops is probably very
flexible if source register specifiers or immediate data):

     Interpretations of the same bitstring for architectures A1 and A2

     0x00             0x02             0x04             0x08
     +----------------+----------------+----------------+----------------+
 A1: | NULL-op        | <branch instr> | <never seen>   | <never seen>   |
     +----------------+----------------+----------------+----------------+

     +---------------------------------+---------------------------------+
 A2: | NULL-op                         | <branch instr>                  |
     +---------------------------------+---------------------------------+

Alignment is most useful when combined with Instruction Length variances, where
the modulus of the lengths isn't equal to the length of the smaller 
instructions.

-----{ Instruction Length

If the two architecture have difference instruction lengths, as above, we can
really start to explore.  It's most useful if one or both architectures have
some variable length instructions, as they will certainly have fine-grained
alignment restrictions:

     Interpretations of the same bitstring for architectures A1 and A2

     0x00             0x02             0x04             0x08
     +----------------+----------------+----------------+----------------+
 A1: | NULL-op        | <branch instr>                  | <never seen> ...
     +----------------+----------------+----------------+----------------+

     +---------------------------------+---------------------------------+
 A2: | NULL-op                         | NULL-op                         |
     +---------------------------------+---------------------------------+

We can overlap the (typically) flexible data portions of each instruction
bitstring with the opcode portion of bitstring, allowing a greater chance at
finding a branch in one that' NULL operations in the other.

====-{ Composite and ASSC(N): N > 2

Take all three of the above, and you've really got a lot of flexibility to work
with.  One of the more difficult aspects, as noted above, is the jump offset
in one architecture rendering the companion NULL-op in the other invalid or
makes it produce side-effects.  This is particularly difficult when there is
limited space to work with (e.g., small buffers).  The shellcode can get very 
large, if you're not careful and/or lucky.

Generating ASSC(N): N > 2, requires a lot of patience and luck.  You need to
create a branch for architecture 1 that's a NULL-op in the  N-1 remaining 
architectures, and then a branch in architecture 2 that's a NULL-op the N-2 
remaining architectures, etc.  Furthermore, if the distance from the initial
jump is too small to contain the N-1 branch/NULL-ops and architecture-specific
shellcode for the remaining architectures, you'll need to go back to 
architecture 1, increase the jump length, and start again.

Another (complex) option is to relax the condition that the N > M architectures
must have NULL-ops when architecture M is branching (or where M is setting up 
for a branch by inserting NULL-ops or critical address calculations).  
Architectures N > M could be doing useful calculations, such as address 
calculations, or executing instructions that have side-effects, as long as those
side-effects can be guaranteed to be reset or corrected prior to architecture 
N's branch.  This requires quite a bit of backtracking when creating instruction
sequences, as each bit becomes more important - each bit has more 
"responsibility".  It is not guaranteed that there is a bitstring that can 
execute on N: N > 1 given architectures, a bit of luck is required.

====-{ assc-gen

As proof of concept, we've started work on 'assc-gen'.  Note that we won't be
releasing it at this time as it's ugly as hell and doesn't work that well yet.
Regardless, assc-gen works by selecting the languages to generate shellcode for, 
and iteratively building the bitstrings interactively.  For example, you could
start by choosing SPARC and x86, and begin with a SPARC NOP instruction.  It 
would then attempt to find valid x86 matches to the SPARC instruction, 
preferably jumps or branches.  You can continue to build the bitstring by either
switching to matching a SPARC NULL-op to a x86 instruction, or keep generating
SPARC instructions.

Instruction sets are defined by general instruction specifications, particular
to that set.  These are generated through a generalization of Karnaugh mapping.
From assc-gen.c

   /* 
      SPARC core instruction set
      r  = source 1 register address bit
      s  = source 2 register address bit
      d  = destination register address bit
      x  = variable bit (don't care)
      i  = immediate data bit
      p  = pointer (relative address) bit
      a  = annul next instruction (branches)
   */
   ...
   "add",          "10ddddd000000rrrrr0xxxxxxxxsssss",

Instructions can be collapsed, to reduce application running time, via the
Karnaugh mapping method mentioned above.  In short, Karnaugh mapping specifies 
that the logical operation on two bitstring can be simplified by collapsing
"OR states" into "don't care" states.  An example illustrates it the best:

   SPARC "add" and the "add immediate" variant
      add           = 10ddddd000000rrrrr0xxxxxxxxsssss
      add immediate = 10ddddd000000rrrrr1iiiiiiiiiiiii
         key:  d = destination register
               r = source register 1
               s = source register 2
               i = sign-extended immediate value
               x = don't care (ignored)

In SPARC land, "add %g0, %g0, %g0" is frequently used as a NOP, since register
%g0 is a global register that always contains the value 0 (i.e., it's a sink
register).  The bitstring for that would be:
      add %g0, %g0, %g0 = 1000000000000000000xxxxxxxx00000
We can generalize this, since we don't care what registers we read from - they
won't be modified, and will still be shoved into the sink.
      add %g0, %X,  %X  = 10ddddd000000xxxxx0xxxxxxxxxxxxx
The last 0 in that string indicates that bits 0-4 should be treated a register 
specifier.  If we change it to a 1, it indicated that the bits 0-12 should be
treated as a sign-extended 13-bit immediate value that should be added to
register (rrrrr).  Again, we don't care whether or not we're adding a register
or immediate, since we're discarding the results, so we have this:
      add %g0, %X,  %X    = 1000000000000xxxxx0xxxxxxxxxxxxx
      add %g0, %X, imm.   = 1000000000000rrrrr1iiiiiiiiiiiii
   => 
      add %g0, <anything> = 1000000000000xxxxxxxxxxxxxxxxxxx
In general, if two strings are exactly the same, but differ in one bit and both
strings are valid (e.g., this OR that), that bit can be specified as "don't 
care," bit signified by an "x".  Repeat until no more bit's can be specified
as "don't care" bits, and you have your minimal set of cardinal bitstrings.

Since we're dealing with more than just "0", "1", and "x", however, we have to
generalize this and decide when the OR condition is met.  We don't have a 
strict logical constraint on pseudo-variable values, as they have interactions
beyond just the bit they represent. For example, it's obvious that:
   
   +----+---+---+---+   
   | OR | 0 | 1 | x |   (If you're not familiar with truth tables, the 
   +----+---+---+---+    intersection of a row and column is the result
   |  0 | 0 | x | x |    of the the OR of the two values in that row and
   +----+---+---+---+    column.)
   |  1 | x | 1 | x |
   +----+---+---+---+
   |  x | x | x | x |
   +----+---+---+---+

But what about:

   +----+---+---+---+---+---+---+
   | OR | 0 | 1 | x | w | d | i | w = <generic asm> width specifier
   +----+---+---+---+---+---+---+
   |  0 | 0 | x | x | ? | ? | ? | d = <generic asm> displacement bit specifier
   +----+---+---+---+---+---+---+
   |  1 | x | 1 | x | ? | ? | ? | i = <generic asm> value bit
   +----+---+---+---+---+---+---+
   |  x | x | x | x | ? | ? | ? |
   +----+---+---+---+---+---+---+
   |  w | ? | ? | ? | w | ? | ? |
   +----+---+---+---+---+---+---+
   |  d | ? | ? | ? | ? | d | ? |
   +----+---+---+---+---+---+---+
   |  i | ? | ? | ? | ? | ? | i |
   +----+---+---+---+---+---+---+

   If we have the following two bitstrings:
      1010101iiiiiiiii
      1010101wdddddddd
   can we collapse it to this?:
      1010101xxxxxxxxx
   It depends on the meaning provided by the ISA, especially since w, d, and i 
   may very well affect bits in positions other than their own.  This needs to
   be coded into both assc-gen's architecture specification and comparison 
   functions when a new ISA is added.

Regardless, we're currently adding support for x86 and core SPARC, enabled us 
to find this (binary listing in little-endian format:

                      ;  x86                  SPARC
   0000000110101000   ;  test $0x1, %al      / 
   0111111111101011   ;  jmp 0x7f            \ add %g7, 0xb7f, %l4

     ...
   <room for 31 SPARC instructions (127 bytes, 4 bytes/instr)>
   <to adjust offset, change the jump offset - SPARC w/ still be nop>
     /* Shellcode */
     /* 0xSC - SPARC asm byte */
     /* 0xXC - x86 asm byte   */
     /* 0xPP - padding        */
                                   x86
     0xa8, 0x01, 0xeb, 0x7f,  --- jumps -+
     0xSC, 0xSC, 0xSC, 0xSC,       to    |
   ...<29 SPARC instr. cut>...           |
     0xSC, 0xSC, 0xSC, 0xSC,             |
     0xPP, 0xPP, 0xPP, 0xXC   <---here --+

====-{ Contact

   We're always open to feedback, comments, and questions - especially if you
   found this interesting. :)

   prole: prole@redgeek.net
          http://www.redgeek.net/~prole
   harm:  harm@xxxxxxx.xxx
          http://xxx.xxx.xxx/