================================================================================ ----------------------{ Architecture Spanning Shellcode }----------------------- ================================================================================ prole && harm 06.17.2000 prole: prole@redgeek.net harm: harmful@xxx.xxxx ====-{ Introduction (prole's perspective) So there I was, mux'ing between a talk session with bind and the whatever the hip channels were at the time, when bind said something that prompted me to say, "yeah, that reminds me of one time when I got into a discussion about architecture-independent shellcode." Of course, that was totally the wrong word for it - shellcode is bytecode, and bytecode is by it's very nature is architecture dependent (traditionally speaking, ignore Crusoe, Java, etc., for this paper - wait for the future article on the impact meta-ISAs have on shellcode exploits.) Anyway, bind asked if I was serious, and I recalled that the conversation was along the theoretical (drunken) lines, but none of the participants including me really attempted to follow up on it. This time, however, we started bandying about ideas and general usefulness. The first order of business was to define exactly what we wanted: architecture-independent code seemed to be completely out of reach in the software only realm with traditional chips, like x86, SPARC, MIPS, PPC, or Alpha. What we really wanted to do was find string of bits that would run on more than one chip. This would more accurately be called "architecture-spanning" shellcode (hereafter called ASSC for brevity). (Note: ASSC(2) implies shellcode that spans two architectures, ASSC(3) spans three, etc.) ====-{ Uses Why would you want ASSC? After all, couldn't you just use siphon, nmap, or queso to determine what OS and/or architecture you target machine is running, and cut-n-paste you favorite shellcode into your exploit-to-go proggie? Yeah, you could. But what if you don't want all the lights and sirens to go off on the other end with an active OS fingerprinter? Or what if the target machine is behind a tight stateful firewall that's redirecting only your desired port to the target, and dropping all your funky fingerprinting packets? Or, maybe you're not initiating the attack manually - maybe the attack is self propagating? Of course, this is all under the conventional idea that the target is a single machine - ASSC would be very useful against distributed targets, especially when a successful 'compromise' entails elevating privileges on all of the target's machines within a certain time frame. For example, some clustered systems today will do content-management between heterogenous architectures, automatically [re]storing to/from a backup periodically. It may be that a compromise of one or a few of the clusters would allow you to add/change/delete the content of some of those servers, but that change may be flushed after the next data- consistency check. Immediately obvious targets that fit this profile include distributed databases, webservers, and LDAP servers. In this case, it would be desirable to be able to compromise all machines relatively quickly, and be able to control the data-consistency mechanisms on all of the servers. The ability to do this may not only increase your chances of achieving your goal, but would also decrease your chances of exposing your presence, as no alarms would fire saying "database servers #3 and #4 has a different grant tables than all the rest!" ====-{ Basic Concepts - ASSC(2) Since it seemed highly unlikely we'd be lucky enough to find a string of bits that translated into valid instructions that had similar logical results on two completely different chips, we decided the best thing to do was split the shellcode into two portions: architecture-common, and architecture-specific. All we had to do was to create a bitstring that comprised the architecture- common code that would jump us to the correct architecture-specific code (hybrid pseudo-code follows): <0xFFFFFbdcb>: if( arch == SPARC ) <0xFFFFFbdc8>: jmp 0xEF <0xFFFFFbdc4>: else <0xFFFFFbdc0>: jmp 0x60 ...<0x60 later>... <0xFFFFFbe20>: ...<0x8b later>... <0xFFFFFbeb7>: Those versed in asm will immediate notice we don't need the second jump, as we will just fall through, and therefore only require to find a jmp in one architecture instead of two. The second thing to notice here is that there will be some SPARC instructions that are invalid x86 instructions, and vice versa, but that SIGILL (or equivalent) will only be generated at runtime, so as long as our architecture-specific sections are clean and we exit/return before we hit the fall-through block, we're safe. We do need to take into account any branch delay slots, however: <0xFFFFFbdcb>: if( arch == SPARC ) <0xFFFFFbdc8>: jmp 0xEF <0xFFFFFbdc4>: <0xFFFFFbdc0>: ...<0xEF later>... <0xFFFFFbeb7>: Of course, we're still faced with the challenge of finding out both what architecture we're on. This is when harm reminded me I'm an complete idiot. We *don't need to know* what architecture we're running on, as we have one string of bits that will be interpreted in *one* way by the chip, so code flow will take use exactly where we want. The challenge is really changed to this: find a string of bits for the architecture-common code that will be interpreted as NOPs or NOP-equivalents in one language, and as a jump or branch in the other. For example: x86 interpretation SPARC interpretation <0xFFFFFbdcb>: 0x1234FDCB jmp 0xEF NOP <0xFFFFFbdc4>: SPARC exec("/bin/sh") ...<0xEF later>... SPARC exit()/return <0xFFFFFbeb7>: x86 exec("/bin/sh") Things began to seem possible, but we still needed to generate these strings, and exhaustive search was out of the question - it would require too much space and time with today 32- and 64- bit architectures, in addition to complications we discuss next. Furthermore, an exhaustive search wouldn't provide a general enough framework, since there is a variable amount of space available for overflows depending on exploit, an just flipping bits affects the interpretation on all architectures involved. ====-{ 6th Grade Remember all those "compare and contrast" papers your sixth grade teacher made you write about random articles and books? That's exactly what we need to do here. In order to efficiently generate the desired strings, we have to know exactly what similarities and differences exists between the architectures we're trying to exploit. The three primary differences differences we utilized were the following: 1) Endian-ness 2) Alignment 2) Instruction Length The difficult part in all of the following sections is the interaction between architectures - in one, you may need to flip a certain bit in one architecture, but that invalidates an instruction in the other, or changes the meaning, or jump landing ... etc. We'll using x86 and SPARC throughout, as they illustrate these differences explicitly. And the more variable the languages, the more likely you'll get a successful ASSC sample. For example, both MIPS and SPARC have 32-bit instructions aligned on 32-bit boundaries, and are big-endian. Hence, we have to rely on chance that they were influenced from a similar ancestor and have the right bits in the right places. (Not as fun - it's more fun to take into account all the differences.) -----{ Endian-ness Since we've been using x86 and SPARC as examples so far, I'll continue to use them. Remember, while the bitstring 0x12345678 may correspond to an valid instruction in SPARC, it may not in x86. However, x86 will interpret the bitstring as 0x78563412 which may be a valid instruction. The most useful aspect of characteristic is that most ISAs have their opcodes in the most significant bits of the string, and user-definable data near the end. This is to speed instruction decoding within the chip itself. So, while a SPARC addcc immediate format is 10ddddd010000rrrrr1iiiiiiiiiiiii key: d = destination register r = source register i = sign-extended immediate value a valid x86 instruction to match this would actually match the string: iiiiiiiirr1iiiii10000rrr10ddddd0 Since the string of "i"s is an immediate value, we can assign it whatever value we want, which is also the first byte of an x86 instruction - giving us a lot of flexibility as to the instruction we wish to execute in x86 land. If we can jump to the x86-specific shellcode, we're almost done. There are two caveats, however: i) We now have a good portion of non user-definable data in the x86 instruction at the end, where x86 will define things such as the width of the instruction, addressing mode, and register addresses. ii) With this particular format, we are storing data to a SPARC register, (ddddd) which may have side effects that are undesirable. The addcc will set any of SPARC's condition code register (N,V,C,Z), which will affect SPARC branching. Furthermore, we may be destroying data in the destination register that we previously put there and don't want munged. If the SPARC interpretation is supposed to be a NOP-equivalent, it shouldn't have such side effects. We have to work around (i) if we want to use SPARC addcc, but (ii) is easily fixed. Use add immediate instead of addcc immediate, with the destination register set to %g0, the SPARC sink register (although this aggravates (i)): SPARC add: 1000000000000rrrrr1iiiiiiiiiiiii byte-swapped: iiiiiiiirr1iiiii00000rrr10000000 Note: hereafter, we will be calling NOPs and NOP-equivalents (any instruction that doesn't affect future instructions or code flow, such as arithmetic to a sink register) NULL instructions. -----{ Alignment Alignment and Instruction Length are very closely related. If the two architectures have different alignment constraints, it gives you more flexibility in designing your bitstring. For example, if arch #1 (A1) requires 16-bit alignment, and arch #2 (A2) requires 32-bit alignment, you can overlap your NULL placements (remember that the last part of NULL-ops is probably very flexible if source register specifiers or immediate data): Interpretations of the same bitstring for architectures A1 and A2 0x00 0x02 0x04 0x08 +----------------+----------------+----------------+----------------+ A1: | NULL-op | | | | +----------------+----------------+----------------+----------------+ +---------------------------------+---------------------------------+ A2: | NULL-op | | +---------------------------------+---------------------------------+ Alignment is most useful when combined with Instruction Length variances, where the modulus of the lengths isn't equal to the length of the smaller instructions. -----{ Instruction Length If the two architecture have difference instruction lengths, as above, we can really start to explore. It's most useful if one or both architectures have some variable length instructions, as they will certainly have fine-grained alignment restrictions: Interpretations of the same bitstring for architectures A1 and A2 0x00 0x02 0x04 0x08 +----------------+----------------+----------------+----------------+ A1: | NULL-op | | ... +----------------+----------------+----------------+----------------+ +---------------------------------+---------------------------------+ A2: | NULL-op | NULL-op | +---------------------------------+---------------------------------+ We can overlap the (typically) flexible data portions of each instruction bitstring with the opcode portion of bitstring, allowing a greater chance at finding a branch in one that' NULL operations in the other. ====-{ Composite and ASSC(N): N > 2 Take all three of the above, and you've really got a lot of flexibility to work with. One of the more difficult aspects, as noted above, is the jump offset in one architecture rendering the companion NULL-op in the other invalid or makes it produce side-effects. This is particularly difficult when there is limited space to work with (e.g., small buffers). The shellcode can get very large, if you're not careful and/or lucky. Generating ASSC(N): N > 2, requires a lot of patience and luck. You need to create a branch for architecture 1 that's a NULL-op in the N-1 remaining architectures, and then a branch in architecture 2 that's a NULL-op the N-2 remaining architectures, etc. Furthermore, if the distance from the initial jump is too small to contain the N-1 branch/NULL-ops and architecture-specific shellcode for the remaining architectures, you'll need to go back to architecture 1, increase the jump length, and start again. Another (complex) option is to relax the condition that the N > M architectures must have NULL-ops when architecture M is branching (or where M is setting up for a branch by inserting NULL-ops or critical address calculations). Architectures N > M could be doing useful calculations, such as address calculations, or executing instructions that have side-effects, as long as those side-effects can be guaranteed to be reset or corrected prior to architecture N's branch. This requires quite a bit of backtracking when creating instruction sequences, as each bit becomes more important - each bit has more "responsibility". It is not guaranteed that there is a bitstring that can execute on N: N > 1 given architectures, a bit of luck is required. ====-{ assc-gen As proof of concept, we've started work on 'assc-gen'. Note that we won't be releasing it at this time as it's ugly as hell and doesn't work that well yet. Regardless, assc-gen works by selecting the languages to generate shellcode for, and iteratively building the bitstrings interactively. For example, you could start by choosing SPARC and x86, and begin with a SPARC NOP instruction. It would then attempt to find valid x86 matches to the SPARC instruction, preferably jumps or branches. You can continue to build the bitstring by either switching to matching a SPARC NULL-op to a x86 instruction, or keep generating SPARC instructions. Instruction sets are defined by general instruction specifications, particular to that set. These are generated through a generalization of Karnaugh mapping. From assc-gen.c /* SPARC core instruction set r = source 1 register address bit s = source 2 register address bit d = destination register address bit x = variable bit (don't care) i = immediate data bit p = pointer (relative address) bit a = annul next instruction (branches) */ ... "add", "10ddddd000000rrrrr0xxxxxxxxsssss", Instructions can be collapsed, to reduce application running time, via the Karnaugh mapping method mentioned above. In short, Karnaugh mapping specifies that the logical operation on two bitstring can be simplified by collapsing "OR states" into "don't care" states. An example illustrates it the best: SPARC "add" and the "add immediate" variant add = 10ddddd000000rrrrr0xxxxxxxxsssss add immediate = 10ddddd000000rrrrr1iiiiiiiiiiiii key: d = destination register r = source register 1 s = source register 2 i = sign-extended immediate value x = don't care (ignored) In SPARC land, "add %g0, %g0, %g0" is frequently used as a NOP, since register %g0 is a global register that always contains the value 0 (i.e., it's a sink register). The bitstring for that would be: add %g0, %g0, %g0 = 1000000000000000000xxxxxxxx00000 We can generalize this, since we don't care what registers we read from - they won't be modified, and will still be shoved into the sink. add %g0, %X, %X = 10ddddd000000xxxxx0xxxxxxxxxxxxx The last 0 in that string indicates that bits 0-4 should be treated a register specifier. If we change it to a 1, it indicated that the bits 0-12 should be treated as a sign-extended 13-bit immediate value that should be added to register (rrrrr). Again, we don't care whether or not we're adding a register or immediate, since we're discarding the results, so we have this: add %g0, %X, %X = 1000000000000xxxxx0xxxxxxxxxxxxx add %g0, %X, imm. = 1000000000000rrrrr1iiiiiiiiiiiii => add %g0, = 1000000000000xxxxxxxxxxxxxxxxxxx In general, if two strings are exactly the same, but differ in one bit and both strings are valid (e.g., this OR that), that bit can be specified as "don't care," bit signified by an "x". Repeat until no more bit's can be specified as "don't care" bits, and you have your minimal set of cardinal bitstrings. Since we're dealing with more than just "0", "1", and "x", however, we have to generalize this and decide when the OR condition is met. We don't have a strict logical constraint on pseudo-variable values, as they have interactions beyond just the bit they represent. For example, it's obvious that: +----+---+---+---+ | OR | 0 | 1 | x | (If you're not familiar with truth tables, the +----+---+---+---+ intersection of a row and column is the result | 0 | 0 | x | x | of the the OR of the two values in that row and +----+---+---+---+ column.) | 1 | x | 1 | x | +----+---+---+---+ | x | x | x | x | +----+---+---+---+ But what about: +----+---+---+---+---+---+---+ | OR | 0 | 1 | x | w | d | i | w = width specifier +----+---+---+---+---+---+---+ | 0 | 0 | x | x | ? | ? | ? | d = displacement bit specifier +----+---+---+---+---+---+---+ | 1 | x | 1 | x | ? | ? | ? | i = value bit +----+---+---+---+---+---+---+ | x | x | x | x | ? | ? | ? | +----+---+---+---+---+---+---+ | w | ? | ? | ? | w | ? | ? | +----+---+---+---+---+---+---+ | d | ? | ? | ? | ? | d | ? | +----+---+---+---+---+---+---+ | i | ? | ? | ? | ? | ? | i | +----+---+---+---+---+---+---+ If we have the following two bitstrings: 1010101iiiiiiiii 1010101wdddddddd can we collapse it to this?: 1010101xxxxxxxxx It depends on the meaning provided by the ISA, especially since w, d, and i may very well affect bits in positions other than their own. This needs to be coded into both assc-gen's architecture specification and comparison functions when a new ISA is added. Regardless, we're currently adding support for x86 and core SPARC, enabled us to find this (binary listing in little-endian format: ; x86 SPARC 0000000110101000 ; test $0x1, %al / 0111111111101011 ; jmp 0x7f \ add %g7, 0xb7f, %l4 ... /* Shellcode */ /* 0xSC - SPARC asm byte */ /* 0xXC - x86 asm byte */ /* 0xPP - padding */ x86 0xa8, 0x01, 0xeb, 0x7f, --- jumps -+ 0xSC, 0xSC, 0xSC, 0xSC, to | ...<29 SPARC instr. cut>... | 0xSC, 0xSC, 0xSC, 0xSC, | 0xPP, 0xPP, 0xPP, 0xXC <---here --+ ====-{ Contact We're always open to feedback, comments, and questions - especially if you found this interesting. :) prole: prole@redgeek.net http://www.redgeek.net/~prole harm: harm@xxxxxxx.xxx http://xxx.xxx.xxx/