Updated: July 19th, 2001


Introduction
Detailed TOC
Processor architectures
MIPS basics
SPARC basics
PA-RISC basics
Power* basics
ALPHA basics
x86 basics
System call interface
IRIX/MIPS
Solaris/SPARC
HP-UX/PA-RISC
AIX/POWER
Ultrix/ALPHA
Solaris/SCO/x86
*BSD/x86
Linux/x86
BeOS/x86
Code specifics
Short code length
Position independence
"Zero free" code
Codes functionality
shellcode
cmdshellcode
set*uidcode
chrootcode
findsckcode
bindsckcode
jump
nop

This paper can be also downloaded in single PDF file, which is more suitable for downloading and offline viewing.

We would recommend you using the PDF version, which is the primary one.

by The Last Stage of Delirium Research Group http://lsd-pl.net

Download printable version in
Version: 1.0.2

Updated: July 4th, 2001

Introduction

This technical document contains information about the specifics of writing assembly components for proof of concept codes on different operating systems/architectures. Specifically, it focuses on commercial UNIX systems: IRIX/MIPS, HP-UX/PA-RISC, AIX/PowerPC/POWER and Solaris/x86/Sparc. It is neither meant to be a complete guide to the aforementioned computer architectures nor it is the assembly language tutorial. It has been written as a result of our side-effect investigation efforts in the area of security research pertaining to proof of concept codes development for security vulnerabilities illustration purposes. Obviously, it is destined for code developers specializing (having/looking for an experience) in the area of buffer overflow and format string vulnerabilities, however it is limited only to these assembly parts. For information regarding general proof of concept codes development, please refer to other papers.

This paper is divided into several inter-related parts. In the beginning some basic information about various processor architectures and their important characteristics is given. Next, a detailed discussion of the system call invocation mechanisms, which seems to be crucial for further parts, is presented in the context of different operating systems. It is followed by the introduction to coding requirements, such as writing position independent and zero free assembly codes. Finally, a detailed discussion of several assembly routines with special emphasis on their functionality is presented. In the appendices of this paper you will also find source codes of every routine for all discussed operating systems and architectures along with sample code of their usage (All source codes from this paper can be also downloaded as a single tar file from our website.).

Because of our ongoing research in the area this document will be updated in the future to contain information about other processor architectures/operating systems. Always refer to the most recent version, which can be downloaded directly from our website: lsd-pl.net.

With any questions or comments feel free to send us email at: contact@lsd-pl.net

Table of Contents

Introduction
1 Processor architectures
1.1 IRIX and MIPS basics
1.2 Solaris and SPARC basics
1.3 HP-UX and PA-RISC basics
1.4 AIX and POWER/PowerPC basics
1.5 Ultrix and ALPHA basics
1.6 Solaris/Linux/SCO{OpenServer,Unixware}/{Free,Net,Open}BSD/BeOS and x86 basics
2 System call interface invocation
2.1 IRIX/MIPS
2.2 Solaris/SPARC
2.3 HP-UX/PA-RISC
2.4 AIX/POWER/PowerPC
2.5 Ultrix/ALPHA
2.6 Solaris/SCO{OpenServer,Unixware}/x86
2.7 {Free,Net,Open}BSD/x86
2.8 Linux/x86
2.9 BeOS/x86
3 Code specifics
3.1 Short code length
3.2 Position independence
3.2.1 IRIX/MIPS
3.2.2 Solaris/SPARC
3.2.3 HP-UX/PA-RISC
3.2.4 AIX/POWER/PowerPC
3.2.5 Ultrix/ALPHA
3.2.6 Solaris/SCO{OpenServer,Unixware}/Linux/{Free,Net,Open}BSD/BeOS/x86
3.3 "Zero free" code
3.3.1 IRIX/MIPS
3.3.2 HP-UX/PA-RISC
3.3.3 AIX/POWER/PowerPC
3.3.4 Ultrix/ALPHA
3.3.5 Solaris/SCO{OpenServer,Unixware}/x86
4 Assembly codes functionality
4.1 Shell execution (shellcode)
4.2 Single command execution (cmdshellcode)
4.3 Privileges restoration (set{uid,euid,reuid,resuid}code)
4.4 Chroot limited environment escape (chrootcode)
4.5 Find socket code (findsckcode)
4.6 Network server code (bindsckcode)
4.7 Stack pointer retrieval (jump)
4.8 No-operation instruction (nop)

Processor architectures

Modern operating systems run atop two main processor architectures - CISC and RISC ones. The CISC architecture, which stands for Complex Instruction Set Computer, represents microprocessor families (The Intel x86 family of microprocessors (8086, 80286, 80286, 80386, Pentium - Pentium IV) is a good example of CISC CPUs) with extremely complex instruction sets that are implemented with microcode mechanism. Typical CISC microprocessor implements instructions which are of different format, encoding lengths and have different execution times. The complexity of CPU's architecture is mainly due to the need to support high-level languages and operating systems.

From the assembly code developer's point of view the most important is what instruction set, register set and addressing modes a given CPU offers. He is usually not interested about CPU's design and its internal features, as they seem to be only influencing overall chip performance. For example, in the case of Intel x86 microprocessors no attention must be paid to data and code memory alignments - 32/16/8 memory chunks can be accessed freely, the instruction stream can start at any valid memory address.

However, there are also some features of the Intel x86 CPUs that must be usually considered when writing proof of concept codes for buffer overflow/format string vulnerabilities. These are CPU caches and pipelines. The first ones become problematic for processors equipped with separately maintained data and instruction caches and usually occur due to abnormal jumps to the code portion residing in program data space (the code that is already residing in a data, but not necessarily in a CPU code cache). Although we have only encountered cache problems on Solaris 2.5 running atop Intel 80486, such caches incoherence should be always taken into consideration. The CPU pipeline problems occur very rarely and only when explicit changes are made to the instruction stream that is just to be executed. In such cases, the changed instructions are not taken into account as they are usually already residing in a pipeline (they are decoded by CPU). This imposes some constraints on writing self-modifying code for x86, for example the one used to construct the long call instruction (used as a system call gate).

The Reduced Instruction Set Computer (RISC) architecture is a concept that emerged in recent years as a result of statistical analysis of the way in which software generated by optimizing compilers actually uses processor instruction set. It turned out that simplest instructions were used most often-even in the code compiled for CISC machines. Thus, the RISC microprocessors have been designed with simplicity in mind. They do not make use of microcode - instead instructions are executed by the chip logic which is implemented as some sort of finite automate with a few general states reflecting instruction decode, argument load, instruction execution and argument store phases. This along with uniform instruction format, same length (usually 32 bits wide) and execution times makes a perfect ground for pipelined instruction execution capability. This results in a more or less parallel instruction execution and in most cases (R10000, SPARC, UltraSPARC, PowerPC 6xx etc.) leads to a modern superscalar architecture with multiple pipelines, separate execution units, advanced branch prediction and out-of-order execution mechanisms.

The number of general purpose registers that are available on RISC microprocessors usually exceeds the one from CISC microprocessors. Register specialization is also less significant - any given general purpose register can be usually used in place of the other. RISC CPUs usually have only few instruction sets dedicated for performing register-register, branch, memory access and special CPU control operations. Although the instruction set usually seems to be very simple, in fact it is very functional. What is also worth mentioning is that RISC microprocessors are load-store machines - their instruction set is mostly focused on register-register operations and the only direct memory access instructions are the "load to" and "store from" register equivalents. As on RISC microprocessors, usually there are not any dedicated instructions for performing stack like operations, the notion of a stack is also different. It is just an ordinary memory area addressed by one of the general purpose registers (denoted as a stack pointer) where classic stack's push and pop operations are implemented with the use of implicit loads and stores.

RISC microprocessor is usually equipped with internal data and instruction caches but that solely depends a given CPU model. As it has been already mentioned, data and instruction caches are separately maintained therefore some incoherencies must be usually fighted. This is the main reason why some special care must be usually taken while writing proof of concept codes for buffer overflow and format string vulnerabilities. This is especially important in the case of MIPS and PowerPC/POWER architectures where in order to avoid the illegal instruction exception (resulting from the abnormal jumps to the code portion residing in a program data space) appropriate techniques must be usually applied.

In the following sections of this chapter we provide brief characteristics of several computer architectures for which assembly codes are discussed further in this document.

IRIX and MIPS basics

Microprocessors from the MIPS R4000-R12000 line are all designed in RISC architecture. MIPS CPU contains 32 general purpose 64-bit wide registers along with a set of 32 bit, ANSI/IEEE-754 standard compliant floating point registers. Special purpose registers (coprocessor and debug registers) of which most are only accessible only from within the CPU supervisor mode allow setting up the microprocessor operating environment (its mode, memory management, interrupts etc.).

MIPS processor can operate in a little or big endian mode, and this second case is used by default in the IRIX operating system. The 32/64 bit mode of CPU operation can be also configured by appropriate setting the value of one of the special purpose registers, The proper choice is usually made by the operating system itself and depends on the MIPS ABI (Abstract Binary Interface) of the ELF binary that is to be executed (032, N32, 64).

The MIPS instruction set can be divided into 6 following classes of instructions:

- load and store instructions,
- computational instructions,
- jump and branch instructions,
- coprocessor instructions,
- special instructions.

CPU instructions have a uniform length of 32 bits and there are three major instruction formats that can be distinguished (respectively for immediate, branch and register operations). MIPS microprocessors can be equipped with internal data and instruction caches but it also depends on a given CPU model. Because MIPS instruction stream is pipelined, a maximum of 8 concurrent instructions (R4000) can be executed simultaneously. Due to the pipeline mechanism an attempt to execute branch delay slot instructions is always made, but its results are canceled if the jump from the preceding branch instruction is taken. CPU execution unit throws bus error exception whenever an attempt to pass execution to a non word-boundary aligned instruction is made. The same exception is thrown whenever 16/32/64 bits wide memory data portions are accessed in a not appropriately aligned manner.

There are three MIPS ABIs available across different IRIX operating environments:

- O32 - programs are executed in a 32 bit environment - processor mode is set up to 32, all registers and memory pointers are 32 bits wide,
- N32 - programs are executed in a semi-64 bits environment - processor mode is set up to 64, registers and memory pointers are 64 bit wide, although they reflect 32 bit entities (this is the 64 bit environment for 32 bit programs),
- 64 - programs are executed in a 64 bit environment - processor mode is set up to 64, all registers and memory pointers are 64 bits wide.

All MIPS ABIs have common definition of general registers and their specialization:

r0 (zero) always contains the value of 0,
r29 (sp) stack pointer (stack grows downwards),
r31 (ra) contains the return address from subroutine (this explains why the jalr ra,reg instruction is always used to pass execution to subrouti-nes),
r28 (gp) global pointer, used for accessing program global data (regardless of the MIPS ABI always lw reg,-offset(gp) instruction is used for that purpose),
r4-r7 (a0-a3) first 4 arguments (integers or pointers) to subroutine/system calls,
r8 -r15 (t0-t7) temporary registers, not saved across subroutine calls,
r16-r23 (s0-s7) temporary registers, saved across subroutine calls,
r2 (v0) upon the system call entry it contains the system call number, upon its exit it holds return value from syscall.

Solaris and SPARC basics

Microprocessors from the SPARC (Scalable Processor ARChitecure) line are also all designed in RISC architecture. The V8 (Sparc, SuperSparc) family consists of 32 bit models while the newer one - the V9 (Ultra Sparc I, II, III) family is made up of 64 bit models. SPARC microprocessors designed according to the V9 specification are fully superscalar microprocessors, which can operate in a big or little endian mode. They can execute up to 3 instructions in parallel with the support of 9-stage pipeline mechanism. Due to the unique usage of a register windows mechanism, SPARC microprocessors can be equipped with a large set of general purpose registers, of which number can vary from 64 to 528 (internal registers). By default, only a basic 32 general purpose registers divided into the 4 register subsets can be accessed directly:

- global registers - g0-g7 (r0-r7),
- output registers - o0-o7 (r8-r15),
- local registers - l0-l7 (r16-r23),
- input registers - i0-i7 (r24-r31).

SPARC microprocessors implement a dedicated subroutine function call mechanism with the use of a call and ret instructions. The call instruction stores current value of a pc register (program counter) to o7 (return address) and passes program execution to the address given as the instruction operand. The ret instruction restores program execution from the address denoted by the value of register o7 and increased by 8 (the length of a call instruction and its delay slot). On a Solaris/SPARC system, the stack grows towards lower addresses. Its notion is similar as on other operating systems - the only noticeable difference is in the additional area which is always reserved on stack for general registers' saving purposes.

In order to allocate space on stack, an appropriate save instruction is always executed at the beginning of each subroutine. Upon return from a subroutine call, the ret/restore instruction sequence is usually executed. The ret transfers program flow to the address location denoted by the sum of 8 constant and the value of register o7. Restore is executed in a delay slot of the ret instruction and its purpose is to restore previous values of registers from stack.

Note:
The way in which stack is handled on Solaris/SPARC system has a great influence on the exploitability issue of its buffer overrun vulnerabilities. This class of errors cannot be always exploited on Solaris/SPARC as there must exist at least one level of subroutine calls nesting, so that two concurrent ret/restore sequences can be executed by a vulnerable program after its stack gets overrun (the first ret/restore loads stack with user supplied data, the second one makes a return to the address taken from the overrun stack).

According to the ABI specification, general registers and their specialization is defined as follows:

r0 (r0) zero,
o7 (r15) return address (stored by a call instruction),
o0-o5 (r8-r12) input arguments to the next subroutine to be called (after execution of the save instruction they will be in registers i0-i5),
i6 stack pointer (after save i6->o6),
o6 frame pointer,
pc program counter,
npc next instruction.

HP-UX and PA-RISC basics

Microprocessors from the PA-RISC 7200-8400 line are similarly all designed in RISC architecture. The 7xxx family consists of 32 bit models while the newer one - the 8xxx family is made up of 64 bit models. There are 32 general purpose registers (32 or 64 bits wide depending on a model) and 32 floating point registers (64 bits wide) available on PA-RISC CPU. Additionally, there are also 8 segment registers (the so-called space registers), 32 control registers and 7 shadow registers. The latter ones are used for holding values of some of the general purpose registers during interrupts processing.

PA-RISC CPU can also operate in a big or little-endian mode. The proper choice is made by setting the E-bit in a processor status word register (PSW) what is done by the operating system.

The PA-RISC instruction set can be divided into 6 following classes of instructions:

- memory reference instructions,
- branch instructions,
- long immediate instructions,
- computational instructions,
- system control instructions,
- assist instructions.

The CPU instructions have a uniform length of 32 bits. They always consist of 6-bit opcode operand identifying the instruction itself and an accompanying number of instruction's parameter fields (source/target registers, memory addresses etc.). When compared to other RISC CPU families, the PA-RISC instruction set is rather big and complex. It contains a large set of two-in-one instructions of which the conditional adds and subs are just an example of. The PA-RISC instruction stream is pipelined so the maximum of 8 concurrent instructions can be executed simultaneously. The pipeline mechanism implies the use of a delay slot filling instructions. Because of the lack of dedicated call/ret mechanism in PA-RISC, subroutine calls must be implemented with the use of inter-segment jump calls.

There are 3 different ABIs used across HP-UX operating system environments:

- s800 (32-bit),
- parisc 1.1 (32-bit),
- parisc 2.0 (32/64-bit).

On HP-UX 10.20 the s800/PA-RISC 1.1 ABI specification is used. On HP-UX 11.x the PA-RISC 1.1/2.0 specification is more common. Under all of the ABI specifications the processor is operating in a big-endian mode.

The PA-RISC ABI specification contains definition of general registers and their specialization:

gr0 zero value register - always contains the zero value,
gr2 (rp) return pointer register - contains the return address from subro-utine (the subroutine return call is usually done with the use of bv,n r0(rp) instruction - the inter-segment jump call through rp register),
gr19 shared library linkage register (used for Data Linkage Table),
gr23-gr26 (arg3-arg0) argument registers (they contain first 4 integer/pointer arguments to subroutine/system calls),
gr27 (dp) data pointer (it usually holds a pointer to global program data from $private_space data segment),

AIX and POWER/PowerPC basics

PowerPC architecture defines a software model for microprocessors implementation. It is derived from the IBM POWER architecture (Performance Optimized with Enhanced RISC architecture) and is optimized for single chip implementations. There are many similarities between these both architectures: they have almost fully compatible register and instruction sets as well as similar programming models. Although the instruction encoding is the same for POWER and PowerPC, each of them introduces different instructions mnemonic notation (Throughout this document we will use the POWER mnemonic notation.).

There are two PowerPC architecture definitions separately for 32 and 64 bit microprocessor implementations. The PowerPC line of 6xx family microprocessors (601, 603, 603e and 604) represents 32 bit implementations of the PowerPC architecture, whereas the 620 model is a 64 bit one. All PowerPC 6xx models are designed in RISC architecture. PowerPC microprocessors can operate both in a little or big endian mode, depending on the setting of the LE bit value of MSR special register. The proper choice is usually made by the operating system and for AIX 4.x, the big endian operation mode is the default.

PowerPC family microprocessors have 32 general purpose registers, which are 32 or 64 bits wide depending on the architecture that the processor in fact implements. There are also 32 64 bits wide IEEE/754 standard compliant floating point registers and some special registers, like LR, CTR, XER and CR.

PowerPC instructions are of uniform length of 32 bits and there are almost 12 instruction formats, which reflect 5 primary classes of instructions:

- branch instructions,
- fixed-point instructions,
- floating-point instructions,
- load and store instructions,
- processor control instructions.

As for the RISC microprocessor, PowerPC has rather complex addressing modes (immediate, register indirect, register indirect with index) and some specialized instructions (integer rotate and shift instructions, integer load and store string/multiple instructions). On PowerPC microprocessors usually no attention must be paid to memory alignments - 64/32/16/8 bits memory chunks can be accessed freely as unaligned memory addresses do not raise exceptions - they just influence performance of code execution. There are however some subtle exceptions to this rule considering the operands of floating point instructions and integer load/store multiple instructions which require aligned operands.

The POWER/PowerPC architecture specification does not precise the pipeline model, but all of the PowerPC 6xx microprocessors have a pipelined instruction stream (usually with 4 pipeline stages). Self-modifying code can be implemented by issuing a proper sequence of cache synchronizing instructions (dcbst, sync, icbi, sync, isync) - but that usually works for systems that do not implement unified L2 caches.

With regard to the linkage model, both 32 and 64 bit PowerPC architectures define subroutine linkage convention and general purpose registers specialization as follows:

r0 it is used in function prologs, as an operand of some instructions it can indicate the value of zero,
r1 (stkp) stack pointer,
r2 (toc) table of contents (toc) pointer - denotes the program execution context - points to the program's global data and is used whenever program global symbols are accessed and during dynamic linking of external symbols; for system calls it contains the syscall number,
r3-r10 (arg0-arg8) first 8 arguments to function/system calls,
r11 used in calls by pointer and as an environment pointer for some langu-ages,
r12 it is used in exception handling and in glink (dynamic linker) code.

The linkage convention regarding the usage of special registers is presented below:

lr (link) it is used as a branch target address or holds a subroutine return address,
ctr it is used as a loop count or as a target of some branch calls,
xer fixed-point exception register - indicates over ows or carries for integer operations,
fpscr floating-point exception register,
msr machine status register, used for configuring microprocessor settings,
cr condition register, it is divided into eight 4 bit fields, cr0-cr7, that re ect the results of certain arithmetic operations and provide a mechanism for conditional. branching.

The iar register denotes the next instruction to be executed by the processor (it reflects the current value of a program counter). According to the linkage convention, stack pointer, toc and registers r13 to r31 must be preserved across subroutine calls.

AIX dynamic linking mechanism is a bit different from what is known from other operating systems. In AIX, all external references are dynamically resolved during program execution and are handled by the dynamic linker (glink) prolog and epilog routines. AIX however uses a different concept for a global symbol pointer. A pointer to an external symbol datum is in fact a reference to the 2-pointer structure consisting of:

- the toc pointer of the module containing the datum object (it specifies the code execution context),
- the pointer to the datum object itself.

Ultrix and ALPHA basics

Microprocessors from the ALPHA family are fully 64-bit models. They are all designed in RISC architecture (load/store RISC) with special emphasis on a short clock cycle, parallel instruction execution and multiprocessor support issues. ALPHA microprocessors can operate both in big and little endian modes. The latter one is being used as a default in Ultrix operating system's environment.

The APLHA instruction set can be divided into the following classes of instructions:

- PALcode instruction,
- conditional branch instruction,
- load and store instruction,
- operate instruction.

Although processor's instructions are of uniform length of 32 bits, it accesses memory with the use of 64-bits wide addresses. The instruction encoding always consists of a 6-bit opcode operand identifying the instruction itself and an accompanying number of instruction’s parameter fields (registers, immediate values etc.).

As for the RISC microprocessor, ALPHA's instruction set is rather simple and very limited. Among its characteristic features, the lack of delay slots mechanism (neither load, nor branch ones) seem to be worth mentioning.

ALPHA processors do not support stack operations and they neither have any dedicated call/ret mechanism (subroutine jumps are usually implemented with the use of a jsr instruction). Contrary to other RISC microprocessors, ALPHA is equipped with instructions that allow it access unaligned memory data portions (16,32,64 bits). As in the case of other RISC processors, there are 32 general purpose registers and 32 floating point, IEEE/754 standard compliant registers.

r0 (v0) upon the system call entry it contains the system call number, upon its exit it holds return value from syscall,
r1-r8,r22-r25 (t0-t11) temporary registers, not saved across subroutine calls,
r9-r15 (s0-s6) temporary registers, saved across subroutine calls,
r16-r21 (a0-a5) first 6 arguments (integers or pointers) to subroutine/system calls,
r26 (ra) contains the return address from subroutine,
r28 (at) reserved by the assembler,
r29 (gp) global pointer, used for accessing program global data,
r30 (sp) stack pointer (stack grows downwards),
r31 (zero) always contains the value of 0.

Solaris/Linux/SCO{OpenServer,Unixware}/{Free,Net,Open}BSD/BeOS and x86 basics

Operating systems that run atop x86 architecture naturally share all common features of the underlying microprocessor architecture. Due to the variety of the x86 family microprocessors, we will only focus on one of its models - Intel 80386. It would not impose any limits to our discussion of the x86 architecture as the 80386 microprocessor model is the base 32 bit model of all 32 bit x86 family microprocessors. In particular, its operation regarding protected mode, implemented instruction set and available register set is the same for all of its successors up to Pentium IV microprocessor models.

Intel 80386 microprocessor is a 32 bit microprocessor which can only operate in a little endian mode. Under Solaris, Linux, *BSD and BeOS it is set to operate in a protected mode to which we will also limit our further discussion. Contrary to all previous architectures, 80386 processor is designed in CISC architecture, thus its instruction set is rather large and complex when compared to RISC architecture. 80386 instructions are divided into three major groups: integer, floating-point (for 80386DX and above), and system instructions. The first group is the largest one and it contains many specialized instructions for data transfer, binary and decimal arithmetic, string, flags and program control flow operations. Apart from that, there are also dedicated I/O and stack operation instructions as well as interrupt processing ones.

On Intel microprocessors no attention must be paid to memory alignments -32/16/8 bits memory chunks can be accessed freely in an aligned or unaligned manner. This also refers to instruction stream, which can start at any valid memory location. The microprocessor stack supports classical push and pop operations and it grows towards lower addresses.

On Intel 80386 microprocessor, memory operands can be specified through an address computation made up of one or more of the following components:

- displacement-an 8-, 16-, or 32-bit value,
- base-the value in a general-purpose register,
- index-the value in a general-purpose register,
- scale factor-a value of 2, 4, or 8 that is multiplied by the index value.

The processor provides 16 registers for use in general system and application programming. These registers can be grouped as follows:

- eax, ebx, ecx, edx, esi, edi, ebp general-purpose data registers - 32 bits wide registers that are available for storing instruction operands and memory pointers,
- cs, ds, ss, es, fs, and gs segment registers - 16 bits wide registers that hold up to six segment selectors (special pointers that identify a segment in memory),
- eflags status and control register - it reports and allows modification of the state of the processor and of the program being executed.

In most cases, any general purpose register can be used as an instruction operand or in memory address calculation. There are however several instruction which require specific registers as operands. Registers specialization with regard to these instructions is presented below:

eax accumulator for operands and results data,
ebx pointer to data in the ds segment,
ecx counter for string and loop operations,
edx I/O pointer,
esi pointer to data in the segment pointed to by the ds register; source pointer for string operations,
edi pointer to data in the segment pointed to by the es register; destination pointer for string operations,
ebp pointer to data on the stack (in the ss segment),
esp stack pointer.

When considering the linkage convention used on x86 based operating systems, the following rules are usually applied:

- eax register usually contains the result of a subroutine/system call
- ebp register contains the value of a current frame's pointer,
- arguments to subroutine/system calls are passed to them through general purpose registers or stack.

System call interface invocation

Proper functionality of every assembly code discussed in this document is obtained by invoking underlying operating system services. These services are implemented in the kernel code and are available to user programs through the system call interface. Because the operating system kernel is a privileged code, it usually operates on a level that is not accessible to common user applications. In most cases the kernel/user space code separation is implemented with some help from hardware. Modern microprocessors support the idea of different modes of operations - separate for user applications and the operating system itself. These are the supervisor/user modes in RISC microprocessors and protected layered modes (rings) of x86 CISC microprocessor.

While executing user applications microprocessor runs on the least privileged mode, which naturally protects the operating system and other users applications from any external interference. The operating system as a privileged code is executed in a supervisor mode and therefore can fully control the microprocessor operation including interrupts, memory management and tasks execution. The only way a user application can call operating system services is through the concept of a system call instruction. Different computer architectures have different system call instructions, but they are all common in operation: upon their execution the microprocessor switches operating mode from user to supervisor equivalent and passes execution to the appropriate kernel system call handling routine. Upon its completion, the execution is returned to the user process at the next instruction following the system call invocation instruction (not always, see AIX discussion). Simultaneously, the microprocessor mode of operation is also switched back to the one reflecting user space applications.

Below we provide detailed information on the mechanism of system call interface invocation used on every computer architecture discussed throughout this document. In every case, all syscalls used in the codes contained further in this document are presented in a table form. Please note that for the clarity of such notation several simplifications have been accepted.

IRIX/MIPS

On IRIX/MIPS the syscall special instruction is used for calling the operating system services. The v0 register denotes the system call number and registers a0-a3 are appropriately filled with a given system call arguments.

The table below contains detailed information about system call services (its numbers and parameters) we use in our IRIX/MIPS assembly codes presented further in this document.

syscall %v0 %a0, %a1, %a2, %a3
execv x3f3 ->path="/bin/sh",->[->a0=path,0]
execv x3f3 ->path="/bin/sh",->[->a0=path,->a1="-c",->a2=cmd,0]
getuid x400
setreuid x464 ruid,euid=0
mkdir x438 ->path="a..",mode= (each value is valid)
chroot x425 ->path="a..","."
chdir x3f4 ->path=".."
getpeername x445 sfd,->sadr=[],->[len=605028752]
socket x453 AF_INET=2,SOCK_STREAM=2,prot=0
bind x442 sfd,->sadr=[0x30,2,hi,lo,0,0,0,0],len=0x10
listen x448 sfd,backlog=5
accept x441 sfd,0,0
close x3ee fd=0,1,2
dup x411 sfd

Solaris/SPARC

On Solaris/SPARC the ta 8 trap instruction is used for calling the operating system services. The g1 register denotes the system call number and registers o0-o4 are appropriately filled with a given system call arguments.

The table below contains detailed information about system call services (its numbers and parameters) we use in our Solaris/SPARC assembly codes presented further in this document.

syscall %g1 %o0, %o1, %o2, %o3, %o4
exec x00b ->path="/bin/ksh",->[->a0=path,0]
exec x00b ->path="/bin/ksh",->[->a0=path,->a1="-c",->a2=cmd,0]
setuid x017 uid=0
mkdir x050 ->path="b..",mode= (each value is valid)
chroot x03d ->path="b..","."
chdir x00c ->path=".."
ioctl x036 sfd,TI GETPEERNAME=0x5491,->[mlen=0x54,len=0x54,->sadr=[]]
so_socket x0e6 AF_INET=2,SOCK STREAM=2,prot=0,devpath=0,SOV DEFAULT=1
bind x0e8 sfd,->sadr=[0x33,2,hi,lo,0,0,0,0],len=0x10,SOV_SOCKSTREAM=2
listen x0e9 sfd,backlog=5,vers= (not required in this syscall)
accept x0ea sfd,0,0,vers= (not required in this syscall)
fcntl x03e sfd,F DUP2FD=0x09,fd=0,1,2

HP-UX/PA-RISC

On HP-UX the inter-segment jump call instruction is used for calling the operating system services:

        ldil     L'-0x40000000,%r1
        be,l     4(%sr7,%r1)

The r22 register denotes the system call number and registers r26-r23 are appropriately filled with a given system call arguments. The inter-segment jump is made through register sr7 which reflects shared memory area in which kernel code resides.

The table below contains detailed information about system call services (its numbers and parameters) we use in our HP-UX/PA-RISC assembly codes presented further in this document.

syscall %r22 %r26,%r25,%r24,%r23
execv x00b ->path="/bin/sh",0
execv x00b ->path="/bin/sh",->[->a0=path,->a1="-c",->a2=cmd,0]
setresuid x07e 0,0,0
mkdir x088 ->path="a..",mode= (each value is valid)
chroot x03d ->path="a..","."
chdir x00c ->path=".."
getpeername x116 sfd,->sadr=[],->[0x10]
socket x122 AF_INET=2,SOCK_STREAM=1,prot=0
bind x114 sfd,->sadr=[0x61,2,hi,lo,0,0,0,0],len=0x10
listen x119 sfd,backlog=5
accept x113 sfd,0,0
dup2 x05a sfd,fd=0,1,2

AIX/POWER/PowerPC

On AIX the svca (sc in a mnemonic notation of PowerPC) instruction is used whenever the operating system services are to be called. The r2 register denotes the system call number and registers r3-r10 are appropriately filled with a given system call arguments. There are two additional prerequisites that must be fulfilled before executing the system call instruction: the LR register must be filled with the return from syscall address value and the crorc cr6, cr6, cr6 instruction must be issued just before the system call.

Because different system call numbers for the same service are used across different AIX 4.x versions, we use syscall numbers lookup table inside our assembly routines, appropriately to a given operating system. The table below contains detailed information about system call services (its numbers and parameters) we use in our AIX/POWER/PowerPC assembly codes presented further in this document.

syscall %r2 %r2 %r2 %r3, %r4, %r5
execve x003 x002 x004 ->path="/bin/sh",->[->a0=path,0],0
execve x003 x002 x004 ->path="/bin/sh",->[->a0=path,->a1="-c",->a2=cmd,0],0
seteuid x068 x071 x082 euid=0
mkdir x07f x08e x0a0 ->path="t..",mode= (each value is valid)
chroot x06f x078 x089 ->path="t..","."
chdir x06d x076 x087 ->path=".."
getpeername x041 x046 x053 sfd,->sadr=[],->[len=0x2c]
socket x057 x05b x069 AF_INET=2,SOCK_STREAM=1,prot=0
bind x056 x05a x068 sfd,->sadr=[0x2c,0x02,hi,lo,0,0,0,0],len=0x10
listen x055 x059 x067 sfd,backlog=5
accept x053 x058 x065 sfd,0,0
close x05e x062 x071 fd=0,1,2
kfcntl x0d6 x0e7 x0fc sfd,F DUPFD=0,fd=0,1,2
v4.1 v4.2 v4.3

Ultrix/ALPHA

On ULTRIX/ALPHA the call_pal special instruction is used for calling the operating system services. The v0 register denotes the system call number and registers a0-a5 are appropriately filled with a given system call arguments.

The table below gives detailed information about system call services (its numbers and parameters) we use in our Ultrix/Alpha assembly codes presented further in this document.

syscall %v0 %a0, %a1
execve x00b ->path="/bin/sh",->[->a0=path,0]
execve x00b ->path="/bin/sh",->[->a0=path,->a1="-c",->a2=cmd,0]
setreuid x07e ruid,euid=0

Solaris/SCO{OpenServer,Unixware}/x86

On Solaris/x86 and SCO{OpenServer,Unixware}/x86 the lcall $0x7,$0x0 instruction (far call through system call call gate selector) is used for calling the operating system services. The eax register denotes the system call number and system call arguments are passed to the appropriate service routine through stack (they are pushed on it in reverse order - the first system call argument is pushed as the last value).

As a prerequisite to the system call invocation there must be one additional value pushed on the stack just before issuing the lcall instruction - the dummy library return address, of which value is unimportant to the call itself.

The table below contains detailed information about system call services (its numbers and parameters stack order) we use in our assembly codes presented further in this document.

syscall %eax stack
exec x00b ret,->path="/bin/ksh",->[->a0=path,0]
exec x00b ret,->path="/bin/ksh",->[->a0=path,->a1="-c",->a2=cmd,0]
setuid x017 ret,uid=0
mkdir x050 ret,->path="b..",mode= (each value is valid)
chroot x03d ret,->path="b..","."
chdir x00c ret,->path=".."
ioctl x036 ret,sfd,TI GETPEERNAME=0x5491,->[mlen=0x91,len=0x91,->sadr=[]]
#ifdef SOLARIS
so_socket x0e6 ret,AF_INET=2,SOCK_STREAM=2,prot=0,devpath=0,SOV_DEFAULT=1
bind x0e8 ret,sfd,->sadr=[0xff,2,hi,lo,0,0,0,0],len=0x10,SOV_SOCKSTREAM=2
listen x0e9 ret,sfd,backlog=5,vers= (not required in this syscall)
accept x0ea ret,sfd,0,0,vers= (not required in this syscall)
fcntl x03e ret,sfd,F DUP2FD=0x09,fd=0,1,2
#endif
#ifdef SCO
close x006 ret,fd=0,1,2
dup x029 ret,sfd
#endif

{Free,Net,Open}BSD/x86

*BSD/x86 uses exactly the same mechanism for invoking the operating system services as Solaris/x86. Additionally, system services can be invoked with the use of the int 0x80 software interrupt instruction.

The table below contains detailed information about system call services (its numbers, parameters and their stack order) we use in our *BSD/x86 assembly codes presented further in this document.

syscall %eax stack
execve x03b ret,->path="/bin//sh",->[->a0=0],0
execve x03b ret,->path="/bin//sh",->[->a0=path,->a1="-c",->a2=cmd,0],0
setuid x017 ret,uid=0
mkdir x088 ret,->path="b..",mode= (each value is valid)
chroot x03d ret,->path="b..","."
chdir x00c ret,->path=".."
getpeername x01f ret,sfd,->sadr=[],->[len=0x10]
socket x061 ret,AF_INET=2,SOCK_STREAM=1,prot=0
bind x068 ret,sfd,->sadr=[0xff,2,hi,lo,0,0,0,0],->[0x10]
listen x06a ret,sfd,backlog=5
accept x01e ret,sfd,0,0
dup2 x05a ret,sfd,fd=0,1,2

Linux/x86

On Linux/x86 the int 0x80 instruction is used for calling the operating system services. The eax register denotes the system call number and registers ebx, ecx, edx are appropriately filled with a given system call arguments.

The table below contains detailed information about system call services (their numbers and parameters) we use in our Linux/x86 assembly codes presented further in this document.

syscall %eax %ebx, %ecx, %edx
exec x00b ->path="/bin//sh",->[->a0=path,0]
exec x00b ->path="/bin//sh",->[->a0=path,->a1="-c",->a2=cmd,0]
setuid x017 uid=0
mkdir x027 ->path="b..",mode=0 (each value is valid)
chroot x03d ->path="b..","."
chdir x00c ->path=".."
socketcall x066 getpeername=7,->[sfd,->sadr=[],->[len=0x10]]
socketcall x066 socket=1,->[AF INET=2,SOCK STREAM=2,prot=0]
socketcall x066 bind=2,->[sfd,->sadr=[0xff,2,hi,lo,0,0,0,0],len=0x10]
socketcall x066 listen=4,->[sfd,backlog=102]
socketcall x066 accept=5,->[sfd,0,0]
dup2 x03f sfd,fd=2,1,0

BeOS/x86

On BeOS/x86 the int 0x25 interrupt invocation instruction is used for calling the operating system services. The eax register denotes the system call number and system call arguments are passed to the appropriate service routine through stack (they are pushed on it in reverse order - the first system call argument is pushed as the last value).

As a prerequisite to the system call invocation there must be two additional values pushed on the stack just before issuing the int 0x25 instruction - the dummy library return address and the value indicating the number of arguments passed to the system call routine.

The table below contains detailed information about system call services (its numbers and parameters stack order) we use in our BeOS/x86 assembly codes presented further in this document. It contains only execv system call description as we have not yet managed to develop same functional codes for BeOS as for other operating systems (This is mainly due to the lack of information available about the way network operations can be implemented through a system call layer on BeOS.).

syscall %eax stack
execv x03f ret,anum=1,->[->path="/bin//sh"],0
execv x03f ret,anum=3,->[->path="/bin//sh",->a1="-c",->a2=cmd],0

Code specifics

The successful application of assembly components in real-life proof of concept codes often requires adopting specific assumptions during their development and actual application. Every piece of assembly code presented in this document is written so that several such assumptions are preserved. First of all, the significant emphasis was put on code length - we have made our best efforts to write possibly the shortest codes. The next assumption is position independence (PIC). We wrote the codes so that they are position independent and there are no memory/registers constraints implied on their usage. The last critical assumption is in making all codes as zero free, what means that code instruction sequences do not contain 0 byte value. Additionally, in all code samples of this work no error handling routines are applied, as we silently assume that system calls return without errors and do not check for them unless they are needed for further proper code execution.

Below each of these critical assumptions is discussed in a more detailed way.

Short code length

Short code length is not usually a requirement that must be fulfilled in order to write a proof of concept code for a given security vulnerability. However in some cases, only specially crafted and really short codes can guarantee success. For us, short code length is a feature much more related with the code art than its real life usability (small is beautiful). In order to write the possibly shortest assembly codes we applied the following rules in their development process:

- if a specific register was to be loaded with a given constant value, that value was usually obtained by a combination of register-register operations,
- if a given memory address was to be loaded with a given value, that value was usually already residing in it.

By applying the rules above, we usually avoided the use of additional memory bytes for code data (there is no need to keep in memory values which in fact are only needed in registers) and store instructions (there is no need to store a given value into memory if it could be already there). This is why in some cases we intentionally introduce some dummy instructions to the code as their encoding bytes are needed for constructing some program data structures in memory. This is especially the case for chroot, bindsck and findsck codes.

If the microprocessor architecture supports a delay slot execution mechanism, we make always use of it This is especially the case for SPARC and MIPS codes. On MIPS branch delay slots following the system call invocation instructions are never wasted - instead of no-operation instruction (nop) they always contain some useful code.

Whenever possible we make use of registers specialization (zero registers, loop count registers, etc.) and instruction complexity. The latter case mainly concerns CISC microprocessors which have lots of instructions performing several operations in one instruction (like x86's LODSx and STOSx instructions, LOOPx instructions). An attempt is always made to load constant values into registers with the use of one instruction. Repeating code fragments are implemented as subroutine calls or loops whenever such implementation allows us to save some bytes from code size. In the case when system call invocation instruction requires several instructions to execute (AIX/POWER/PowerPC and HP-UX/PA-RISC) or due to the zero byte problem avoiding, where the system call instruction must be explicitly constructed (Solaris/x86) a separate subroutine call is also used.

For x86 based architectures, instead of pretty lengthy mov instruction, we make often use of 1 byte PUSH and POP stack instructions whenever there is a need to temporarily store register values in memory. The same considers the 1 byte long inc reg and dec reg instructions which are always used in favor of their add reg,1 and sub reg,1 equivalents. An attempt is always made to use 1 byte instead of 2 byte register exchange instruction (xchg) by properly selecting its destination parameter. As a short equivalent of the following two instruction sequence:

        mov     reg1,reg2
        add     reg1,offset

we make often use of the lea reg1,[reg2+offset] instruction.

Position independence

Position independence is a feature that allows the code access its own data regardless of its initial memory location. Position independent code can be usually written in a position dependant way, as it was the case for early assembly routines included in proof of concept codes. However, position independence usually makes the code shorter and frees it of any constraints imposed on the knowledge or even validity of the initial register values, that are used for proper reconstruction of a given code's data.

On systems where arguments to system calls are passed through stack, there however exist two subtle constraints with regard to the initial value (the value upon entering our code) of the stack pointer register. On these systems, stack pointer must usually point to a valid memory area (the one with write access and that is mapped in process address space) before execution enters our assembly routines, so that appropriate push operations can be issued. Apart from that, stack pointer cannot point to the area in which our assembly code resides as execution of its successive push operations could modify the code itself and destroy it. There usually exists a possibility to properly set the value of a stack pointer register so that two aforementioned constraints are fulfilled. This can be achieved with regard to the value of a code base address of which value can be obtained by applying appropriate instruction sequences discussed previously in this paragraph. In our assembly routines we never set the initial value of a stack pointer register as such an operation would in most cases unnecessarily (that is the claim made upon the experience we obtained by writing proof of concept codes) increase the code length of assembly routines.

In the following subsections, the additional details of writing proof of concept codes for all discussed operating systems will be presented.

IRIX/MIPS

In order to write position independent code, at the beginning of each code block the following instruction is used:

        label:     bltzal  $zero<label>

The bltzal instruction branches if its register operand is less than zero and makes a link and is in fact a conditional subroutine call instruction. Making a link is equivalent to saving the return address from subroutine to the ra register. For this particular instance of instruction, the operand register is set to zero, therefore the condition is never fulfilled and the branch is never taken. However, the link is done and ra register is filled with the branch return address of <label+8> instruction. This is not the address of <label+4> instruction as on MIPS every branch is supposed to be followed by a no-operation branch delay slot instruction. This is the reason why the additional instruction length was accumulated in the result address.

Solaris/SPARC

We obtain the base address of our code by executing the following instruction sequence:

        label:     bn,a    <label-4>
                   bn,a    <label>
                   call    <label+4>

Because the first branch never bn,a instruction does not make a branch and due to its a - (annulate) suffix, the next bn,a instruction does not get executed. As a result an attempt to execute the call instruction is made, which upon execution stores current value of a program counter to register o7 and transfers program control to the second bn,a instruction. Similarly to the first one, the second bn,a instruction annuls the execution of the next instruction, so the call does not get executed for the second time. As a result of the above sequence of instructions the register o7 is loaded with the offset address of label+12.

On sparc v8+ and above architectures there exists an instruction that allows to obtain the value of a pc register in a more direct way.

        rd     %pc,%o7  

Although it has zero in its encoding, by appropriately setting one unused bit, you can get zero-free instruction.

HP-UX/PA-RISC

We obtain the base address of our code by executing the following branch and link instruction:

        bl     .+4,reg

The bl instruction provides a functionality of a subroutine call. It makes a call to the address specified by an 8-bit relative offset instruction operand and saves the value of subroutine return address in register reg. In this specific case, the value of 4 is used for relative jump offset, so the jump is made forward to the instruction immediately following the branching instruction. Simultaneously, register reg is loaded with the address of the next instruction - the address of a jump target in this case.

AIX/POWER/PowerPC

We issue the following instructions sequence in order to obtain the base address of our code:

        label:     xor.     reg1,reg1,reg1
                   bnel      <label>
                   mflr     reg2

The first instruction sets the EQ bit of CR0 conditional register field, which reflects the zero or equality status of the instruction execution result. The bnel instruction is a conditional branch instruction which makes a jump if the EQ bit of CR0 field is not set (it denotes the non-equality state), which in our case is always false thus the branch is never taken. However, as a result of executing the bnel instruction the link is done and the link register (LR) is loaded with the branch return address of <label+8> instruction. Because link register is a special register which can not be used as an operand of memory access operations, we move its value to the general purpose register reg2 with the use of mflr (move from link register) instruction.

Ultrix/ALPHA

In the case of APLHA processors, the base address of code is obtained by executing the following sequence of instructions.

                   ldah    a3, 27643(zero)  
                   lda     a3, -32767(a3)   
                   stl     a3, 320(sp)      
                   lda     a4, 320(sp)      
         jump:     jsr     ra, (a4),0x10 

On Alpha, again there is no easy way of reading the value of a PC register with a zero-free instruction sequence. In our codes we make use of the jump to and return from subroutine instructions. However this is not done in a direct way. The first two instructions, that is LDA and LDAH, load register a3 with an opcode value of a ret zero, (ra), 1 instruction. Next, the stl – store long instruction is issued which stores the value of register a3 at a given stack location. In a result we have a ret zero, (ra), 1 instruction at some stack location. The address of that location is put into register a4 by the next lda (load address) instruction. Finally, a jump to subroutine is made through register a4 to the just built in return from subroutine instruction. As a result of executing the presented code block, the address of instruction at offset <jump+4> is available in register ra.

Solaris/SCO{OpenServer,Unixware}/Linux/{Free,Net,Open}BSD/BeOS/x86

On x86 architectures the following instruction sequence is issued at the beginning of each code block in order to obtain its base address:

                  jmp near ptr  <label>
        back:     pop reg
                  ...
        label:    call near ptr <back>

First, a forward near jump is made to the call instruction within the relative offset covered by the 8 bit value. Then a near backward call to the pop instruction is made. Upon its execution the address of the instruction following the call one (the offset value of <label+5> in this case) is pushed onto the stack. That value is next obtained in register reg by executing an appropriate pop reg operation.

"Zero free" code

The need for zero free code is a result of the requirement that must be fulfilled when writing proof of concept codes for most buffer overflow and format string vulnerabilities. These classes of errors are based on improper handling of user supplied string data - in most cases this concerns string lengths and their format. In C language, strings are represented as a contiguous sequence of bytes with a null (zero) byte at the end. If a user supplied data containing assembly code is to be properly interpreted as a string argument, it must conform to the way strings are constructed and treated in a UNIX system. This basically explains the need for a zero free code. However in practice, such a need for zero code poses many limitations on the way the assembly code can be constructed and it sometimes makes it a bit more difficult.

On most RISC architectures we cannot use registers with lower numbers (including r0 - zero register) as instruction operands because they usually generate zero byte opcode in the instruction encoding. This is why we try to focus on registers with higher numbers and make appropriate use of the xor reg, reg, reg instruction whenever there is a need for a 0 value as the operand.

On x86, SPARC and PA-RISC architectures 8 and 16 bit constants can be freely loaded into registers without any fear of a zero byte opcode problem. This is due to the fact that x86 microprocessors support loading 8/16 bits constants as SPARC and PA-RISC does for 11/22 bits ones.

We cannot use forward branch instructions in codes unless the architecture supports relative jumps made within the area covered by the 8bit offset value. This is only the case for Intel x86 and PA-RISC. In all other cases branch jumps can be only made backwards as we must avoid zero byte opcodes in instruction encoding (16 bit relative branch offset field encoding). Most of memory access operations are made with the use of 16bit_immediate (reg) addressing mode (register indirect with 16 bit immediate offset). Such a negative offsets memory references allows us avoid zero byte opcodes in the 16 bits memory offset encoding.

IRIX/MIPS

The MIPS syscall instruction is a special instruction which usually has the 0x0000000c encoding. By making closer look at its format we can notice that it has a code field available for use as software parameters:

31 16 0
0 0 0 0 0 0 c c c c c c c c c c c c c c c c c c c c 0 0 1 1 0 0

where:
- c bits - denote code operand field.

We can get rid of zero bytes in the syscall instruction encoding by making use of its code field which may be filled with some arbitrary values other than 0. In most cases we use 0x03ffffcc value for system call instruction across all our codes.

The other problem is in the NOP instruction (0x00000000 encoding) which has zero byte opcodes in its encoding. We usually solve it by using NOP equivalent dummy instruction of which operation does not influence the assembly code operation. In most cases we use the li t7,4660 instruction with the 0x240f1234 encoding.

In order to avoid zero byte in the load immediate (li reg,16_bit_constant_value) instruction encoding we first use:

        li     reg, const_value

somewhere in the beginning of the code and:

        addi     reg,-(const_value-target_8bit_value)

whenever we need to load register reg with target_8bit_constant.

The li/addi instruction sequence is also used for loading 0 into registers. Direct use of li reg, zero instruction would yield zero byte in its encoding if a0-a4 registers were to be zeroed in this way. This is all caused by the specifics of MIPS register-register instruction encoding. In general, we cannot only use low registers as instruction operands, they must be mixed with some higher registers in order to avoid zero bytes. This is the reason why we heavily exploit the s0-s4 registers (r16-r20) in the codes. Another advantage of using them is that they are saved across system calls, so whenever a loop with a system call inside it is needed, it can be easily implemented.

HP-UX/PA-RISC

The HP-UX system call invocation is usually composed of the following sequence of instructions:

        ldil     L'-0x40000000,%r1
        be,l     4(%sr7,%r1)
        ldi      syscall_number,%r22

The last ldi instruction from the sequence above has the encoding as presented below:

31 16 0
0 0 1 1 0 1 0 0 0 0 0 1 0 1 1 0 0 0 s s s s s s s s s s s s s s

where:
- s bits - denote syscall_number operand

It can be clearly seen that, for a small range of the syscall_number parameter values (less than 256), the instruction yields zero byte opcode in its encoding. Fortunately, this is all about loading system call number to register r22, which can be also done in other way. We use the following instructions sequence as an equivalent of the code presented above:

        ldil     L'-0x40000000,%r1
        be,l     4(%sr7,%r1)
        addi,>   syscall_number,%r0,%r22

The other problem occurs whenever we need to add a constant to a given register. We cannot simply use add immediate instruction, as it usually has zero byte in the instruction encoding (especially for low immediate operand values). Instead, we make use of the conditional representation of the add immediate instruction. It is presented in the figure below:

        addi,cond    immediate,s_reg,t_reg

31 16 0
1 0 1 1 0 1 s s s s s t t t t t c c c c i i i i i i i i i i i i

where:
- s bits - denote source register operand (s_reg),
- t bits - denote target register operand (t_reg),
- c bits - denote condition field (cond),
- i bits - denote immediate operand value (immediate).

By appropriately filling the condition field (bits 17-20) with binary value of 0111 we make the instruction encoding independent of the immediate argument's value and avoid zero byte opcodes in it.

On HP-UX forward jumps are possible with the use of comparative branch instructions. For that purpose we usully use the comb,= instrucion.

AIX/POWER/PowerPC

The POWER/PowerPC system call instruction (sc in PowerPC mnemonic, svca in POWER mnemonic) contains zero byte opcode in its encoding. This is due to fact that, according to the documentation, reserved bits from the system call instruction encoding must be set to 0, what is presented below:

31 16 0
0 1 0 0 0 1 r r r r r r r r r r r r r r r r r r r r r r r r 1 r

where:
- r bits - denote reserved bits.

As shown above, the r bits do not influence the instruction opcode field (bits 31-27). A quick lookup in the microprocessor documentation also reveals that there is only one instruction with such an opcode. So, the microprocessor instruction decoding unit should properly recognize the system call instruction regardless of its r-bits values. And this is in fact the case. This allows us to set arbitrary r-bits values in the svca instruction encoding and get rid of the zero byte in it. In most cases we use the 0x04ffff02 value for the system call instruction across all our codes.

For POWER/PowerPC, the preferred instruction for a nop operation is oril r0,r0,0x0 (0x60000000 encoding). However, we use the mr r31, r31 instruction with 0x7ffffb78 encoding in order to avoid zero bytes.

Similarly like in the case of IRIX/MIPS, an appropriate lil/cal instruction sequence is used for loading 8bit constants into registers across our AIX/POWER/PowerPC codes.

Ultrix/ALPHA

The ALPHA call_pal instruction is a special instruction, which usually has the 0x00000083 opcode encoding, presented below in a binary notation.

31 16 0
0 0 0 0 0 0 c c c c c c c c c c c c c c c c c c c c c c c c c c

where:
- c bits - number (call_pal=0x83).

Unfortunately, contrary to other RISC microprocessors (like MIPS for example) we cannot easily get rid of zeros in opcode of this instruction. This is the reason why we construct the call pall instruction on stack independently with the use of the following code block:

        bis     zero, 0x83, a3   
        stl     a3, 8320(sp)     
        lda     a4, 8320(sp)     
        lda     a5, 699(zero)    
        lda     v0, -640(a5)     
        jsr     ra, (a4),0x10    

First, the opcode value of a call pall instruction (0x00000083) is loaded into register a3 with the use of a bis operation. Then, the stl – store long instruction is issued which stores the value of register a3 at a given stack location. The address of that location is put into register a4 by the next lda (load address) instruction. The syscall number of a to be called system service is calculated and placed in register v0. This is done in two steps in order to avoid zero bytes in instructions encoding. Finally, a jump to subroutine is made through register a4 to the just built in call pall instruction what results in a system call invocation.

Solaris/SCO{OpenServer,Unixware}/x86

On Solaris and SCO systems, the far lcall instruction that is used for invoking operating system services has the following encoding:

        lcall     $0x7,$0x0   0x9a,0x00,0x00,0x00,0x00,0x07,0x00

Because there does not exist a far call equivalent instruction or a sequence of instructions providing the same functionality (That is the case for Solaris, FreeBSD provides int 0x80 as another way of invoking operating system services.), in order to get rid of zero bytes in the lcall instruction encoding we must construct it in the code itself. And that lcall construction is done with the use of the following code fragment:

        syscallcode:     xorl    %eax,%eax
                         jmp     <syscallcode+13>
                         popl    %edi
                         pushl   %edi
                         incl    %edi
                         stosl   %eax,%es:(%edi)
                         incl    %edi
                         stosb   %al,%es:(%edi)
                         ...
                         call    <syscallcode+4>
                         "\x9a\xff\xff\xff\xff"
                         "\x07\xff"
                         ret

In the code above, the far call instruction is constructed in memory just after the relative call to <syscallcode+4>. First the absolute address of program data after the call instruction is obtained in register edi. Then, two succesive store string (word and byte) operations are performed that put zero byte values (the contents of eax register) in place of 0xff ones what finally results in a proper 0x9a, 0x00, 0x00, 0x00, 0x00, 0x07, 0x00 lcall instruction encoding sequence.

Assembly codes functionality

This section discusses in a more detail the functionality of assembly components used in proof of concept codes we wrote for different operating systems. We have distinguished several types of such assembly routines, which can be differentiated by their actual functionality and possible impact of practical application. In this section, operation of each type is explained with the use of a short similar to the C language program. This section might be also considered as a practical introduction to the contents of appendices from the end of this paper. They contain various functional types of assembly routines discussed in this document that has been written for different operating systems. All codes have been developed in concordance with previously presented assumptions. The special effort has been done to make these code blocks position and register independent, so they can be almost freely combined together in order to obtain a given functionality (for example, chroot breaking bind shellcode can be easily built by appending together an optional syscallcode, chrootcode, bindsckcode and shellcode tables). In the appropriate appendix, the sample program illustrating the usage of all codes is also included.

Shell execution (shellcode)

The simplest and most common assembly routine seen in proof of concept codes is shellcode. It is equivalent to the following C language statement:

        execl("/bin/sh","/bin/sh",0);

It simply executes the /bin/sh program.

Single command execution (cmdshellcode)

The cmdshellcode routine is more or less equivalent to the following C language statement:

        execl("/bin/sh","/bin/sh","-c",cmd,0);

It executes the commands denoted by the cmd string with the use of the /bin/sh shell program. As a prerequisite to this code, null terminated cmd string must be appended to the end of the cmdshellcode.

Privileges restoration (set{uid,euid,reuid,resuid}code)

Privileges restoration routines restore a given process' root user privileges whenever they are possessed by it but are temporarily unavailable because of some security reasons. These routines are especially useful for exploiting vulnerabilities in certain setuid binaries, the ones that revert but do not completely drop their elevated privileges. Because of different implementation of the privilege restoration mechanism on various operating systems, we use several routines for the purpose of privilege restoration.

[setuidcode] In the case of Solaris, privileges restoration is done by setuidcode routine, which is equivalent to the following C language statement:

        setuid(0);

It sets privileges of a given process to the privileges of a root user.

[seteuidcode] In AIX systems, it is done by seteuidcode routine, which is equivalent to the following C language statement:

        seteuid(0);

It sets a given process' effective privileges to the privileges of a root user.

[setreuidcode] In IRIX and Ultrix systems, it is done by setreuidcode routine, with respective C language statements:

        setreuid(getuid(),0); // (Irix)
        setreuid(0,0);        // (Ultrix)

It restores a given process' saved root user privileges whenever they are possessed by it but are temporarily unavailable due to previous setreuid call.

[setresuidcode] In HP-UX systems, it is done by setresuidcode routine, which is equivalent to the following C language statement:

        setresuid(0,0,0);

The setresuid function does the same as the setreuid one if its third argument is equal to -1. In our codes we invoke setresuid with the third argument set to 0 what directly sets a given process' root user privileges provided that they has been possessed by it before. enddescription

Chroot limited environment escape (chrootcode)

The chrootcode breaks the chroot jail. It is more or less equivalent to the following C language statements:

        mkdir("a..",mode);
        chroot("a..");
        for(i=257;i--;i>0) chdir("..");
        chroot(".");

This piece of code breaks the chroot jail if the process in which context the code executes possesses the euid of a root user (this is the prerequisite for the chroot call to succeed). At the start of the chrootcode a helper directory with the "a.." name is created. At this time operating system's kernel structures for a given process, holding the vnode values of its current and root directories are the same or the current directory value is set to be below root. When the chroot("a..") system call is executed the root directory vnode goes below the current one in a directory tree hierarchy. In a result, every chdir("..") system call executed in a loop completes successfully, because no chroot vnode is encountered while moving up the directory tree. The last chroot(".") system call completes the chroot jail break - it resets the process root directory vnode to the value of its current directory - the absolute value of / filesystem directory.

In order to minimize the code length, we usually use some dummy instruction in the beginning of a chrootcode routine. That is usually one of the instructions which has the "a.." string in its opcode and does not influence the operation of the code itself (it only uses register or immediate operand values). If such a proper instruction is used in the code, we don't have to make an extra construction of a dirname ("a.."), current_dir ("."), and parent_dir ("..") system calls parameters as they are all the substrings of the "a.." string.

Find socket code (findsckcode)

The findsckcode routine is more or less equivalent to the following C language statements:

        j=sizeof(sockaddr_in);
        for(i=256;i>=0;i--){
            if(getpeername(sck,&adr,&j)==-1) continue;
            if(*((unsigned short)&(adr[2]))==htons(port)) break;
        }
        for(j=2;j>=0;j--) dup2(j,i);

It allows the reuse of existing TCP connections of a given process, so that interactive command shells can be usually spawned upon them.

The code above walks the process descriptor table in a search for a socket descriptor of the remote TCP endpoint identified by a port number contained at FINDSCKPORTOFS offset of the findsckcode routine. In a case such an endpoint is located the loop is terminated and found TCP socket descriptor is duplicated on stdin, stdout and stderr of a given process.

Prior to executing the findsckcode routine, a client software should establish a TCP connection with a process in which context the code is to be executed. Appropriate setting of the code data at FINDSCKPORTOFS offset should be also made to assure proper identification of the client's connection.

Network server code (bindsckcode)

The bindsckcode is more or less equivalent to the following C language statements:

        sck=socket(AF_INET,SOCK_STREAM,0);
        bind(sck,addr,sizeof(addr));
        listen(sck,5);
        clt=accept(sck,NULL,0);
        for(i=2;i>=0;i--) dup2(i,clt);

The code above creates a listening TCP socket on a given port. Upon accepting a connection, it duplicates the socket descriptor of the connected remote party to the process stdio descriptors (0, 1 and 2). The port number to which the socket is bound is defined at offset BINDSCKPORTOFS of the bindsckcode (its value is set to 0x1234 by default).

In order to minimize the code length, we usually use some dummy instruction in the beginning of a bindsckcode routine. Its opcode value is partially used for the proper sockaddr_in structure construction, which is passed as an argument to the bind system call as follows:

        struct sockaddr_in {
               uchar sin_len = xx (does not matter for AF_INET)
               uchar sin_family = 02 (AF_INET)
               ushort sin_port = contains the port value
               uint sin_addr.s_addr = 00 (INADDR_ANY)
        }

In our bindsckcode codes, we never set the sin_len field of the sockaddr_in structure, as its value is not important for AF_INET domain sockets (ours is in AF_INET).

The dummy instruction does not influence the operation of the code itself (it only uses register or immediate operand values) and it is selected in such a way, so that it has sin_port or sin_family values contained in its encoding.

Stack pointer retrieval (jump)

The jump routine obtains the current value of a given process' stack pointer register. It is usually implemented as a subroutine call and a two instructions sequence:

- the one transferring the contents of a stack pointer register to the return value register, as specified in a linkage convention/ABI for a given architecture,
- the branch instruction that makes an actual return from the subroutine.

On most UNIX systems, the invocation of a jump code from within a C language program can be done with the use of an appropriate cast operator:

        int     sp=(*(int(*)())jump)();

However, on AIX, due to different global symbols linkage convention the call to the jump code must be made in a special way:

        int     buf[2]={(int)&jump,*((int*)&main+1)};
        int     sp=(*(int(*)())buf)();

It is also worth mentioning that, on HP-UX/PA-RISC, special inter-segment jump call instruction is used to make a return call from the jump subroutine:

        be     0x0(%sr0,%rp)

Such an inter-segment call instruction is required whenever program execution is to be passed between data and code segments.

No-operation instruction (nop)

The nop (no operation) instruction is a helper instruction that is used in proof of concept codes whenever a heuristic jump must be made within a vulnerable program to the user supplied code data.

Although, every microprocessor architecture supports a concept of a nop instruction, not all of them can be used in proof of concept codes due to the zero byte avoiding problem. In such cases, nop-equivalent instructions are used and these are usually the ones that only use register and immediate operands and do not reference memory in any way.

Source code and references

The source code of all assembly components for all mentioned processor's architectures can be found in the single tar.gz file, which can be downloaded from our projects page. The codes for viewing can be also found in the appendices of version of this paper.

The references for this paper can be found in the architectures section of our general references page.

Final Notes

The authors reserve the right not to be responsible for the topicality, correctness, completeness or quality of the information provided in this document. Liability claims regarding damage caused by the use of any information provided, including any kind of information which is incomplete or incorrect, will therefore be rejected.

The Last Stage of Delirium Research Group reserves the right to change or discontinue this document without notice.

Copyright © 2001 The Last Stage of Delirium Research Group, Poland