System calls. Man syscalls (2): Linux system calls Theory

All tips

VLADIMIR MESHKOV

Intercepting system calls in Linux OS

In recent years, the Linux operating system has firmly taken a leading position as a server platform, ahead of many commercial developments. Nevertheless, the issues of protecting information systems built on the basis of this OS do not cease to be relevant. There are a large number of technical means, both software and hardware, that can ensure system security. These are means of encrypting data and network traffic, delineating access rights to information resources, protecting email, web servers, anti-virus protection, etc. The list, as you understand, is quite long. In this article, we invite you to consider a protection mechanism based on intercepting system calls of the Linux operating system. This mechanism allows you to take control of the operation of any application and thereby prevent possible destructive actions that it may perform.

System calls

Let's start with a definition. System calls are a set of functions implemented in the OS kernel. Any user application request is eventually translated into a system call that performs the requested action. A complete list of Linux system calls is located in the file /usr/include/asm/unistd.h. Let's look at the general mechanism for executing system calls with an example. Let your application source call the creat() function to create a new file. The compiler, having encountered a call to this function, converts it into assembly code, loading the system call number corresponding to this function and its parameters into the processor registers and subsequently calling interrupt 0x80. The following values are loaded into the processor registers:

to register EAX– system call number. So, for our case, the system call number will be 8 (see __NR_creat);
to the EBX register– the first parameter of the function (for creat this is a pointer to a line containing the name of the file being created);
to the ECX register– second parameter (file access rights).

The third parameter is loaded into the EDX register; in this case, we do not have it. To perform a system call in Linux OS, use the system_call function, which is defined in the file /usr/src/liux/arch/i386/kernel/entry.S. This function is the entry point for all system calls. The kernel responds to interrupt 0x80 by calling the system_call function, which is essentially a handler for interrupt 0x80.

To make sure we are on the right track, let's write a small test fragment in assembly language. In it we will see what the creat() function turns into after compilation. Let's call the file test.S. Here is its content:

Globl_start

Text

Start:

We load the system call number into the EAX register:

movl $8, %eax

In the EBX register – the first parameter, a pointer to a line with the file name:

movl $filename, %ebx

In the ECX register – the second parameter, access rights:

movl $0, %ecx

Call an interrupt:

int $0x80

We exit the program. To do this, call the exit(0) function:

movl $1, %eax movl $0, %ebx int $0x80

In the data segment we indicate the name of the file to be created:

Data

filename: .string "file.txt"

Compiling:

gcc -c test.S

ld -s -o test test.o

The executable file test will appear in the current directory. By running it, we will create a new file called file.txt.

Now let's get back to looking at the system call mechanism. So, the kernel calls interrupt handler 0x80 - the system_call function. System_call places copies of the registers containing the call parameters on the stack using the SAVE_ALL macro and calls the desired system function with the call command. The table of pointers to kernel functions that implement system calls is located in the sys_call_table array (see file arch/i386/kernel/entry.S). The system call number, which is in the EAX register, is an index into this array. Thus, if the value 8 is found in EAX, the kernel function sys_creat() will be called. Why is the SAVE_ALL macro needed? The explanation here is very simple. Since almost all kernel system functions are written in C, they look for their parameters on the stack. And the parameters are pushed onto the stack using the SAVE_ALL macro! The return value of the system call is stored in the EAX register.

Now let's figure out how to intercept a system call. The mechanism of loadable kernel modules will help us with this. Although we have previously considered the development and use of kernel modules, in the interests of consistency in presenting the material, we will briefly consider what a kernel module is, what it consists of, and how it interacts with the system.

Loadable kernel module

A loadable kernel module (let's denote it LKM - Loadable Kernel Module) is program code executed in kernel space. The main feature of LKM is the ability to dynamically load and unload without the need to reboot the entire system or recompile the kernel.

Each LKM consists of two main functions (minimum):

module initialization function. Called when LKM is loaded into memory:

int init_module(void) ( ... )

module unload function:

void cleanup_module(void) ( ... )

Here's an example of a simple module:

#define MODULE

#include

int init_module(void)

printk("Hello World");

return 0;

void cleanup_module(void)

printk("Bye ");

Compile and load the module. The module is loaded into memory using the insmod command:

gcc -c -O3 helloworld.c

insmod helloworld.o

Information about all modules currently loaded into the system is located in the /proc/modules file. To verify that the module is loaded, enter the command cat /proc/modules or lsmod. The rmmod command unloads a module:

rmmod helloworld

System call interception algorithm

To implement a module that intercepts a system call, it is necessary to define an interception algorithm. The algorithm is as follows:

save a pointer to the original (original) call so that it can be restored;
create a function that implements a new system call;
in the system call table sys_call_table, replace calls, i.e. set up an appropriate pointer to the new system call;
upon completion of work (when unloading the module), restore the original system call using the previously saved pointer.

Tracing allows you to find out which system calls are used when the user's application is running. By tracing, you can determine which system call should be intercepted to take control of the application. An example of using the tracing program will be discussed below.

Now we have enough information to begin studying examples of implementations of modules that intercept system calls.

Examples of intercepting system calls

Prohibition of directory creation

#include

Export the table of system calls:

extern void *sys_call_table;

Let's define a pointer to save the original system call:

int (*orig_mkdir)(const char *path);

Let's create our own system call. Our call does nothing, just returns a null value:

int own_mkdir(const char *path)

return 0;

During module initialization, we save a pointer to the original call and replace the system call:

int init_module()

orig_mkdir=sys_call_table;

sys_call_table=own_mkdir; return 0;

When unloading, we restore the original call:

void cleanup_module()

Sys_call_table=orig_mkdir;

We will save the code in the file sys_mkdir_call.c. To obtain an object module, create a Makefile with the following content:

CC = gcc

CFLAGS = -O3 -Wall -fomit-frame-pointer

sys_mkdir_call.o: sys_mkdir_call.c

$(CC) -c $(CFLAGS) $(MODFLAGS) sys_mkdir_call.c

Use the make command to create a kernel module. Having downloaded it, we will try to create a directory with the mkdir command. As you can see, nothing happens. The command doesn't work. To restore its functionality, simply unload the module.

Prevent reading a file

In order to read a file, it must first be opened using the open function. It is easy to guess that this function corresponds to the sys_open system call. By intercepting it, we can protect the file from being read. Let's look at the implementation of an interceptor module.

#include

extern void *sys_call_table;

Pointer to save the original system call:

int (*orig_open)(const char *pathname, int flag, int mode);

The first parameter of the open function is the name of the file to open. The new system call must compare this parameter with the name of the file we want to protect. If the names match, a file opening error will be simulated. Our new system call looks like:

int own_open(const char *pathname, int flag, int mode)

Let's put the name of the file to open here:

char *kernel_path;

The name of the file we want to protect:

char hide="test.txt"

Let's allocate memory and copy the name of the file to be opened there:

kernel_path=(char *)kmalloc(255,GFP_KERNEL);

copy_from_user(kernel_path, pathname, 255);

Let's compare:

if(strstr(kernel_path,(char *)&hide) != NULL) (

We free memory and return an error code if the names match:

kfree(kernel_path);

return -ENOENT;

else(

If the names do not match, we call the original system call to perform the standard procedure for opening a file:

kfree(kernel_path);

return orig_open(pathname, flag, mode);

int init_module()

orig_open=sys_call_table;

sys_call_table=own_open;

return 0;

void cleanup_module()

sys_call_table=orig_open;

Let's save the code in the file sys_open_call.c and create a Makefile to obtain the object module:

CC = gcc

CFLAGS = -O2 -Wall -fomit-frame-pointer

MODFLAGS = -D__KERNEL__ -DMODULE -I/usr/src/linux/include

sys_open_call.o: sys_open_call.c

$(CC) -c $(CFLAGS) $(MODFLAGS) sys_open_call.c

In the current directory, create a file called test.txt, load the module and enter the command cat test.txt. The system will report that there is no file with that name.

To be honest, such protection is easy to bypass. It is enough to use the mv command to rename the file and then read its contents.

Hiding a file entry in a directory

Let's determine which system call is responsible for reading the contents of the directory. To do this, let's write another test fragment that reads the current directory:

/* File dir.c*/

#include

int main()

DIR *d;

struct dirent *dp;

d = opendir(“.”);

dp = readdir(d);

Return 0;

Let's get the executable module:

gcc -o dir dir.c

and trace it:

strace ./dir

Let's pay attention to the penultimate line:

getdents(6, /* 4 entries*/, 3933) = 72;

The contents of the directory are read by the getdents function. The result is stored as a list of structures of type struct dirent. The second parameter of this function is a pointer to this list. The function returns the length of all entries in the directory. In our example, the getdents function determined the presence of four entries in the current directory - “.”, “..” and our two files, the executable module and the source text. The length of all entries in the directory is 72 bytes. Information about each record is stored, as we have already said, in the struct dirent structure. Of interest to us are two fields of this structure:

d_reclen– record size;
d_name- file name.

In order to hide an entry about a file (in other words, make it invisible), you need to intercept the sys_getdents system call, find the corresponding entry in the list of received structures, and delete it. Let's look at the code that performs this operation (the author of the original code is Michal Zalewski):

extern void *sys_call_table;

int (*orig_getdents)(u_int, struct dirent *, u_int);

Let's define our system call.

int own_getdents(u_int fd, struct dirent *dirp, u_int count)

unsigned int tmp, n;

int t;

The purpose of the variables will be shown below. Additionally, we need structures:

struct dirent *dirp2, *dirp3;

The name of the file we want to hide:

char hide="our.file";

Let's determine the length of entries in the directory:

tmp=(*orig_getdents)(fd,dirp,count);

if(tmp>0)(

Let's allocate memory for the structure in kernel space and copy the contents of the directory into it:

dirp2=(struct dirent *)kmalloc(tmp,GFP_KERNEL);

сopy_from_user(dirp2,dirp,tmp);

Let's use the second structure and save the length of entries in the directory:

dirp3=dirp2;

t=tmp;

Let's start looking for our file:

while(t>0) (

We read the length of the first entry and determine the remaining length of entries in the directory:

n=dirp3->d_reclen;

t-=n;

We check whether the file name from the current entry matches the one we are looking for:

if(strstr((char *)&(dirp3->d_name),(char *)&hide) != NULL) (

If so, overwrite the entry and calculate the new length of the entries in the directory:

memcpy(dirp3,(char *)dirp3+dirp3->d_reclen,t);

tmp-=n;

We position the pointer to the next entry and continue the search:

dirp3=(struct dirent *)((char *)dirp3+dirp3->d_reclen);

We return the result and free the memory:

copy_to_user(dirp,dirp2,tmp);

kfree(dirp2);

Return the length of entries in the directory:

return tmp;

The functions for initializing and unloading a module have a standard form:

int init_module(void)

orig_getdents=sys_call_table;

sys_call_table=own_getdents;

return 0;

void cleanup_module()

sys_call_table=orig_getdents;

Let's save the source text in the file sys_call_getd.c and create a Makefile with the following content:

CC = gcc

module = sys_call_getd.o

CFLAGS = -O3 -Wall

LINUX = /usr/src/linux

MODFLAGS = -D__KERNEL__ -DMODULE -I$(LINUX)/include

sys_call_getd.o: sys_call_getd.c $(CC) -c

$(CFLAGS) $(MODFLAGS) sys_call_getd.c

In the current directory, create a file our.file and load the module. The file disappears, which was what needed to be proven.

As you understand, it is not possible to consider an example of intercepting every system call in one article. Therefore, for those who are interested in this issue, I recommend visiting the sites:

There you can find more complex and interesting examples of system call interception. Please write about all comments and suggestions on the magazine forum.

When preparing the article, materials from the site were used

The Walrus said about many things, “the time has come to talk.”
L. Carroll (Quote from the book by B. Stroustrap)

Instead of an introduction.

Much has been written and rewritten on the topic of the internal structure of the Linux kernel in general, its various subsystems and system calls in particular. Probably every self-respecting author should write about this at least once, just as every self-respecting programmer should definitely write his own file manager :) Although I am not a professional IT writer, and in general, I make my notes exclusively for first of all, so as not to forget what you have learned too quickly. But, if my travel notes are useful to someone, of course, I will be only happy. Well, in general, you can’t spoil porridge with oil, so maybe I’ll even be able to write or describe something that no one bothered to mention.

Theory. What are system calls?

When they explain to the uninitiated what software (or OS) is, they usually say the following: the computer itself is a piece of hardware, but the software is what allows you to get some benefit from this piece of hardware. Rough, of course, but in general, somewhat true. I would probably say the same about the OS and system calls. In fact, in different OSs, system calls can be implemented differently, the number of these calls may vary, but one way or another, in one form or another, the system call mechanism is present in any OS. Every day, the user explicitly or implicitly works with files. Of course, he can clearly open the file for editing in his favorite MS Word or Notepad, or he can simply launch a toy, the executable image of which, by the way, is also stored in the file, which, in turn, must be opened and read by the bootloader executable files. In turn, the toy can also open and read dozens of files during its operation. Naturally, files can not only be read, but also written (not always, however, but here we are not talking about separation of rights and discrete access :)). All this is managed by the kernel (in microkernel operating systems the situation may be different, but we will now unobtrusively move towards the object of our discussion - Linux, so we will ignore this point). The very creation of a new process is also a service provided by the OS kernel. All this is great, as is the fact that modern processors operate at frequencies in the gigahertz range and consist of many millions of transistors, but what next? Yes, what if there was no mechanism by which user applications could perform some fairly mundane and, at the same time, necessary things ( in fact, these trivial actions in any case are performed not by the user application, but by the OS kernel - author.), then the OS was simply a thing in itself - absolutely useless, or, on the contrary, each user application in itself would have to become an operating system in order to independently serve all its needs. Cute, isn't it?

Thus, we have come to the definition of a system call in the first approximation: a system call is a certain service that the OS kernel provides to a user application upon the latter's request. Such a service can be the already mentioned opening of a file, its creation, reading, writing, creating a new process, obtaining the process identifier (pid), mounting the file system, stopping the system, finally. In real life there are many more system calls than are listed here.

What does a system call look like and what is it? Well, from what was said above, it becomes clear that a system call is a kernel subroutine that has the corresponding form. Those who have had experience programming under Win9x/DOS probably remember the int 0x21 interrupt with all (or at least some) of its many functions. However, there is one small quirk that applies to all Unix system calls. By convention, a function implementing a system call can take N arguments or none at all, but one way or another, the function must return an int value. Any non-negative value is interpreted as successful execution of the system call function, and therefore the system call itself. A value less than zero is a sign of an error and at the same time contains an error code (error codes are defined in the include/asm-generic/errno-base.h and include/asm-generic/errno.h headers). In Linux, the gateway for system calls until recently was the int 0x80 interrupt, while in Windows (up to version XP Service Pack 2, if I’m not mistaken) the gateway is the 0x2e interrupt. Again, in the Linux kernel, until recently all system calls were handled by the system_call() function. However, as it turned out later, the classic mechanism for processing system calls through gateway 0x80 leads to a significant drop in performance on Intel Pentium 4 processors. Therefore, the classic mechanism was replaced by the method of virtual dynamic shared objects (DSO - dynamic shared object file. I can’t vouch for the correct translation, but DSO is what Windows users know as DLL - dynamically loaded and linked library) - VDSO. What is the difference between the new method and the classic one? First, let's look at the classic method, which works through gate 0x80.

The classic mechanism for servicing system calls in Linux.

Interrupts in x86 architecture.

As mentioned above, previously, gateway 0x80 (int 0x80) was used to service requests from user applications. The operation of a system based on the IA-32 architecture is controlled by interrupts (strictly speaking, this applies to all x86-based systems in general). When some event occurs (a new timer tick, some activity on some device, errors - division by zero, etc.), an interrupt is generated. An interrupt is so named because it typically interrupts the normal flow of code. Interrupts are usually divided into hardware and software interrupts. Hardware interrupts are interrupts that are generated by system and peripheral devices. When a device needs to attract the attention of the OS kernel, it (the device) generates a signal on its interrupt request line (IRQ - Interrupt ReQuest line). This leads to the fact that a corresponding signal is generated at certain processor inputs, on the basis of which the processor decides to interrupt the execution of the instruction stream and transfer control to the interrupt handler, which already finds out what happened and what needs to be done. Hardware interrupts are asynchronous in nature. This means that an interruption can occur at any time. In addition to peripheral devices, the processor itself can generate interrupts (or, more precisely, hardware exceptions - Hardware Exceptions - for example, the already mentioned division by zero). This is done in order to notify the OS that an abnormal situation has occurred so that the OS can take some action in response to the occurrence of such a situation. After processing the interrupt, the processor returns to executing the interrupted program. The interrupt can be initiated by a user application. This type of interrupt is called a software interrupt. Software interrupts, unlike hardware ones, are synchronous. That is, when an interrupt is called, the code that called it pauses until the interrupt is serviced. When exiting the interrupt handler, it returns to the farthest address stored earlier (when the interrupt was called) on the stack, to the next instruction after the interrupt calling instruction (int). An interrupt handler is a resident (permanently located in memory) piece of code. As a rule, this is a small program. Although, if we talk about the Linux kernel, then the interrupt handler there is not always so small. The interrupt handler is defined by a vector. A vector is nothing more than the address (segment and offset) of the start of the code that should handle interrupts at a given index. Working with interrupts differs significantly in the real (Real Mode) and protected (Protected Mode) processor operating modes (let me remind you that hereinafter we mean Intel processors and those compatible with them). In the real (unprotected) mode of processor operation, interrupt handlers are determined by their vectors, which are always stored at the beginning of memory; the required address is selected from the vector table by index, which is also the interrupt number. By overwriting the vector with a certain index, you can assign your own handler to the interrupt.

In protected mode, interrupt handlers (gates, gates, or gates) are no longer defined using a vector table. Instead of this table, a gate table or, more correctly, an interrupt table - IDT (Interrupt Descriptors Table) is used. This table is generated by the kernel, and its address is stored in the processor idtr register. This register is not directly accessible. Working with it is possible only using lidt/sidt instructions. The first of them (lidt) loads the idtr register with the value specified in the operand and is the base address of the interrupt descriptor table, the second (sidt) stores the table address located in idtr into the specified operand. In the same way that information about a segment is retrieved from the descriptor table using a selector, a segment descriptor that serves an interrupt in protected mode is also retrieved. Memory protection is supported by Intel processors starting with the CPU i80286 (not exactly in the form in which it is presented now, if only because the 286 was a 16-bit processor - so Linux cannot run on these processors) and i80386, and therefore the processor itself makes all the necessary selections and, therefore, we will not delve deeply into all the intricacies of protected mode (namely, Linux operates in protected mode). Unfortunately, neither time nor capabilities allow us to dwell for long on the mechanism for handling interruptions in protected mode. Yes, this was not the goal when writing this article. All the information provided here regarding the operation of x86 family processors is rather superficial and is provided only to help a little better understand the mechanism of how kernel system calls work. Some things can be learned directly from the kernel code, although to fully understand what is happening, it is still advisable to familiarize yourself with the principles of protected mode. The section of code that initializes (but does not set!) the IDT is located in arch/i386/kernel/head.S: /* * setup_idt * * sets up a idt with 256 entries pointing to * ignore_int, interrupt gates. It doesn"t actually load * idt - that can be done only after paging has been enabled * and the kernel moved to PAGE_OFFSET. Interrupts * are enabled elsewhere, when we can be relatively * sure everything is ok. * * Warning: %esi is live across this function. */ 1.setup_idt: 2. lea ignore_int,%edx 3. movl $(__KERNEL_CS<< 16),%eax 4. movw %dx,%ax /* selector = 0x0010 = cs */ 5. movw $0x8E00,%dx /* interrupt gate - dpl=0, present */ 6. lea idt_table,%edi 7. mov $256,%ecx 8.rp_sidt: 9. movl %eax,(%edi) 10. movl %edx,4(%edi) 11. addl $8,%edi 12. dec %ecx 13. jne rp_sidt 14..macro set_early_handler handler,trapno 15. lea \handler,%edx 16. movl $(__KERNEL_CS << 16),%eax 17. movw %dx,%ax 18. movw $0x8E00,%dx /* interrupt gate - dpl=0, present */ 19. lea idt_table,%edi 20. movl %eax,8*\trapno(%edi) 21. movl %edx,8*\trapno+4(%edi) 22..endm 23. set_early_handler handler=early_divide_err,trapno=0 24. set_early_handler handler=early_illegal_opcode,trapno=6 25. set_early_handler handler=early_protection_fault,trapno=13 26. set_early_handler handler=early_page_fault,trapno=14 28. ret A few notes on the code: the above code is written in a version of AT&T assembler, so your knowledge of assembler in its usual Intel notation can only be confusing. The most important difference is the order of the operands. If the order is defined for the Intel notation - “accumulator”< "источник", то для ассемблера AT&T порядок прямой. Регистры процессора, как правило, должны иметь префикс "%", непосредственные значения (константы) префиксируются символом доллара "$". Синтаксис AT&T традиционно используется в Un*x-системах.

In the example above, lines 2-4 set the address of the default interrupt handler for all interrupts. The default handler is the ignore_int function, which does nothing. The presence of such a stub is necessary for the correct processing of all interrupts at this stage, since there are simply no others yet (however, traps are installed a little lower in the code - for traps, see the Intel Architecture Manual Reference or something similar, we will not discuss them here touch the traps). Line 5 sets the gate type. On line 6 we load the index register with the address of our IDT table. The table should contain 255 records, 8 bytes each. In lines 8-13, we fill the entire table with the same values set earlier in the eax and edx registers - i.e., this is the interrupt gate referencing the ignore_int handler. Just below we define a macro for setting traps - lines 14-22. In lines 23-26, using the above macro, we set traps for the following exceptions: early_divide_err - division by zero (0), early_illegal_opcode - unknown processor instruction (6), early_protection_fault - memory protection failure (13), early_page_fault - page translation failure (14) . The numbers of “interrupts” generated when a corresponding abnormal situation occurs are given in parentheses. Before checking the processor type in arch/i386/kernel/head.S, the IDT table is set by calling setup_idt: /* * start system 32-bit setup. We need to re-do some of the things done * in 16-bit mode for the "real" operations. */ 1. call setup_idt ... 2. call check_x87 3. lgdt early_gdt_descr 4. lidt idt_descr After finding out the type of (co)processor and carrying out all the preparatory steps in lines 3 and 4, we load the GDT and IDT tables, which will be used during the very first stages of the kernel.

System calls and int 0x80.

From interrupts, let's go back to system calls. So, what is needed to serve a process that requests some service? To begin with, you need to move from ring 3 (privilege level CPL=3) to the most privileged level 0 (Ring 0, CPL=0), because The kernel code is located in the segment with the highest privileges. In addition, a handler code is required that will service the process. This is exactly what the 0x80 gateway is used for. Although there are quite a few system calls, they all use a single entry point - int 0x80. The handler itself is installed when calling the function arch/i386/kernel/traps.c::trap_init(): void __init trap_init(void) ( ... set_system_gate(SYSCALL_VECTOR,&system_call); ... ) We are most interested in this line in trap_init(). In the same file above you can look at the code for the set_system_gate() function: static void __init set_system_gate(unsigned int n, void *addr) ( _set_gate(n, DESCTYPE_TRAP | DESCTYPE_DPL3, addr, __KERNEL_CS); ) Here you can see that the gate for interruption 0x80 (namely, this value is defined by the SYSCALL_VECTOR macro - you can take my word for it :)) is installed as a trap with the privilege level DPL=3 (Ring 3), i.e. this interrupt will be caught when called from user space. Problem with the transition from Ring 3 to Ring 0 i.e. solved. The _set_gate() function is defined in the include/asm-i386/desc.h header file. For those who are especially curious, the code is given below, without lengthy explanations, however: static inline void _set_gate(int gate, unsigned int type, void *addr, unsigned short seg) ( __u32 a, b; pack_gate(&a, &b, (unsigned long)addr, seg, type, 0); write_idt_entry(idt_table, gate , a, b); ) Let's return to the trap_init() function. It is called from the start_kernel() function in init/main.c. If you look at the trap_init() code, you can see that this function rewrites some values of the IDT table anew - the handlers that were used in the early stages of kernel initialization (early_page_fault, early_divide_err, early_illegal_opcode, early_protection_fault) are replaced with those that will be used already during the process kernel work. So, we almost got to the point and already know that all system calls are processed uniformly - through the int 0x80 gateway. The system_call() function is installed as a handler for int 0x80, as can again be seen from the above piece of code arch/i386/kernel/traps.c::trap_init().

system_call().

The code for the system_call() function is located in the file arch/i386/kernel/entry.S and looks like this: # system call handler stub ENTRY(system_call) RING0_INT_FRAME # can"t unwind into user space anyway pushl %eax # save orig_eax CFI_ADJUST_CFA_OFFSET 4 SAVE_ALL GET_THREAD_INFO(%ebp) # system call tracing in operation / emulation /* Note, _TIF_SECCOMP is bit number 8 , and so it needs testw and not testb */ testw $(_TIF_SYSCALL_EMU|_TIF_SYSCALL_TRACE|_TIF_SECCOMP|_TIF_SYSCALL_AUDIT),TI_flags(%ebp) jnz syscall_trace_entry cmpl $(nr_syscalls), %eax jae syscall_badsys syscall_call: call *sys_call_table (,%eax, 4) movl %eax,PT_EAX(%esp) # store the return value ... The code is not shown in full. As you can see, first system_call() configures the stack to work in Ring 0, saves the value passed to it via eax onto the stack, saves all registers onto the stack, receives data about the calling thread and checks whether the passed value, the system call number, is not beyond the limits of the system call table and then finally, using the value passed to eax as an argument, system_call() jumps to the actual system output handler based on which table element is referenced by the index in eax. Now remember the good old table of interrupt vectors from real mode. Doesn't remind you of anything? In reality, of course, everything is somewhat more complicated. In particular, the system call must copy the results from the kernel stack to the user stack, pass a return code, and some other things. In the case where the argument specified in eax does not refer to an existing system call (the value is out of range), a transition to the syscall_badsys label occurs. Here, the value -ENOSYS is pushed onto the stack at the offset at which the eax value should be located - the system call is not implemented. This completes the execution of system_call().

The system call table is located in the file arch/i386/kernel/syscall_table.S and has a fairly simple form: ENTRY(sys_call_table) .long sys_restart_syscall /* 0 - old "setup()" system call, used for restarting */ .long sys_exit .long sys_fork .long sys_read .long sys_write .long sys_open /* 5 */ .long sys_close .long sys_waitpid .long sys_creat ... In other words, the entire table is nothing more than an array of function addresses arranged in the order of the system call numbers that these functions serve. The table is an ordinary array of double machine words (or 32-bit words - as you prefer). The code for some of the functions servicing system calls is located in the platform-dependent part - arch/i386/kernel/sys_i386.c, and the platform-independent part - in kernel/sys.c.

This is the case with system calls and gate 0x80.

New mechanism for handling system calls in Linux. sysenter/sysexit.

As mentioned, it quickly became clear that using the traditional method of processing system calls based on gate 0x80 leads to a loss of performance on Intel Pentium 4 processors. Therefore, Linus Torvalds implemented a new mechanism in the kernel, based on the sysenter/sysexit instructions and designed to increase kernel performance on machines , equipped with a Pentium II processor and higher (it is with Pentium II+ that Intel processors support the mentioned sysenter/sysexit instructions). What is the essence of the new mechanism? Oddly enough, but the essence remains the same. The execution has changed. According to Intel documentation, the sysenter instruction is part of the "fast system call" mechanism. In particular, this instruction is optimized for quickly moving from one privilege level to another. More precisely, it speeds up the transition to ring 0 (Ring 0, CPL=0). In this case, the operating system must prepare the processor to use the sysenter instruction. This setting is carried out once when loading and initializing the OS kernel. When called, sysenter sets the processor registers according to the machine-specific registers previously set by the OS. In particular, the segment register and the instruction pointer register - cs:eip, as well as the stack segment and the stack top pointer - ss, esp are installed. The transition to a new code segment and the offset is carried out from ring 3 to 0.

The sysexit instruction does the opposite. It makes a quick transition from privilege level 0 to privilege level 3 (CPL=3). In this case, the code segment register is set to 16 + the cs segment value stored in the processor's machine-dependent register. The eip register contains the contents of the edx register. The sum of 24 and the cs values entered by the OS earlier into the machine-dependent register of the processor when preparing the context for the operation of the sysenter instruction are entered into ss. The contents of the ecx register are entered into esp. The values necessary for the operation of the sysenter/sysexit instructions are stored at the following addresses:

SYSENTER_CS_MSR 0x174 - code segment where the value of the segment in which the system call handler code is located is entered.
SYSENTER_ESP_MSR 0x175 - pointer to the top of the stack for the system call handler.
SYSENTER_EIP_MSR 0x176 - pointer to the offset within the code segment. Points to the beginning of the system call handler code.

These addresses refer to model-dependent registers that do not have names. Values are written to model-dependent registers using the wrmsr instruction, while edx:eax must contain the leading and low parts of the 64-bit machine word, respectively, and ecx must contain the address of the register into which the entry will be made. In Linux, the addresses of model-dependent registers are defined in the header file include/asm-i368/msr-index.h as follows (before version 2.6.22, at least they were defined in the header file include/asm-i386/msr.h, let me remind you that we consider the system call mechanism using the example of the Linux kernel 2.6.22): #define MSR_IA32_SYSENTER_CS 0x00000174 #define MSR_IA32_SYSENTER_ESP 0x00000175 #define MSR_IA32_SYSENTER_EIP 0x00000176 The kernel code responsible for setting the model-dependent registers is located in the file arch/i386/sysenter.c and looks like this: 1. void enable_sep_cpu(void) ( 2. int cpu = get_cpu(); 3. struct tss_struct *tss = &per_cpu(init_tss, cpu); 4. if (!boot_cpu_has(X86_FEATURE_SEP)) ( 5. put_cpu(); 6. return; ) 7. tss->x86_tss.ss1 = __KERNEL_CS; 8. tss->x86_tss.esp1 = sizeof(struct tss_struct) + (unsigned long) tss; 9. wrmsr(MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0); 10. wrmsr( MSR_IA32_SYSENTER_ESP, tss->x86_tss.esp1, 0); 11. wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long) sysenter_entry, 0); 12. put_cpu(); ) Here, in the tss variable, we receive the address of the structure that describes the task state segment. TSS (Task State Segment) is used to describe the task context and is part of the hardware multitasking mechanism for the x86 architecture. However, Linux practically does not use hardware task context switching. According to Intel documentation, switching to another task is accomplished either by executing an intersegment jump instruction (jmp or call) referencing the TSS segment or by executing a task gate descriptor in the GDT (LDT). A special processor register, invisible to the programmer - TR (Task Register) contains the task descriptor selector. Loading this register also loads the software-invisible base and limit registers associated with TR.

Even though Linux does not use hardware context switching, the kernel is forced to allocate a TSS entry for each processor installed on the system. This is because when the processor switches from user mode to kernel mode, it retrieves the kernel stack address from the TSS. In addition, TSS is needed to control access to I/O ports. The TSS contains a map of port access rights. Based on this map, it becomes possible to control access to ports for each process using in/out instructions. Here tss->x86_tss.esp1 points to the kernel stack. __KERNEL_CS naturally points to a segment of kernel code. The offset-eip is the address of the sysenter_entry() function.

The sysenter_entry() function is defined in the file arch/i386/kernel/entry.S and looks like this: /* SYSENTER_RETURN points to after the "sysenter" instruction in the vsyscall page. See vsyscall-sysentry.S, which defines the symbol. */ # sysenter call handler stub ENTRY(sysenter_entry) CFI_STARTPROC simple CFI_SIGNAL_FRAME CFI_DEF_CFA esp, 0 CFI_REGISTER esp, ebp movl TSS_sysenter_esp0(%esp),%esp sysenter_past_esp: /* * No need to follow this irqs on/off section: the syscall * disabled irqs and here we enable it straight after entry: */ ENABLE_INTERRUPTS(CLBR_NONE) pushl $(__USER_DS) CFI_ADJUST_CFA_OFFSET 4 /*CFI_REL_OFFSET ss, 0*/ pushl %ebp CFI_ADJUST_CFA_OFFSET 4 CFI_REL_OFFSET esp, 0 pushfl CFI_ADJUST_C FA_OFFSET 4 pushl $(__USER_CS) CFI_ADJUST_CFA_OFFSET 4 /*CFI_REL_OFFSET cs, 0*/ /* * Push current_thread_info()->sysenter_return to the stack. * A tiny bit of offset fixup is necessary - 4*4 means the 4 words * pushed above; +8 corresponds to copy_thread"s esp0 setting. */ pushl (TI_sysenter_return-THREAD_SIZE+8+4*4)(%esp) CFI_ADJUST_CFA_OFFSET 4 CFI_REL_OFFSET eip, 0 /* * Load the potential sixth argument from user stack. * Careful about security . */ cmpl $__PAGE_OFFSET-3,%ebp jae syscall_fault 1: movl (%ebp),%ebp .section __ex_table,"a" .align 4 .long 1b,syscall_fault .previous pushl %eax CFI_ADJUST_CFA_OFFSET 4 SAVE_ALL GET_THREAD_INFO(%ebp ) /* Note, _TIF_SECCOMP is bit number 8, and so it needs testw and not testb */ testw $(_TIF_SYSCALL_EMU|_TIF_SYSCALL_TRACE|_TIF_SECCOMP|_TIF_SYSCALL_AUDIT),TI_flags(%ebp) jnz syscall_trace_entry cmpl $(nr_syscalls), %eax jae syscall_ badsys call *sys_call_table(,%eax,4) movl %eax,PT_EAX(%esp) DISABLE_INTERRUPTS(CLBR_ANY) TRACE_IRQS_OFF movl TI_flags(%ebp), %ecx testw $_TIF_ALLWORK_MASK, %cx jne syscall_exit_work /* if something modifies registers it must also disable sysexit */ movl PT_EIP(%esp), %edx movl PT_OLDESP(%esp), %ecx xorl %ebp,%ebp TRACE_IRQS_ON 1: mov PT_FS(%esp), %fs ENABLE_INTERRUPTS_SYSEXIT CFI_ENDPROC .pushsection .fixup,"ax" 2: movl $0,PT_FS(%esp) jmp 1b .section __ex_table,"a" .align 4 .long 1b,2b .popsection ENDPROC(sysenter_entry) As with system_call(), most of the work is done in the line call *sys_call_table(,%eax,4). This is where a specific system call handler is called. So, it is clear that little has fundamentally changed. The fact that the interrupt vector is now embedded in the hardware and the processor helps us move faster from one privilege level to another changes only some details of execution with the same content. True, the changes do not end there. Remember how the story began. At the very beginning I already mentioned virtual shared objects. So, if earlier the implementation of a system call, say, from the libc system library looked like an interrupt call (despite the fact that the library took on some functions to reduce the number of context switches), now thanks to VDSO the system call can be made almost directly , without libc. It could have been implemented directly before, again, as an interruption. But now the call can be requested as a regular function exported from a dynamically linked library (DSO). At boot, the kernel determines which mechanism should and can be used for a given platform. Depending on the circumstances, the kernel sets the entry point to the function that executes the system call. Next, the function is exported to user space as the linux-gate.so.1 library. The linux-gate.so.1 library does not physically exist on disk. It is, so to speak, emulated by the kernel and exists exactly as long as the system is running. If you stop the system and mount the root file system from another system, you will not find this file on the root file system of the stopped system. In fact, you won’t be able to find it even on a running system. Physically it simply does not exist. This is why linux-gate.so.1 is something other than VDSO - i.e. Virtual Dynamically Shared Object. The kernel maps the dynamic library thus emulated into the address space of each process. You can easily verify this by running the following command: f0x@devel0:~$ cat /proc/self/maps 08048000-0804c000 r-xp 00000000 08:01 46 /bin/cat 0804c000-0804d000 rw-p 00003000 08:01 46 /bin/cat 0804d000-0806e00 0 rw-p 0804d000 00:00 0 ... b7fdf000-b7fe1000 rw-p 00019000 08:01 2066 /lib/ld-2.5.so bffd2000-bffe8000 rw-p bffd2000 00:00 0 ffffe000-ffffff000 r-xp 00000000 00: 00 0 Here the very last line is the object we are interested in: ffffe000-fffff000 r-xp 00000000 00:00 0 From the above example it is clear that the object occupies exactly one page in memory - 4096 bytes, almost at the back of the address space. Let's do another experiment: f0x@devel0:~$ ldd `which cat` linux-gate.so.1 => (0xffffe000) libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0xb7e87000) /lib/ ld-linux.so.2 (0xb7fdf000) f0x@devel0:~$ ldd `which gcc` linux-gate.so.1 => (0xffffe000) libc.so.6 => /lib/tls/i686/cmov/libc .so.6 (0xb7e3c000) /lib/ld-linux.so.2 (0xb7f94000) f0x@devel0:~$ Here we just took two applications offhand. It can be seen that the library is mapped into the process address space at the same permanent address - 0xffffe000. Now let's try to see what is actually stored on this memory page...

You can dump the memory page where the VDSO shared code is stored using the following program: #include #include #include int main () ( char* vdso = 0xffffe000; char* buffer; FILE* f; buffer = malloc (4096); if (!buffer) exit (1); memcpy (buffer, vdso, 4096) ; if (!(f = fopen ("test.dump", "w+b"))) ( free (buffer); exit (1); ) fwrite (buffer, 4096, 1, f); fclose (f) ; free (buffer); return 0; ) Strictly speaking, earlier this could have been done easier, using the command dd if=/proc/self/mem of=test.dump bs=4096 skip=1048574 count=1, but kernels since version 2.6.22, or perhaps even earlier, no longer map process memory to /proc/`pid`/mem. This file is obviously saved for compatibility, but does not contain any more information.

Let's compile and run the given program. Let's try to disassemble the resulting code: f0x@devel0:~/tmp$ objdump --disassemble ./test.dump ./test.dump: file format elf32-i386 Disassembly of section .text: ffffe400<__kernel_vsyscall>: ffffe400: 51 push %ecx ffffe401: 52 push %edx ffffe402: 55 push %ebp ffffe403: 89 e5 mov %esp,%ebp ffffe405: 0f 34 sysenter ... ffffe40e: eb f3 jmp ffffe403<__kernel_vsyscall+0x3>ffffe410: 5d pop %ebp ffffe411: 5a pop %edx ffffe412: 59 pop %ecx ffffe413: c3 ret ... f0x@devel0:~/tmp$ Here it is our gateway for system calls, all in full view. A process (or the libc system library) calling the __kernel_vsyscall function ends up at address 0xffffe400 (in our case). Next, __kernel_vsyscall saves the contents of the ecx, edx, ebp registers on the user process stack. We have already talked about the purpose of the ecx and edx registers earlier; in ebp it is used later to restore the user’s stack. The sysenter instruction is executed, “interrupt interception” and, as a result, the next transition to sysenter_entry (see above). The jmp instruction at 0xffffe40e is inserted to restart a system call with 6 arguments (see http://lkml.org/lkml/2002/12/18/). The code placed on the page is in the file arch/i386/kernel/vsyscall-enter.S (or arch/i386/kernel/vsyscall-int80.S for hook 0x80). Although I found that the address of the __kernel_vsyscall function is constant, there is an opinion that this is not the case. Typically, the position of the entry point in __kernel_vsyscall() can be found from the ELF-auxv vector using the AT_SYSINFO parameter. The ELF-auxv vector contains information passed to the process via the stack at startup and contains various information needed while the program is running. This vector specifically contains process environment variables, arguments, etc.

Here is a small example in C of how you can call the __kernel_vsyscall function directly: #include int pid; int main () ( __asm ("movl $20, %eax \n" "call *%gs:0x10 \n" "movl %eax, pid \n"); printf ("pid: %d\n", pid) ; return 0; ) This example is taken from the Manu Garg page, http://www.manugarg.com. So, in the above example, we make the getpid() system call (number 20 or otherwise __NR_getpid). In order not to climb the process stack in search of the AT_SYSINFO variable, we will take advantage of the fact that the libc.so system library copies the value of the AT_SYSINFO variable to the Thread Control Block (TCB) when loaded. This block of information is typically referenced by a selector in gs. We assume that the desired parameter is located at offset 0x10 and make a call to the address stored in %gs:$0x10.

Results.

In fact, in practice, it is not always possible to achieve a significant performance increase even with FSCF (Fast System Call Facility) support on this platform. The problem is that one way or another, the process rarely accesses the kernel directly. And there are good reasons for this. Using the libc library allows you to guarantee program portability regardless of the kernel version. And it is through the standard system library that most system calls go. Even if you compile and install the latest kernel compiled for a platform that supports FSCF, this is not a guarantee of performance gains. The fact is that your system library libc.so will continue to use int 0x80 and you can only deal with this by rebuilding glibc. Whether the VDSO interface and __kernel_vsyscall are generally supported in glibc, I honestly find it difficult to answer at the moment.

Links.

Manu Garg's page, http://www.manugarg.com
Scatter/Gather thoughts by Johan Petersson, http://www.trilithium.com/johan/2005/08/linux-gate/
Good old Understanding the Linux kernel Where would we be without it :)
And of course, Linux source codes (2.6.22)

System calls

So far, all the programs we've made have had to use well-defined kernel mechanisms to register /proc files and device drivers. This is great if you want to do something already provided by the kernel programmers, such as write a device driver. But what if you want to do something unusual, change the behavior of the system in some way?

This is where kernel programming gets dangerous. When writing the example below, I destroyed the open system call. This meant that I couldn't open any files, I couldn't run any programs, and I couldn't shut down the system with the shutdown command. I have to turn off the power to stop it. Fortunately, no files were destroyed. To ensure that you also don't lose files, please perform a sync before you issue the insmod and rmmod commands.

Forget about /proc files and device files. They are just small details. The real process of communication with the kernel, used by all processes, is system calls. When a process requests service from the kernel (such as opening a file, starting a new process, or requesting more memory), this mechanism is used. If you want to change kernel behavior in interesting ways, this is the place to be. By the way, if you want to see what system calls are used by a program, run: strace .

In general, the process is not able to access the kernel. It cannot access kernel memory and cannot call kernel functions. The CPU hardware dictates this state of affairs (there's a reason it's called `protected mode'). System calls are the exception to this general rule. The process fills the registers with the appropriate values and then calls a special instruction that jumps to a predefined location in the kernel (of course , it is read by user processes but not overwritten by them.) Under Intel CPUs, this is accomplished through interrupt 0x80. The hardware knows that once you jump to that location, you are no longer running in user-restricted mode. Instead, you are running as the kernel of the operating system, and therefore you are allowed to do whatever you want to do.

The location in the kernel that a process can call to is called system_call. The procedure that is there checks the system call number, which tells the kernel what exactly the process wants. It then looks up the system call table (sys_call_table) to find the address of the kernel function to call. The desired function is then called, and after it returns a value, several system checks are made. The result is then returned back to the process (or to another process if the process has terminated). If you want to see the code that does all this, it is in the arch/ source file< architecture >/kernel/entry.S , after the line ENTRY(system_call) .

So, if we want to change how some system call works, the first thing we have to do is write our own function to do the appropriate thing (usually by adding a bit of our own code, and then calling the original function), then change the pointer to sys_call_table to point to our function. Since we may be deleted later and don't want to leave the system in an inconsistent state, it's important for cleanup_module to restore the table to its original state.

The source code given here is an example of such a module. We want to "spy" on a certain user, and send a message via printk whenever that user opens a file. We replace the file open system call with our own function called our_sys_open. This function checks the uid (user id) of the current process, and if it is equal to the uid we are spying on, calls printk to display the name of the file that will be opened. Then it calls the original open function with the same parameters, actually opening the file.

The init_module function changes the corresponding location in sys_call_table and stores the original pointer in a variable. The cleanup_module function uses this variable to restore everything back to normal. This approach is dangerous because of the possibility of two modules modifying the same system call. Imagine that we have two modules, A and B. Let's call the open system call of module A A_open and call the same call to module B B_open. Now that the kernel inserted syscall is replaced with A_open, which will call the original sys_open when it does what it needs to do. Then, B will be inserted into the kernel, and will replace the system call with B_open, which will call what it thinks is the original system call, but is actually A_open.

Now if B is removed first, everything will be fine: it will simply restore the system call on A_open that calls the original. However, if A is removed and then B is removed, the system will collapse. Removing A will restore the system call to the original, sys_open, cutting B out of the loop. Then, when B is removed, it will restore the system call to what it thinks is the original. The call will actually be directed to A_open, which is no longer in memory. At first glance it seems like we could solve this particular problem by checking if the system call is equal to our open function and if so, not changing the value of that call (so that B doesn't change the system call when it gets deleted), but that would cause another worst problem. When A is removed, it sees that the system call has been changed to B_open so that it no longer points to A_open, so it will not restore the pointer to sys_open before being removed from memory. Unfortunately, B_open will still try to call A_open, which is no longer in memory, so even without removing B, the system will still crash.

I see two ways to prevent this problem. First: restore access to the original value of sys_open. Unfortunately, sys_open is not part of the kernel table in /proc/ksyms, so we can't access it. Another solution is to use a reference counter to prevent the module from being unloaded. This is good for regular modules, but bad for "educational" modules.

/* syscall.c * * System call "stealing" sample */ /* Copyright (C) 1998-99 by Ori Pomerantz */ /* The necessary header files */ /* Standard in kernel modules */ #include /* We"re doing kernel work */ #include /* Specifically, a module */ /* Deal with CONFIG_MODVERSIONS */ #if CONFIG_MODVERSIONS==1 #define MODVERSIONS #include #endif #include /* The list of system calls */ /* For the current (process) structure, we need * this to know who the current user is. */ #include /* In 2.2.3 /usr/include/linux/version.h includes a * macro for this, but 2.0.35 doesn't - so I add it * here if necessary. */ #ifndef KERNEL_VERSION #define KERNEL_VERSION(a ,b,c) ((a)*65536+(b)*256+(c)) #endif #if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0) #include #endif /* The system call table (a table of functions). We * just define this as external, and the kernel will * fill it up for us when we are insmod"ed */ extern void *sys_call_table; /* UID we want to spy on - will be filled from the * command line */ int uid; #if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0) MODULE_PARM(uid, "i"); #endif /* A pointer to the original system call. The reason * we keep this, rather than call the original function * (sys_open), is because somebody else might have * replaced the system call before us. Note that this * is not 100% safe, because if another module * replaced sys_open before us, then when we"re inserted * we"ll call the function in that module - and it * might be removed before we are. * * Another reason for this is that we can"t get sys_open. * It"s a static variable, so it is not exported. */ asmlinkage int (*original_call)(const char *, int, int); /* For some reason, in 2.2.3 current->uid gave me * zero, not the real user ID. I tried to find what went * wrong, but I couldn"t do it in a short time, and * I"m lazy - so I"ll just use the system call to get the * uid, the way a process would. * * For some reason, after I recompiled the kernel this * problem went away. */ asmlinkage int (*getuid_call)(); /* The function we"ll replace sys_open (the function * called when you call the open system call) with. To * find the exact prototype, with the number and type * of arguments, we find the original function first * (it" s at fs/open.c). * * In theory, this means that we"re tied to the * current version of the kernel. In practice, the * system calls almost never change (it would wreck havoc * and require programs to be recompiled, since the system * calls are the interface between the kernel and the * processes). */ asmlinkage int our_sys_open(const char *filename, int flags, int mode) ( int i = 0; char ch; /* Check if this is the user we"re spying on */ if (uid == getuid_call()) ( /* getuid_call is the getuid system call, * which gives the uid of the user who * ran the process which called the system * call we got */ /* Report the file, if relevant */ printk("Opened file by %d: ", uid); do ( #if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0) get_user(ch, filename+i); #else ch = get_user(filename+ i); #endif i++; printk("%c", ch); ) while (ch != 0); printk("\n"); ) /* Call the original sys_open - otherwise, we lose * the ability to open files */ return original_call(filename, flags, mode); ) /* Initialize the module - replace the system call */ int init_module() ( /* Warning - too late for it now, but maybe for * next time. .. */ printk("I"m dangerous. I hope you did a "); printk("sync before you insmod"ed me.\n"); printk("My counterpart, cleanup_module(), is even"); printk("more dangerous. If\n"); printk("you value your file system, it will "); printk("be \"sync; rmmod\" \n"); printk("when you remove this module.\n"); /* Keep a pointer to the original function in * original_call, and then replace the system call * in the system call table with our_sys_open */ original_call = sys_call_table[__NR_open]; sys_call_table[__NR_open] = our_sys_open; /* To get the address of the function for system * call foo, go to sys_call_table[__NR_foo]. */ printk("Spying on UID:%d\n", uid); /* Get the system call for getuid */ getuid_call = sys_call_table[__NR_getuid]; return 0; ) /* Cleanup - unregister the appropriate file from /proc */ void cleanup_module() ( /* Return the system call back to normal */ if (sys_call_table[__NR_open] != our_sys_open) ( printk("Somebody else also played with the "); printk("open system call\n"); printk("The system may be left in "); printk("an unstable state.\n"); ) sys_call_table[__NR_open] = original_call; )

This material is a modification of the article of the same name by Vladimir Meshkov, published in the magazine "System Administrator"

This material is a copy of articles by Vladimir Meshkov from the magazine "System Administrator". These articles can be found using the links below. Some examples of source codes of programs were also changed - improved, finalized. (Example 4.2 was greatly modified, since we had to intercept a slightly different system call) URLs: http://www.samag.ru/img/uploaded/p.pdf http://www.samag.ru/img/uploaded/a3. pdf

Have questions? Then here you go: [email protected]

2. Loadable kernel module
4. Examples of intercepting system calls based on LKM
- 4.1 Prohibition of directory creation

1. General view of Linux architecture

The most general view allows us to see a two-level model of the system. kernel<=>progs In the center (left) is the system core. The kernel interacts directly with the computer hardware, isolating application programs from architectural features. The kernel has a set of services provided to application programs. The kernel's services include I/O operations (opening, reading, writing, and managing files), creating and managing processes, their synchronization, and interprocess communication. All applications request kernel services through system calls.

The second level consists of applications or tasks, both system ones, which determine the functionality of the system, and application ones, which provide the Linux user interface. However, despite the external heterogeneity of the applications, the interaction schemes with the kernel are the same.

Interaction with the kernel occurs through a standard system call interface. The system call interface represents a set of kernel services and defines the format of service requests. A process requests a service through a system call to a specific kernel procedure, similar in appearance to a regular library function call. The kernel, on behalf of the process, executes the request and returns the necessary data to the process.

In the example above, the program opens a file, reads data from it, and closes the file. In this case, the operation of opening (open), reading (read) and closing (close) a file is performed by the kernel at the request of the task, and the open(2), read(2) and close(2) functions are system calls.

/* Source 1.0 */ #include main () ( int fd; char buf; /* Open the file - get a link (file descriptor) fd */ fd = open("file1",O_RDONLY); /* Read 80 characters into the buffer buf */ read(fd, buf , sizeof(buf)); /* Close the file */ close(fd); ) /* EOF */ A complete list of OS Linux system calls can be found in the file /usr/include/asm/unistd.h. Let's now look at the mechanism for executing system calls using this example. The compiler, having encountered the open() function to open a file, converts it into assembly code, loading the system call number corresponding to this function and its parameters into the processor registers and subsequently calling interrupt 0x80. The following values are loaded into the processor registers:

in the EAX register - the number of the system call. So, for our case, the system call number is 5 (see __NR_open).
into the EBX register - the first parameter of the function (for open() - this is a pointer to a string containing the name of the file to be opened.
to the ECX register - second parameter (file access rights)

The third parameter is loaded into the EDX register; in this case, we do not have it. To perform a system call in OS Linux, use the system_call function, which is defined (depending on the architecture, in this case i386) in the file /usr/src/linux/arch/i386/kernel/entry.S. This function is the entry point for all system calls. The kernel responds to interrupt 0x80 by calling the system_call function, which is essentially a handler for interrupt 0x80.

To make sure we're on the right track, let's look at the code for the open() function in the libc system library:

# gdb -q /lib/libc.so.6 (gdb) disas open Dump of assembler code for function open: 0x000c8080 : call 0x1082be< __i686.get_pc_thunk.cx >0x000c8085 : add $0x6423b,%ecx 0x000c808b : cmpl $0x0.0x1a84(%ecx) 0x000c8092 :jne 0xc80b1 0x000c8094 : push %ebx 0x000c8095 : mov 0x10(%esp,1),%edx 0x000c8099 : mov 0xc(%esp,1),%ecx 0x000c809d : mov 0x8(%esp,1),%ebx 0x000c80a1 : mov $0x5,%eax 0x000c80a6 : int $0x80 ... As is not difficult to notice in the last lines, parameters are transferred to the EDX, ECX, EBX registers, and the system call number is placed in the last EAX register, equal to, as we already know, 5.

Now let's get back to looking at the system call mechanism. So, the kernel calls the interrupt handler 0x80 - the system_call function. System_call places copies of the registers containing the call parameters on the stack using the SAVE_ALL macro and calls the desired system function with the call command. The table of pointers to kernel functions that implement system calls is located in the sys_call_table array (see file arch/i386/kernel/entry.S). The system call number, which is located in the EAX register, is an index into this array. Thus, if EAX contains the value 5, the kernel function sys_open() will be called. Why is the SAVE_ALL macro needed? The explanation here is very simple. Since almost all kernel system functions are written in C, they look for their parameters on the stack. And the parameters are pushed onto the stack using SAVE_ALL! The return value of the system call is stored in the EAX register.

Now let's figure out how to intercept a system call. The mechanism of loadable kernel modules will help us with this.

2. Loadable kernel module

Loadable Kernel Module (common abbreviation LKM - Loadable Kernel Module) is program code executed in kernel space. The main feature of LKM is the ability to dynamically load and unload without the need to reboot the entire system or recompile the kernel.

Each LKM consists of two main functions (minimum):

module initialization function. Called when LKM is loaded into memory: int init_module(void) ( ... )
module unload function: void cleanup_module(void) ( ... )

Here's an example of a simple module: /* Source 2.0 */ #include int init_module(void) ( printk("Hello World\n"); return 0; ) void cleanup_module(void) ( printk("Bye\n"); ) /* EOF */ Compile and load the module. Loading a module into memory is done with the insmod command, and viewing loaded modules with the lsmod command: # gcc -c -DMODULE -I/usr/src/linux/include/ src-2.0.c # insmod src-2.0.o Warning: loading src-2.0 .o will taint the kernel: no license Module src-2.0 loaded, with warnings # dmesg | tail -n 1 Hello World # lsmod | grep src src-2.0 336 0 (unused) # rmmod src-2.0 # dmesg | tail -n 1 Bye

3. System call interception algorithm based on LKM

To implement a module that intercepts a system call, it is necessary to define an interception algorithm. The algorithm is as follows:

save a pointer to the original (original) call so that it can be restored
create a function that implements a new system call
in the system call table sys_call_table, replace calls, i.e. set up a corresponding pointer to a new system call
upon completion of work (when unloading the module), restore the original system call using the previously saved pointer

Tracing allows you to find out which system calls are used during the operation of the user application. By tracing, you can determine which system call needs to be intercepted to take control of the application. # ltrace -S ./src-1.0 ... open("file1", 0, 01 SYS_open("file1", 0, 01) = 3<... open resumed>) = 3 read(3, SYS_read(3, "123\n", 80) = 4<... read resumed>"123\n", 80) = 4 close(3 SYS_close(3) = 0<... close resumed>) = 0 ... Now we have enough information to begin studying examples of implementations of modules that intercept system calls.

4. Examples of intercepting system calls based on LKM

4.1 Prohibition of directory creation

When a directory is created, the kernel function sys_mkdir is called. The parameter is a string containing the name of the directory to be created. Let's look at the code that intercepts the corresponding system call. /* Source 4.1 */ #include #include #include /* Export the system call table */ extern void *sys_call_table; /* Define a pointer to save the original call */ int (*orig_mkdir)(const char *path); /* Let's create our own system call. Our call does nothing, just returns a null value */ int own_mkdir(const char *path) ( return 0; ) /* During module initialization, we save a pointer to the original call and replace the system call */ int init_module(void) ( orig_mkdir =sys_call_table; sys_call_table=own_mkdir; printk("sys_mkdir replaced\n"); return(0); ) /* When unloading, restore the original call */ void cleanup_module(void) ( sys_call_table=orig_mkdir; printk("sys_mkdir moved back\n "); ) /* EOF */ To obtain the object module, run the following command and conduct a series of experiments on the system: # gcc -c -DMODULE -I/usr/src/linux/include/ src-3.1.c # dmesg | tail -n 1 sys_mkdir replaced # mkdir test # ls -ald test ls: test: No such file or directory # rmmod src-3.1 # dmesg | tail -n 1 sys_mkdir moved back # mkdir test # ls -ald test drwxr-xr-x 2 root root 4096 2003-12-23 03:46 test As you can see, the "mkdir" command does not work, or rather, nothing happens. To restore system functionality, simply unload the module. Which is what was done above.

4.2 Hiding a file entry in a directory

Let's determine which system call is responsible for reading the contents of the directory. To do this, let's write another test fragment that reads the current directory: /* Source 4.2.1 */ #include #include int main() ( DIR *d; struct dirent *dp; d = opendir("."); dp = readdir(d); return 0; ) /* EOF */ Get the executable and trace it: # gcc -o src-3.2.1 src-3.2.1.c # ltrace -S ./src-3.2.1 ... opendir("." SYS_open(".", 100352, 010005141300) = 3 SYS_fstat64(3, 0xbffff79c, 0x4014c2c0, 3, 0xbffff874) = 0 SYS_fcntl64(3, 2, 1, 1, 0x4014c2c0) = 0 SYS_br k(NULL) = 0x080495f4 SYS_brk(0x0806a5f4) = 0x0806a5f4 SYS_brk(NULL) = 0x0806a5f4 SYS_brk(0x0806b000) = 0x0806b000<... opendir resumed>) = 0x08049648 readdir(0x08049648 SYS_getdents64(3, 0x08049678, 4096, 0x40014400, 0x4014c2c0) = 528<... readdir resumed>) = 0x08049678 ... Pay attention to the last line. The contents of the directory are read by the getdents64 function (getdents is possible in other kernels). The result is saved as a list of structures of type struct dirent, and the function itself returns the length of all entries in the directory. We are interested in two fields of this structure:

d_reclen - record size
d_name - file name

In order to hide a file record about a file (in other words, make it invisible), you need to intercept the sys_getdents64 system call, find the corresponding record in the list of received structures and delete it. Let's look at the code that performs this operation (the author of the original code is Michal Zalewski): /* Source 4.2.2 */ #include #include #include #include #include #include #include #include extern void *sys_call_table; int (*orig_getdents)(u_int fd, struct dirent *dirp, u_int count); /* Define our system call */ int own_getdents(u_int fd, struct dirent *dirp, u_int count) ( unsigned int tmp, n; int t; struct dirent64 ( int d_ino1,d_ino2; int d_off1,d_off2; unsigned short d_reclen; unsigned char d_type; char d_name; ) *dirp2, *dirp3; /* Name of the file we want to hide */ char hide = "file1"; /* Determine the length of the directory entries */ tmp = (*orig_getdents)(fd,dirp ,count); if (tmp>0) ( /* Allocate memory for the structure in kernel space and copy the contents of the directory into it */ dirp2 = (struct dirent64 *)kmalloc(tmp,GFP_KERNEL); copy_from_user(dirp2,dirp,tmp) ; /* Let's use the second structure and save the length of the entries in the directory */ dirp3 = dirp2; t = tmp; /* Let's start searching for our file */ while (t>0) ( /* Read the length of the first entry and determine the remaining length of the entries in directory */ n = dirp3->d_reclen; t -= n; /* Check whether the file name from the current entry matches the searched one */ if (strstr((char *)&(dirp3->d_name), (char * )&hide) != NULL) ( /* If this is the case, then overwrite the entry and calculate the new length of the entries in the directory */ memcpy(dirp3, (char *)dirp3+dirp3->d_reclen, t); tmp -= n; ) /* Position the pointer to the next record and continue searching */ dirp3 = (struct dirent64 *)((char *)dirp3+dirp3->d_reclen); ) /* Return the result and free the memory */ copy_to_user(dirp,dirp2,tmp); kfree(dirp2); ) /* Return the length of entries in the directory */ return tmp; ) /* Functions for initializing and unloading a module have a standard form */ int init_module(void) ( orig_getdents = sys_call_table; sys_call_table=own_getdents; return 0; ) void cleanup_module() ( sys_call_table=orig_getdents; ) /* EOF */ Having compiled this code, Let's notice how "file1" disappears, which is what we needed to prove.

5. Direct access method to the kernel address space /dev/kmem

Let's first consider theoretically how interception is carried out using direct access to the kernel address space, and then we will proceed to practical implementation.

Direct access to the kernel address space is provided by the device file /dev/kmem. This file displays all available virtual address space, including the swap partition. To work with the kmem file, standard system functions are used - open(), read(), write(). By opening /dev/kmem in the standard way, we can access any address in the system, setting it as an offset in this file. This method was developed by Silvio Cesare.

System functions are accessed by loading function parameters into processor registers and then calling software interrupt 0x80. The handler for this interrupt, the system_call function, pushes the call parameters onto the stack, retrieves the address of the called system function from the sys_call_table table and transfers control to this address.

Having full access to the kernel address space, we can obtain the entire contents of the system call table, i.e. addresses of all system functions. By changing the address of any system call, we will thereby intercept it. But for this you need to know the table address, or, in other words, the offset in the /dev/kmem file at which this table is located.

To determine the address of the sys_call_table table, you first need to calculate the address of the system_call function. Since this function is an interrupt handler, let's look at how interrupts are handled in protected mode.

In real mode, when registering an interrupt, the processor accesses the interrupt vector table, which is always located at the very beginning of memory and contains two-word addresses of interrupt processing programs. In protected mode, the analogue of the interrupt vector table is the interrupt descriptor table (IDT, Interrupt Descriptor Table), located in the protected mode operating system. In order for the processor to access this table, its address must be loaded into the IDTR (Interrupt Descriptor Table Register) register. The IDT table contains descriptors of interrupt handlers, which, in particular, include their addresses. These descriptors are called gates. The processor, having registered an interrupt, retrieves the gateway from the IDT using its number, determines the address of the handler and transfers control to it.

To calculate the address of the system_call function from the IDT table, it is necessary to extract the interrupt gate int $0x80, and from it the address of the corresponding handler, i.e. address of the system_call function. In the system_call function, the system_call_table table is accessed using the call command<адрес_таблицы>(,%eax,4). Having found the opcode (signature) of this command in the /dev/kmem file, we will also find the address of the system call table.

To determine the opcode, we will use the debugger and disassemble the system_call function:

# gdb -q /usr/src/linux/vmlinux (gdb) disas system_call Dump of assembler code for function system_call: 0xc0194cbc : push %eax 0xc0194cbd : cld 0xc0194cbe : push %es 0xc0194cbf : push %ds 0xc0194cc0 : push %eax 0xc0194cc1 : push %ebp 0xc0194cc2 : push %edi 0xc0194cc3 : push %esi 0xc0194cc4 : push %edx 0xc0194cc5 : push %ecx 0xc0194cc6 : push %ebx 0xc0194cc7 : mov $0x18,%edx 0xc0194ccc : mov %edx,%ds 0xc0194cce : mov %edx,%es 0xc0194cd0 : mov $0xffffe000,%ebx 0xc0194cd5 : and %esp,%ebx 0xc0194cd7 : testb $0x2.0x18(%ebx) 0xc0194cdb :jne 0xc0194d3c 0xc0194cdd : cmp $0x10e,%eax 0xc0194ce2 :jae0xc0194d69 0xc0194ce8 : call *0xc02cbb0c(,%eax,4) 0xc0194cef : mov %eax,0x18(%esp,1) 0xc0194cf3 : nop End of assembler dump. The line "call *0xc02cbb0c(,%eax,4)" is a call to the sys_call_table table. The value 0xc02cbb0c is the table address (most likely your numbers will be different). Let's get the opcode of this command: (gdb) x/xw system_call+44 0xc0194ce8 : 0x0c8514ff We found the opcode of the command to access the sys_call_table table. It is equal to \xff\x14\x85. The next 4 bytes are the table address. You can verify this by entering the command: (gdb) x/xw system_call+44+3 0xc0194ceb : 0xc02cbb0c Thus, finding the sequence \xff\x14\x85 in the /dev/kmem file and reading the next 4 bytes, we get the address of the system call table sys_call_table. Knowing its address, we can get the contents of this table (addresses of all system functions) and change the address of any system call by intercepting it.

Let's look at the pseudocode that performs the interception operation:

Readaddr(old_syscall, scr + SYS_CALL*4, 4); writeaddr(new_syscall, scr + SYS_CALL*4, 4); The readaddr function reads the system call address from the system call table and stores it in the old_syscall variable. Each entry in the sys_call_table table takes 4 bytes. The required address is located at offset sct + SYS_CALL*4 in the file /dev/kmem (here sct is the address of the sys_call_table table, SYS_CALL is the serial number of the system call). The writeaddr function overwrites the address of the SYS_CALL system call with the address of the new_syscall function, and all calls to the SYS_CALL system call will be serviced by this function.

It seems that everything is simple and the goal has been achieved. However, let's remember that we are working in the user's address space. If we place a new system function in this address space, then when we call this function we will receive a nice error message. Hence the conclusion - the new system call must be placed in the kernel address space. To do this, you need to: get a memory block in kernel space, place a new system call in this block.

You can allocate memory in kernel space using the kmalloc function. But it is impossible to call a kernel function directly from the user address space, so we will use the following algorithm:

knowing the address of the sys_call_table table, we get the address of some system call (for example, sys_mkdir)
We define a function that calls the kmalloc function. This function returns a pointer to a block of memory in the kernel address space. Let's call this function get_kmalloc
save the first N bytes of the sys_mkdir system call, where N is the size of the get_kmalloc function
overwrite the first N bytes of the sys_mkdir call with the get_kmalloc function
we call the sys_mkdir system call, thereby launching the get_kmalloc function
restore the first N bytes of the sys_mkdir system call

As a result, we will have at our disposal a block of memory located in kernel space.

But to implement this algorithm, we need the address of the kmalloc function. There are several ways to find it. The simplest is to read this address from the System.map file or determine it using the gdb debugger (print &kmalloc). If module support is enabled in the kernel, the kmalloc address can be determined using the get_kernel_syms() function. This option will be discussed further. If there is no support for kernel modules, then the address of the kmalloc function will have to be looked for by the opcode of the kmalloc call command - similar to what was done for the sys_call_table table.

The kmalloc function takes two parameters: the size of the requested memory and the GFP specifier. To search for the opcode, we will use the debugger and disassemble any kernel function that contains a call to the kmalloc function.

# gdb -q /usr/src/linux/vmlinux (gdb) disas inter_module_register Dump of assembler code for function inter_module_register: 0xc01a57b4 : push %ebp 0xc01a57b5 : push %edi 0xc01a57b6 : push %esi 0xc01a57b7 : push %ebx 0xc01a57b8 : sub $0x10,%esp 0xc01a57bb : mov 0x24(%esp,1),%ebx 0xc01a57bf : mov 0x28(%esp,1),%esi 0xc01a57c3 : mov 0x2c(%esp,1),%ebp 0xc01a57c7 : movl $0x1f0,0x4(%esp,1) 0xc01a57cf : movl $0x14,(%esp,1) 0xc01a57d6 : call 0xc01bea2a ... It doesn’t matter what the function does, the main thing in it is what we need - a call to the kmalloc function. Pay attention to the last lines. First, the parameters are loaded onto the stack (the esp register points to the top of the stack), followed by a function call. The GFP specifier is loaded onto the stack first ($0x1f0,0x4(%esp,1). For kernel versions 2.4.9 and higher, this value is 0x1f0. Let's find the opcode of this command: (gdb) x/xw inter_module_register+19 0xc01a57c7 : 0x042444c7 If we find this opcode, we can calculate the address of the kmalloc function. At first glance, the address of this function is an argument to the call instruction, but this is not entirely true. Unlike the system_call function, here the instruction does not contain the kmalloc address, but the offset to it relative to the current address. Let's verify this by defining the opcode of the command call 0xc01bea2a: (gdb) x/xw inter_module_register+34 0xc01a57d6 : 0x01924fe8 The first byte is e8 - this is the opcode of the call instruction. Let's find the value of the argument of this command: (gdb) x/xw inter_module_register+35 0xc01a57d7 : 0x0001924f Now if we add the current address 0xc01a57d6, offset 0x0001924f and 5 bytes of the command, we will get the desired address of the kmalloc function - 0xc01bea2a.

This concludes the theoretical calculations and, using the above method, we will intercept the sys_mkdir system call.

6. Example of interception using /dev/kmem

/* source 6.0 */ #include #include #include #include #include #include #include #include /* System call number to intercept */ #define _SYS_MKDIR_ 39 #define KMEM_FILE "/dev/kmem" #define MAX_SYMS 4096 /* Description of the IDTR register format */ struct ( unsigned short limit; unsigned int base; ) __attribute__ ((packed) ) idtr; /* Description of the IDT table interrupt gate format */ struct ( unsigned short off1; unsigned short sel; unsigned char none, flags; unsigned short off2; ) __attribute__ ((packed)) idt; /* Description of the structure for the get_kmalloc function */ struct kma_struc ( ulong (*kmalloc) (uint, int); // - address of the kmalloc function int size; // - memory size for allocation int flags; // - flag, for kernels > 2.4.9 = 0x1f0 (GFP) ulong mem; ) __attribute__ ((packed)) kmalloc; /* A function that only allocates a block of memory in the kernel address space */ int get_kmalloc(struct kma_struc *k) ( k->mem = k->kmalloc(k->size, k->flags); return 0; ) /* A function that returns the address of the function (needed for kmalloc search) */ ulong get_sym(char *n) ( struct kernel_sym tab; int numsyms; int i; numsyms = get_kernel_syms(NULL); if (numsyms > MAX_SYMS || numsyms< 0) return 0; get_kernel_syms(tab); for (i = 0; i < numsyms; i++) { if (!strncmp(n, tab[i].name, strlen(n))) return tab[i].value; } return 0; } /* Наша новая системная функция, ничего не делает;) */ int new_mkdir(const char *path) { return 0; } /* Читает из /dev/kmem с offset size данных в buf */ static inline int rkm(int fd, uint offset, void *buf, uint size) { if (lseek(fd, offset, 0) != offset){ printf("lseek err\n"); return 0; } if (read(fd, buf, size) != size) return 0; return size; } /* Аналогично, но только пишет в /dev/kmem */ static inline int wkm(int fd, uint offset, void *buf, uint size) { if (lseek(fd, offset, 0) != offset) return 0; if (write(fd, buf, size) != size) return 0; return size; } /* Читает из /dev/kmem данные размером 4 байта */ static inline int rkml(int fd, uint offset, ulong *buf) { return rkm(fd, offset, buf, sizeof(ulong)); } /* Аналогично, но только пишет */ static inline int wkml(int fd, uint offset, ulong buf) { return wkm(fd, offset, &buf, sizeof(ulong)); } /* Функция для получения адреса sys_call_table */ ulong get_sct(int kmem) { ulong sys_call_off; // - адрес обработчика // прерывания int $0x80 (функция system_call) char *p; char sc_asm; asm("sidt %0" : "=m" (idtr)); if (!rkm(kmem, idtr.base+(8*0x80), &idt, sizeof(idt))) return 0; sys_call_off = (idt.off2 << 16) | idt.off1; if (!rkm(kmem, sys_call_off, &sc_asm, 128)) return 0; p = (char *)memmem(sc_asm, 128, "\xff\x14\x85", 3) + 3; printf("call for sys_call_table at %08x\n",p); if (p) return *(ulong *)p; return 0; } /* Функция для определения адреса функции kmalloc */ ulong get_kma(ulong pgoff) { uint i; unsigned char buf, *p, *p1; int kmemz; ulong ret; ret = get_sym("kmalloc"); if (ret) { printf("\nZer gut!\n"); return ret; } kmemz = open("/dev/kmem", O_RDONLY); if (kmemz < 0) return 0; for (i = pgoff+0x100000; i < (pgoff + 0x1000000); i += 0x10000){ if (!rkm(kmemz, i, buf, sizeof(buf))) return 0; p1=(char *)memmem(buf,sizeof(buf),"\x68\xf0\x01\x00",4); if(p1) { p=(char *)memmem(p1+4,sizeof(buf),"\xe8",1)+1; if (p) { close(kmemz); return *(unsigned long *)p+i+(p-buf)+4; } } } close(kmemz); return 0; } int main() { int kmem; // !! - пустые, нужно подставить ulong get_kmalloc_size; // - размер функции get_kmalloc !! ulong get_kmalloc_addr; // - адрес функции get_kmalloc !! ulong new_mkdir_size; // - размер функции-перехватчика!! ulong new_mkdir_addr; // - адрес функции-перехватчика!! ulong sys_mkdir_addr; // - адрес системного вызова sys_mkdir ulong page_offset; // - нижняя граница адресного // пространства ядра ulong sct; // - адрес таблицы sys_call_table ulong kma; // - адрес функции kmalloc unsigned char tmp; kmem = open(KMEM_FILE, O_RDWR, 0); if (kmem < 0) return 0; sct = get_sct(kmem); page_offset = sct & 0xF0000000; kma = get_kma(page_offset); printf("OK\n" "page_offset\t\t:\t0x%08x\n" "sys_call_table\t:\t0x%08x\n" "kmalloc()\t\t:\t0x%08x\n", page_offset,sct,kma); /* Найдем адрес sys_mkdir */ if (!rkml(kmem, sct+(_SYS_MKDIR_*4), &sys_mkdir_addr)) { printf("Cannot get addr of %d syscall\n", _SYS_MKDIR_); perror("er: "); return 1; } /* Сохраним первые N байт вызова sys_mkdir */ if (!rkm(kmem, sys_mkdir_addr, tmp, get_kmalloc_size)) { printf("Cannot save old %d syscall!\n", _SYS_MKDIR_); return 1; } /* Перепишем первые N байт, функцией get_kmalloc */ if (!wkm(kmem, sys_mkdir_addr,(void *)get_kmalloc_addr, get_kmalloc_size)) { printf("Can"t overwrite our syscall %d!\n",_SYS_MKDIR_); return 1; } kmalloc.kmalloc = (void *) kma; //- адрес функции kmalloc kmalloc.size = new_mkdir_size; //- размер запращевоемой // памяти (размер функции-перехватчика new_mkdir) kmalloc.flags = 0x1f0; //- спецификатор GFP /* Выполним сис. вызов sys_mkdir, тем самым выполним нашу функцию get_kmalloc */ mkdir((char *)&kmalloc,0); /* Востановим оригинальный вызов sys_mkdir */ if (!wkm(kmem, sys_mkdir_addr, tmp, get_kmalloc_size)) { printf("Can"t restore syscall %d !\n",_SYS_MKDIR_); return 1; } if (kmalloc.mem < page_offset) { printf("Allocated memory is too low (%08x < %08x)\n", kmalloc.mem, page_offset); return 1; } /* Оторбразим результаты */ printf("sys_mkdir_addr\t\t:\t0x%08x\n" "get_kmalloc_size\t:\t0x%08x (%d bytes)\n\n" "our kmem region\t\t:\t0x%08x\n" "size of our kmem\t:\t0x%08x (%d bytes)\n\n", sys_mkdir_addr, get_kmalloc_size, get_kmalloc_size, kmalloc.mem, kmalloc.size, kmalloc.size); /* Разместим в пространстве ядра наш новый сис. вызво */ if(!wkm(kmem, kmalloc.mem, (void *)new_mkdir_addr, new_mkdir_size)) { printf("Unable to locate new system call !\n"); return 1; } /* Перепишем таблицу sys_call_table на наш новый вызов */ if(!wkml(kmem, sct+(_SYS_MKDIR_*4), kmalloc.mem)) { printf("Eh ..."); return 1; } return 1; } /* EOF */ Скомпилируем полученый код и определим адреса и размеры функций get_kmalloc и new_mkdir. Запускать полученое творение рано! Для вычисления адресов и размеров воспользуемся утилитой objdump: # gcc -o src-6.0 src-6.0.c # objdump -x ./src-6.0 >dump Let's open the dump file and find the data we are interested in: 080485a4 g F .text 00000032 get_kmalloc 080486b1 g F .text 0000000a new_mkdir Now let's enter these values into our program: ulong get_kmalloc_size=0x32; ulong get_kmalloc_addr=0x080485a4 ; ulong new_mkdir_size=0x0a; ulong new_mkdir_addr=0x080486b1; Now let's recompile the program. By launching it, we will intercept the sys_mkdir system call. All calls to the sys_mkdir call will now be serviced by the new_mkdir function.

End Of Paper/EOP

The functionality of the code from all sections was tested on kernel 2.4.22. When preparing the report, materials from the site were used

System calls. Man syscalls (2): Linux system calls Theory

Instead of an introduction.

Theory. What are system calls?

The classic mechanism for servicing system calls in Linux.

Interrupts in x86 architecture.

System calls and int 0x80.

system_call().

New mechanism for handling system calls in Linux. sysenter/sysexit.

Results.

Links.

1. General view of Linux architecture

2. Loadable kernel module

3. System call interception algorithm based on LKM

4. Examples of intercepting system calls based on LKM

4.1 Prohibition of directory creation

4.2 Hiding a file entry in a directory

5. Direct access method to the kernel address space /dev/kmem

6. Example of interception using /dev/kmem

End Of Paper/EOP

Popular articles

Latest articles

Sections

Pages

Special projects

Contacts