Hacking Linux Kernel

Author: Sanjay Ahuja

 

User space vs. Kernel space

Linux is a protected operating system. It is implemented over the protected mode of the i386 series of CPUs.

Memory is divided into roughly two parts: kernel space and user space. Kernel space is where the kernel code lives, and user space is where the user programs live. Of course, a given user program can't write to kernel memory or to another program's memory area.

Unfortunately, this is also the case for kernel code. Kernel code can't write to user space either. What does this mean? Well, when a given hardware driver wants to write data bytes to a program in user memory, it can't do it directly, but rather it must use specific kernel functions instead. Also, when parameters are passed by address to a kernel function, the kernel function can not read the parameters directly. It must use other kernel functions to read each byte of the parameters.

Here are a few useful functions to use in kernel mode for transferring data bytes to or from user memory.

#include <asm/segment.h>

get_user(ptr)

Gets the given byte, word, or long from user memory. This is a macro, and it relies on the type of the argument to determine the number of bytes to transfer. You then have to use typecasts wisely.

put_user(ptr)

This is the same as get_user(), but instead of reading, it writes data bytes to user memory.

memcpy_fromfs(void *to, const void *from,unsigned long n)

Copies n bytes from *from in user memory to *to in kernel memory.

memcpy_tofs(void *to,const *from,unsigned long n)

Copies n bytes from *from in kernel memory to *to in user memory.

 

 

System calls

Most libc calls rely on system calls, which are the simplest kernel functions a user program can call. These system calls are implemented in the kernel itself or in loadable kernel modules, which are little chunks of dynamically linkable kernel code.

Like MS-DOS and many others, Linux system calls are implemented through a multiplexor called with a given maskable interrupt. In Linux, this interrupt is int 0x80. When the 'int 0x80' instruction is executed, control is given to the kernel (or, more accurately, to the function _system_call()), and the actual demultiplexing process occurs.

 

 

How does _system_call() work ?

First, all registers are saved and the content of the %eax register is checked against the global system calls table, which enumerates all system calls and their addresses. This table can be accessed with the extern void *sys_call_table[] variable. A given number and memory address in this table corresponds to each system call. System call numbers can be found in /usr/include/sys/syscall.h. The following list shows my syscall.h

#ifndef _SYS_SYSCALL_H

#define _SYS_SYSCALL_H

#define SYS_setup 0 /* Used only by init, to get system going. */

SYS_exit 1

SYS_fork 2 /* systemcall for the well-know fork()

function in user space */

SYS_read 3

SYS_write 4

SYS_open 5

SYS_close 6

SYS_waitpid 7

SYS_creat 8

SYS_link 9

SYS_unlink 10

SYS_execve 11

SYS_chdir 12

SYS_time 13

SYS_mknod 14

SYS_chmod 15

SYS_lchown 16

SYS_break 17

SYS_oldstat 18

SYS_lseek 19

SYS_getpid 20

SYS_mount 21

SYS_umount 22

SYS_setuid 23 /* systemcalls for managing UID etc */

SYS_getuid 24 /* systemcalls for managing UID etc */

SYS_stime 25

SYS_ptrace 26

SYS_alarm 27

SYS_oldfstat 28

SYS_pause 29

SYS_utime 30

SYS_stty 31

SYS_gtty 32

SYS_access 33

SYS_nice 34

SYS_ftime 35

SYS_sync 36

SYS_kill 37

SYS_rename 38

SYS_mkdir 39

SYS_rmdir 40

SYS_dup 41

SYS_pipe 42

SYS_times 43

SYS_prof 44

SYS_brk 45 /* changes the size of used DS (data

segment) */

SYS_setgid 46

SYS_getgid 47

SYS_signal 48

SYS_geteuid 49

SYS_getegid 50

SYS_acct 51

SYS_umount2 52

SYS_lock 53

SYS_ioctl 54

SYS_fcntl 55

SYS_mpx 56

SYS_setpgid 57

SYS_ulimit 58

SYS_oldolduname 59

SYS_umask 60

SYS_chroot 61

SYS_ustat 62

SYS_dup2 63

SYS_getppid 64

SYS_getpgrp 65

SYS_setsid 66

SYS_sigaction 67

SYS_sgetmask 68

SYS_ssetmask 69

SYS_setreuid 70

SYS_setregid 71

SYS_sigsuspend 72

SYS_sigpending 73

SYS_sethostname 74

SYS_setrlimit 75

SYS_getrlimit 76 /* Back compatible 2Gig limited rlimit */

SYS_getrusage 77

SYS_gettimeofday 78

SYS_settimeofday 79

SYS_getgroups 80

SYS_setgroups 81

SYS_select 82

SYS_symlink 83

SYS_oldlstat 84

SYS_readlink 85

SYS_uselib 86

SYS_swapon 87

SYS_reboot 88

SYS_readdir 89

SYS_mmap 90

SYS_munmap 91

SYS_truncate 92

SYS_ftruncate 93

SYS_fchmod 94

SYS_fchown 95

SYS_getpriority 96

SYS_setpriority 97

SYS_profil 98

SYS_statfs 99

SYS_fstatfs 100

SYS_ioperm 101

SYS_socketcall 102

SYS_syslog 103

SYS_setitimer 104

SYS_getitimer 105

SYS_stat 106

SYS_lstat 107

SYS_fstat 108

SYS_olduname 109

SYS_iopl 110

SYS_vhangup 111

SYS_idle 112

SYS_vm86old 113

SYS_wait4 114

SYS_swapoff 115

SYS_sysinfo 116

SYS_ipc 117

SYS_fsync 118

SYS_sigreturn 119

SYS_clone 120

SYS_setdomainname 121

SYS_uname 122

SYS_modify_ldt 123

SYS_adjtimex 124

SYS_mprotect 125

SYS_sigprocmask 126

SYS_create_module 127

SYS_init_module 128

SYS_delete_module 129

SYS_get_kernel_syms 130

SYS_quotactl 131

SYS_getpgid 132

SYS_fchdir 133

SYS_bdflush 134

SYS_sysfs 135

SYS_personality 136

SYS_afs_syscall 137 /* Syscall for Andrew File System */

SYS_setfsuid 138

SYS_setfsgid 139

SYS__llseek 140

SYS_getdents 141

SYS__newselect 142

SYS_flock 143

SYS_msync 144

SYS_readv 145

SYS_writev 146

SYS_getsid 147

SYS_fdatasync 148

SYS__sysctl 149

SYS_mlock 150

SYS_munlock 151

SYS_mlockall 152

SYS_munlockall 153

SYS_sched_setparam 154

SYS_sched_getparam 155

SYS_sched_setscheduler 156

SYS_sched_getscheduler 157

SYS_sched_yield 158

SYS_sched_get_priority_max 159

SYS_sched_get_priority_min 160

SYS_sched_rr_get_interval 161

SYS_nanosleep 162

SYS_mremap 163

SYS_setresuid 164

SYS_getresuid 165

SYS_vm86 166

SYS_query_module 167

SYS_poll 168

SYS_nfsservctl 169

SYS_setresgid 170

SYS_getresgid 171

SYS_prctl 172

SYS_rt_sigreturn 173

SYS_rt_sigaction 174

SYS_rt_sigprocmask 175

SYS_rt_sigpending 176

SYS_rt_sigtimedwait 177

SYS_rt_sigqueueinfo 178

SYS_rt_sigsuspend 179

SYS_pread 180

SYS_pwrite 181

SYS_chown 182

SYS_getcwd 183

SYS_capget 184

SYS_capset 185

SYS_sigaltstack 186

SYS_sendfile 187

SYS_getpmsg 188 /* some people actually want streams */

SYS_putpmsg 189 /* some people actually want streams */

SYS_vfork 190

SYS_ugetrlimit 191 /* SuS compliant getrlimit */

SYS_mmap2 192

SYS_truncate64 193

SYS_ftruncate64 194

SYS_stat64 195

SYS_lstat64 196

SYS_fstat64 197

SYS_lchown32 198

SYS_getuid32 199

SYS_getgid32 200

SYS_geteuid32 201

SYS_getegid32 202

SYS_setreuid32 203

SYS_setregid32 204

SYS_getgroups32 205

SYS_setgroups32 206

SYS_fchown32 207

SYS_setresuid32 208

SYS_getresuid32 209

SYS_setresgid32 210

SYS_getresgid32 211

SYS_chown32 212

SYS_setuid32 213

SYS_setgid32 214

SYS_setfsuid32 215

SYS_setfsgid32 216

SYS_pivot_root 217

SYS_mincore 218

SYS_madvise 219

SYS_madvise1 219 /* delete when C lib stub is removed */

SYS_getdents64 220

SYS_fcntl64 221

SYS_security 223 /* syscall for security modules */

SYS_gettid 224

SYS_readahead 225

SYS_setxattr 226

SYS_lsetxattr 227

SYS_fsetxattr 228

SYS_getxattr 229

SYS_lgetxattr 230

SYS_fgetxattr 231

SYS_listxattr 232

SYS_llistxattr 233

SYS_flistxattr 234

SYS_removexattr 235

SYS_lremovexattr 236

SYS_fremovexattr 237

SYS_tkill 238

SYS_sendfile64 239

SYS_futex 240

SYS_sched_setaffinity 241

SYS_sched_getaffinity 242

SYS_set_thread_area 243

#endif /* */

 

They are of the form SYS_systemcallname. If the system call is not implemented, the corresponding cell in the sys_call_table is 0, and an error is returned. Otherwise, the system call exists and the corresponding entry in the table is the memory address of the system call code.

Here is an example of an invalid system call:

 

root@test kernel]# cat no1.c

#include <linux/errno.h>

#include <sys/syscall.h>

#include <errno.h>

extern void *sys_call_table[];

sc()

{ // system call number 245 doesn't exist at this time.

__asm__(

"movl $245,%eax

int $0x80");

}

main()

{

errno = -sc();

perror("test of invalid syscall");

}

[root@test kernel]# gcc no1.c

[root@test kernel]# ./a.out

test of invalid syscall: Function not implemented

[root@test kernel]# exit

 

The control is then transferred to the actual system call, which performs whatever you requested and returns. _system_call() then calls _ret_from_sys_call() to check various stuff, and ultimately returns to user memory.

 

 

What is the Kernel-Symbol-Table

There is another very important point we need to understand - the Kernel Symbol Table. Take a look at /proc/ksyms. Every entry in this file represents an exported (public) Kernel Symbol, which can be accessed by our LKM. Every Symbol used in our LKM (like a function) is also exported to the public, and is also listed in that file. LKM developers are able to use the following piece of regular code to limit the exported symbols of their module:

static struct symbol_table module_syms= { /*we define our own symbol table !*/

#include /*symbols we want to export, do we ?*/

...

};

register_symtab(&module_syms); /*do the actual registration*/

As we don't want to export any symbols to the public, so we can use the following construction:

register_symtab(NULL);

This line must be inserted in the init_module() function.

 

libc

The int $0x80 isn't used directly for system calls; rather, libc functions, which are often wrappers to interrupt 0x80, are used.

libc generally features the system calls using the _syscallX() macros, where X is the number of parameters for the system call.

For example, the libc entry for write(2) would be implemented with a _syscall3 macro, since the actual write(2) prototype requires 3 parameters. Before calling interrupt 0x80, the _syscallX macros are supposed to set up the stack frame and the argument list required for the system call.

Finally, when the _system_call() (which is triggered with int $0x80) returns, the _syscallX() macro will check for a negative return value (in %eax) and will set errno accordingly.

Let's check another example with write(2) and see how it gets preprocessed.

 

[root@test kernel]# cat no2.c

#include <linux/types.h>

#include <linux/fs.h>

#include <sys/syscall.h>

#include <asm/unistd.h>

#include <sys/types.h>

#include <stdio.h>

#include <errno.h>

#include <fcntl.h>

#include <ctype.h>

_syscall3(ssize_t,write,int,fd,const void *,buf,size_t,count);

main()

{

char *t = "this is a test.\n";

write(0, t, strlen(t));

}

[root@test kernel]# gcc -E no2.c > no2.C

[root@test kernel]# indent no2.C -kr

indent:no2.C:3304: Warning: old style assignment ambiguity in "=-". Assuming "= -"

[root@test kernel]# tail -n 50 no2.C

 

#9 "no2.c" 2

 

ssize_t write(int fd, const void *buf, size_t count)

{

long __res;

__asm__ __volatile("int $0x80":"=a"(__res):"0"(4), "b"((long) (fd)), "c"((long) (buf)), "d"((long) (count)));

if (__res >= 0)

return (ssize_t) __res;

errno = -__res;

return -1;

};

main()

{

char *t = "this is a test.\n";

write(0, t, strlen(t));

}

[root@test kernel]# exit

 

Note that the "0"(4) in the write() function above matches the SYS_write definition in /usr/include/sys/syscall.h.

 

 

 

Making your own system calls.

There are a few ways to make your own system calls.

For example, you could modify the kernel sources and append your own code. A far easier way, however, would be to write a loadable kernel module.

A loadable kernel module is nothing more than an object file containing code that will be dynamically linked into the kernel when it is needed.

The main purposes of this feature are to have a small kernel, and to load a given driver when it is needed with the insmod(1) command. It's also easier to write a Kernel Loadable Module than to write code in the kernel source tree.

 

 

 

Writing a Kernel Loadable Module

A Kernel Loadable Module is easily made in C. It contains a chunk of #defines, some functions, an initialization function called init_module(), and an unload function called cleanup_module().

LKMs can be manually loaded using insmod and they can be removed using rmmod. For unloading the module the "Usage Counter" must be 0.

Loading a module - normally restricted to root - is managed by issuing the following command:

# insmod module.o

This command forces the System to do the following things :

after this the init_module systemcall is used for the LKM initialisation -> executing int init_module(void) etc.

Here is a typical Kernel Loadable Module source structure:

 

#define MODULE

#define __KERNEL__

#define __KERNE_SYSCALLS__

#include <linux/config.h>

#ifdef MODULE

#include <linux/module.h>

#include <linux/version.h>

#else

#define MOD_INC_USE_COUNT

#define MOD_DEC_USE_COUNT

#endif

#include <linux/types.h>

#include <linux/fs.h>

#include <linux/mm.h>

#include <linux/errno.h>

#include <asm/segment.h>

#include <sys/syscall.h>

#include <linux/dirent.h>

#include <asm/unistd.h>

#include <sys/types.h>

#include <stdio.h>

#include <errno.h>

#include <fcntl.h>

#include <ctype.h>

int errno;

char tmp[64];

/* for example, we may need to use ioctl */

_syscall3(int, ioctl, int, d, int, request, unsigned long, arg);

int myfunction(int parm1,char *parm2)

{

int i,j,k;

/* ... */

}

int init_module(void)

{

/* ... */

printk("\nModule loaded.\n");

return 0;

}

void cleanup_module(void)

{

/* ... */

}

Check the mandatory #defines (#define MODULE, #define __KERNEL__) and

#includes (#include <linux/config.h> ...)

Also note that as our Kernel Loadable Module will be running in kernel mode, we can't use libc functions, but we can use system calls with the previously discussed _syscallX() macros.

You would compile this module with 'gcc -c -O3 module.c' and insert it into the kernel with 'insmod module.o' (optimization must be turned on).

As the title suggests, Kernel Loadable Module can also be used to modify kernel code without having to rebuild it entirely. For example, you could patch the write(2) system call to hide portions of a given file. Seems like a good place for backdoors, too: what would you do if you couldn't trust your own kernel?

 

 

Kernel and system calls backdoors

The main idea behind this is pretty simple. We'll redirect those damn system calls to our own ones in a Kernel Loadable Module, which will enable us to force the kernel to react as we want it to. For example, we could hide a sniffer by patching the IOCTL system call and masking the PROMISC bit. Lame but efficient.

To modify a given system call, just add the definition of the extern void *sys_call_table[] in your Kernel Loadable Module, and have the init_module() function modify the corresponding entry in the sys_call_table to point to your own code. The modified call can then do whatever you wish it to, call the original system call by modifying sys_call_table once more, and ...