Hacking Linux Kernel
Author: Sanjay Ahuja
Linux is a protected operating system. It is implemented over the protected mode of the i386 series of CPUs.
Memory is divided into roughly two parts: kernel space and user space. Kernel space is where the kernel code lives, and user space is where the user programs live. Of course, a given user program can't write to kernel memory or to another program's memory area.
Unfortunately, this is also the case for kernel code. Kernel code can't write to user space either. What does this mean? Well, when a given hardware driver wants to write data bytes to a program in user memory, it can't do it directly, but rather it must use specific kernel functions instead. Also, when parameters are passed by address to a kernel function, the kernel function can not read the parameters directly. It must use other kernel functions to read each byte of the parameters.
Here are a few useful functions to use in kernel mode for transferring data bytes to or from user memory.
#include <asm/segment.h>
get_user(ptr)
Gets the given byte, word, or long from user memory. This is a macro, and it relies on the type of the argument to determine the number of bytes to transfer. You then have to use typecasts wisely.
put_user(ptr)
This is the same as get_user(), but instead of reading, it writes data bytes to user memory.
memcpy_fromfs(void *to, const void *from,unsigned long n)
Copies n bytes from *from in user memory to *to in kernel memory.
memcpy_tofs(void *to,const *from,unsigned long n)
Copies n bytes from *from in kernel memory to *to in user memory.
Most libc calls rely on system calls, which are the simplest kernel functions a user program can call. These system calls are implemented in the kernel itself or in loadable kernel modules, which are little chunks of dynamically linkable kernel code.
Like MS-DOS and many others, Linux system calls are implemented through a multiplexor called with a given maskable interrupt. In Linux, this interrupt is int 0x80. When the 'int 0x80' instruction is executed, control is given to the kernel (or, more accurately, to the function
_system_call()), and the actual demultiplexing process occurs.
How does _system_call() work ?
First, all registers are saved and the content of the %eax register is checked against the global system calls table, which enumerates all system calls and their addresses. This table can be accessed with the
extern void *sys_call_table[] variable. A given number and memory address in this table corresponds to each system call. System call numbers can be found in /usr/include/sys/syscall.h. The following list shows my syscall.h#ifndef _SYS_SYSCALL_H
#define _SYS_SYSCALL_H
#define SYS_setup 0 /* Used only by init, to get system going. */
SYS_exit 1
SYS_fork 2 /* systemcall for the well-know fork()
function in user space */
SYS_read 3
SYS_write 4
SYS_open 5
SYS_close 6
SYS_waitpid 7
SYS_creat 8
SYS_link 9
SYS_unlink 10
SYS_execve 11
SYS_chdir 12
SYS_time 13
SYS_mknod 14
SYS_chmod 15
SYS_lchown 16
SYS_break 17
SYS_oldstat 18
SYS_lseek 19
SYS_getpid 20
SYS_mount 21
SYS_umount 22
SYS_setuid 23 /* systemcalls for managing UID etc */
SYS_getuid 24 /* systemcalls for managing UID etc */
SYS_stime 25
SYS_ptrace 26
SYS_alarm 27
SYS_oldfstat 28
SYS_pause 29
SYS_utime 30
SYS_stty 31
SYS_gtty 32
SYS_access 33
SYS_nice 34
SYS_ftime 35
SYS_sync 36
SYS_kill 37
SYS_rename 38
SYS_mkdir 39
SYS_rmdir 40
SYS_dup 41
SYS_pipe 42
SYS_times 43
SYS_prof 44
SYS_brk 45 /* changes the size of used DS (data
segment) */
SYS_setgid 46
SYS_getgid 47
SYS_signal 48
SYS_geteuid 49
SYS_getegid 50
SYS_acct 51
SYS_umount2 52
SYS_lock 53
SYS_ioctl 54
SYS_fcntl 55
SYS_mpx 56
SYS_setpgid 57
SYS_ulimit 58
SYS_oldolduname 59
SYS_umask 60
SYS_chroot 61
SYS_ustat 62
SYS_dup2 63
SYS_getppid 64
SYS_getpgrp 65
SYS_setsid 66
SYS_sigaction 67
SYS_sgetmask 68
SYS_ssetmask 69
SYS_setreuid 70
SYS_setregid 71
SYS_sigsuspend 72
SYS_sigpending 73
SYS_sethostname 74
SYS_setrlimit 75
SYS_getrlimit 76 /* Back compatible 2Gig limited rlimit */
SYS_getrusage 77
SYS_gettimeofday 78
SYS_settimeofday 79
SYS_getgroups 80
SYS_setgroups 81
SYS_select 82
SYS_symlink 83
SYS_oldlstat 84
SYS_readlink 85
SYS_uselib 86
SYS_swapon 87
SYS_reboot 88
SYS_readdir 89
SYS_mmap 90
SYS_munmap 91
SYS_truncate 92
SYS_ftruncate 93
SYS_fchmod 94
SYS_fchown 95
SYS_getpriority 96
SYS_setpriority 97
SYS_profil 98
SYS_statfs 99
SYS_fstatfs 100
SYS_ioperm 101
SYS_socketcall 102
SYS_syslog 103
SYS_setitimer 104
SYS_getitimer 105
SYS_stat 106
SYS_lstat 107
SYS_fstat 108
SYS_olduname 109
SYS_iopl 110
SYS_vhangup 111
SYS_idle 112
SYS_vm86old 113
SYS_wait4 114
SYS_swapoff 115
SYS_sysinfo 116
SYS_ipc 117
SYS_fsync 118
SYS_sigreturn 119
SYS_clone 120
SYS_setdomainname 121
SYS_uname 122
SYS_modify_ldt 123
SYS_adjtimex 124
SYS_mprotect 125
SYS_sigprocmask 126
SYS_create_module 127
SYS_init_module 128
SYS_delete_module 129
SYS_get_kernel_syms 130
SYS_quotactl 131
SYS_getpgid 132
SYS_fchdir 133
SYS_bdflush 134
SYS_sysfs 135
SYS_personality 136
SYS_afs_syscall 137 /* Syscall for Andrew File System */
SYS_setfsuid 138
SYS_setfsgid 139
SYS__llseek 140
SYS_getdents 141
SYS__newselect 142
SYS_flock 143
SYS_msync 144
SYS_readv 145
SYS_writev 146
SYS_getsid 147
SYS_fdatasync 148
SYS__sysctl 149
SYS_mlock 150
SYS_munlock 151
SYS_mlockall 152
SYS_munlockall 153
SYS_sched_setparam 154
SYS_sched_getparam 155
SYS_sched_setscheduler 156
SYS_sched_getscheduler 157
SYS_sched_yield 158
SYS_sched_get_priority_max 159
SYS_sched_get_priority_min 160
SYS_sched_rr_get_interval 161
SYS_nanosleep 162
SYS_mremap 163
SYS_setresuid 164
SYS_getresuid 165
SYS_vm86 166
SYS_query_module 167
SYS_poll 168
SYS_nfsservctl 169
SYS_setresgid 170
SYS_getresgid 171
SYS_prctl 172
SYS_rt_sigreturn 173
SYS_rt_sigaction 174
SYS_rt_sigprocmask 175
SYS_rt_sigpending 176
SYS_rt_sigtimedwait 177
SYS_rt_sigqueueinfo 178
SYS_rt_sigsuspend 179
SYS_pread 180
SYS_pwrite 181
SYS_chown 182
SYS_getcwd 183
SYS_capget 184
SYS_capset 185
SYS_sigaltstack 186
SYS_sendfile 187
SYS_getpmsg 188 /* some people actually want streams */
SYS_putpmsg 189 /* some people actually want streams */
SYS_vfork 190
SYS_ugetrlimit 191 /* SuS compliant getrlimit */
SYS_mmap2 192
SYS_truncate64 193
SYS_ftruncate64 194
SYS_stat64 195
SYS_lstat64 196
SYS_fstat64 197
SYS_lchown32 198
SYS_getuid32 199
SYS_getgid32 200
SYS_geteuid32 201
SYS_getegid32 202
SYS_setreuid32 203
SYS_setregid32 204
SYS_getgroups32 205
SYS_setgroups32 206
SYS_fchown32 207
SYS_setresuid32 208
SYS_getresuid32 209
SYS_setresgid32 210
SYS_getresgid32 211
SYS_chown32 212
SYS_setuid32 213
SYS_setgid32 214
SYS_setfsuid32 215
SYS_setfsgid32 216
SYS_pivot_root 217
SYS_mincore 218
SYS_madvise 219
SYS_madvise1 219 /* delete when C lib stub is removed */
SYS_getdents64 220
SYS_fcntl64 221
SYS_security 223 /* syscall for security modules */
SYS_gettid 224
SYS_readahead 225
SYS_setxattr 226
SYS_lsetxattr 227
SYS_fsetxattr 228
SYS_getxattr 229
SYS_lgetxattr 230
SYS_fgetxattr 231
SYS_listxattr 232
SYS_llistxattr 233
SYS_flistxattr 234
SYS_removexattr 235
SYS_lremovexattr 236
SYS_fremovexattr 237
SYS_tkill 238
SYS_sendfile64 239
SYS_futex 240
SYS_sched_setaffinity 241
SYS_sched_getaffinity 242
SYS_set_thread_area 243
#endif /* */
They are of the form
SYS_systemcallname. If the system call is not implemented, the corresponding cell in the sys_call_table is 0, and an error is returned. Otherwise, the system call exists and the corresponding entry in the table is the memory address of the system call code.Here is an example of an invalid system call:
root@test kernel]# cat no1.c
#include <linux/errno.h>
#include <sys/syscall.h>
#include <errno.h>
extern void *sys_call_table[];
sc()
{ // system call number 245 doesn't exist at this time.
__asm__(
"movl $245,%eax
int $0x80");
}
main()
{
errno = -sc();
perror("test of invalid syscall");
}
[root@test kernel]# gcc no1.c
[root@test kernel]# ./a.out
test of invalid syscall: Function not implemented
[root@test kernel]# exit
The control is then transferred to the actual system call, which performs whatever you requested and returns.
_system_call() then calls _ret_from_sys_call() to check various stuff, and ultimately returns to user memory.
What is the Kernel-Symbol-Table
There is another very important point we need to understand - the Kernel Symbol Table. Take a look at /proc/ksyms. Every entry in this file represents an exported (public) Kernel Symbol, which can be accessed by our LKM. Every Symbol used in our LKM (like a function) is also exported to the public, and is also listed in that file. LKM developers are able to use the following piece of regular code to limit the exported symbols of their module:
static struct symbol_table module_syms= { /*we define our own symbol table !*/
#include /*symbols we want to export, do we ?*/
...
};
register_symtab(&module_syms); /*do the actual registration*/
As we don't want to export any symbols to the public, so we can use the following construction:
register_symtab(NULL);
This line must be inserted in the
init_module() function.
The int $0x80 isn't used directly for system calls; rather, libc functions, which are often wrappers to interrupt 0x80, are used.
libc generally features the system calls using the
_syscallX() macros, where X is the number of parameters for the system call.For example, the libc entry for
write(2) would be implemented with a _syscall3 macro, since the actual write(2) prototype requires 3 parameters. Before calling interrupt 0x80, the _syscallX macros are supposed to set up the stack frame and the argument list required for the system call.Finally, when the
_system_call() (which is triggered with int $0x80) returns, the _syscallX() macro will check for a negative return value (in %eax) and will set errno accordingly.Let's check another example with
write(2) and see how it gets preprocessed.
[root@test kernel]# cat no2.c
#include <linux/types.h>
#include <linux/fs.h>
#include <sys/syscall.h>
#include <asm/unistd.h>
#include <sys/types.h>
#include <stdio.h>
#include <errno.h>
#include <fcntl.h>
#include <ctype.h>
_syscall3(ssize_t,write,int,fd,const void *,buf,size_t,count);
main()
{
char *t = "this is a test.\n";
write(0, t, strlen(t));
}
[root@test kernel]# gcc -E no2.c > no2.C
[root@test kernel]# indent no2.C -kr
indent:no2.C:3304: Warning: old style assignment ambiguity in "=-". Assuming "= -"
[root@test kernel]# tail -n 50 no2.C
#9 "no2.c" 2
ssize_t write(int fd, const void *buf, size_t count)
{
long __res;
__asm__ __volatile("int $0x80":"=a"(__res):"0"(4), "b"((long) (fd)), "c"((long) (buf)), "d"((long) (count)));
if (__res >= 0)
return (ssize_t) __res;
errno = -__res;
return -1;
};
main()
{
char *t = "this is a test.\n";
write(0, t, strlen(t));
}
[root@test kernel]# exit
Note that the "0"(4) in the
write() function above matches the SYS_write definition in /usr/include/sys/syscall.h.
There are a few ways to make your own system calls.
For example, you could modify the kernel sources and append your own code. A far easier way, however, would be to write a loadable kernel module.
A loadable kernel module is nothing more than an object file containing code that will be dynamically linked into the kernel when it is needed.
The main purposes of this feature are to have a small kernel, and to load a given driver when it is needed with the
insmod(1) command. It's also easier to write a Kernel Loadable Module than to write code in the kernel source tree.
Writing a Kernel Loadable Module
A Kernel Loadable Module is easily made in C. It contains a chunk of #defines, some functions, an initialization function called
init_module(), and an unload function called cleanup_module().LKMs can be manually loaded using insmod and they can be removed using rmmod. For unloading the module the "Usage Counter" must be 0.
Loading a module - normally restricted to root - is managed by issuing the following command:
# insmod module.o
This command forces the System to do the following things :
after this the
init_module systemcall is used for the LKM initialisation -> executing int init_module(void) etc.Here is a typical Kernel Loadable Module source structure:
#define MODULE
#define __KERNEL__
#define __KERNE_SYSCALLS__
#include <linux/config.h>
#ifdef MODULE
#include <linux/module.h>
#include <linux/version.h>
#else
#define MOD_INC_USE_COUNT
#define MOD_DEC_USE_COUNT
#endif
#include <linux/types.h>
#include <linux/fs.h>
#include <linux/mm.h>
#include <linux/errno.h>
#include <asm/segment.h>
#include <sys/syscall.h>
#include <linux/dirent.h>
#include <asm/unistd.h>
#include <sys/types.h>
#include <stdio.h>
#include <errno.h>
#include <fcntl.h>
#include <ctype.h>
int errno;
char tmp[64];
/* for example, we may need to use ioctl */
_syscall3(int, ioctl, int, d, int, request, unsigned long, arg);
int myfunction(int parm1,char *parm2)
{
int i,j,k;
/* ... */
}
int init_module(void)
{
/* ... */
printk("\nModule loaded.\n");
return 0;
}
void cleanup_module(void)
{
/* ... */
}
Check the mandatory
#defines (#define MODULE, #define __KERNEL__) and#includes (#include <linux/config.h> ...)
Also note that as our Kernel Loadable Module will be running in kernel mode, we can't use libc functions, but we can use system calls with the previously discussed
_syscallX() macros.You would compile this module with 'gcc -c -O3 module.c' and insert it into the kernel with 'insmod module.o' (optimization must be turned on).
As the title suggests, Kernel Loadable Module can also be used to modify kernel code without having to rebuild it entirely. For example, you could patch the
write(2) system call to hide portions of a given file. Seems like a good place for backdoors, too: what would you do if you couldn't trust your own kernel?
Kernel and system calls backdoors
The main idea behind this is pretty simple. We'll redirect those damn system calls to our own ones in a Kernel Loadable Module, which will enable us to force the kernel to react as we want it to. For example, we could hide a sniffer by patching the
IOCTL system call and masking the PROMISC bit. Lame but efficient.To modify a given system call, just add the definition of the
extern void *sys_call_table[] in your Kernel Loadable Module, and have the init_module() function modify the corresponding entry in the sys_call_table to point to your own code. The modified call can then do whatever you wish it to, call the original system call by modifying sys_call_table once more, and ...