***************************************************************************** * Kernel debugging, part 1: "Understanding what's gone wrong" * * * * Copyright (c) 2001 Daniel P. Bovet, Marco Cesati, and Cosimo Comella * * Permission is granted to copy, distribute and/or modify this document * * under the terms of the GNU Free Documentation License, Version 1.1, * * published by the Free Software Foundation; with no Invariant Sections, * * with no Front-Cover Texts, and with no Back-Cover Texts. A copy of the * * license is included in the file named LICENSE. * * * * (version 1.0) * ***************************************************************************** Every complex program has bugs, and the Linux kernel is not an exception. However, debugging the kernel is an hard task, because: * the kernel is BIG (millions of lines) * the kernel is very complex (multithreaded, hardware-related, ...) * there is no higher program that monitors it You are usually made aware of a kernel bug when: * your User Mode program crashes, freezes, or behaves in unexpected ways * you discover a kernel oops on the console or in the log files * your system does not reply to ping, your console no longer responds ... * your cabinet smokes :-) Fixing a kernel bug involves three different steps: 1. finding where the bug showed itself 2. understanding what is wrong with the existing code 3. fixing the code As you may well suppose, steps 1 and 3 are relatively easy, while step 2 is hairy. You need to understand how the kernel is supposed to work, and how it actually works. If you finds a difference, then you have likely found a bug. Now, I have to admit that the title of this lecture is a lie: I cannot give you any help to cope with step 2. You have to study the kernel by yourself, and trying to make a sense of it. In this lecture we'll focus on step 1, namely, locating the exact instruction in the kernel source code where a particular bug showed up. Be warned that the bug is not necessarily due to a wrong instruction in the exact place where it showed up: the bug might be a design flaw, a stale pointer passed by another function, and so on. We shall proceed by examples. Example 1. The "mystery1" program hangs up. If "mystery1" is in foreground, it cannot be stopped or killed. But we can use the "ps" command to collect some information about it: $ ps u -C mystery1 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND cesati 208 0.0 0.1 1200 312 pts/1 D 18:35 0:00 ./mystery1 The "D" flag gives us two bits of information: 1. the process is in Kernel Mode 2. the process is in TASK_UNINTERRUPTIBLE state In Linux, every process alternates between User Mode and Kernel Mode. In User Mode the process executes application-specific code, while in Kernel Mode it executes kernel code. (In Intel processors, User and Kernel Modes are different hardware states encoded in the last two bits of the cs and ss segmentation registers.) While in Kernel Mode, the process can be in one of the following states: TASK_RUNNING running or waiting for the CPU TASK_INTERRUPTIBLE waiting for some event, interruptible by signals TASK_UNINTERRUPTIBLE waiting for some event, not interruptible TASK_STOPPED stopped by a signal or terminal keypress (Ctrl-Z) TASK_ZOMBIE execution terminated, waiting to be destroyed Thus, our "mystery1" program is stuck in Kernel Mode waiting for some event. Understanding what precisely the process is waiting for might clarify a lot. The "ps" command is still useful here: $ ps n -o pid,tt,user,fname,wchan -C mystery1 PID TT USER COMMAND WCHAN 208 pts/1 1000 mystery1 161dad The WCHAN field is the information we need: it denotes the "wait channel" of the process, namely, an offset of a function where the process is blocked. The "n" flag of "ps" forces numerical output, thus 161dad must be interpreted as hex number 0x161dad. How can we make sense out of this number? First of all, recall that the kernel lives in the fourth gigabyte of linear address space. The address returned by WCHAN should therefore be considered as 0xc00161dad. Now, we must know what instruction lies in RAM at that address. Association between kernel symbols and RAM addresses is mostly done in the linking phase of kernel compilation. Beside the kernel image file to be loaded in RAM, the compilation yields a file named System.map, which lists the linear addresses of all kernel symbols. When installing a new kernel, you should always save such file in a safe place. Even better, you should save this file in a place where the "ps" command may find it. IMHO, the best place is the /boot directory. Also, "ps" looks first for a file named "System.map-X.Y.Z", where "X.Y.Z" is the release of the kernel as returned by "uname -r" (first four lines of the top Makefile). Without the "n" flag, "ps" will search the System.map file and translate the address by yourself: $ ps -o pid,tt,user,fname,wchan -C mystery1 PID TT USER COMMAND WCHAN 208 pts/1 cesati mystery1 semctl_down WARNING: be sure that "ps" looks at the right System.map file before using the symbolic name! In our case, "ps" looks at the wrong file, therefore it gives a wrong result. Using the correct System.map file yields: $ ps -o pid,tt,user,fname,wchan -C mystery1 PID TT USER COMMAND WCHAN 208 pts/1 cesati mystery1 noninterruptible_delay Now, we should look up noninterruptible_delay() in the kernel source code. Either you already know the source file where the function is defined, or you have to search. I'm used to type a command like this in the top source directory: $ find . -name \*.[chS] -exec grep -l interruptible_delay {} \; ./lkhc/hack1.c ./lkhc/hack-static.c In the lkhc/hack1.c file, interruptible_delay() is invoked: ====== excerpt from lkhc/hack1.c ====== static int hack1_read(char *page, char **start, off_t off, int count, int *eof, void *data) { int written; written = sprintf(page+off, "jiffies=%ld ", jiffies); noninterruptible_delay(1); /* sleep one second */ written += sprintf(page+off, "%ld\n", jiffies); return written; } ======================================= In the lkhc/hack-static.c file, interruptible_delay() is defined as: ====== excerpt from lkhc/hack-static.c ====== void noninterruptible_delay(int seconds) { DECLARE_WAIT_QUEUE_HEAD(wait); sleep_on_timeout(&wait, seconds*HZ*1000); } ============================================= We eventually found where our process is blocked: the sleep_on_timeout() function, which puts the current process in TASK_UNINTERRUPTIBLE state until either it is awakened by another process, or a timeout expires. Now step 2 can start: how is the kernel behaving? From its definition, it is quite clear that the parameter "seconds" represents the timeout in seconds. However, there is a mistake in the invocation of sleep_on_timeout(), because this function expects a time interval in ticks. Step 3 is also trivial: just replace the wrong line with: sleep_on_timeout(&wait, seconds*HZ); Let's try a slightly different example. Example 2. The "mystery2" program hangs up. When applying our "ps" command, we get a weird result: $ ps -o pid,tt,user,fname,wchan -C mystery2 PID TT USER COMMAND WCHAN 10687 pts/3 cesati mystery2 end The symbol "end" is too vague to be a real symbol name, and actually it corresponds to the linker symbol denoting the end of the kernel code. Something's going wrong. Let's try "ps" with numerical output: $ ps n -o pid,tt,user,fname,wchan -C mystery2 PID TT USER COMMAND WCHAN 10687 pts/3 1000 mystery2 5509d Now, if we try to interpret 5509d as the linear address 0xc005509d, we fail: looking at the Symbol.map file, we don't find it. As you have already guessed, the wait channel falls inside a module. It is easy to understand that the System.map file cannot hold the linear addresses of symbols defined inside modules, because any module is relocated when it is loaded at run-time. Thus, we need another source of information: /proc/ksyms virtual file to the rescue! The /proc/ksyms file has not been designed for debugging. It is mainly used when loading a module, since it includes the linear addresses of all kernel symbols (both from static code and loaded modules) that are exported. For example, consider our hack1.c module, that refers to the jiffies global variable: the /proc/ksyms file includes the linear address of jiffies: $ grep jiffies /proc/ksyms c02559e4 jiffies The insmod program looks into /proc/ksyms, fetches the address, and replace any occurrence of jiffies in hack1.o when preparing the relocated executable code of the module. This discussion suggests that /proc/ksyms does not include the address of every function of every module. In fact: $ cat /proc/ksyms cd055000 __insmod_hack2_O/home/cesati/LKHC/linux-2.4.16-lkhc1/lkhc/hack2.o_M3C1659AA_V132112 [hack2] cd055060 __insmod_hack2_S.text_L208 [hack2] cd055130 __insmod_hack2_S.rodata_L24 [hack2] cd0551f4 __insmod_hack2_S.bss_L4 [hack2] cd053000 __insmod_hack1_O/home/cesati/LKHC/linux-2.4.16-lkhc1/lkhc/hack1.o_M3C164209_V132112 [hack1] cd053060 __insmod_hack1_S.text_L144 [hack1] cd0530f0 __insmod_hack1_S.rodata_L24 [hack1] cd0531b4 __insmod_hack1_S.bss_L4 [hack1] cd046c88 pcmcia_lookup_bus [pcmcia_core] cd0465e4 pcmcia_access_configuration_register [pcmcia_core] cd044d8c pcmcia_adjust_resource_info [pcmcia_core] cd0466a0 pcmcia_bind_device [pcmcia_core] cd04676c pcmcia_bind_mtd [pcmcia_core] cd045830 pcmcia_check_erase_queue [pcmcia_core] cd0458f8 pcmcia_close_memory [pcmcia_core] (snipped) The hack1 and hack2 modules don't export any symbol. Fortunately, the insmod program produces a few symbols that mark the beginning and the end of the kernel address space allocated to any loaded module. In our example, the code of module hack1 starts at linear address 0xcd053060 (text section), and the code of module hack2 starts at 0xcd055060. These symbols allow us to make sense of the number 0x5509d returned by "ps": it looks like an offset inside the code of hack2. (Short digression: you might wonder why statically compiled symbols start at 0xc0000000, while module symbols start at 0xcd000000. The kernel uses the initial portion of the fourth gigabyte of linear address space to linearly map all physical RAM in the system; the statically compiled kernel "lives" here. The rest of the fourth gigabyte is used by the kernel to establish temporary mappings; modules are loaded using these temporary mappings.) Now, we would like to establish the exact function inside the hack2 module that corresponds to 0xcd05509d. The module starts at 0xcd055060, thus the instruction is at offset 0xcd05509d-0xcd055060=0x3d=61. In order to get the Assembly code of the module, let's use the objdump command: $ objdump -d hack2.o hack2.o: file format elf32-i386 Disassembly of section .text: 0000000000000000 <_noninterruptible_delay>: 0: 83 ec 18 sub $0x18,%esp 3: 53 push %ebx 4: 8b 4c 24 20 mov 0x20(%esp,1),%ecx 8: c7 44 24 04 00 00 00 movl $0x0,0x4(%esp,1) f: 00 10: 8b 44 24 04 mov 0x4(%esp,1),%eax 14: 8d 54 24 14 lea 0x14(%esp,1),%edx 18: 89 54 24 08 mov %edx,0x8(%esp,1) 1c: 89 54 24 0c mov %edx,0xc(%esp,1) 20: 89 54 24 14 mov %edx,0x14(%esp,1) 24: 89 54 24 18 mov %edx,0x18(%esp,1) 28: 69 d1 a0 86 01 00 imul $0x186a0,%ecx,%edx 2e: 8d 5c 24 10 lea 0x10(%esp,1),%ebx 32: 89 44 24 10 mov %eax,0x10(%esp,1) 36: 89 d8 mov %ebx,%eax 38: e8 fc ff ff ff call 39 <_noninterruptible_delay+0x39> 3d: 5b pop %ebx 3e: 83 c4 18 add $0x18,%esp 41: c3 ret 42: 89 f6 mov %esi,%esi 0000000000000044 : 44: 6a 00 push $0x0 46: 68 a4 01 00 00 push $0x1a4 4b: 68 00 00 00 00 push $0x0 50: e8 fc ff ff ff call 51 55: a3 00 00 00 00 mov %eax,0x0 5a: 83 c4 0c add $0xc,%esp (snipped) The instruction precedings those starting at offset 0x3d is where our process is blocked. As you see, the instruction is inside the _uninterruptible_timeout function, and it is a call to an external function whose symbol is yet to be resolved by insmod. If things are not clear, you can also compare the output of objdump with the Assembly code generated by the compiler. To generate the Assembly file, go into the top source directory and give the command: # make lkhc/hack2.s Assembly files have suffix ".s". In our case it contains: ====== excerpt from lkhc/hack2.s ====== .file "hack2.c" .version "01.01" gcc2_compiled.: .text .align 4 .type _noninterruptible_delay,@function _noninterruptible_delay: subl $24,%esp pushl %ebx movl 32(%esp),%ecx movl $0,4(%esp) movl 4(%esp),%eax leal 20(%esp),%edx movl %edx,8(%esp) movl %edx,12(%esp) movl %edx,20(%esp) movl %edx,24(%esp) imull $100000,%ecx,%edx leal 16(%esp),%ebx movl %eax,16(%esp) movl %ebx,%eax call sleep_on_timeout popl %ebx addl $24,%esp ret .Lfe1: .size _noninterruptible_delay,.Lfe1-_noninterruptible_delay .section .rodata .LC11: (snipped) Thus, the problem is again the invocation of the sleep_on_timeout() function. We look in the source code of the module, and we discover the same problem as in the previous example: ====== excerpt from lkhc/hack2.c ====== static void _noninterruptible_delay(int seconds) { DECLARE_WAIT_QUEUE_HEAD(wait); sleep_on_timeout(&wait, seconds*HZ*1000); } (A hint that may save you some trouble: remember to remove your Assembly file from the kernel source code tree before recompiling, otherwise the make program will use it instead of the corresponding C file.) Example 3. The "mystery3" program fails. The "mystery3" program fails with a "Segmentation fault" message. This message is usually due to a system call that returned the -EFAULT error code, but it could also be raised by a uncaught SIGSEGV signal sent to the process. If we have the source code of the failing program, we could easily understand what happened. Things are a bit harder if the source code of the program is not available. Let's suppose we don't have the source code of mystery3. In order to understand what happens, we must trace the system calls invoked by the program by means of the "strace" command: $ strace mystery3 execve("./mystery3", ["./mystery3"], [/* 36 vars */]) = 0 brk(0) = 0x80496c8 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40016000 open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file or directory) open("/opt/kde/lib/i686/mmx/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory) stat64(0xbfffedbc, 0xbfffee14) = -1 ENOENT (No such file or directory) open("/opt/kde/lib/i686/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory) stat64(0xbfffedbc, 0xbfffee14) = -1 ENOENT (No such file or directory) open("/opt/kde/lib/mmx/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory) stat64(0xbfffedbc, 0xbfffee14) = -1 ENOENT (No such file or directory) open("/opt/kde/lib/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory) stat64(0xbfffedbc, 0xbfffee14) = 0 open("/etc/ld.so.cache", O_RDONLY) = 3 fstat64(0x3, 0xbfffedc4) = 0 old_mmap(NULL, 37203, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40017000 close(3) = 0 open("/lib/libc.so.6", O_RDONLY) = 3 read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\264\323"..., 1024) = 1024 fstat64(0x3, 0xbfffee0c) = 0 old_mmap(NULL, 1116516, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x40021000 mprotect(0x40128000, 39268, PROT_NONE) = 0 old_mmap(0x40128000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x106000) = 0x40128000 old_mmap(0x4012e000, 14692, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x4012e000 close(3) = 0 munmap(0x40017000, 37203) = 0 getpid() = 220 open("/proc/hack3", O_RDONLY) = 3 read(3, "jiffies=14462\n", 128) = 14 fstat64(0x1, 0xbffff20c +++ killed by SIGSEGV +++ The first system calls are issued by the ELF program interpreter that sets up the execution context of the program. We are interested in the last three system calls. The program successfully open the /proc/hack3 virtual file (file descriptor 3). Then it read 128 bytes from the file. The read() system call fills the program buffer with the string "jiffies=14462\n" (length = 14 bytes). Then the program invokes the fstat64() system call, but this system call never terminates: the process receives a SIGSEGV signal and aborts. The SIGSEGV signal is sent either by a User Mode process, or by the kernel when a serious error occurs. In this second case, the kernel builds a so-called "kernel Oops". A kernel Oops is a short message that describes the erroneous condition. There are several ways to read the message (and you often have to try many of them): 1. The Oops appears on the system console, or on a terminal that acts as the console. This is not very convenient for you, because console messages scrolls as new messages arrive, and the Oops is eventually lost. 2. The Oops can be read in the /proc/kmsg virtual file. Usually a system process named klogd continuously read from this file, and sends its content to another system process called syslogd. In turn, syslogd could write the message in one of the system log files in /var/log. 3. If the klogd process is not running, you could read the /proc/kmsg file by yourself. 4. The kernel messages (including Oops) can also be read with the dmesg command. This is my preferred method. Be aware, however, that the kernel message buffer usually contains only 16KB, thus the kernel messages are cyclically discarded. Let's use the dmesg command to look for the Oops: $ dmesg (snipped) Unable to handle kernel NULL pointer dereference at virtual address 00000008 printing eip: c012ecc9 *pde = 00000000 Oops: 0000 CPU: 0 EIP: 0010:[] Not tainted EFLAGS: 00010246 eax: 00000000 ebx: c8840000 ecx: 00000001 edx: 00000000 esi: fffffff7 edi: 00000001 ebp: bffff1ac esp: c8841fb0 ds: 0018 es: 0018 ss: 0018 Process mystery3 (pid: 217, stackpage=c8841000) Stack: c0134065 c8840000 40131170 00000001 c0106b0b 00000001 bffff20c 4012d760 40131170 00000001 bffff1ac 000000c5 0000002b 0000002b 000000c5 400df05c 00000023 00000212 bffff140 0000002b Call Trace: [] [] Code: 3b 48 08 73 06 8b 40 14 8b 14 88 85 d2 74 03 ff 42 14 89 d0 As you see, there is lot of information here. First of all, a description of the error: "Unable to handle kernel NULL pointer dereference at virtual address XXX". This is equivalent to state that the kernel tried to access an object addressed by a null pointer. The CPU raised an exception ("General protection fault"), and the corresponding exception handler established that the fault was due to a kernel bug. The Oops then shows the content of the eip register (the Program Counter) when the exception was raised. Finally, this is something we can understand. We could look at the System.map file, and search the corresponding function: ====== excerpt from System.map ====== 00000000c012ecb8 T fget 00000000c012ece0 T put_filp ===================================== Thus, we have a function name, fget(), and an offset inside its code: 0xc012ecc9 - 0xc012ecb8 = 0x11 = 17 Next steps are straightforward: we search fget() inside the kernel source code with the find command. It turns out to be defined in the fs/file_table.c file: ====== excerpt from fs/file_table.c ====== struct file * fget(unsigned int fd) { struct file * file; struct files_struct *files = current->files; read_lock(&files->file_lock); file = fcheck(fd); if (file) get_file(file); read_unlock(&files->file_lock); return file; } =========================================== Now we have to locate the faulting instruction. By invoking objdump on the correspoding object file we get: $ objdump -d fs/file_table.o (snipped) 0000000000000248 : 248: 89 c1 mov %eax,%ecx 24a: b8 00 e0 ff ff mov $0xffffe000,%eax 24f: 21 e0 and %esp,%eax 251: 8b 80 54 05 00 00 mov 0x554(%eax),%eax 257: 31 d2 xor %edx,%edx 259: 3b 48 08 cmp 0x8(%eax),%ecx 25c: 73 06 jae 264 25e: 8b 40 14 mov 0x14(%eax),%eax 261: 8b 14 88 mov (%eax,%ecx,4),%edx 264: 85 d2 test %edx,%edx 266: 74 03 je 26b 268: ff 42 14 incl 0x14(%edx) 26b: 89 d0 mov %edx,%eax 26d: c3 ret 26e: 89 f6 mov %esi,%esi (snipped) Linus suggests using the GNU debugger on the uncompressed kernel image: $ gdb linux-2.4.16-lkhc/vmlinux GNU gdb 5.0 Copyright 2000 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-slackware-linux"...(no debugging symbols found)... (gdb) disassemble fget Dump of assembler code for function fget: 0xc012ecb8 : mov %eax,%ecx 0xc012ecba : mov $0xffffe000,%eax 0xc012ecbf : and %esp,%eax 0xc012ecc1 : mov 0x554(%eax),%eax 0xc012ecc7 : xor %edx,%edx 0xc012ecc9 : cmp 0x8(%eax),%ecx 0xc012eccc : jae 0xc012ecd4 0xc012ecce : mov 0x14(%eax),%eax 0xc012ecd1 : mov (%eax,%ecx,4),%edx 0xc012ecd4 : test %edx,%edx 0xc012ecd6 : je 0xc012ecdb 0xc012ecd8 : incl 0x14(%edx) 0xc012ecdb : mov %edx,%eax 0xc012ecdd : ret 0xc012ecde : mov %esi,%esi End of assembler dump. (gdb) quit The instruction at offset 17 is "cmp 0x8(%eax),%ecx". (This is also confirmed by the Code section in our Oops, which shows the bytes of the machine code of the faulting instruction.) The Assembly instruction addresses the data structure pointed to by the eax register, and store the field at offset 8 in the ecx register. The source code of the function suggests that eax should contain the content of the files field of the current process descriptor (at offset 0x554); 8 is the offset of the file_lock field inside the data structure pointed to by files. Now, we have to consider another precious information coming from the Oops: the dump of all register contents. In particular, before executing the faulting instruction, %eax is set to 0. This causes the "NULL pointer dereference", the general protection fault, the exception handler, the SIGSEGV signal sent to mystery3, and so on. However, the crucial point is: what is the bug? Of course, the Oops could be avoided by simply adding a check on the files field before using its value. In this way, however, the bug is not fixed, it is just hidden. Step 2 of our bug fixing procedure starts here: is it conceivable that the files field is NULL? A basic understanding of the kernel is enough to establish that files should definitevely be not null here. In fact, the field stores a pointer to the table of opened file of the process; moreover, the fget() function is invoked to lookup a particular opened file specified by its file descriptor. Clearly, the bug is not in the fget() function itself. Most likely, the bug has been introduced in some function invoked before fget(). The Oops comes once again to our rescue, because it shows a dump of the kernel stack when the exception was raised. Even better, the "call trace" section lists the addresses on the stack corresponding to the functions, discarding the function arguments: Call Trace: [] [] Looking into the System.map file we find that 0xc0134065 corresponds to sys_fstat64(), while 0xc0106b0b is inside the system_call() function. In this particular case, the call trace section is not particularly rewarding: the Oops occurred while processing the fstat64() system call, as we already know. Searching the function and looking at it doesn't help, either: ====== excerpt from fs/stat.c ====== asmlinkage long sys_fstat64(unsigned long fd, struct stat64 * statbuf, long flags) { struct file * f; int err = -EBADF; f = fget(fd); if (f) { struct dentry * dentry = f->f_dentry; err = do_revalidate(dentry); if (!err) err = cp_new_stat64(dentry->d_inode, statbuf); fput(f); } return err; } =================================== The function doesn't touch the files field. Therefore, the bug has been introduced in a previous kernel function, most likely in the previous read() system call. Unfortunately, the kernel stack is cleared when returning in User Mode, thus the Oops cannot show the information relative to the read() system call execution. However, the strace command told us that the read() was reading from the /proc/hack3 virtual file. We can thus dig in the kernel source code looking for the function that implements /proc/hack3. We easily find the lkhc/hack3.c source file, and discover the buggy instruction: ====== excerpt from lkhc/hack3.c ====== static int hack3_read(char *page, char **start, off_t off, int count, int *eof, void *data) { int written; current->files = NULL; written = sprintf(page+off, "jiffies=%d\n", jiffies); return written; } ======================================= Eureka! (A final note on kernel Oops. A small program named ksymoops fetches several bits of information from the System.map file, the uncompressed kernel image, the /proc/ksyms file, and so on, and then interprets the Oops for you. The home page of the program is listed in the scripts/ksymoops directory of the Linux source code.)