Debugging

GNU debugger

For an introduction into gdb, check out RMS's gdb Tutorial (note that this is not "the" RMS). Especially mind the example debugging sessions.

Compile with flags -g3 -gdwarf-2 so that macro definitions are compiled along as well. Remove any optimization flags like -O2.

An overview of commands:

n Next instruction
s Step into
fin Finish current function (step out)
p varPrint variable
p macroPrint macro if compiled with -g3 -gdwarf-2
p *varPrint variable to which is pointed by var, but if you're printing a string (defined as char*), don't precede with a *
p/x varPrint variable hexadecimal
p/x var@lengthPrint char array hexadecimal, array has length characters
set varModify variable
l List source code around current step
l start,endList source code, parameters optional
c Continue running
f Current frame (I don't know why it's called like that, this just shows the current line)
bt Show backtrace
disp varLike print, but repeat after every step
i macro macronamePrint information about macro
i args Print arguments of current function
i locals Print all local variables
x/3tb ptrExamine address ptr, print 3 bytes in binary
x/3xb ptrThe same, except print 3 bytes in hex
x/60cb ptrThe same, except print 60 bytes as characters including ASCII value
j lineJump straight to line

Breakpoint commands:

b main Set breakpoint at function main
b blah.c:88Set breakpoint at file blah.c, line 88
b 7 if var==0Break at line 7 if variable var equals zero
i b List all breakpoints
d nDelete breakpoint n

GDB options

Some useful options:

listsize countNumber of lines shown when giving the l (list) command
history save on Saves the command history when exiting
follow-fork-mode ask Set fork-mode to ask, which can also be parent or child

Put options in $HOME/.gdbrc in the form of lines like:

  set option param1

Controlling core dumping

When a segfault occurs, your program immediately quits. To search for the cause of the segfault, it's useful to let your program "dump core". This core file can then be loaded into the debugger to see the "backtrace", i.e. the stack of function calls that led to dumping the core. To let core dumping happen, compile your program with debugging options (-g) and enter the following command either on the commandline or in your shell startup file:

  ulimit -c unlimited

This sets no limits to the size of the core file. But be careful: core dumps with full debugging information can easily add up in size. An example of a few core dumps:

 -rw-------  1 test users  56M 2007-05-24 22:40 /tmp/core_srv_tscu_29328
 -rw-------  1 test users  57M 2007-06-04 18:07 /tmp/core_srv_tscu_4102
 -rw-------  1 test users  57M 2007-06-05 20:48 /tmp/core_srv_tscu_5266

Particularly pay attention to the 5th column, the file size.

When the program is run and the segfault occurs, a file is created which is called core.pid in the directory where the executable was started. To change this behaviour, edit /etc/sysctl.conf and add a line like:

  kernel.core_pattern=/tmp/core_%e_%p

Check man proc for more options. After doing a sysctl -p as root, all core files will be written to /tmp with a name like core_executablename_pid. To see the current setting:

  $ sysctl kernel.core_pattern

You can test this as follows:

  $ cat

Now press CTRL-\ (Ctrl and backslash).

  Quit (core dumped)

If that doesn't work:

  $ cat &
  $ kill -3 $!

Examining core dumps

Start the program in the debugger and load the core file. Then give the bracktrace command to show what it was doing at the time of the crash:

  $ gdb ./srv_dlr
  Using host libthread_db library "/lib/tls/libthread_db.so.1".
  (gdb) core core.3902
  (gdb) bt
  #0  memcpy () from /lib/libc.so.6
  #1  dlrdb_query (p=, seq_flag=1, max_id=694) at dlrdb.c:227
  #2  dlr_packet_submit (p=, seq_flag=1, answer=1, nhk_avail=) at dlr.c:175
  #3  dlr_packet (opts=, nhk_avail=, p=) at dlr.c:236
  #4  process_packet (opts=, nhk_avail=, p=) at process.c:93
  #5  run_server (opts=) at server.c:428
  #6  tuce_srv (opts=) at server.c:580
  #7  main (argc=2, argv=) at main.c:180

You can now jump to the part of the backtrace where you think might be the culprit. Since we can be confident that libc's memcpy isn't a problem itself, we check frame 1:

  (gdb) f 1
  #1  dlrdb_query (p=, seq_flag=1, max_id=694) at dlrdb.c:227
  227                         memcpy((void *)p_ptr, (void *)bufptr, PRIM_HDRLEN + length + 1);
  (gdb)

We can now see what the contents of the variables were before we experienced the segmentation violation, but first you'll want to give the 'l' list command to see the context of the current source code line.

If you type bt and just get a listing as follows:

 (gdb) bt
 #0  ?? ()
 #1  ?? ()
 #2  ?? ()
 ... more lines ...

Then either you didn't compile with debugging flags, or you forgot to pass the original executable when you started gdb.

Stray core files

If you have a core file lying around and you don't know where it came from, use the file utility:

  $ file core
  core: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style, SVR4-style, from 'telisky'

Reproducing segmentation faults

If you can reproduce the issue, you can also run the program through gdb. Any signals will be caught and can then be examined. For instance, the program 'srv_dlr' generated a segmentation fault upon startup:

  $ gdb ./srv_dlr
  Using host libthread_db library "/lib/libthread_db.so.1".
  (gdb)

Run the program. In this case, the segmentation fault immediately happens:

  (gdb) r
  Starting program: src/telis/tuce/server/srv_dlr
  [Thread debugging using libthread_db enabled]
  [New Thread 16384 (LWP 11573)]
  
  Program received signal SIGSEGV, Segmentation fault.
  [Switching to Thread 16384 (LWP 11573)]
  0x401afbce in ____strtol_l_internal () from /lib/libc.so.6
  (gdb)

The segmentation fault just happened. Let's see where it occurred with the backtrace command:

  (gdb) bt
  #0  0x401afbce in ____strtol_l_internal () from /lib/libc.so.6
  #1  0x401af90a in __strtol_internal () from /lib/libc.so.6
  #2  0x401ad226 in atoi () from /lib/libc.so.6
  #3  0x08051c4e in dlrdb_init () at dlrdb.c:45
  #4  0x0805238f in dlr_initialize () at dlr.c:38
  #5  0x0804b727 in process_init (opts=0xbffff470) at process.c:43
  #6  0x0804b58b in tuce_srv (opts=0xbffff470) at server.c:563
  #7  0x0804a3e8 in main (argc=1, argv=0xbffff514) at main.c:180

We can probably trust Gnu's libc, so let's check out frame number three:

  (gdb) f 3
  #3  0x08051c4e in dlrdb_init () at dlrdb.c:45
  45                      actual_id = atoi(row[0]);
  (gdb)

We can examine the variables here:

  (gdb) p row[0]
  $1 = 0x0
  (gdb)

That wasn't the expected content of variable row[0]. Now we can examine why this variable wasn't properly filled.

What's it doing?

You're taking over a codebase. You compile, run the binary and nothing happens. It just sits there, waiting for something... If you want to know what's it doing, start gdb program_name. Type 'r' and when it's done starting, press CTRL-C and type 'bt'.

This will show the current call stack. You can now see what function the program is stuck in. When you're done examining, type 'c' plus Enter. The program will continue running.

Multithreaded programming and debugging

To debug a multithreaded program, the following commands are useful:

i th List all threads
thr numberSwitch to thread number (now you can do a backtrace, set breakpoints, step, etc.)

Useful links

Find code which uses memory returned by malloc without initializing it and code which uses a pointer after it is freed