Debugging the debugger

While working on our high-performance serverless platform rFaaS, I stumbled upon a curious bug in the GNU debugger gdb. In rFaaS, we allowed clients to submit the function code directly to a remote executor by shipping the contents of the shared library across networks. This works quite well for simple functions in homogenous HPC clusters since there are no issues with binary compatibility.

Sometimes, the function does not execute correctly even if it works when called directly from the application. There might be a bug in serializing function arguments, deserializing binary data in a function, or maybe an unnoticed dependency on the shared state. The easiest way to debug the issue or an unexpected crash is to use the debugger. However, in this case, we observed a very unusual behavior - gdb always hangs! Thus, we’re going to look at how to reproduce this issue and find a possible cause.

Since rFaaS spawns new processes to execute the function, we attached gdb to the process running function. However, this step is unnecessary to reproduce the issue, and we will skip the attaching from this point on.

Let’s take a look at the following code that reads a shared library, extracts a function foo that accepts a single integer and returns an integer, and executes the function.


void* library_handle = dlopen("./lib.so", RTLD_NOW);
assert(library_handle);

typedef int (*func_t)(int);
func_t func = dlsym(library_handle, "foo");
assert(func);

func(42);

dlclose(library_handle);

When the shared library is transmitted over the network, it would be wasteful to write the contents to the file and read it again. Instead, we can create a memory-mapped file and store the data there. In rFaaS, the data is transmitted over RDMA to the memory location. For simplicity, here we replace it by moving the data from the file to the memory buffer.

const char* path = "lib.so";
// Receive code information
FILE* file = fopen(path, "rb");
assert(file);
fseek (file, 0 , SEEK_END);
size_t size = ftell(file);
rewind(file);

int fd = memfd_create("libfunction", 0);
assert(fd > 0);
int ret = ftruncate(fd, size) ;
assert(ret == 0);

void* memory_handle = mmap(NULL, size, PROT_WRITE, MAP_SHARED, fd, 0);
assert(memory_handle);
size_t bytes_read = fread(memory_handle, 1, size, file);
assert(bytes_read == size);
fclose(file);

char buf[32];
snprintf(buf, 32, "%s%d", "/proc/self/fd/", fd);
void* library_handle = dlopen(buf, RTLD_NOW);

This code works fine as well - the library_handle can be used identically as in the previous code snippet.

Let’s assume that our function experiences some issues, and we want to find the root cause. For a spawned process, we could insert an artificial loop waiting on a test variable, attach the debugger to the process, set up breakpoints as desired, and change the variable value to continue execution. To simplify the discussion, we will execute the code directly from the main application and skip the process spawn.

~ gdb ./from_memory 
GNU gdb (Ubuntu 12.0.90-0ubuntu1) 12.0.90
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./from_memory...
(gdb) r
Starting program: /home/mcopik/bug_report/from_memory 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Here we observe the issue - gdb hangs. It does not respond to any comments, it does not respond to signals, and we need to use SIGKILL to terminate the debugging session. But what can be going wrong in this example? Can it be caused just by our memory-mapped files? Let’s attach gdb to the frozen gdb instance to find what might be happening.

Attaching to process 546764
[New LWP 546767]
[New LWP 546768]
[New LWP 546769]
[New LWP 546770]
[New LWP 546771]
[New LWP 546772]
[New LWP 546773]
[New LWP 546774]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__GI___libc_read (nbytes=4096, buf=0x564e51949e30, fd=15) at ../sysdeps/unix/sysv/linux/read.c:26
26      ../sysdeps/unix/sysv/linux/read.c: No such file or directory.
(gdb) bt
#0  __GI___libc_read (nbytes=4096, buf=0x564e51949e30, fd=15) at ../sysdeps/unix/sysv/linux/read.c:26
#1  __GI___libc_read (fd=15, buf=0x564e51949e30, nbytes=4096) at ../sysdeps/unix/sysv/linux/read.c:24
#2  0x00007f69046c3cb6 in _IO_new_file_underflow (fp=0x564e517345d0) at ./libio/libioP.h:947
#3  0x00007f69046c24b8 in __GI__IO_file_xsgetn (fp=0x564e517345d0, data=<optimized out>, n=64) at ./libio/fileops.c:1321
#4  0x00007f69046b6c29 in __GI__IO_fread (buf=0x7ffc6e69b6f0, size=1, count=64, fp=0x564e517345d0) at ./libio/iofread.c:38
#5  0x0000564e4f0c833e in ?? ()
#6  0x0000564e4f0c842a in ?? ()
#7  0x0000564e4f0c7224 in ?? ()
#8  0x0000564e4f0f338b in ?? ()
#9  0x0000564e4f0cc08a in ?? ()
#10 0x0000564e4f0cb946 in ?? ()
#11 0x0000564e4efb9bae in ?? ()
#12 0x0000564e4efb8c97 in ?? ()
#13 0x0000564e4efbac77 in ?? ()
#14 0x0000564e4efbb6cb in ?? ()
#15 0x0000564e4efbb923 in ?? ()
#16 0x0000564e4ecf1cc5 in ?? ()
#17 0x0000564e4ee79d96 in ?? ()
#18 0x0000564e4ee7bba3 in ?? ()
#19 0x0000564e4ee7d5c1 in ?? ()
#20 0x0000564e4f1b6576 in ?? ()
#21 0x0000564e4f1b6a5a in ?? ()
#22 0x0000564e4eec227d in ?? ()
#23 0x0000564e4eec3f65 in ?? ()
#24 0x0000564e4ec5a150 in ?? ()
#25 0x00007f6904660d90 in __libc_start_call_main (main=main@entry=0x564e4ec5a110, argc=argc@entry=2, argv=argv@entry=0x7ffc6e69c568)
    at ../sysdeps/nptl/libc_start_call_main.h:58
#26 0x00007f6904660e40 in __libc_start_main_impl (main=0x564e4ec5a110, argc=2, argv=0x7ffc6e69c568, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7ffc6e69c558) at ../csu/libc-start.c:392
#27 0x0000564e4ec5fbf5 in ?? ()

We can see at the stack frame #4 that the issue is in a call that attempts to read some file data. Since I first noticed this issue with gdb version 12.0.90-0ubuntu1, I manually built the newest version 12.1. The entire process is straightforward: run ./configure and make -j${CPUS}. The problem persists, but we can now use a debug build of gdb to get more insight.

Attaching to process 573765
[New LWP 573767]
[New LWP 573768]
[New LWP 573769]
[New LWP 573770]
[New LWP 573771]
[New LWP 573772]
[New LWP 573773]
[New LWP 573774]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__GI___libc_read (nbytes=4096, buf=0x558434f1b220, fd=15) at ../sysdeps/unix/sysv/linux/read.c:26
26      ../sysdeps/unix/sysv/linux/read.c: No such file or directory.
(gdb) bt
#0  __GI___libc_read (nbytes=4096, buf=0x558434f1b220, fd=15) at ../sysdeps/unix/sysv/linux/read.c:26
#1  __GI___libc_read (fd=15, buf=0x558434f1b220, nbytes=4096) at ../sysdeps/unix/sysv/linux/read.c:24
#2  0x00007f82b8697cb6 in _IO_new_file_underflow (fp=0x558434e4f2c0) at ./libio/libioP.h:947
#3  0x00007f82b86964b8 in __GI__IO_file_xsgetn (fp=0x558434e4f2c0, data=<optimized out>, n=64) at ./libio/fileops.c:1321
#4  0x00007f82b868ac29 in __GI__IO_fread (buf=buf@entry=0x7ffc32027bb0, size=size@entry=1, count=count@entry=64, fp=fp@entry=0x558434e4f2c0)
    at ./libio/iofread.c:38
#5  0x000055843364f53e in fread (__stream=0x558434e4f2c0, __n=64, __size=1, __ptr=0x7ffc32027bb0) at /usr/include/x86_64-linux-gnu/bits/stdio2.h:293
#6  cache_bread_1 (nbytes=64, buf=0x7ffc32027bb0, f=0x558434e4f2c0) at cache.c:319
#7  cache_bread (abfd=<optimized out>, buf=0x7ffc32027bb0, nbytes=64) at cache.c:358
#8  0x000055843364e564 in bfd_bread (ptr=ptr@entry=0x7ffc32027bb0, size=<optimized out>, size@entry=64, abfd=<optimized out>, abfd@entry=0x558434f0f210)
    at bfdio.c:259
#9  0x000055843366ca53 in bfd_elf64_object_p (abfd=0x558434f0f210) at /home/mcopik/bug_report/build/gdb-12.1/bfd/elfcode.h:519
#10 0x000055843365199c in bfd_check_format_matches (abfd=0x558434f0f210, format=<optimized out>, matching=0x0) at format.c:344
#11 0x000055843351edfe in solib_bfd_open (pathname=0x558434e90200 "/proc/self/fd/4") at ./../gdbsupport/gdb_ref_ptr.h:130
#12 0x000055843351dee7 in solib_map_sections (so=0x558434f0ff00) at solib.c:540
#13 0x000055843351fe56 in update_solib_list (from_tty=<optimized out>) at solib.c:860
#14 0x0000558433520877 in solib_add (pattern=pattern@entry=0x0, from_tty=from_tty@entry=0, readsyms=1) at solib.c:960
#15 0x0000558433520b00 in handle_solib_event () at solib.c:1269
#16 0x0000558433252165 in bpstat_stop_status (aspace=<optimized out>, bp_addr=bp_addr@entry=140737353900800, thread=thread@entry=0x558434ddf090, ws=..., 
    stop_chain=stop_chain@entry=0x0) at breakpoint.c:5455
#17 0x00005584333dbc8b in handle_signal_stop (ecs=0x7ffc32028700) at infrun.c:6191
#18 0x00005584333dda68 in handle_stop_requested (ecs=<optimized out>) at infrun.c:4465
#19 handle_stop_requested (ecs=<optimized out>) at infrun.c:4460
#20 handle_inferior_event (ecs=0x7ffc32028700) at infrun.c:5695
#21 0x00005584333df48e in fetch_inferior_event () at infrun.c:4085
#22 0x00005584336f7ef6 in gdb_wait_for_event (block=block@entry=0) at event-loop.cc:700
#23 0x00005584336f83da in gdb_wait_for_event (block=0) at event-loop.cc:596
#24 gdb_do_one_event () at event-loop.cc:212
#25 0x0000558433424275 in start_event_loop () at main.c:421
#26 captured_command_loop () at main.c:481
#27 0x0000558433425e75 in captured_main (data=0x7ffc320288a0) at main.c:1351
#28 gdb_main (args=args@entry=0x7ffc320288d0) at main.c:1366
#29 0x00005584331b9d10 in main (argc=<optimized out>, argv=<optimized out>) at gdb.c:32

The stackframe #11 proves that gdb is trying to open a filedescriptor associated with our memory-mapped file. Then, the request is redirected to its internal cache of file descriptors at stackframes #6 and #7, where a single call to fread is made. The function cache_bread_1 attempts to read 64 bytes from the file, which does not terminate.

What if gdb cannot read the data because it’s simply not there? We might have to flush the data back to the filesystem to make it visible to other processes.

msync(memory_handle, size, MS_SYNC);

Unfortunately, this does not resolve the issue. The problem seems to be a core gdb issue that I cannot resolve by myself, and I opened a bug request at their Bugzilla.

You can find all of the code and compilation scripts on GitHub.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Installing FetchContent targets in CMake
  • Google Summer of Code 2023
  • Remote Bash scripts with SSH
  • JSON in Bash and CLI with jq
  • Relative paths in LaTeX.