The author of this post works for Undo and uses examples that reference his company’s product. The methods mentioned however are agnostic from any specific tool.
The most successful development teams are those that are agile and can respond quickly to rapid business changes while also maintaining the quality of their hardware or software development. To help realize team agility, Continuous Integration (CI) has quickly become an industry best practice. CI is a practice that requires developers to integrate code into a shared repository several times a day. Each check-in is then verified by an automated build, allowing teams to detect problems early. Whilst this mode of working allows new features to be integrated quickly, it is important to maintain a proper backlog in order to discover potential hiccups, and in a busy development environment, these hiccups can sometimes be neglected.
Currently, a developer is usually informed about a failure via an automated report, which might be little more than a flag that a particular test has failed after a certain commit was made, together with some logging statements that may or may not identify the problem. Backtracking from this terse statement of failure to the source of the problem is often laborious, time- consuming and painful for the developer, and that is just the case for failures that are reproducible.
If a test doesn’t fail on the machine that a developer was working on, they will wait for it to be reproduced on a different machine before they try to fix the failure. Irreproducible failures or intermittently failing tests add an extra dimension of complexity to the investigation. They sometimes appear in 1 in every 300 runs of the program, and can be difficult to find. It is easier (and more cost effective) to ignore them and hope that they do not appear again. Consequently, some testing systems can accumulate a backlog of “known-failing” tests, being a dump of (possibly important) test failures that no one has the time, inclination or energy to fix.
Capture the real failure
Falling victim to irreproducible failures can be costly. Without proper backlogs, it can be extremely difficult to find and fix problems. In fact, a recent study by University of Cambridge Judge Business School showed that debugging is a $312 billion industry problem each year.
There are many different types of software bugs, all of which can impact software performance. For C/C++ programmers, common bugs include execution state corruption, data structure corruption, race conditions, deadlocks and memory leaks. These bugs can appear regularly in software development.
Fundamentally, even in well-developed software, bugs occur because people don’t understand what their software really does. However, there are now a range of tools that allow insight into the murky depths of software execution. Deterministic recording technology allows developers to capture exact replicas of failures—effectively a perfect reproduction of a bug. Using reversible debugging, such recordings can be replayed and rewound to home in on the root cause of the failure.
For example, if your program fails in the cloud, you could download a recording which is an exact copy of the failure to determine what went wrong from the comfort of your laptop.
In the series of images below, we’ve outlined an example session in a stripped-down version of something that could have appeared in an embedded environment.
Squash that bug—Finding the root cause
For this example, we have used Undo’s recording and reverse debugging products.
To illustrate the power of record-and-replay technology, let’s assume we were running a test with recording enabled, and on this run, the failure we’re interested in occurs. We would load the recording and run to the end of the program’s execution:
undodb-gdb: Have loaded Undo Recording: undodb-gdb: my-recording.undo undodb-gdb: Note that the debuggee is currently at the beginning of undodb-gdb: the recording. You can use the "continue" command to undodb-gdb: run to the end of the recording. (undodb-gdb) continue Continuing. Program received signal SIGSTOP, Stopped (signal). 0x00007f55d0b5dc19 in __GI__exit (status=status@entry=0) at ../sysdeps/unix/sysv/li 32 INLINE_SYSCALL (exit_group, 1, status); (undodb-gdb)
To get a closer look at the error manifestation, set a breakpoint at the place where the error was detected, and let the program run backwards:
(undodb-gdb) break example.c:86 Breakpoint 3 at 0x400a62: file example.c, line 86. (undodb-gdb) reverse-continue Continuing. Breakpoint 3, main () at example.c:86 86 printf ("the factorial of %u is: %zu\n", m, q); (undodb-gdb)
Unsurprisingly, we find that ret is wrong. Ret being the factorial of a number <10, it should never be larger than factorial (9) = 362880. Let’s use that as a conditional watchpoint going backwards and search for the point at which the ret becomes okay again:
(undodb-gdb) print m $8 = 0 (undodb-gdb) print ret $9 = 515396071800
(undodb-gdb) watch ret if ret < 362880 Hardware watchpoint 7: ret (undodb-gdb) reverse-continue Continuing. Hardware watchpoint 7: ret Old value = 4294967265 New value = 24 factorial (m=5) at example.c:30 30 size_t ret = 1; (undodb-gdb) print m $10 = 5 (undodb-gdb) print ret $11 = 24
If the watchpoint fired here, that means going back one more step will show us where the bad thing happened, right?
(undodb-gdb) reverse-step 29 { (undodb-gdb) reverse-step 84 size_t q = factorial (m);
However, we’re back in main, and there is nothing suspicious. This is the last time the function was called where ret was okay, which is not quite what we wanted. The debugger tries to be clever and only enables the watchpoint where the variable is actually in scope. We need to watch the address itself if we want it to trip on every write access, but first let’s step back into the function by rewinding and replaying the program to explore the code as much as needed:
(undodb-gdb) step factorial (m=5) at example.c:30 30 size_t ret = 1; (undodb-gdb) step 32 while (m > 0) { (undodb-gdb) print &ret $14 = (size_t *) 0x7fff799e4698 (undodb-gdb) watch *(size_t *) 0x7fff799e4698 Hardware watchpoint 8: *(size_t *) 0x7fff799e4698 (undodb-gdb) continue Continuing. [Switching to Thread 13229.13254] Hardware watchpoint 15: *(size_t *) 0x7fff799e4698 Old value = 1 New value = 0 0x0000000000400843 in timeout_fn (val=...) at example.c:22 22 *((unsigned *) val.sival_ptr) -= 1
As you may have noticed, the timeout_fn() is writing to this, causing an overflow, but the timer is being deleted in line 66. In addition, the main page of timer_delete()actually warns that the treatment of any pending signal generated by the deleted timer is unspecified. Even though this example is using threads instead of signals, that seems to be exactly what is going wrong: Later invocations of the timer_fn() will try to change the value of init_param that was passed by addressing during setup(), only setup() has returned since. Therefore, the pointer to it is invalid (C99 standard, section 6.2.4) and was in this particular case pointing to local variables for the function factorial().
In duplicating this debugging session, developers would find that it only takes a matter of minutes to complete. Seasoned engineers that spend time debugging similar issues can attest that other approaches can take weeks to get to the same resolution.
How to win at CI
Thanks to the power of record-and-replay technology, developers can reduce debugging time and ship quality C/C++ code on time and on budget. This approach applies equally during day-to-day development as when fixing test failures. Using record-and-replay technology to find the root of the failure can speed up the QA process and allow for a faster development cycle.