xavier roche's homework

random thoughts and sometimes pieces of code

Compiler-error-abuse-based large scale renaming in C++

Many developers hate compiler errors. This can become a daily pain, an endless stream of misery and frustration. But what if I told you compiler errors may help you refactor safely and quickly an entire codebase to rename class members and types ?

A Story of Renaming

At my company, we have been using string views for quite some time in our codebase - long before std::string_view was cool (that is, before C++17) - and we recently realized that some of our initial choices might not have been completely wise at that time.

Here’s what we used to write nearly ten years ago (lots of boilerplate removed):

class MyFancyView
{
public:
    MyFancyView(const char* ptr, size_t ptrlen) ...
... lots of fancy functions out there

public:
    const char* data;	// public data pointer
    size_t len;		    // public length
};

First obvious issue with this primitive view are the public members data and len, something which is now frowned upon (for good reasons, as this tend to allow hacky changes in address and pointer). The other issue is that, if we want to be closer to the standard library API, accessors needs to be data() and size().

The solution is rather straightforward: we need to rewrite thousands of lines of code using those direct members, using accessors, and sometimes change the code logic to use internal view logic - typically trimming views, taking a sub-view etc. Simple but overwhelming: the operation needs to be carefully done, line by line.

Besides,

  • Using an editor to do the work may work, but it also may lead to errors. You may replace members in stuff that do not belong. You may also miss a lot of ugly legacy macros embedding the renamed members. The compiler is the only one knowing which type has to be replaced and where.
  • Using tools such as clang-rename did not appear to help either, as they seem more convenient to replace one file at a time, with an offset provided by an editor typically.
  • Using a straight replacement of all data into data() and len into size() in the codebase would lead to havoc: we have a lots of stuff out there with those names and this will never work.

And there’s a last catch: programmers are lazy. There is no world where we’ll do that manually!

Oh, and there’s a perfectly valid excuse not to do it anyway: the risk of introducing subtle regressions is great, and the risk of tuning any code reviewer crazy is even greater.

So let’s start from the beginning: we’d like to rename those two members.

The strategy needs to be a bit subtle:

  1. Rename the member definitions to an intermediate name that shall be unique (ie. in a unique pattern) in the codebase
  2. Identify all members references to be renamed and replace them with the intermediate pattern
  3. Then, we can start doing blind replacements in the codebase

The first item has a pretty straightforward hack: let’s rename data into dAtA, and len into lEn in our definition file!

 public:
-    const char* data{ nullptr };
-    size_t len{ 0 };
+    const char* dAtA{ nullptr };
+    size_t lEn{ 0 };
 };

Why the weird l33t case ? I’m glad you ask! Here’s the reason:

git grep -E "(dAtA|lEn)"

The command above did not yield any matches in our codebase. We have our perfect namings!

One more thing: you might have not realize it, but those two new namings do also have the same length as previously, so that replacing them won’t change a line size. You’ll see later why this is important.

What’s happening if we build the code without any changes ? Well, we’ll obviously have a couple of those errors before the compiler gives up on our own foolishness:

../types/MixedFancy.h:84:16: error: no member named 'len' in 'MyFancyView'
    return lhs.len == rhs.size() && (lhs.len == 0 || __builtin_memcmp(lhs.data, rhs.data(), lhs.len) == 0);

This message is actually cool. It provides four major information:

  • The error we seek to fix (no member named ‘len’)
  • The file involved (types/MixedFancy.h)
  • The line number involved (84)
  • And even the column number involved (16)

And remember those information are extracted from the compiler itself: we can rely on them.

What more could you even ask for ? A margarita ?

If we grep for error: no member named, then isolate the leftmost stuff (file:line:column), we almost are set to fix the error.

With this, if we extract the file, the line, and the column, we can now do our replacement:

LANG=C sed \
    -E "84s/^(.{15})len/\1lEn/" \
    -i "../types/MixedFancy.h"

The leading 84 before sed s option is the line number, and we need to capture 15 (16 - 1) characters from the beginning of the line before column 16. Then, replace the captured part, the new name, and we’re all set. Using LANG=C is required as the compiler we use (clang) will provide offsets at byte level.

A last detail: we’re lazy (we’re merely humans after all), but the compiler shouldn’t be. We need to tell the compiler not to worry if we have thousands of errors, by adding -ferror-limit=0 in the cmake’s cxx_flags. We also need to tell ninja not to be too lazy (by using -k0)

With all this, we can have the final renaming script main logic:

ninja -k0 -C . 2>&1 |
  # capture compiler errors
  grep 'error: no member named' |
  # extract file, line, and column
  cut -f1-3 -d':' |
  # sort all
  sort -u |
  # only keep first file/line when several patterns exist on a single line
  sort -u -k1,2 -t':' |
# now bulk replace
while read line; do
    # extract filename, line, column
    f="$(echo "$line"|cut -f1 -d':')"
    l="$(echo "$line"|cut -f2 -d':')"
    c="$(echo "$line"|cut -f3 -d':')"
    # replace at line l and column c either of the pattern
    LANG=C sed -E -i "$f" \
        -e "${l}s/^(.{$((c-1))})data/\1dAtA/" \
        -e "${l}s/^(.{$((c-1))})len/\1lEn/"
done

It is highly suggested you split each step above until you’re confident enough, to be able to hunt for any mistakes.

A note on the fact we have the same name length for replaced versions:

std::find(_elements.data, _elements.data + _elements.size, fun)

This code above will trigger two warnings, on the same line, but on two different columns. If we were to replace with a name of a different length, this would completely mess up our script.

Going Further

That’s nice, but we also have a lot of stuff inside macros. I know, macros are bad, but legacy reasons…

Errors inside those macros will typically generate stacked error messages; such as in this case where we have two macros calling each other involved (in this case, we were renaming MyFancyView itself):

test/MyFancyTests.cpp:72:9: error: unknown type name 'MyFancyView'
        TEST_ENTITY("〹", 0x3039, false);
        ^
test/MyFancyTests.cpp:64:48: note: expanded from macro 'TEST_ENTITY'
#define TEST_ENTITY(name, value, onlyAlphaNum) _TEST_ENTITY(entities, name, value, onlyAlphaNum)
                                               ^
test/MyFancyTests.cpp:48:9: note: expanded from macro '_TEST_ENTITY'
...

The error location is actually on MyFancyTests.cpp:48:9, but its message is triggered at MyFancyTests.cpp:72:9.

To capture those cases, a multiline grep capture (as demonstrated on StackOverflow) with grep -Pzo will do the magic:

errMatch="(unknown type name|use of undeclared identifier|no type named|no member named)"
fromMacro="note: expanded from macro"
grep -aPzo \
		"(?s)[^\n]*: error: ${errMatch} '[A-Za-z0-9\._]*'[^\n]*\n([^\n]*\n[^\n]*\n)?[^\n]*: ${fromMacro}[^\n]*(\n([^\n]*\n[^\n]*\n)?[^\n]*: ${fromMacro}[^\n]*)*"

Then, with this capture, we need to cleanup null characters (emitted by the -z flag), keep only what we need (ie. filter only the desired symbols), and capture the four information we need.

The Final Script

The final script I used is the following. It was aimed for clang, but could be adapted for other compilers. Feel free to use it as inspiration, and happy renaming!

  • It enables -ferror-limit=0
  • It first patches files holding the class/members to be renamed
  • Compute aLtErNaTe case intermediate patterns for replacement
  • Check we don’t have hits on intermediate patterns
  • Replace the files holding the definitions of the types we want to rename
  • Loop until we don’t have any more errors
    • Capture errors and locate patterns to replace
    • Replace patterns
  • Final name replacements from aLtErNaTe case to final versions
  • Removal of idempotent types if we replaced a type by its final alias
  • A bit of clang-tidy where we are at it!
#!/bin/bash

set -e

# Source and build location
readonly sources=$HOME/git/codebase
readonly build=${sources}/build
readonly headerFiles=(MyFancyView.hpp)

# List of replacements
# TODO: put your own replacements here
declare -A replacements=(
	["MyFancyView"]="StandardView"
	["MyFancuString"]="U16StandardView"
	["MyFancuUnicodeString"]="U32StandardView"
	)

# Function to return aLtErNaTe case version of $1
altCase() {
	local result=
	local i
	local up=
	local c
	for i in $(printf "%s" "$1" | sed -e 's/\(.\)/\1\n/g'); do
		if [[ -n "$up" ]]; then
			c=$(echo "$i" | tr '[:lower:]' '[:upper:]')
			up=
		else
			c=$(echo "$i" | tr '[:upper:]' '[:lower:]')
			up=yes
		fi
		result="${result}${c}"
	done
	echo "$result"
}

# Print a message and fail
die() {
	echo "** FATAL: $*" >&2
	exit 1
}

# Emit info stuff
info() {
	echo "[ $* ] ..." >&2
}

# Rebuild logic; places errors in /tmp/errors
# TODO: put your own (re)build logic here. It should build as much as possible, ignoring errors!
rebuild() {
	# clean repository
	git clean -dxf

	# rebuild (-k0 to continue despite of errors)
	rm -rf "build" \
		&& mkdir -p "build" \
		&& (cd "build" && cmake -GNinja .) \
		&& ninja -C "build" -k0 \
		| tee /tmp/errors
}

# Code format logic, because we are worth it
format() {
	# TODO: insert your code format logic here
	info "Cleanup clang-format"
	git diff -U0 --no-prefix --no-color | clang-format-diff-14 -style file -i
}

# TODO: put your own logic here to add -ferror-limit=0 on the build
info "Set error limit to infinity"
sed -e 's/CFLAGS=/CFLAGS=-ferror-limit=0/' -i CMakeLists.txt
git commit CMakeLists.txt -m "chore: Add -ferror-limit=0"

# Compute intermediate class names with mIxEd cAsE for line column stability
declare -A intermediate
srcAllowed=
for from in "${!replacements[@]}"; do
	to="${replacements[$from]}"
	intermediate[$from]=$(altCase "$from")
	if [[ -n "$srcAllowed" ]]; then
		srcAllowed="${srcAllowed}|"
	fi
	srcAllowed="${srcAllowed}${from}"
done

info "Check we don't have hits on intermediate patterns"
for from in "${!intermediate[@]}"; do
	to="${intermediate[$from]}"
	! git grep -q "$to" || die "Oops: $to did hit; please change the logic"
done

info "Do all intermediate replacements in types"
for from in "${!intermediate[@]}"; do
	to="${intermediate[$from]}"
	echo "$from$to"

	# Let's replace directly concerned files first except includes
    # TODO: put your own logic here to specify the file(s) holding the types to be renamed to patch
	for i in ${headerFiles[@]}; do
		sed -e "s/\b${from}\b/${to}/g" -i "$i"
		sed -e "s%${to}\.h\"%${from}\.h\"%g" -i "$i"
	done
done

info "Do all intermediate forward declaration replacements"
for from in "${!intermediate[@]}"; do
	to="${intermediate[$from]}"
	echo "$from$to"

	for i in $(git grep -E "^class ${from};" | cut -f1 -d':' | sort -u); do
		sed -e "s/^class ${from};$/class ${to};/g" -i "$i"
	done
done

# Let's loop until we converge without errors
step=1
while true; do
	info "Rebuild #$step"
	rebuild

	# Let the magic begin: fetch error information and replace exact pattern at the exact places
	errMatch="(unknown type name|use of undeclared identifier|no type named|no member named)"
	fromMacro="note: expanded from macro"

	# First handle inlined macros. Trickier, as this is a multiline capture. (LANG=C because way faster)
	#
	# /code/library/utils/MyFancyTools.cpp:60:45: error: use of undeclared identifier 'UnicodeChar'
	#     constexpr UnicodeChar replacementChar = UNICODE_REPLACEMENT_CHAR;
	#                                             ^
	# /code/library/unicode/unicode.h:24:36: note: expanded from macro 'UNICODE_REPLACEMENT_CHAR'
	#
	# -or- with nested macros:
	#
	# /code/test/MyFancyTests.cpp:72:9: error: unknown type name 'AString'
	#         TEST_ENTITY("〹", 0x3039, false);
	#         ^
	# /code/test/MyFancyTests.cpp:64:48: note: expanded from macro 'TEST_ENTITY'
	# #define TEST_ENTITY(name, value, onlyAlphaNum) _TEST_ENTITY(entities, name, value, onlyAlphaNum)
	#                                                ^
	# /code/test/MyFancyTests.cpp:48:9: note: expanded from macro '_TEST_ENTITY'
	LANG=C \
	grep -aPzo \
		"(?s)[^\n]*: error: ${errMatch} '[A-Za-z0-9\._]*'[^\n]*\n([^\n]*\n[^\n]*\n)?[^\n]*: ${fromMacro}[^\n]*(\n([^\n]*\n[^\n]*\n)?[^\n]*: ${fromMacro}[^\n]*)*" \
			/tmp/errors |
		tr '\0' '\n' |
		grep -aE "(error: ${errMatch}|${fromMacro})" |
		sed \
			-e "s/^.*error: [^']* '\([^']*\)'[^\n]*$/%\1/" \
			-e "s/^\([^:]*\):\([^:]*\):\([^:]*\): ${fromMacro}.*/\1\t\2\t\3/" |
		tr '\n%' '\t\n' |
		sed -E 's/^([^\t]*)\t([^\t]*\t[^\t]*\t[^\t]*\t)*([^\t]*)\t([^\t]*)\t([^\t]*)\t$/\3;\4;\5;\1/g' |
		grep -aE ";${srcAllowed};" |
		tr ';' '\t' |
		grep -avE '^$' |
		sort -u |
		while read -r elt; do
			file=$(echo "$elt" | cut -f1)
			line=$(echo "$elt" | cut -f2)
			col=$(echo "$elt" | cut -f3)
			from=$(echo "$elt" | cut -f4)
			to="${intermediate[$from]}" || die "Can't find $from in intermediate"
			printf "\r\33[K\r%s %s %s %s %s" "$file" "$line" "$col" "$from" "$to"

			# Note: We need LANG=C as columns are apparently expressed in byte offset
			LANG=C sed -E "${line}s/^(.{$((col-1))})${from}/\1${to}/" -i "$file"
		done

	# Then fetch direct code error information and replace exact pattern at the exact places
	#
	# /code/library/FooWriter.h:35:25: error: unknown type name 'MyFancyView'
	# /code/library/FooWriter.h:431:21: error: use of undeclared identifier 'MyFancyView'
	info "Execute replacements"
	grep -a " error: " /tmp/errors |
		grep -aE "${errMatch}" |
		sed -e "s/^\([^:]*\):\([^:]*\):\([^:]*\):[^']*'\([^']*\)'.*/\1\t\2\t\3\t\4/" |
		grep -aE "\b(${srcAllowed})\b" |
		sort -u |
		while read -r elt; do
			file=$(echo "$elt" | cut -f1)
			line=$(echo "$elt" | cut -f2)
			col=$(echo "$elt" | cut -f3)
			from=$(echo "$elt" | cut -f4)
			to="${intermediate[$from]}" || die "Can't find $from in intermediate"
			printf "\r\33[K\r%s %s %s %s %s" "$file" "$line" "$col" "$from" "$to"

			# Note: We need LANG=C as columns are apparently expressed in byte offset
			LANG=C sed -E "${line}s/^(.{$((col-1))})${from}/\1${to}/" -i "$file"
		done

	# If no changes, bail out
	if git diff-index --quiet HEAD --; then
		break
	fi

	# Format code
	format

	# Commit
	if [[ $step == 1 ]]; then
		info "Commit result"
		git commit -a -m "chore: Automated replacements ($0)"
	else
		info "Merging commit result"
		git commit -a --amend --no-edit
	fi

	step=$((step+1))
done

info "Final name replacements"
for key in "${!replacements[@]}"; do
	from="${intermediate[$key]}"
	to="${replacements[$key]}"
	echo "$from$to"

	for i in $(git grep "$from" | cut -f1 -d':' | sort -u); do
		sed -e "s/\b${from}\b/$to/g" -i "$i"
	done
done

info "Remove idempotent types (using X=X)"
for from in "${!replacements[@]}"; do
	to="${replacements[$from]}"
	echo "$from$to"

	for i in $(git grep -E "^using ${to} = ${to};" | cut -f1 -d':'); do
		sed -E "/^using ${to} = ${to};/d" -i "$i"
	done
done

# Format code
format

info "Merging commit result"
git commit -a --amend --no-edit

info "Final build"
rebuild

Redirecting Stdio is Hard

The Two Nasty Bugs

The nasty bugs

You know what it is like: you never saw them, but some people did, and these people are reliable, so there must be something wrong. But nobody on the engineering team managed to reproduce the issues. And the issues are random, and only happen rarely. Very rarely, such as once in a month for a given customer, or even once in a year for another one. And you know what: these bugs are horrible, because it seems you’ll never catch them. They hide, and silently trigger when you are asleep.

First bug report: Can not run an external crash reporter on Linux

This was the nicest one, actually, and it took only few days to investigate it. Basically, in case of crash, a crash reporter was supposed to be ran, and send useful information (stack traces et al.). Unfortunately the crash reporter appeared not to work correctly. It did not start. For an unknown reason (no error messages)

Fortunately, this bug was reproducible internally.

Second bug report: I see log files in my database on Windows

Yes, log files in my database. Or more precisely log files in my database binary files. I was a bit surprised, too. It seems that, randomly, some log files that were supposed to be sent to the logs were redirected to another random file, including database ones.

Obviously, the database was not happy after that.

That was the report after months of investigation: the first ones were just the usual there was a random crash and I have no clue why - I scratched everything and now it works.

These two issues were actually related, and the root cause was the way file descriptors are handled.

Oh, what are file descriptors, but he way ?

The Three Little Descriptors

All C programmers know the three standard library input/output files: stdin, stdout, stderr. These are the default standard input, output, and error streams. The first one typically receives the console input (keyboard), the later ones allows to display output.

If you look at the stdin manual page, you will see that

A file with associated buffering is called a stream and is declared to be a pointer to a defined type FILE. (…) At program start-up, three streams shall be predefined and need not be opened explicitly: standard input (for reading conventional input), standard output (for writing conventional output), and standard error (for writing diagnostic output).

Basically, these FILE streams are encapsulating a file descriptor, that is, a lower-level input or output resource (which is generally identified by an integer, ie. an int), and provide additional features, such as buffering, and offer a set of advanced functions, such as fprintf to print formatted output.

By default, the input/output streams are connected to the console (Linux, or DOS), but you may redirect them easily - this feature is especially known in the Linux/Un*x world:

ls -l /tmp/foo >/tmp/list 2>/tmp/errorlog

In this very basic shell script example, the ls -l /tmp/foo command standard output is redirected to the file /tmp/list, and the standard error output is redirected to /tmp/errorlog. Note that the >/tmp/list can also be written as 1>/tmp/list: the ‘1’ and ‘2’ refer to the file descriptors 1 and 2, which are the standard output and error file descriptors. The standard input is the file descriptor 0:

read A 0</tmp/bar

Why are the standard input/output streams identified by 0, 1 and 2 ? Well, this is just a convention - and, by default, file descriptors are allocated increasingly, and the first three files opened (this is done by the system) are these streams.

We can redirect the standard streams easily using shell commands - but can we do that programmatically, especially for the output streams ?

Redirection to a File

We have two output file descriptors (1 and 2), and we want to capture them in a file - say, a log file. This is quite convenient, especially for server applications: log files can then be rotated, inspected…

The standard way to reassign a file descriptor is through the dup2 standard library function. This function takes an opened file descriptor, and clone it to the second file descriptor provided. You can also extract the file descriptor beneath a FILE stream through the fileno standard library function: here’s a simple code to redirect the standard output to a file:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <assert.h>

/** Redirect stdout and stderr. Returns 0 upon success. **/
static int redirect_std_out_err(const char *log_name) {
  int failure = 1;

  /* Open log FILE stream. */
  FILE *const fp = fopen(log_name, "wb");
  if (fp != NULL) {

    /* Extract log FILE fd, and standard stdout/stderr ones. */
    const int log_fd = fileno(fp);
    const int stdout_fd = fileno(stdout);
    const int stderr_fd = fileno(stderr);
    if (log_fd != -1 && stdout_fd != -1 && stderr_fd != -1) {

      /* Clone log fd to standard stdout/stderr ones. */
      if (dup2(log_fd, stdout_fd) != -1 && dup2(log_fd, stderr_fd) != -1) {
        failure = 0;
      }
    }

    /* Close initial file. We do not need it, as we cloned it. */
    fclose(fp);
  }
  return failure;
}

int main(int argc, char **argv) {
  /* Redirect stdout and stderr to a single log file */
  if (argc != 2 || redirect_std_out_err(argv[1])) {
    return EXIT_FAILURE;
  }

  /* Print output to both streams */
  printf("Hey, stdout is redirected!\n");
  fprintf(stderr, "Hey, stderr is also redirected!\n");

  return EXIT_SUCCESS;
}

And that’s it! Simple, isn’t it ?

Here Comes The Problems

Redirection is what we typically do in my company - I even modified some of the involved code. But sending output to a log file has its drawbacks. Especially regarding log file size: the file is growing, and we need to clear it time to time, or the disk will eventually fill. This is generally done through a “rotate” operation, that is:

  • closing the current log file
  • renaming it with some timestamp
  • opening a fresh one
  • cleanup old logs if necessary

File descriptors, more generally, are special resources in most operating systems, as they are inherited when spawning a process. By default, all file descriptors are transmitted to spawned processes (on POSIX systems, fork clone them, and exec functions keep them opened by default), and this is a pain: pipes and sockets are also inherited. This may lead to the following case:

  • main program opens a socket to a remote location
  • spawning a process, which inherits this socket (file descriptor) too
  • main program sens commands to remote location and closes the socket
  • spawned program still lives, and still has the socket file descriptor copy opened
  • remote location does not “sees” the client close() and is stuck

The thing is that spawned processed do not give a damn about already opened file descriptors (except for stdio). But the standard is broken for historical reasons, very unfortunately.

The solution is painful: you need to flag all possible opened file descriptors with the FD_CLOEXEC flag set. This flag explicitly tells the system that the file descriptor shall not be inherited when spawning a process:

  • open(..., ...) => open(..., ... | O_CLOEXEC)
  • fopen(..., ...) => open(..., ... | O_CLOEXEC) + fdopen(...) OR fopen(..., ... + "e")
  • dup(...) => fcntl(..., F_DUPFD_CLOEXEC, (long) 0)
  • dup2(..., ...) => dup3(..., ..., O_CLOEXEC)
  • opendir(...) => open(..., O_RDONLY | O_DIRECTORY | O_CLOEXEC) + fdopendir(...)
  • pipe(...) => pipe2(..., O_CLOEXEC)
  • acept(..., ..., ...) => accept4(..., ..., ..., O_CLOEXEC)
  • socket(..., ..., ...) => socket(..., ... | SOCK_CLOEXEC, ...)
  • and so on…

Some of these functions are non-standard (such as accept4) unfortunately. Some other will not be available (Windows). Some old kernels (Linux) will sometimes not support these features - and some old glibc will not work either (especially for the “e” flag). Yes, this is a real pain.

Oh, one last thing: do not dare to use fcntl(fd, F_SETFD, ...) to clear the FD_CLOEXEC flag: this is insanely hopeless in a multithreaded environment. Can you guess why ?

So What Was That All About ?

Okay, back yo our nasty bugs!

  1. The first one was quite easy to investigate.

Here are the investigation steps, ordered temporally:

  • “can not run an external crash reporter on Linux”; ie. can not fork()/execv() the external crash reporter in a signal handler, with a status code returned by waitpid() reporting a signal 6 caught during launch (ie. the launched program called abort())
  • we can not run the crash reporter outside the signal handler either
  • we can not run the crash reporter using system() either
  • we can not run “/bin/echo foo” through system() either; the command returns the status “return code 1”!
  • … BUT we CAN run “/bin/echo -n” through system()!

At this stage, it is clear that the issue is related to the output file descriptors.

Further investigation showed that:

  • we redirected output to logs, opening the log file with the FD_CLOEXEC flag set, AND using the Linux-specific dup3() call
  • as a result or dup3(), standard streams had their FD_CLOEXEC flag set
  • spawning an external program led to have the standard streams closed
  • attempting to write on the closed standard streams led to either a broken pipe signal, or an assertion failure depending on the program involved

The solution was simply to use the default dup2() function for standard streams. These are the only streams we want to be inherited, after all!

  1. The second bug took a bit more time

The biggest issue is that the bug could not be reproduced at first internally. Until our Q&A team managed to reproduce it (these guys are really good at finding bugs, I must admit) three days in a row.

The logs reports were strange, and totally unrelated. But one log was kind of interesting:

[2014/11/28-10:59:08.514] [info] ...
[2014/11/28-10:59:08.659] [info] ...
[Fatal Error] :2:999961: XML document structures must start and end within the same entity.

The error was a corrupted produced XML document, with another unrelated log inside it, in the middle of the structure. The final error did not have any timestamp, but the previous one was just before midnight (GMT). Humm, a midnight bug ? What are the chances for this being a coincidence ?

I then realized that midnight was the trigger time for a number of maintenance operations, including… log rotation.

But after thoughtfully inspecting the code, everything looked fine. And the bug never popped up on Linux, it was Windows-specific (or at least, much frequent on this operating system)

But there was another clue: all weird logs were actually logs coming from the Java virtual machine (either through Log4j, or through System.out.println() calls)

When using System.out (or System.err) in Java, we are actually using a PrintStream, which is an object wrapped in various FilterOutputStream objects for buffering, the final instance being a FileOutputStream, which is, according to the documentation an:

output stream for writing data to a File or to a FileDescriptor

And, in our case, the FileDescriptor involved was FileDescriptor.out and FileDescriptor.err, that is, Java handles to the standard output and error streams.

Fortunately, the sources of the Open JDK are public, and it is possible to dig into the real code beneath, that is, the native code:

  • ./jdk/src/share/classes/java/io/FileOutputStream.java
  • ./jdk/src/windows/native/java/io/FileOutputStream_md.c
  • ./jdk/src/solaris/native/java/io/FileOutputStream_md.c

Hummm… there is a Windows-specific implementation for FileOutputStream native calls. strange, isn’t it ?

JNIEXPORT void JNICALL
Java_java_io_FileOutputStream_write(JNIEnv *env, jobject this, jint byte, jboolean append) {
    writeSingle(env, this, byte, append, fos_fd);
}

And the writeSingle call is using a file descriptor returned by GET_FD():

void
writeSingle(JNIEnv *env, jobject this, jint byte, jboolean append, jfieldID fid) {
    // Discard the 24 high-order bits of byte. See OutputStream#write(int)
    char c = (char) byte;
    jint n;
    FD fd = GET_FD(this, fid);

And this macro just calls GetObjectField(..., IO_handle_fdID) to retrieve … the “handle” member of the corresponding FileDescriptor class.

What ? A handle ? Not a file descriptor ?

Yep: a HANDLE - and this was the root cause of all out problems.

Let me give you the reason: on Windows, a standard library handle is actually a kernel HANDLE wrapped by the C library, nothing more.

It means that if you replace the standard library input/output streams, you’ll just replace the C library streams.

Here is the complete bug scenario on Windows after collecting all the clues:

  • log rotation function is called at low level, and replaces the standard output and error streams 1 and 2 using dup2()
  • the Visual C++ C library replaces the streams, by closing the underlying existing HANDLE, and replacing by new ones
  • the old HANDLE are now totally meaningless (“dangling handles”)
  • Java has no knowledge of the replacement, and still has the old HANDLE stored in its FileDescriptor internals
  • Java System.out calls are now sending data to nowhere (this is bad)
  • someone opens a new file, and Windows recycle the old HANDLE
  • Java System.out calls are now sending data to a valid – yet unrelated – file (this is very very bad)
  • data corruption ensues (and not profit)

The root cause is that Java is unaware that the low-level HANDLE has changed.

The solution is simple: let’s tell him!:

For each System.out and System.err OutputStream instances, we need to:

  • get the “out” member field of the FilterOutputStream class and mark it public (see Field.setAccessible())
  • unwrap all FilterOutputStream wrapping by calling fetching the “out” member field recursively
  • cast the resulting instance to a FileOutputStream
  • fetch the FileDescriptor through FileOutputStream.getFD()
  • use SharedSecrets.getJavaIOFileDescriptorAccess() to set the HANDLE (a long) value to -1 temporarily
  • rotate the logs
  • fetch the new HANDLE (a long) through a native call to _get_osfhandle()
  • set the new HANDLE using the previous steps

This is a bit cumbersome - but mixing native code and Java is always cumbersome, so you have to be used to it :)

Epilogue

The two issues were rather different:

  • one was a mistake created when attempting to fix the insane default handle inheritance on both POSIX and Windows API, causing external programs not to run anymore
  • one was a design bug between the Visual C++ library and the JVM (ie. there is no dup2() for HANDLEs), causing random corruptions on files or sockets

But in both cases, a simple standard stream mishandling was the root cause. But the consequences were nasty, the resolution painful, and the time spent unreasonable (the second bug took months to be investigated)

TL;DR: redirecting Stdio is Hard, and prone to nasty bugs if you are not careful enough!

I Found A Bug In strncat()

Yes, A Bug

This code won’t behave correctly (ie. partial copy) on many glibc versions on Linux:

strncat(dest, source, SIZE_MAX);

This is a corner case, but this is a bug.

(hint: the correct behavior is to do strcat(dest, source) equivalent)

A Secure Version Of Strcat

It began with code cleanup.

In httrack, there is a lot of legacy (old, dirty) code laying around. Rewriting everything from scratch would probably be a sane decision, but I just don’t have the time (and the courage) to do that, so I’m cleaning up bits and bits when possible, attempting to improve the code quality when possible.

One of the terrible design decision was to use C strings everywhere, and associated strcpy, strcat etc. primitives. It was an unfortunate decision, because of the lack of flexibility (immutable capacity), and security issues (buffer overflows that might be introduced).

In this hypothetical case, we’re adding a string to a structure foo:

struct foo {
  char filename[1024];
...
};

void append_filename(struct foo *foo, const char *filename) {
  strcat(foo->filename, filename);
}

This code will overflow if the passed string added to the existing string overflows the filename buffer.

The strlcat version exists on BSD, and is a reasonable solution to mitigate this problem. Unfortunately, some people do not like BSD, and decided that these functions will never be part of the glibc. (insert flame here)

Besides, switching to strlcat and its friends would involve checking every single occurrence of strcat, strcpy. etc., probably introducing new bugs!

I then decided to use a modified version of the dangerous libc primitives, using compile-time type checking:

/**
 * Check whether 'VAR' is of type char[].
 */
#if (defined(__GNUC__) && !defined(__cplusplus))
/* Note: char[] and const char[] are compatible */
#define HTS_IS_CHAR_BUFFER(VAR) ( __builtin_types_compatible_p ( typeof (VAR), char[] ) )
#else
/* Note: a bit lame as char[8] won't be seen. */
#define HTS_IS_CHAR_BUFFER(VAR) ( sizeof(VAR) != sizeof(char*) )
#endif
#define HTS_IS_NOT_CHAR_BUFFER(VAR) ( ! HTS_IS_CHAR_BUFFER(VAR) )

/**
 * Append at most N characters from "B" to "A".
 * If "A" is a char[] variable then the size is assumed to be the capacity of this array.
 */
#define strncatbuff(A, B, N) \
  ( HTS_IS_NOT_CHAR_BUFFER(A) \
  ? strncat(A, B, N) \
  : strncat_safe_(A, sizeof(A), B, \
  HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), N, \
  "overflow while appending '" #B "' to '"#A"'", __FILE__, __LINE__) )

The idea was to sniff at compile-time the arguments passed to the modified strncat version: when an opaque char* is used, there’s nothing you can do. But when passing a regular char[...] array, a secure version can be used. This is not perfect (especially because with non-GCC compilers, char[8] are seen as char*), but at least it allows you to quickly mitigate many potential buffer overflows without touching existing code.

But at one point I did this quite daring thing to create the secure strcat version:

#define strcatbuff(A, B) strncatbuff(A, B, (size_t) -1)

The (size_t) -1 value was simply passed as the boundary of strncatbuff - at some point, we would be doing strncat(dest, source, (size_t) -1), and this seemed fine to me.

Soon after, bugs started to appear (strings were copied partially). It took me a bit of time (with the kind help of the bug reporter) to figure out that my daring macro was involved in this bug. This took me a while because the bug only appeared on Linux, on some glibc releases, and quite randomly.

I managed to have a reproducible minimalistic case:

/* strncat_size_t-1_testcase. */

/* gcc -Wall strncat_size_t-1_testcase.c -o strncat_size_t-1_testcase */
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <assert.h>

/* strncat_size_t-1_testcase "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor" */
int main(int argc, char **argv) {
  const size_t size = strlen(argv[1]);
  char *const buffer = malloc(size + 1);
  int success;
  buffer[0] = '\0';
  strncat(buffer, argv[1], SIZE_MAX);
  success = strlen(buffer) == size;
  if (!success) {
    fprintf(stderr, "** result: '%s'\n", buffer);
  }
  assert(success);
  return success ? EXIT_SUCCESS : EXIT_FAILURE;
}

If you build this program, and use it:

gcc -Wall strncat_size_t-1_testcase.c -o strncat_size_t-1_testcase
./strncat_size_t-1_testcase "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor"

… then you’ll get an error. Or maybe you’ll get one. Randomly.

This Is Not A glibc Bug, But A Misuse Of Strncat

The only way to be sure was to carefully read the strncat standard.

The strncat() function shall append not more than n bytes (a null byte and bytes that follow it are not appended) from the array pointed to by s2 to the end of the string pointed to by s1. The initial byte of s2 overwrites the null byte at the end of s1. A terminating null byte is always appended to the result. If copying takes place between objects that overlap, the behavior is undefined.

With the help of specialists in comp.unix.programmer, several points must be noticed:

  • strncat is not a secure version of strcat (strlcat is), and the third argument is not the destination capacity
  • using (size_t) -1 as third argument is actually using SIZE_MAX
  • the third argument is an additional limit and can have any value

Therefore, strncat(..., ..., SIZE_MAX) should behave as strcat(..., ...).

Yes, yes, this is a corner case. But this is a legit one.

Where Is The Bug ?

The reason why not all glibc releases were impacted is simple: this is an optimization bug. (insert laughs from experienced developers)

The original strncat source is not used on x86-64 architectures: the optimized SSE2 version is used in place.

The code might have been introduced in 2013 in this commit which was supposed to introduce a “Faster strlen on x64” - but some existing files seems to have been merged at that time.

Unfortunately, the assembly source is uncommented, and I did not have the courage to dig deeper. But this source is buggy.

Funny Remark: By the way, I’m not even convinced that the assembly version is faster at all: code executing speed is probably irrelevant compared to L1/L2/L3/memory bandwidth issues.

What’S Next ?

First I obviously stopped using SIZE_MAX with strncat, and dispatch to strcat accordingly. This was a daring thing to use this corner case value anyway, prone to … corner-case bugs.

The second step was to fill a bug report in the glibc.

TL;DR: Beware when using corner cases, even when perfectly conforming to the standard. And never “optimize” code by rewriting obvious functions into a 300-lines of undocumented assembly!

Coucal, Cuckoo-hashing-based hashtable with stash area C library

Greater Coucal Centropus sinensis

Coucal ?

According to Wikipedia,

A coucal is one of about 30 species of birds in the cuckoo family. All of them belong in the subfamily Centropodinae and the genus Centropus. Unlike many Old World cuckoos, coucals are not brood parasites.

If you read my previous blog entry on HTTrack’s new hashtables, you probably already know that cuckoo hashing is a rather new (2001) hashing technique to be used for efficient hashtables.

The idea is to have a smarter open-addressing strategy based on kicking an existing entry if a collision occurs. This is why the name Cuckoo hashtable was chosen.

Note that the name is a bit misleading: a typical cuckoo will not only kick an existing foe, but will consequently kill it (nature, you scary!). Here, we only relocate it, reason why I choose the Coucal, a nicer cuckoo member who does not destroy existing entries - I mean, existing eggs!

This implementation has the following features:

  • guaranteed constant time (Θ(1)) lookup
  • guaranteed constant time (Θ(1)) delete or replace
  • average constant time (O(1)) insert
  • one large memory chunk for table (and one for the key pool)
  • simple enumeration

This implementation has been tested since July 2013 in HTTrack, and so far, no bugs were reported (the only issues spotted were bugs in upstream code that was corrupting existing keys)

Please have a look at this library implementation: https://github.com/xroche/coucal

TL;DR: cuckoo hashtables are both efficient and smart!