xavier roche's homework

random thoughts and sometimes pieces of code

Compiler-error-abuse-based large scale renaming in C++

Many developers hate compiler errors. This can become a daily pain, an endless stream of misery and frustration. But what if I told you compiler errors may help you refactor safely and quickly an entire codebase to rename class members and types ?

A Story of Renaming

At my company, we have been using string views for quite some time in our codebase - long before std::string_view was cool (that is, before C++17) - and we recently realized that some of our initial choices might not have been completely wise at that time.

Here’s what we used to write nearly ten years ago (lots of boilerplate removed):

class MyFancyView
{
public:
    MyFancyView(const char* ptr, size_t ptrlen) ...
... lots of fancy functions out there

public:
    const char* data;	// public data pointer
    size_t len;		    // public length
};

First obvious issue with this primitive view are the public members data and len, something which is now frowned upon (for good reasons, as this tend to allow hacky changes in address and pointer). The other issue is that, if we want to be closer to the standard library API, accessors needs to be data() and size().

The solution is rather straightforward: we need to rewrite thousands of lines of code using those direct members, using accessors, and sometimes change the code logic to use internal view logic - typically trimming views, taking a sub-view etc. Simple but overwhelming: the operation needs to be carefully done, line by line.

Besides,

  • Using an editor to do the work may work, but it also may lead to errors. You may replace members in stuff that do not belong. You may also miss a lot of ugly legacy macros embedding the renamed members. The compiler is the only one knowing which type has to be replaced and where.
  • Using tools such as clang-rename did not appear to help either, as they seem more convenient to replace one file at a time, with an offset provided by an editor typically.
  • Using a straight replacement of all data into data() and len into size() in the codebase would lead to havoc: we have a lots of stuff out there with those names and this will never work.

And there’s a last catch: programmers are lazy. There is no world where we’ll do that manually!

Oh, and there’s a perfectly valid excuse not to do it anyway: the risk of introducing subtle regressions is great, and the risk of tuning any code reviewer crazy is even greater.

So let’s start from the beginning: we’d like to rename those two members.

The strategy needs to be a bit subtle:

  1. Rename the member definitions to an intermediate name that shall be unique (ie. in a unique pattern) in the codebase
  2. Identify all members references to be renamed and replace them with the intermediate pattern
  3. Then, we can start doing blind replacements in the codebase

The first item has a pretty straightforward hack: let’s rename data into dAtA, and len into lEn in our definition file!

 public:
-    const char* data{ nullptr };
-    size_t len{ 0 };
+    const char* dAtA{ nullptr };
+    size_t lEn{ 0 };
 };

Why the weird l33t case ? I’m glad you ask! Here’s the reason:

git grep -E "(dAtA|lEn)"

The command above did not yield any matches in our codebase. We have our perfect namings!

One more thing: you might have not realize it, but those two new namings do also have the same length as previously, so that replacing them won’t change a line size. You’ll see later why this is important.

What’s happening if we build the code without any changes ? Well, we’ll obviously have a couple of those errors before the compiler gives up on our own foolishness:

../types/MixedFancy.h:84:16: error: no member named 'len' in 'MyFancyView'
    return lhs.len == rhs.size() && (lhs.len == 0 || __builtin_memcmp(lhs.data, rhs.data(), lhs.len) == 0);

This message is actually cool. It provides four major information:

  • The error we seek to fix (no member named ‘len’)
  • The file involved (types/MixedFancy.h)
  • The line number involved (84)
  • And even the column number involved (16)

And remember those information are extracted from the compiler itself: we can rely on them.

What more could you even ask for ? A margarita ?

If we grep for error: no member named, then isolate the leftmost stuff (file:line:column), we almost are set to fix the error.

With this, if we extract the file, the line, and the column, we can now do our replacement:

LANG=C sed \
    -E "84s/^(.{15})len/\1lEn/" \
    -i "../types/MixedFancy.h"

The leading 84 before sed s option is the line number, and we need to capture 15 (16 - 1) characters from the beginning of the line before column 16. Then, replace the captured part, the new name, and we’re all set. Using LANG=C is required as the compiler we use (clang) will provide offsets at byte level.

A last detail: we’re lazy (we’re merely humans after all), but the compiler shouldn’t be. We need to tell the compiler not to worry if we have thousands of errors, by adding -ferror-limit=0 in the cmake’s cxx_flags. We also need to tell ninja not to be too lazy (by using -k0)

With all this, we can have the final renaming script main logic:

ninja -k0 -C . 2>&1 |
  # capture compiler errors
  grep 'error: no member named' |
  # extract file, line, and column
  cut -f1-3 -d':' |
  # sort all
  sort -u |
  # only keep first file/line when several patterns exist on a single line
  sort -u -k1,2 -t':' |
# now bulk replace
while read line; do
    # extract filename, line, column
    f="$(echo "$line"|cut -f1 -d':')"
    l="$(echo "$line"|cut -f2 -d':')"
    c="$(echo "$line"|cut -f3 -d':')"
    # replace at line l and column c either of the pattern
    LANG=C sed -E -i "$f" \
        -e "${l}s/^(.{$((c-1))})data/\1dAtA/" \
        -e "${l}s/^(.{$((c-1))})len/\1lEn/"
done

It is highly suggested you split each step above until you’re confident enough, to be able to hunt for any mistakes.

A note on the fact we have the same name length for replaced versions:

std::find(_elements.data, _elements.data + _elements.size, fun)

This code above will trigger two warnings, on the same line, but on two different columns. If we were to replace with a name of a different length, this would completely mess up our script.

Going Further

That’s nice, but we also have a lot of stuff inside macros. I know, macros are bad, but legacy reasons…

Errors inside those macros will typically generate stacked error messages; such as in this case where we have two macros calling each other involved (in this case, we were renaming MyFancyView itself):

test/MyFancyTests.cpp:72:9: error: unknown type name 'MyFancyView'
        TEST_ENTITY("〹", 0x3039, false);
        ^
test/MyFancyTests.cpp:64:48: note: expanded from macro 'TEST_ENTITY'
#define TEST_ENTITY(name, value, onlyAlphaNum) _TEST_ENTITY(entities, name, value, onlyAlphaNum)
                                               ^
test/MyFancyTests.cpp:48:9: note: expanded from macro '_TEST_ENTITY'
...

The error location is actually on MyFancyTests.cpp:48:9, but its message is triggered at MyFancyTests.cpp:72:9.

To capture those cases, a multiline grep capture (as demonstrated on StackOverflow) with grep -Pzo will do the magic:

errMatch="(unknown type name|use of undeclared identifier|no type named|no member named)"
fromMacro="note: expanded from macro"
grep -aPzo \
		"(?s)[^\n]*: error: ${errMatch} '[A-Za-z0-9\._]*'[^\n]*\n([^\n]*\n[^\n]*\n)?[^\n]*: ${fromMacro}[^\n]*(\n([^\n]*\n[^\n]*\n)?[^\n]*: ${fromMacro}[^\n]*)*"

Then, with this capture, we need to cleanup null characters (emitted by the -z flag), keep only what we need (ie. filter only the desired symbols), and capture the four information we need.

The Final Script

The final script I used is the following. It was aimed for clang, but could be adapted for other compilers. Feel free to use it as inspiration, and happy renaming!

  • It enables -ferror-limit=0
  • It first patches files holding the class/members to be renamed
  • Compute aLtErNaTe case intermediate patterns for replacement
  • Check we don’t have hits on intermediate patterns
  • Replace the files holding the definitions of the types we want to rename
  • Loop until we don’t have any more errors
    • Capture errors and locate patterns to replace
    • Replace patterns
  • Final name replacements from aLtErNaTe case to final versions
  • Removal of idempotent types if we replaced a type by its final alias
  • A bit of clang-tidy where we are at it!
#!/bin/bash

set -e

# Source and build location
readonly sources=$HOME/git/codebase
readonly build=${sources}/build
readonly headerFiles=(MyFancyView.hpp)

# List of replacements
# TODO: put your own replacements here
declare -A replacements=(
	["MyFancyView"]="StandardView"
	["MyFancuString"]="U16StandardView"
	["MyFancuUnicodeString"]="U32StandardView"
	)

# Function to return aLtErNaTe case version of $1
altCase() {
	local result=
	local i
	local up=
	local c
	for i in $(printf "%s" "$1" | sed -e 's/\(.\)/\1\n/g'); do
		if [[ -n "$up" ]]; then
			c=$(echo "$i" | tr '[:lower:]' '[:upper:]')
			up=
		else
			c=$(echo "$i" | tr '[:upper:]' '[:lower:]')
			up=yes
		fi
		result="${result}${c}"
	done
	echo "$result"
}

# Print a message and fail
die() {
	echo "** FATAL: $*" >&2
	exit 1
}

# Emit info stuff
info() {
	echo "[ $* ] ..." >&2
}

# Rebuild logic; places errors in /tmp/errors
# TODO: put your own (re)build logic here. It should build as much as possible, ignoring errors!
rebuild() {
	# clean repository
	git clean -dxf

	# rebuild (-k0 to continue despite of errors)
	rm -rf "build" \
		&& mkdir -p "build" \
		&& (cd "build" && cmake -GNinja .) \
		&& ninja -C "build" -k0 \
		| tee /tmp/errors
}

# Code format logic, because we are worth it
format() {
	# TODO: insert your code format logic here
	info "Cleanup clang-format"
	git diff -U0 --no-prefix --no-color | clang-format-diff-14 -style file -i
}

# TODO: put your own logic here to add -ferror-limit=0 on the build
info "Set error limit to infinity"
sed -e 's/CFLAGS=/CFLAGS=-ferror-limit=0/' -i CMakeLists.txt
git commit CMakeLists.txt -m "chore: Add -ferror-limit=0"

# Compute intermediate class names with mIxEd cAsE for line column stability
declare -A intermediate
srcAllowed=
for from in "${!replacements[@]}"; do
	to="${replacements[$from]}"
	intermediate[$from]=$(altCase "$from")
	if [[ -n "$srcAllowed" ]]; then
		srcAllowed="${srcAllowed}|"
	fi
	srcAllowed="${srcAllowed}${from}"
done

info "Check we don't have hits on intermediate patterns"
for from in "${!intermediate[@]}"; do
	to="${intermediate[$from]}"
	! git grep -q "$to" || die "Oops: $to did hit; please change the logic"
done

info "Do all intermediate replacements in types"
for from in "${!intermediate[@]}"; do
	to="${intermediate[$from]}"
	echo "$from$to"

	# Let's replace directly concerned files first except includes
    # TODO: put your own logic here to specify the file(s) holding the types to be renamed to patch
	for i in ${headerFiles[@]}; do
		sed -e "s/\b${from}\b/${to}/g" -i "$i"
		sed -e "s%${to}\.h\"%${from}\.h\"%g" -i "$i"
	done
done

info "Do all intermediate forward declaration replacements"
for from in "${!intermediate[@]}"; do
	to="${intermediate[$from]}"
	echo "$from$to"

	for i in $(git grep -E "^class ${from};" | cut -f1 -d':' | sort -u); do
		sed -e "s/^class ${from};$/class ${to};/g" -i "$i"
	done
done

# Let's loop until we converge without errors
step=1
while true; do
	info "Rebuild #$step"
	rebuild

	# Let the magic begin: fetch error information and replace exact pattern at the exact places
	errMatch="(unknown type name|use of undeclared identifier|no type named|no member named)"
	fromMacro="note: expanded from macro"

	# First handle inlined macros. Trickier, as this is a multiline capture. (LANG=C because way faster)
	#
	# /code/library/utils/MyFancyTools.cpp:60:45: error: use of undeclared identifier 'UnicodeChar'
	#     constexpr UnicodeChar replacementChar = UNICODE_REPLACEMENT_CHAR;
	#                                             ^
	# /code/library/unicode/unicode.h:24:36: note: expanded from macro 'UNICODE_REPLACEMENT_CHAR'
	#
	# -or- with nested macros:
	#
	# /code/test/MyFancyTests.cpp:72:9: error: unknown type name 'AString'
	#         TEST_ENTITY("〹", 0x3039, false);
	#         ^
	# /code/test/MyFancyTests.cpp:64:48: note: expanded from macro 'TEST_ENTITY'
	# #define TEST_ENTITY(name, value, onlyAlphaNum) _TEST_ENTITY(entities, name, value, onlyAlphaNum)
	#                                                ^
	# /code/test/MyFancyTests.cpp:48:9: note: expanded from macro '_TEST_ENTITY'
	LANG=C \
	grep -aPzo \
		"(?s)[^\n]*: error: ${errMatch} '[A-Za-z0-9\._]*'[^\n]*\n([^\n]*\n[^\n]*\n)?[^\n]*: ${fromMacro}[^\n]*(\n([^\n]*\n[^\n]*\n)?[^\n]*: ${fromMacro}[^\n]*)*" \
			/tmp/errors |
		tr '\0' '\n' |
		grep -aE "(error: ${errMatch}|${fromMacro})" |
		sed \
			-e "s/^.*error: [^']* '\([^']*\)'[^\n]*$/%\1/" \
			-e "s/^\([^:]*\):\([^:]*\):\([^:]*\): ${fromMacro}.*/\1\t\2\t\3/" |
		tr '\n%' '\t\n' |
		sed -E 's/^([^\t]*)\t([^\t]*\t[^\t]*\t[^\t]*\t)*([^\t]*)\t([^\t]*)\t([^\t]*)\t$/\3;\4;\5;\1/g' |
		grep -aE ";${srcAllowed};" |
		tr ';' '\t' |
		grep -avE '^$' |
		sort -u |
		while read -r elt; do
			file=$(echo "$elt" | cut -f1)
			line=$(echo "$elt" | cut -f2)
			col=$(echo "$elt" | cut -f3)
			from=$(echo "$elt" | cut -f4)
			to="${intermediate[$from]}" || die "Can't find $from in intermediate"
			printf "\r\33[K\r%s %s %s %s %s" "$file" "$line" "$col" "$from" "$to"

			# Note: We need LANG=C as columns are apparently expressed in byte offset
			LANG=C sed -E "${line}s/^(.{$((col-1))})${from}/\1${to}/" -i "$file"
		done

	# Then fetch direct code error information and replace exact pattern at the exact places
	#
	# /code/library/FooWriter.h:35:25: error: unknown type name 'MyFancyView'
	# /code/library/FooWriter.h:431:21: error: use of undeclared identifier 'MyFancyView'
	info "Execute replacements"
	grep -a " error: " /tmp/errors |
		grep -aE "${errMatch}" |
		sed -e "s/^\([^:]*\):\([^:]*\):\([^:]*\):[^']*'\([^']*\)'.*/\1\t\2\t\3\t\4/" |
		grep -aE "\b(${srcAllowed})\b" |
		sort -u |
		while read -r elt; do
			file=$(echo "$elt" | cut -f1)
			line=$(echo "$elt" | cut -f2)
			col=$(echo "$elt" | cut -f3)
			from=$(echo "$elt" | cut -f4)
			to="${intermediate[$from]}" || die "Can't find $from in intermediate"
			printf "\r\33[K\r%s %s %s %s %s" "$file" "$line" "$col" "$from" "$to"

			# Note: We need LANG=C as columns are apparently expressed in byte offset
			LANG=C sed -E "${line}s/^(.{$((col-1))})${from}/\1${to}/" -i "$file"
		done

	# If no changes, bail out
	if git diff-index --quiet HEAD --; then
		break
	fi

	# Format code
	format

	# Commit
	if [[ $step == 1 ]]; then
		info "Commit result"
		git commit -a -m "chore: Automated replacements ($0)"
	else
		info "Merging commit result"
		git commit -a --amend --no-edit
	fi

	step=$((step+1))
done

info "Final name replacements"
for key in "${!replacements[@]}"; do
	from="${intermediate[$key]}"
	to="${replacements[$key]}"
	echo "$from$to"

	for i in $(git grep "$from" | cut -f1 -d':' | sort -u); do
		sed -e "s/\b${from}\b/$to/g" -i "$i"
	done
done

info "Remove idempotent types (using X=X)"
for from in "${!replacements[@]}"; do
	to="${replacements[$from]}"
	echo "$from$to"

	for i in $(git grep -E "^using ${to} = ${to};" | cut -f1 -d':'); do
		sed -E "/^using ${to} = ${to};/d" -i "$i"
	done
done

# Format code
format

info "Merging commit result"
git commit -a --amend --no-edit

info "Final build"
rebuild

Comments