xavier roche's homework

random thoughts and sometimes pieces of code

C Corner Cases (and Funny Things)

Enter The void

void is a very strange animal. Basically void means nothing, emptiness. A function “returning” void actually does not return anything. On the machine code side, register(s) aimed to hold return values are simply unused in such case.

You won’t be surprised to hear that the size of void is undefined for this reason ; ie. sizeof(void) is not supposed to exist, even if some compilers have rather dirty extensions that allow to set this size to 1 to do pointer arithmetics.

But what about void* ? A pointer to something that does not exist, what’s that ?

Well, this is the first trap: void* is not really related to void. The void* type is used to store addresses of unspecified types ; ie. you can cast from/to the void* type, for example when passing generic functions (such as read(), write(), memset()..)

When I say cast, I don’t really mean cast, actually. You can view the pointer to void as a kind of a supertype, and aliasing rules do not apply per se: casting int* to void* is legit and vice-versa. But casting the resulting void* pointer to long*, for example, is not, because aliasing rules would be violated by indirectly casting from int* to long*. Yes, aliasing rules are a whole lot of fun. We’ll see that later.

Functions Are Not Data

Here comes another trap: you can cast anything from/to void*, except pointer to functions. Yep, in C, pointers to data and functions are not necessarily if the same size. Basically this means that the address space dedicated to data (data/bss segments) and the one dedicated to code (text segment) are not necessarily of the same size (you could have a 32-bit code pointers and 64-bit data pointers for example). Of course this is almost always the case, because the universe would otherwise collapse, and standard functions such as dlsym would not work anymore.

To quote the holy standard:

The ISO C standard does not require that pointers to functions can be cast back and forth to pointers to data. Indeed, the ISO C standard does not require that an object of type void * can hold a pointer to a function. Implementations supporting the XSI extension, however, do require that an object of type void * can hold a pointer to a function.

Function Pointer Definition

Oh, yes.

  int (*function)(char* s, int c) = my_function;

Function definitions are sometimes see as tricky by newcomers. They are NOT. Simply replace (*function) on the left member (the parenthesis are there for operator priority reasons) by a function name, and you just have the classic function declaration.

Is char Nice ?

In comparison to void* and function pointers, char is the nice guy:

  • its size is 1 by definition, which makes the pointer variant suitable for pointer arithmetics
  • it can be also cast from and to any other type without violating aliasing rules (still per se, re-casting to a third incompatible type would still violate aliasing rules)
  • it can be dereferenced after a cast (ie. you can actually reads the bytes within an int* safely)

But char Is Generally Signed

Yes, generally, because this is not specified in the standard. This is actually annoying, because when handling strings, you may end up with negative characters for ASCII > 127, and negative values can be promoted to integers.

Here’s a buggy example:

/** Return the next character within a \0-terminated string, or EOF. **/
int my_read_char(const char *buffer, size_t *offs) {
  if (buffer[*offs] != '\0') {
    return buffer[*offs++];  /* here's the trap */
  } else {
    return EOF;
  }

This function will return negative values for ASCII > 127, and especially the value -1 for ASCII 255 (0xFF), which is also the value of EOF.

You need to explicitly cast to a unsigned char to have a working version:

/** Return the next character within a \0-terminated string, or EOF. **/
int my_read_char(const char *buffer, size_t *offs) {
  if (buffer[*offs] != '\0') {
    return (unsigned char) buffer[*offs++];
  } else {
    return EOF;
  }

You May Use T* for const T*, But Not T** for const T**

The typical example is that:

  • a char* string may be implicitly cast into const char*:
      char *foo = strdup("Hello, world!");
      const char *bar = foo;  /* No need for a dirty cast. */
    

.. but a char** array of string may NOT be implicitly cast into const char**:

  char *foo[2];
  foo[0] = foo[1] = strdup("Hello, world!");
  const char **bar = foo;  /* Warning. */

The compiler will complaint:

1.c:6:22: warning: initialization from incompatible pointer type [enabled by default] const char *bar = foo; / Warning. */

The reason behind of the same reason why, in Java, you can not cast List<String> into List<Object>: a function taking the supertype array version may add an incompatible type without being noticed.

Here’s an example:

static void replace_the_first_element(const char **array) {
  static const char *hello = "hello, world!";
  array[0] = hello;  /* replace first element by a constant string */
}
...
  char *foo[2];
  foo[0] = foo[1] = strdup("Hello, world!");
  replace_the_first_element(foo);

In this example, without a cast warning, the function replace_the_first_element would silently replace the first element of foo by a non-constant string, and the caller would have a constant string within the array when the function returns. The code is strictly equivalent to:

  static const char *hello = "hello, world!";
  char *foo[2];
  foo[0] = hello;  /* now you can see the problem */

I would say that we should be able to silently cast T** into const T*const* (an array of constant arrays of T), but anyway …

An Array Is Not Always An Array

Quizz time: what the test function below is supposed to print ? (don’t cheat!)

static void test(char foo[42]) {
  char bar[42];
  printf("foo: %zu bytes\n", sizeof(foo));
  printf("bar: %zu bytes\n", sizeof(bar));
}

Well, you might be surprised to see:

foo: 8 bytes
bar: 42 bytes

Yes, an array of type T in a function argument list is strictly equivalent of declaring the type as T*. This is why nobody uses this confusing syntax within function argument list.

To quote the standard:

“A declaration of a parameter as ‘array of type’ shall be adjusted to ‘qualified pointer to type’, where the type qualifiers (if any) are those specified within the [ and ] of the array type derivation.”

This also imply that the passed object is passed by reference, not value.

… and this is totally inconsistent with the handling of structures as function argument, or as return type. Yes, yes, historical mistakes.

Switch And Case Are Hacks

switch can be seen as a dispatch function, and case as label. Each switch must have its case, but you are free to interleave loops of you want, such as in:

switch (step) {
  while (1) {
    case step_A:
    case step_B:
      ...
   }
}

See my previous An Unusual Case of Case (/switch) entry for more fun!

Pointer To Arrays

Let’s say I have a char foo[42] and I want a pointer to this array in bar. I can write:

  • bar = foo
  • bar = &foo
  • bar = &foo[0]

What shall I use ? The first and third pointers are identical: they are char* ones, the one you generally want to use: the pointer points to a char area of undefined size. The second one is a pointer to the array of 42 chars itself. You may then use:

  • char *bar = foo;
  • char (*bar)[42] = &foo;
  • char *bar = &foo[0];

The pointer to the array of 42 chars syntax is a bit weird for newcomers, but it is logical: *bar is an array of 42 chars, so you just have to replace foo by (*bar) in the standard definition to have the correct syntax (the parenthesis are needed for operators priority reasons, otherwise you’ll declare an array of 42 pointer to char). (This trick is also helpful to understand the apparent obscure pointer to function syntax.)

The advantage of the second definition is that you still have a pointer to an array of 42 char: sizeof(*bar) is 42, *bar is of type char[] (suitable for specific compiler optimizations and safe string operations)

Empty Structure

struct { } foo;

this code is undefined in C: an empty structure is not officially supported. Some compilers such as GCC support them, and define the size of the structure as 0.

It is supported in C++ (because many classes do not have any member within) and in such case the sizeof of such empty structure is … 1, because C++ needs a different pointer for every instance of a given class. Oh, and because you need to do pointer arithmetics also, and having a null size would cause issues.

Strings Are char Arrays, And Are Not Const (But They Really Are)

Another corner case: string literals, such as "hello world" are actually arrays of char, with a terminating \0. The two definitions are equivalent:

char *foo = "hi";
static const char hi[] = { 'h', 'i', '\0' };
char *foo = (char*) hi;

Note that sizeof("hi") is equal to 3 (an array of 3 chars).

Oh, did you notice the (char*) in the second example ? Isn’t it a bit dirty ?

  • yes, it is
  • in C, string literals are unfortunately of type “array of char” and not “array of const char”
  • in reality, the type is really “array of const char”: if you dare to write inside a string literal, chances are you’ll segfault, because the string literal is inside a global read-only data segment

Parenthesis Have Sometimes Different Meanings

Let’s have a look at two code samples:

int i = 0, j = 0;
foo(i++, j+=i, j++);
int i = 0, j = 0;
bar((i++, j+=i, j++));

The first sample calls a function named foo taking three integers as arguments. Is has an undefined behavior: side-effects of i and j will be executed at an undefined time, and might even be optimized. You basically don’t have a clue of what foo will receive as arguments.

The second sample calls a function named bar taking one integer argument. It has a perfectly defined behavior. This integer is the result of the comma operator i++, j+=i, j++, which can be rewritten as the pseudo-C++-code:

static int my_comma(int &i, int &j) {
  i++;
  j+=i;
  return j++;
}
...

int i = 0, j = 0;
bar(my_comma(i, j));

The comma operator evaluates each expression starting from the leftmost member to the rightmost member, and yields the last member as value. All other values are discarded (each expression should yield void, actually). And between each member evaluation, a sequence point is committed: side-effects of each expression are committed for the next expression. This is why this expression has a perfectly defined behavior.

Logical And/Or Can Replace If

Two pretty nice feature of the “boolean and” (&&) and the “boolean or” (||) operators is that they:

  • commit side-effects between each operation: if (foo[i++] != '\0' && bar[i++] != '\0') is perfectly defined, as the post-increment of i will be committed before the evaluation of the right expression (ie. a sequence point exists between left and right operator evaluation)
  • these operators are short-circuit operators, which means that there is a guarantee that the right member is only evaluated if the result of the left member evaluation could not provide the result of the expression ; which allows to write something like if (foo != NULL && *foo != '\0') without fearing of a NULL pointer dereferencing

Beware of char buffer[1024] = ""

Not only is it a bit dirty to have large buffers on the stack (I must admit that some legacy code in httrack is filled with that, cough cough), but the C standard ensure that when initializing a structure or an array, missing elements shall be initialized to zero.

This basically means here that the first element of the array is explicitly initialized to zero (this is the terminating \0 of the empty string), and the 1023 other elements will be implicitly initialized to zero, too. The performance impact is the same as memset(buffer, 0, sizeof(buffer)).

Use at least an explicit buffer[0] = '\0' without any initializer when declaring the buffer.

The Bogus strchr (and Friends) Definitions

Did you notice that ?

char *strchr(const char *s, int c);
char *strstr(const char *haystack, const char *needle);
...

All these function takes a const char* and return a char*. How can a function transform an array of constant chars to a non-constant one ?

Well, obviously this is impossible, and the only reason of this oddity was to offer a somewhat generic function that would work for both char* and const char* strings: accepting const char* as argument will also accept char* versions by silently casting it, and returning char* will also allow to store the result in a const char* for the same reason.

Unfortunately, this may lead to:

char *pos = strchr(my_const_string, ' ');
if (pos != NULL) {
  *pos++ = '\0';  /* oops, I forgot that the string was constant, and the compiler did not warn me! */
  /* kaboom here */
} 

A sane decision would be to split all these string handling functions into const and non-const ones. In C++, you can have specialized versions.

Never Divide By Zero .. Or Divide INT_MIN by -1 After Midnight

Dividing an integer by zero leads to a floating point exception (Yes, this is not a floating point operation, but the error triggers one.).

But dividing the smallest signed integer by -1 also leads to the same floating point exception. This is a bit of a curiosity: when dividing by -1 the smallest integer (-2147483648), the result should be 2147483648, but is too large (by one) to be represented in an integer (the largest integer is 2147483647), and the divide operation then triggers the same exception that the one triggered by the division by zero.

However, multiplying the smallest signed integer by -1 does not trigger anything because… well, because the multiply operation does not raises anything, even when overflowing.

Therefore, the first printf below will gently print a result, but the second one won’t, and the program will abort:

/* Note: using volatile to prevent the div to be optimized out */
volatile int int_min = INT_MIN;
volatile int minus_1 = -1;
int main(int argc, char **argv) {
  printf("%d\n", int_min * minus_1);
  printf("%d\n", int_min / minus_1);

  return EXIT_SUCCESS;
}

By the way, what’s the first result ? Well, -2147483648, because the 32-bit multiply operation of two 32-bit numbers provides a 64-bit result, which is 2147483648 (0x0000000080000000 in hexadecimal), and the lowest part is just truncated to the 32-bit 0x80000000 value, which is… -2147483648 in two’s complement arithmetics. Basically the overflow was silently ignored. Got it ?

Macros Are Not Functions

Macros are often considered evil (and will even cause C++-fans to scream and have foam on their mouth), especially when hiding multiple argument evaluation:

#define min(a, b) ( (a) < (b) ? (a) : (b) )

This function will not behave well when used with something like min(++i, ++j): a second increment will be committed in either i or j…

Another difference between a function version and its macro version are the side effects, which are committed before and after calling a function, but not a macro.

Preprocessor Magic

You know that __FILE__ is a magic preprocessor variable holding the current source filename, and __LINE__ the line number at the time of the macro evaluation. This is quite convenient to print debugging:

#define PRINT_CURRENT_FILE_AND_LINE printf("at %s:%d\n", __FILE__, __LINE__)
...
int main(...) {
...
  PRINT_CURRENT_FILE_AND_LINE;  /* print some debug */
...

The __FILE__ and __LINE__ macros will be expanded to the location in the main() function, at expansion time.

You can use the expansion time behavior to convert the __LINE__ numerical token to a string one, for example using a double macro expansion:

#define L_(TOKEN) #TOKEN
#define L(TOKEN) L_(TOKEN)
#define FILE_AND_LINE __FILE__ ":" L(__LINE__)

int main(int argc, char **argv) {
  printf("we are at: %s\n", FILE_AND_LINE);
  return 0;
}

A call to FILE_AND_LINE will form a string using the three substrings (in C, “foo” “bar” is equivalent to “foobar”). We need an intermediate L_() macro because otherwise __LINE__ would not be expanded, and the resulting string would simply be … "__LINE__".

Aliasing Rules

My headache is killing me, so I’m going to redirect you to the excellent article on cellperformance. But aliasing rules are very, very, very fun, trust me!

More ?

Do not hesitate to comment this article to complete it! There are other C-corner cases I probably didn’t mention…

TL;DR: corner cases are not reserved to C++!

Comments