xavier roche's homework

random thoughts and sometimes pieces of code

Embrace Unicode (and Do Not Worry)

The Fear of Unicode/UTF-8

As a programmer – especially as a C/C++ programmer – you might be reluctant to embrace the Unicode beast. You vaguely know that wchar.h contains various functions to handle the dirty things, but it does not make things clearer. In Java, programmers are using String without troubles, until they want to convert them from/to arrays of bytes (hey, how Java is supposed to handle supplemental characters ?)

Unicode is very simple in its essence, even if you can make things unreasonably complex (have you ever decomposed an Unicode character ?): this is basically the extension of the ASCII concept (assigning a number to a character) to fit more characters.

The Simplicity Behind The Complexity

Yes, Unicode is rather simple if you keep things simple.

Let’s begin with some concepts:

  • Unicode: a consortium aimed to standardize all possible characters in use on Earth (even the most improbable languages – but this does not include Klingon), such as latin characters, Chinese, Arabic, etc., and assign them a number. Numbers from 0 to 127 are the standard 7-bit ASCII, and code points above are non-ascii characters, obviously (the first range assigned above being latin letters with accents)

  • UTF-8: a way to encode an Unicode character code into a sequence of bytes ; 7-bit ASCII is encoded “as is”, using 7-bit ASCII (ie. the code point for the letter “A”, which is the letter 65 in the Unicode world, is encoded into the ASCII byte “A”, which is also 65 in the ASCII world), and the rest is using a sequence of bytes in the range (in hex) 80..FD (FE and FF are never used, which is pretty cool, because this greatly helps making the difference between an UTF-8 stream and other encodings)

  • UCS-2: encoding of Unicode points as 16-bit wide characters – basically, an array of uint16_t – note that you may only encode Unicode points from 0 to 65535 (something which is called the Basic Multilingual Plane, or BMP). This is generally used internally only – most people won’t store or send raw arrays of uint16_t. You may have two flavors: UCS-2BE (big-endian flavor), or UCS-2LE (little-endian flavor), and a common trick to detect the endianness is to put a zero-width no-break space (sometimes called byte order mark) as first invisible character in a text stream, whose code point is FEFF. The “reverse” code point FFFE is invalid in Unicode, and should never appear anywhere. Oh, did I tell you that FE and FF never appear in an UTF-8 stream previously ? If you use the byte order mark, you’ll never confuse an UCS stream with an UTF-8 one.

  • UCS-4: encoding of Unicode points as 32-bit wide characters – basically, an array of uint32_t. You also have two flavors: UCS-4BE (big-endian flavor), or UCS-4LE (little-endian flavor), and the byte-order-mark trick is also used in many places.

  • UTF-16: encoding of Unicode points as 16-bit wide characters (like UCS-2), except that Unicode points beyond the Basic Multilingual Plane are encoded as two 16-bit characters in the surrogate range. This is a bit wicked: we are encoding a character using two characters, and each of them are encoded as 16-bit characters. You can treat an UTF-16 stream as a regular UCS-2 stream – in this case, code points beyond the Basic Multilingual Plane will be seen as a pair of unknown-class characters (surrogates).

  • UTF-32: this is actually the synonym for UCS-4 (and is in my opinion an confusing naming: this is not a variable length encoding). Oh, and you are not supposed to have surrogates (used for UTF-16, see above) inside an UCS-4/UTF-32 stream.

More Precisely, What About UTF-8 ?

Let’s have a look as UTF-8, which is generally the preferred encoding for storing and sending data (most RFC are using UTF-8 as default encoding nowadays).

The good news is that 7-bit ASCII is a subset of UTF-8: nothing will change on this side.

For the rest, UTF-8 is a very simple and smart encoding (Ken Thompson is behind this fine piece of work), and decoding it is not really hard.

If a character outside the 7-bit range is used, UTF-8 will encode it as a byte sequence. For example, the “é” letter (latin letter e with acute accent), which is in the Unicode world the point E9 (in hexadecimal – that is, 233 in decimal), will be encoded as bytes C3 and A9 (C3 A9) (which is displayed as “é” – something you’ll learn to recognize and which generally is a sign of UTF-8 not being recognized by a client). The “” Chinese character (“princess”), which is in the Unicode world the point 59EB (23019 in decimal), will be encoded as E5 A7 AB (displayed as “姫” if your display client or browser is having troubles recognizing it).

To understand UTF-8, you just need to know one thing: for a given byte, visualize its 8 bits, and count the number of leading “1” (ie. the number of “1” before the first “0”). This is the only information you need to know whether this byte is a single 7-bit ASCII character, the beginning of an UTF-8 sequence, or an intermediate byte within an UTF-8 sequence. Oh, and if the byte is the beginning of a sequence, you also know how long it is: its length is precisely the number of leading “1”. Isn’t that magic ?

Yep, this is it.

Basically,

  • no leading “1”: 7-bit ASCII (0..7F)
  • one leading “1”: in the middle of a sequence
  • two or more leading “1”: a sequence whose length is the number of leading “1”

This is an extremely good property of UTF-8: you can easily cut an UTF-8 string/skip characters/count characters without bothering about actual code points. Only bother about leading “1”’s.

Oh, by the way: did you know that the number of leading “1” could be computed without using a loop ? Take a look at the Hacker’s Delight recipes (do not hesitate to buy the book, it is worth it, and recipes are presented with their mathematical explanation – there is no pure magic out there)

Here’s the trick: (this is the number of leading zeros, but you probably have an idea on how to use this function to compute the number of leading ones)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
/* Hacker's delight number of leading zeros. */
static unsigned int nlz8(unsigned char x) {
  unsigned int b = 0;

  if (x & 0xf0) {
    x >>= 4;
  } else {
    b += 4;
  }

  if (x & 0x0c) {
    x >>= 2;
  } else {
    b += 2;
  }

  if (! (x & 0x02) ) {
    b++;
  }

  return b;
}

We know how to recognize UTF-8 sequences, and we even know how to get their length. How to encode Unicode points ?

Encoding is extremely simple. It could not be simpler, actually: visualize your Unicode point as a stream of bits (starting from the leading “1”). You’ll encode them in the remaining space, in the first leading character, and in the remaining sequence. The remaining space within sequence bytes is the space after the leading “1” and the “0” delimiter.

For example, if the leading byte defines a 4-byte sequence, then it will start with four “1”, and a “0” delimiter. Three bits remains available for encoding (the first three bits of the Unicode point number, starting from the leading “1”, will be encoded there). Continuing characters (in sequence) have a leading “1” followed by the “0” delimiter: six bits are available for encoding.

Let’s go back to the “姫” Chinese character. This character is 59EB in the Unicode charts ; here’s how it is encoded as E5 A7 AB (and sorry for my lame Chinese writing skills!):

Encoding 59EB

An Exercise For You!

As an exercise, you can now easily rewrite the getwchar() function in an UTF-8 environment (locale).

Here’s my try (do not cheat! look at this solution after you tried!):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
#include <stdio.h>
#include <stdlib.h>

#define UTF8_ERROR ( (int) (-2) )

/* Hacker's delight number of leading zeros. */
static unsigned int nlz8(unsigned char x) {
  unsigned int b = 0;

  if (x & 0xf0) {
    x >>= 4;
  } else {
    b += 4;
  }

  if (x & 0x0c) {
    x >>= 2;
  } else {
    b += 2;
  }

  if (! (x & 0x02) ) {
    b++;
  }

  return b;
}

/* Length of an UTF-8 sequence. */
static size_t utf8_length(const char lead) {
  const unsigned char f = (unsigned char) lead;
  return nlz8(~f);
}

/* Equivalent to getwchar() on an UTF-8 locale. */
static int utf8_getchar() {
  const int c = getchar();
  const size_t len = utf8_length(c);
  if (c < 0) {  /* EOF */
    return EOF;
  } else if (len == 1) {  /* ASCII */
    return UTF8_ERROR;
  } else if (len == 0) {  /* Error (in-sequence) */
    return c;
  } else {  /* UTF-8 */
    unsigned int uc = c & ( (1 << (7 - len)) - 1 );
    size_t i;
    for(i = 0 ; i + 1 < len ; i++) {
      const int c = getchar();
      /* not EOF, and second bit shall always be cleared */
      if (c != -1 && ( c >> 6 ) == 0x2) {
       uc <<= 6;
       uc |= (c & 0x3f);
      } else if (c == -1) {
        return EOF;
      } else {
        return UTF8_ERROR;
      }
    }
    return (int) uc;
  }
}

int main(int argc, char* argv[]) {
  for(;;) {
    const int c = utf8_getchar();
    if (c == EOF) {
      break;
    } else if (c == UTF8_ERROR) {
      fprintf(stderr, "* UTF-8 error\n");
    } else {
      printf("unicode character 0x%04x\n", c);
    }
  }
  return EXIT_SUCCESS;
}

Note that the above code can be improved a little bit, by unrolling the for() loop, and using few switch.

TL;DR: Do not worry, and embrace UTF-8 and the Unicode world.

What Are Your GCC Flags ?

Say What ?

Your GCC build flags. Yes, this is actually an interesting question! I have been building code at the various places where I work for many years, on different architectures, and tweaking the build flags has always been an important task.

I mean, you probably are too experimented to just use something like:

1
gcc -c fubar.c -o fubar.o

And you probably use -W -Wall, or additional flags, to tune the compiler behavior.

The idea behind is not only to carefully optimize produced bytecode, but also to improve its quality, and its security. Compiler warnings are critical to spot many programming errors or lethal typos that would otherwise consume days of debugging (and a big pile of money!).

Many beginners are still using nowadays the default gcc command without further tuning, and this always makes me cringe. Warning messages are not annoying, they are useful. And the minutes you spend fixing them (Note that I did not say hide, but fix) will spare you days or even months of nightmares.

Here are the flags we are using where I work:

1
gcc -pipe -m64 -ansi -fPIC -g -O3 -fno-exceptions -fstack-protector -Wl,-z,relro -Wl,-z,now -fvisibility=hidden -W -Wall -Wno-unused-parameter -Wno-unused-function -Wno-unused-label -Wpointer-arith -Wformat -Wreturn-type -Wsign-compare -Wmultichar -Wformat-nonliteral -Winit-self -Wuninitialized -Wno-deprecated -Wformat-security -Werror -c source.c -o dest.o

Within these fancy flags, I was among the craziest tuner. Here’s the summary of what I shamelessly added:

  • -pipe I prefer NOT to use temporary files when possible, and use pipes, for example when compiling preprocessed code into intermediate assembly source (.S). This has no real impact on performances (temporary files are generally put on some kind of tmpfs filesystem), but this is a bit cleaner.

  • -fno-exceptions I don’t like C++ exceptions. And we don’t use them where I work. So let’s remove the overhead generated by unwinding code (ie. code and data aimed to allow the runtime to “rollback” function calls in case an exception is thrown, eg. object destructors that need to be called, etc.). Bonus: it may also produce a bit faster code. What if an exception is thrown (eg. by the runtime or the STL) by the way ? Well, abort() will be called instead, which is perfectly fine as thrown exceptions are either programming errors (out_of_range, bad_cast etc.) or critical conditions (new throwing bad_alloc)

  • -fstack-protector -Wl,-z,relro -Wl,-z,now -Wformat-security These flags are actually ripped from the Debian Hardening Wiki, and are aimed to detect stack smashing issues, have a read-only global offset table (preventing any attacks involving writing through the GOT), and various format strings suspicious usages.

  • -fvisibility=hidden Ah yes, this one is cool. It changes the default extern symbol export mode in GCC to “hidden”. Basically, it means that “extern” symbols are visible by all units within a library, but not outside this library. If you ever wrote code on Windows environments, you probably remember those: _declspec(dllexport) and _declspec(dllimport). Without them, symbols inside your DLL would not be exported (or imported). On POSIX systems, you export all extern symbols, and they become visible in your .so library. But this default mode has several drawbacks

    • You are exporting internal functions you do not want people to mess with
    • You may have internal symbols conflicting with other libraries
    • You may have troubles having your code building on Windows “out of the box”

The last argument was actually the strongest one: the build would be broken several times per week because someone forgot to properly export the symbols and the code was building fine on Linux but not on Windows. For all these reasons, using -fvisibility=hidden (and properly exporting symbols using __attribute__ ((visibility("default")))) was a true relief!

  • -Wpointer-arith We do not want to know the size of void, and we won’t use it!

  • -Wformat-nonliteral Banishing “non literal format strings” was actually a sane decision. Especially when the format string source was some kind of user generated input – we’re not in 1990 anymore, and security issues involving string format can not be ignored anymore.

  • -Winit-self This is actually a joke in the C standard.

1
int i = i;

Yes, this code is perfectly valid, and won’t even produce a single warning by default. And, of course, the behavior is unspecified — the variable i is left “initialized” with a garbage value. I have been bitten by this idiotic default behavior too often before turning on this specific warning. Shame on me.

  • -Werror Haha. This one was painful to propagate through all of our code. It basically breaks the build once you have a warning somewhere. Yes, the code won’t build unless it is totally warning-free. Yes, people were a bit upset at me at the beginning :), and they were even more upset when committing code that would break on a new compiler version (but not on an older one). This was painful. This was hard. But I did not give up! And at the end, the overall impact was tremendously positive.
    • Warnings emitted during build are useless unless someone check them regularly (and grep‘ping thousands of lines of logs is not a cool thing to do, and nobody is going to do that)
    • As more warnings are emitted, and ignored, new warnings tend to be ignored too. As more warnings are emitted, people tend to give up at fixing them (Hey! I’m not the janitor!).
    • Warnings are sometimes annoying, but most of the time they are warnings that you are doing something wrong (possibly really wrong)
    • It is far better to spend five minutes fixing a warning than spending few months on a vicious bug (yes strict aliasing rules, I am talking about you)
    • You can always disable specific warnings (-Wno-* flags) if needed!

Oh, and on the linker side, I did a bit of tweaking too:

  • -Wl,-O1 Did you know that you also have an optimization flag for the linker ? Now you know!

  • -Wl,--discard-all Discard all local symbols, for the sake of library size.

  • -Wl,--no-undefined Yes, I don’t like missing (unresolved) symbols at link time, even if this is actually a feature. It makes the Windows build easier too, by preventing behavioral differences between Linux/Windows.

  • -Wl,--build-id=sha1 This adds a fancy “build identifier” in all produced modules, which has a deterministic value (ie. two builds with the same sources and build flags will produce the same identifier), allowing to control whether or not you managed to rebuild from scratch and produce the same binaries.

  • -rdynamic (note: this actually passes the -export-dynamic flag to ld) This linker option allows you to have more exported symbols within the dynamic symbol table, which is actually a nice thing when attempting to have readable backtraces!

That’s all, folks.

TL;DR: Take the time to read carefully and understand the man gcc and man ld pages. It is worth it.

Thanks to Andrew Hochhaus for changing -fvisibility=internal to -fvisibility=hidden. Thanks for the insightful remarks to all hacker news contributors, including additional flags and their advantages.

Fancy Standalone Visual C++ Compiler

A Fancy Standalone Compiler ?

Yes. Many people like me are developing on multiple platforms, including Windows, but do not necessarily work on Windows for their primary platform. Besides, you may log-in on a random Windows machine with a very minimalistic setup (say, a new virtual machine, for example), and do not want to have a full compiler installed. You may also use multiple versions of a compiler (playing with Visual C++ 2008, 2010, 2012, 2013..), and switching between them may become a real burden.

For these many reasons, having a standalone compiler, say, ready-to-use on a network drive, is often a very handy solution.

Can We Do That ?

Technically, using a full Visual C++ install deported on an external location, without actually installing anything, is not really supported (I would say rather untested, though)

Fortunately enough, Visual C++ is nice and well-educated enough to accept being used in a standalone fashion ; at least for the commandline tools, such as cl.exe or link.exe. Yay!

The Recipe

The recipe is almost the same for all Visual C++ releases (tested from 2005 to 2013): blindly copying important directories in a safe place, adjusting some DLL’s. I will describe the rough necessary steps for 2013 and 2010 – you’ll need to adapt a bit to fit your needs, sometimes find a bit in subfolders to catch some DLL’s, but this should not be too cumbersome.

  • Start on a fresh machine. Preferably on a virtual machine, up-to-date (service packs). The machine can be trashed after creating the remote compiler, of course. Having a fresh install is quite handy to spot what has been installed, and to avoid being polluted by external DLL’s (such as external redistributables)

  • Install a regular Visual C++ Express version. Yes, I’m using the Express flavor, as it is simpler to use, and has probably less licensing issues than the “paid” release. You will miss some components, though, such as ATL and MFC (but who use them nowadays ? Err, I may have still some MFC code lying around by the way …)


  • Create a standalone directory with compiler binaries (to be placed later on a network drive, for example, or on an USB key):
    • vssrc is generally something like C:\Program Files (x86)\Microsoft Visual Studio 12.0 or C:\Program Files (x86)\Microsoft Visual Studio 10.0
    • mssdk is generally something like C:\Program Files\Microsoft SDKs\Windows
    • programfiles is generally something like C:\Program Files

Do not forget to install every possible updates, patches, service packs…

Visual C++ 2013

Source Destination
vssrc\VC VC
vssrc\Common7\IDE 10.0\Common7\IDE\{mspdb*.dll} VC\bin\
vssrc\VC\redist\x86\Microsoft.VC120.CRT\*.dll VC\bin\
vssrc\VC\redist\x86\Microsoft.VC120.CRT\*.dll VC\bin\amd64
vssrc\VC\bin\{mspdb*.*} VC\bin\amd64
vssrc\VC\bin\msobj120.dll VC\bin\amd64
programfiles\Microsoft SDKs\Windows\v7.1A\Bin bin
programfiles\Microsoft SDKs\Windows\v7.1A\Include include
programfiles\Microsoft SDKs\Windows\v7.1A\Lib lib
programfiles\MSBuild\12.0\Bin NET\Framework
programfiles\MSBuild\12.0\Bin\amd64 NET\Framework64


Visual C++ 2010

Source Destination
vssrc\VC VC
vssrc\Common7\IDE\msobj100.dll VC\bin\
vssrc\Common7\IDE\mspdb100.dll VC\bin\
vssrc\Common7\IDE\mspdbcore.dll VC\bin\
vssrc\Common7\IDE\mspdbsrv.exe VC\bin\
mssdk\v7.1\Redist\VC..\msvcr100.dll VC\bin\
mssdk\v7.1\Redist\VC..\x64\msvcr100.dll VC\bin\amd64
programfiles\Microsoft SDKs\Windows\v7.1\Bin bin
programfiles\Microsoft SDKs\Windows\v7.1\Include include
programfiles\Microsoft SDKs\Windows\v7.1\Lib lib
C:\Windows\Microsoft.NET\Framework NET\Framework
C:\Windows\Microsoft.NET\Framework64 NET\Framework64

In all cases, you may also want to have Process Explorer and Dependency Walker within the same place, as they are really useful tools for a developer.

Watch out: 64-bit binaries may be placed in different subfolders:

  • x86_x64
  • x64
  • amd64
  • x86_amd64

Yes, Microsoft had some troubles using a single naming scheme for 64-bit :)

After that, you only need to set environment variables to have the 64-bit or 32-bit directories in your PATH in some kind of cmd script ; eg:

1
2
3
4
5
6
7
8
@SET VSINSTALLDIR=X:\data\my-standalone-compiler
@SET VCINSTALLDIR=%VSINSTALLDIR%\VC
@SET WindowsSdkDir=%VSINSTALLDIR%\

@set PATH=%VCINSTALLDIR%\BIN\x86_amd64;%VCINSTALLDIR%\BIN;%VSINSTALLDIR%\bin\x64;%PATH%
@set LIB=%VCINSTALLDIR%\LIB\amd64;%VSINSTALLDIR%\LIB\x64
@set INCLUDE=%VCINSTALLDIR%\INCLUDE;%VSINSTALLDIR%\INCLUDE
@set LIBPATH=%VCINSTALLDIR%\LIB\amd64;%VSINSTALLDIR%\LIB\x64

You may deploy Visual C++ redistributables (32-bit or 64-bit) depending on your needs.

After that, you’re done – enjoying the simplicity of having a ready-to-use build setup anywhere.

A last note: I’m not fan of .NET, but apparently some tools will cause you some headache if you are using this solution. You have been warned! :)

TL;DR: Who needs to install a compiler on every machine ?

I Do Not Want Your Search Bar

I am writing to follow up on an email regarding a possible sponsorship on HTTrack.com. …

I would like to offer you a partnership agreement to monetize your application HTTrack …

We are a software monetization platform and I’m contacting you to discuss possible partnership …

Sheesh.

The first time I got one of these email, I really didn’t know what it was all about. At first glance, I thought they wanted to offer some kind of ad-platform for httrack.com, or maybe some kind of sponsored links.

I generally always decline these offers, as I do not plan to put more ads on the main site (the only ad, placed on the download page, is to support bandwidth fees, as I’m hosting the binaries directly). Oh, and also because many of these ad providers are really terrible – do you really want to have ads for counterfeit clothes, fake enhancement pills, or pyramid-scheme business ? Me neither.

After several emails, I decided to ask for a bit more information on these “partnerships” and “opportunities” to understand what they really wanted.

It appears that what they were interesting in was to put a “download bar” (or a “software bundle”, or a “download optimizer”) in the httrack installer executable, and share benefits with me (well, at least this is what they said).

The only purpose of the search bar was to “enhance browser experience” and things like that (yes, really).

Oh great, a search toolbar!

.. said nobody. Ever.

Do you have a search bar installed on your PC ? I mean, you have a browser, and you decide to install an additional bar to… search for things. Like with the regular bar.

No, I suppose. Me, neither. Nobody does. Even your grandma.

So, what’s the point ?

  • each time the search bar is installed by your bundled application, you get some revenue
  • each time the user is using the search bar, the search is redirected to some kind of second-zone search site loaded with ads (and crap)
  • ads do generate revenue, fueling the system

Great, isn’t it ?

Hell no, this is crapware!

Adware, spyware, crapware. Yes. And this was my first reaction, and then I politely declined.

But you can opt-out your users during install!

Oh great. So each time a user install the program and does not carefully look at the checkbox enabled by default, he’s going to install the bar. This is the kind of thing which infuriates me.

But you can opt-in your users during install, too!

I was still skeptical, but why not, after all ? People installing the program by clicking “next-next” will not be harmed. And after all, this is just a search bar, isn’t it ?

Pact With The Devil

Well, no, this is wrong. So wrong.

First, if you take look at the reputation of all these companies, you’ll be horrified. Endless list of users trying to remove the infamous search bar, trying to figure out why they caught this virus, why they have pornographic pop-ups everywhere, etc.

Second, nobody will install this component opt-in.

More precisely:

  • very few people will install the opt-in component, generating nearly no revenue at all
  • people who trust you will do – and will be betrayed, and will hate you, and it will harm your reputation

What’s the point in allowing opt-in for these companies, by the way, is it does not generate any real revenue ?

The bleak truth is that in this case they are not providing you a marginal revenue. They are using your reputation to cover their bad one.

And this is the last reason why you should never, ever, make a deal with these people: as a developer, you have a reputation. You built something great (with few bugs, possibly) and you have a large user base trusting you. Do not betray them.

Oh, And One More Thing

Unfortunately, some über-jerks found an even better solution: bundle your application without telling you, and put it on their smelly download site.

The consequences are extremely annoying

  • users complaining that httrack contains trojans
  • users who are reluctant in trusting httrack because of the bad PR
  • more generally, bad PR for “free” softwares (are they even safe ?)

That is why I recently decided to start signing the Windows installer version, with an official trusted SSL key (registered to Xavier Roche). I will probably do the same for the program DLL’s soon. By the way, on Debian platforms, I have been signing the sources from the beginning, and this is a very good practice.

TL;DR: Search bars are crap, and you should always download programs on trusted sites.