Should you use std::string, std::u16string, or std::u32string?

 

You are reading the old blog! This post has been moved to http://www.ohadsoft.com/2014/11/should-you-use-stdstring-stdu16string-or-stdu32string/

 

C++11 introduced a couple of new string classes on top of std::string:

  1. u16string
  2. u32string

“Finally”, you must think, “C++ has addressed the sorry state of Unicode development in portable code! All I have to do is choose one of these classes and I’m all set!”.

Well, you’d might want to rethink that. To see why, let’s take a look at some definitions:

typedef basic_string<char> string;

typedef basic_string<char16_t> u16string;

typedef basic_string<char32_t> u32string;

As you can see, they all use the same exact template class. In other words, there is nothing Unicode-aware, or anything special at all for that matter, with the new classes. You don’t get “Unicode for free” or anything like that. We do see however an important difference between them – each class uses a different type as an underlying “character”.

Why do I say “character” with double quotes? Well, when used correctly, these underlying data types should actually represent code units (minimal Unicode encoding blocks) – not characters! For example, suppose you have a UTF-8 encoded std::string containing the Hebrew word “שלום”. Since Hebrew requires two bytes per character, the string will actually contain 8 char “characters” – not 4!

And this is not only true for variable length encoding such as UTF-8 (and indeed, UTF-16). Suppose your UTF-32 encoded std::u32string contains the grapheme cluster (what we normally think of as a “character”) ў. That cluster is actually a combination of the Cyrillic у character with the Breve diacritic (which is a combining code point), so your string will actually contain 2 char32_t “characters” – not 1!

In other words, these strings should really be thought of as sequences of bytes, where each string type is more suitable for a different Unicode encoding:

  • std::string is suitable for UTF-8
  • std::u16string is suitable for UTF-16
  • std::u32string is suitable for UTF-32

Unfortunately, after all this talk we’re back to square one – what string class should we use? Well, since we now understand this is a question of encoding, the question becomes what encoding we should use. Fortunately, even though this is somewhat of a religious war, “the internet” has all but declared UTF-8 as the winner. Here’s what renowned Perl/Unicode expert Tom Christiansen had to say about UTF-16 (emphasis mine):

I yesterday just found a bug in the Java core String class’s equalsIgnoreCase method (also others in the string class) that would never have been there had Java used either UTF-8 or UTF-32. There are millions of these sleeping bombshells in any code that uses UTF-16, and I am sick and tired of them. UTF-16 is a vicious pox that plagues our software with insidious bugs forever and ever. It is clearly harmful, and should be deprecated and banned.

Other experts, such as the author of Boost.Locale, have a similar view. The key arguments follow (for many more see the links above):

  1. Most people who work with UTF-16 assume it is a fixed-width encoding (2 bytes per code point). It is not (and even if it were, like we already saw code points are not characters). This can be a source of hard to find bugs that may very well creep in to production and only occur when some Korean guy uses characters outside the Basic Multilingual Plane (BMP) to spell his name. In UTF-8 these things will pop up far sooner, as you’ll be running into multi-byte code points very quickly (e.g. Arabic).
  2. UTF-16 is not ASCII backward-compliant. UTF-8 is, since any ASCII string can be encoded the same (i.e. have the same bytes) in UTF-8 (I say can because in Unicode there may be multiple byte sequences that define the exact same grapheme clusters – I’m not actually sure if there could be different forms for the same ASCII string but disclaimers such as these are usually due when dealing with Unicode:) )
  3. UTF-16 has endianness issues. UTF-8 is endianness independent.
  4. UTF-8 favors efficiency for English letters and other ASCII characters (one byte per character). Since a lot of strings are inherently English (code, xml, etc.) this tradeoff makes sense in most scenarios.
  5. The World Wide Web is almost universally UTF-8.

So now that we know what string class we should use (std::string) and what encoding we should use with it (UTF-8), you may be wondering how we should deal with these beasts. For example – how do we count grapheme clusters?

Unfortunately, that question depends on your use case and can be extremely complex. A couple of good places to start would be UTF8-CPP and Boost.Locale. Good luck 🙂

Advertisements

One Response to “Should you use std::string, std::u16string, or std::u32string?”

  1. Cobbler Says:

    Hello,

    yes, there are valid points I made those early 2000 like many other devs from non-english countries where intl was already a requirement ;

    it looks like the brits at large just discover it, but however you miss many points about string manipulations and scanning (not talking about some strange char) even if it takes more memory ; utf8 has a huge space complexity for iterating character ; splitting and substr and so on;

    BTW std::string is suitable for holding utf8 like any 8bit arrays, but the container itself is not, everything is binary, it should have a std::u8string collection aware ; then std::string as is ; should be totally deprecated ; wiped out ; but microsoft lobbying is a pain in the arse on many topics ( c++ std ) ; they impose their last 15 years of stupidity instead of embracing their time and do like everyone else ; somehow a trojan/ poisonous folks in the committee ; they should comply or be kicked out like WWW consortium, they abuse their position in the industry, they might smiling on camera and write nice blogs ; but beyond the facade they are full of shit ; I say it because many think the same, the obvious but they shut their mouths hoping for a job offer.

    Best Regards.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: