Anyone who has dealt with both “narrow” and “wide” strings in the same C++ module is inevitably faced with this little annoyance:
std::wstring get_wide_text() { // . . . } void write_narrow_text(std::ostream& os) { // ERROR! can't write wide string to narrow stream! os << get_wide_text(); }
In MSVC, you might see an error like C2676 ("binary '<<': 'std::basic_ostream<char,std::char_traits<char>>' does not define this operator or a conversion to a type acceptable to the predefined operator"
). But if you think about it, it’s totally sensible for the C++ standard library to not define any default operators that work across different string types.
For one thing, it is not possible to know for certain the encoding of a std::string
or std::wstring
by just looking its character type. Plenty of code out there gets away with assuming std::string
means UTF-8 and std::wstring
means UTF-16. That is, however, just convention and not contractual. So, if we want our program to interoperate with these string types, we have pay for what we use. We must convert!
You might try to reach for an STL solution to this surely near-universal problem. Sadly, the code that did exist to handle this scenario, the codecvt
header, is now deprecated. That being said, it is not too hard to cook up a Win32 replacement for the very simple use case we are dealing with above, i.e., conversion from UTF-16 to UTF-8. The Windows API for this is WideCharToMultiByte.
Now, we can fairly easily write a nice C++ wrapper over this C-style API (see commit on GitHub):
std::string to_utf8(const std::wstring& utf16_input) { const auto input_size = gsl::narrow<int>(utf16_input.size()); const auto required_size = WideCharToMultiByte(CP_UTF8, 0, utf16_input.c_str(), input_size, nullptr, 0, nullptr, nullptr); THROW_LAST_ERROR_IF_MSG(required_size == 0, "Failed to get required size"); auto utf8_result = std::string(required_size, '\0'); const auto actual_size = WideCharToMultiByte(CP_UTF8, 0, utf16_input.c_str(), input_size, utf8_result.data(), required_size, nullptr, nullptr); THROW_LAST_ERROR_IF_MSG(actual_size == 0, "Failed to convert to UTF-8"); return utf8_result; }
Note that we’re using a few helper libraries here: GSL (for a bit of safety) and WIL (for a bit of convenience). Also note the “call twice” pattern here to first discover how long the converted string will be so that we can prepare a buffer of the right size on the final call. (This pattern is used by so many Win32 APIs, that WIL has helpers for one common resize-and-retry strategy.)
If you want simplicity, it’s hard to beat this implementation. Ah, but wait, it’s a bit too simple. Because 0 is used an error signal in this Win32 API, we cannot tell the difference between a true zero-length string and zero-because-we-failed. This should fix it (see commit on GitHub):
const auto input_size = gsl::narrow<int>(utf16_input.size()); if (input_size == 0) { return {}; }
Wait, we still need to account for totally invalid inputs (see commit on GitHub):
const auto required_size = WideCharToMultiByte(CP_UTF8, WC_ERR_INVALID_CHARS, utf16_input.c_str(), input_size, nullptr, 0, nullptr, nullptr); // . . . const auto actual_size = WideCharToMultiByte(CP_UTF8, WC_ERR_INVALID_CHARS, utf16_input.c_str(), input_size, utf8_result.data(), required_size, nullptr, nullptr);
Okay, so now it’s simple and it definitely solves the problem we had above:
void write_narrow_text(std::ostream& os) { // Error is gone, and everything works... // ...as long as we're prepared to deal with UTF-8 output os << to_utf8(get_wide_text()); }
But there is something not quite satisfying about this solution. In particular, what if our access pattern looked more like this?
void write_narrow_text(std::ostream& os) { const auto s = to_utf8(get_wide_text()); for (auto c : s) { // put commas before each space char if (c == ' ') { os << ','; } os << c; } }
The function above would produce an output like "a, b, c"
when given a (wide string) input of L"a b c"
. We have to walk through each input char individually, but we also had to do that when converting the input. In fact, we had to do it twice during conversion as discussed before. Luckily, there is no rule that says we have to produce a full-fledged std::string
from the conversion process — the WideCharToMultiByte
, being a typical C-callable Win32 API, works purely on raw buffers. Wouldn’t it make sense then to convert as we go, block by block? Let’s find out!
First, we have to figure out a strategy for doing this block conversion. Using a size_t
template parameter to define the output block size gives us some niceties around constexpr
-ness. But what should the input block size be? Astute readers of the Unicode standard will note that UTF-16 can represent all characters in at most two code units. In other words, we might have one or two wchar_t
elements for each “character” (or really, code point). Further, UTF-8 may need up to four code units (which we store as char
) for the same code point. The absolute worst case then if we knew nothing more than this is the expansion of one wchar_t
of input to four char
s of output. (It turns out only surrogate pairs should end up producing the longest sequences in UTF-8, but this 1:4 ratio is a safe upper bound for our purposes.)
Next, we are keeping our char-by-char goal in mind, which means we should build a range type. Internally, the range should use some sort of input iterator to do the heavy lifting. How do we know when we’ve reached the end? We could use a sentinel.
Putting all this together, we end up with something like this for our “public” API:
template <size_t N> requires (N > 3) detail::Utf8Range<N> to_utf8(const std::wstring& utf16_input) { return { std::wstring_view{utf16_input} }; }
The user needs to decide on a block size N and pass an input string, and we give back the range as promised:
template <size_t N> requires (N > 3) class Utf8Range { public: using Input = std::wstring_view; using Output = gsl::span<char>; class Iterator; struct Sentinel {}; Utf8Range(Input input) noexcept : m_input{ input }, m_block{} {} Iterator begin() { return { m_input, Output{m_block} }; } Sentinel end() { return {}; } // . . . private: Input m_input; std::array<char, N> m_block; };
This should be enough to support a range-based for loop to do a single pass over the converted output string. Now what should the iterator look like? It has a few complexities like needing to support conversion of the next block when we exhaust the current one and the expectation of passing back the first char right away (unlike C# enumerators which are positioned before the first element), just to name a few. As a first cut, this might work (see commit on GitHub):
class Iterator { public: using difference_type = std::ptrdiff_t; using value_type = char; Iterator() noexcept : m_source{}, m_next_offset{}, m_block{}, m_index{} {} Iterator(Input source, Output block) : m_source{ source }, m_next_offset{}, m_block{ block }, m_index{ m_block.size() - 1 } { move_next(); } char operator*() const { return m_block[m_index]; } Iterator& operator++() { if (!at_end()) { move_next_char(); } return *this; } Iterator operator++(int) { Iterator prev = *this; ++*this; return prev; } bool operator==(Sentinel) const noexcept { return at_end(); } private: bool at_block_end() const noexcept { return m_index == m_block.size() - 1; } bool at_end() const noexcept { return (m_next_offset >= m_source.size()) && at_block_end(); } void move_next_char() { if (!at_block_end()) { ++m_index; } else { move_next(); } } void move_next() { if (at_end()) { return; } const auto hr = try_move_next(); THROW_IF_FAILED_MSG(hr, "Error near offset %llu while attempting UTF-8 conversion", m_next_offset); m_index = 0; } HRESULT try_move_next() { const auto input_block = next_block(); const auto input_size = gsl::narrow<int>(input_block.size()); static constexpr const auto output_size = gsl::narrow<int>(N); const auto actual_size = WideCharToMultiByte(CP_UTF8, WC_ERR_INVALID_CHARS, input_block.data(), input_size, m_block.data(), output_size, nullptr, nullptr); if (actual_size == 0) { return HRESULT_FROM_WIN32(GetLastError()); } m_block = m_block.subspan(0, actual_size); m_next_offset += actual_size; return S_OK; } auto next_block() const { static constexpr const auto M = (N / 4); const auto max_input_size = m_source.size() - m_next_offset; const auto input_size = (M <= max_input_size) ? M : max_input_size; return m_source.substr(m_next_offset, input_size); } Input m_source; size_t m_next_offset; Output m_block; size_t m_index; };
This works for the scenarios we described above, correctly sliding the block forward to the next as we reach the end of each previous one. And then someone asks, “What about emoji?” Ah, yes — emoji characters use the surrogate pair encoding we talked about before. Each one will use two wchar_t
s, and it would be an error to try to split one “down the middle,” so to speak. We can see this error in action by using this test:
TEST(to_utf8_test, emoji_block) { const auto raw = std::array<uint16_t, 5>{0xD83D, 0xDE0E, 0xD83D, 0xDC4D, 0x0000}; const auto input = std::wstring{ reinterpret_cast<const wchar_t*>(raw.data()) }; const auto expected = std::string{ "\xF0\x9F\x98\x8E\xF0\x9F\x91\x8D" }; test_utf8_block<4>(input, expected); test_utf8_block<9>(input, expected); test_utf8_block<14>(input, expected); test_utf8_block<19>(input, expected); }
The first test function (test_utf8_block<4>
) fails because the four char
output block is mapped to a one wchar_t
input block, and it is not valid to have an unpaired surrogate. The fix is to detect this error and then widen the input block by one wchar_t
as a retry step. Worst case, we’ll fail again and report the same error. With luck, though, we’ll continue happily converting the rest of the string (see commit on GitHub):
if (FAILED(try_move_next<false>())) { const auto hr = try_move_next<true>(); THROW_IF_FAILED_MSG(hr, "Error near offset %llu while attempting UTF-8 conversion", m_next_offset); } // . . . template <bool OneMore> HRESULT try_move_next() { const auto input_block = next_block<OneMore>(); // . . . } template <bool OneMore> auto next_block() const { static constexpr const auto M = (N / 4) + (OneMore ? 1 : 0); // . . . }
Whew! That’s a fair amount of work, but it seems to do the job. Now to finally answer the question asked in the title of this article — how fast is this? Let’s run some benchmarks (see full commit on GitHub):
static void to_utf8_very_long_string(benchmark::State& state) { const auto input = make_very_long_string(); auto ss = std::stringstream{}; for (auto _ : state) { ss.clear(); const auto result = str::to_utf8(input); ss << result; } } // . . . static void to_utf8_very_long_string_by_char(benchmark::State& state) { const auto input = make_very_long_string(); auto ss = std::stringstream{}; for (auto _ : state) { ss.clear(); for (auto c : str::to_utf8(input)) { ss << c; } } } // ... template <size_t N> static void to_utf8_block_very_long_string(benchmark::State& state) { const auto input = make_very_long_string(); auto ss = std::stringstream{}; for (auto _ : state) { ss.clear(); for (auto c : str::to_utf8<N>(input)) { ss << c; } } }
These are the basic benchmarks we will consider. We have the full round-trip through a std::string
version tested in to_utf8_very_long_string
and to_utf8_very_long_string_by_char
as well as a template function to_utf8_block_very_long_string
for testing a set of block-based variations. The results on my machine look like this:
------------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------------- to_utf8_very_long_string 56961 ns 52550 ns 13380 to_utf8_very_long_string_by_char 2036363 ns 1801273 ns 373 to_utf8_64_block_very_long_string 728050 ns 697545 ns 1120 to_utf8_256_block_very_long_string 694001 ns 655692 ns 1120 to_utf8_1024_block_very_long_string 714926 ns 655692 ns 1120 to_utf8_4096_block_very_long_string 718333 ns 655692 ns 1120 to_utf8_16384_block_very_long_string 997483 ns 1024933 ns 747
This tells us that if char-by-char is the need, the block-based algorithm can beat the simpler one (when using roughly ~1K block sizes). But if you just need to output the whole string as-is, it is still orders of magnitude faster to do the full round-trip conversion (maybe there is some vectorization going on?). In any case, we have shown yet again the triumph of measurement over assumption!