An Algorithm for Converting Strings to C++17 String Literals

(learn about this date format)

I’m currently working on a pull request for Descent 3. One of Descent 3’s maintainers has asked me to change part of that pull request. As part of implementing their requested change, I need to write code that takes a string and converts it into a properly escaped C++ string literal.

Doing so is surprisingly challenging.

What Characters Can Be in the Input String?

Our code has to start with a variable that contains some text that is going to get converted. I’m going to call that variable the input string. What characters can the input string possibly contain? Well, in my specific situation, the input string is going to be a path. The first thing we need to do is figure out what characters could possibly be in a valid path.

What Is and Is Not Allowed in Paths?

Restrictions That Linux® Places on Paths

Linux places basically no restrictions on the format of paths. Paths on Linux must end with a null terminator and must not contain any other null bytes. Other than those two rules, paths on Linux can contain pretty much any combination of bytes. Paths on Linux can use any character encoding or no character encoding at all.

Restrictions That Windows Places on Paths

Windows places more restrictions on paths. As far as the Windows NT kernel is concerned, files and directories are just some of the different types of objects that are handled by the Object Manager. If an object has a name, then it has a path. Paths to objects are called NT paths. Here are some examples of NT paths:

  • \Device\HarddiskVolume4\Program Files\Git\git-cmd.exe
  • \Device\HarddiskVolume6\Games
  • \REGISTRY
  • \Driver\HTTP
  • \ObjectTypes\Job

The Windows NT kernel stores NT paths using the UNICODE_STRING data type. Every UNICODE_STRING is supposed to be encoded using UTF-16, but as far as I know, the Windows NT kernel does not enforce this rule. This means that it’s technically possible to create a file object that has invalid UTF-16 in its NT path. For the purposes of this blog post, I’m going to say that an NT path is invalid if it doesn’t contain valid UTF-16. This means that a valid NT path can contain any Unicode® character.

If you’re used to C and C++, you might be surprised to hear that NT paths can contain any Unicode character. Most of the time, strings in C or C++ are null terminated. Null-terminated UTF-16 strings can’t contain the U+0000 null character. NT paths can contain U+0000 null because UNICODE_STRINGs are not necessarily null-terminated.

NT paths are not the only type of paths that exist on Windows. Win32 paths are an alternative to NT paths that allow developers and end users to write paths in a more familiar and more backwards compatible manner. Here are some examples of Win32 paths:

  • C:\Program Files\Git\git-cmd.exe
  • D:\Games\
  • Example\relative\path\

Valid NT paths can contain any Unicode character, but what about valid Win32 paths? What characters can they contain?

Win32 paths must be converted to NT paths before the kernel can actually use them. Most applications (including Descent 3) only ever deal with Win32 paths and allow Windows APIs to automatically do the conversion for them. The rules for converting a Win32 path to a NT path are kind of complicated, but if you begin a Win32 path with \\?\, then most of those rules are disabled. If you begin a Win32 path with \\?\, then the only rule is that the first 4 characters of the Win32 path are replaced with \??\. Any other characters in the path are left alone. This means that Win32 paths can contain all the same characters that NT paths can contain. Win32 paths can contain any Unicode character. That being said, it’s unlikely that you would actually be able to open a file that has U+0000 null characters in its name using a Win32 path. Most of the time, Win32 paths are stored as null-terminated strings, and null-terminated strings can’t typically contain U+0000 null characters.

Restrictions That Descent 3 Places on Paths

If Descent 3 was as perfect as it could possibly be, then it would support paths that contain any arbitrary sequence of bytes including null bytes (for the sake of weirdly named files on Windows). Descent 3 is not 100% perfect, though. Descent 3 often (always?) stores paths as std::filesystem::path objects. Descent 3 also often converts std::filesystem::path objects to UTF-8 encoded C strings by doing this:

example_path.u8string().c_str()

This puts two additional restrictions on paths:

  1. It must be possible to encode the path as valid UTF-8, or else u8string() might fail.
  2. It can’t contain a U+0000 null character because c_str() returns a null-terminated string and null-terminated UTF-8 strings can’t contain U+0000 null characters.

What This Means for Our Input String

Now, back to the original question: What characters can be in the input string? When we take a look at all of the previously mentioned limitations, we can combine them into two simple rules:

  1. We must be able to store the string as valid UTF-8 or UTF-16, so the string must be a sequence of valid Unicode characters.
  2. The string can’t contain a U+0000 null character.

Dangerous Characters

Now that we know what characters our input string can contain, we have another problem. Some characters need to be escaped. For example, we can’t just write a string literal like this:

"So-called "hot" potatoes"

We have to escape the quotation marks like this:

"So-called \"hot\" potatoes"

I call characters like quotation marks dangerous characters because they might cause problems if they aren’t properly escaped. I tried to find a list of dangerous characters, but I wasn’t able to. I could find lists of characters that could be escaped, but I couldn’t find a list of characters that have to be escaped. This means that we’re going to have to dive into the C++ standard and figure it out ourselves. Specifically, I’m going to take a look at C++17 because Descent 3 uses that version of the standard at the moment.

End-of-Line Indicators

The C++ standard describes how C++ programs are compiled. It breaks up the process of compiling C++ programs into 9 “translation phases”. During the first translation phase, this happens:

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Any source file character not in the basic source character set (5.3) is replaced by the universal-character-name that designates that character.

— ISO/IEC 14882:2017, section 5.2

Terminology

There are a few terms in that quote that need explaining. The first one is “physical source file characters”. Descent 3 uses UTF-8 for its source files, so “physical source file characters” means “Unicode characters” in this situation. The second one is “basic source character set”. Section 5.3 explains that the basic source character set is just a group of 96 Unicode characters. & is an example of a character that’s in the basic source character set. 🍕 is an example of a character that is not in the basic source character set. Characters like 🍕 get replaced with a universal-character-name. The universal-character-name for 🍕 is \U0001F355.

The next term that we need to look at is “new-line character”. On its face, that term sounds vague because there are multiple different Unicode characters that can create new lines. The standard often talks about “the new-line character” which suggests that (as far as C++ is concerned) there is exactly one new-line character, but the standard never directly says which character is the new-line character. The C++17 standard does mention that some terms are defined in other standards:

For the purposes of this document, the terms and definitions given in ISO/IEC 2382-1:1993, the terms, definitions, and symbols given in ISO 80000-2:2009, and the following apply.

— ISO/IEC 14882:2017, section 3

I checked both ISO/IEC 2382-1:1993 and ISO 80000-2:2009. Neither of them have a definition for “the new-line character”. That being said, part of the C++17 standard does hint at what the new-line character actually is:

Table 8 — Escape sequences
new-lineNL(LF)\n
horizontal tabHT\t
vertical tabVT\v
— ISO/IEC 14882:2017, section 5.13.3

Based on the “(LF)” in that table, I think that it’s safe to say that “the new-line character” is U+000A line feed. I suppose it’s possible that the new-line character is something else in certain implementations, but as far as I can tell, it will always be exactly one character.

The last term from section 5.2 of the C++17 standard that we need to understand is “end-of-line indicator”. Unfortunately, there isn’t a definition for “end-of-line indicator” in the C++17 standard, ISO/IEC 2382-1:1993 or ISO 80000-2:2009. I also wasn’t able to find a definition in any of those standards for the related terms “end-of-line” and “line”. This means that each implementation can have a different definition for “end-of-line indicator”.

There is a little bit of hope here because Descent 3 uses UTF-8 for its source code. The Unicode Core Specification has a whole section called “Newline Guidelines”. Here’s a relevant quote from that section:

Converting to Other Character Code Sets.

R3 If the intended target is known, map NLF, LS, and PS depending on the target conventions.

For example, when mapping to Microsoft Word’s internal conventions for documents, LS would be mapped to VT, and PS and any NLF would be mapped to CRLF.

The Unicode Standard Version 16.0 – Core Specification, section 5.8.3

In our situation, UTF-8 text is being converted to text in C++17’s basic source character set. The convention for C++17’s basic source character set is to always use “the new-line character”. This means that implementations should replace LS characters and NLFs with the new-line character (I’m not sure how PS characters should be handled). This raises a question, though. What about other characters? Is there a chance that something other than the LS character, the PS character or NLFs would get converted into “the new-line character”? Unfortunately, the answer is yes. Every Unicode character has a “General Category”. One of the General Categories is called Line_Separator or Zl for short. It would be reasonable for an implementation to treat all Line_Separator characters as end-of-line indicators. Unfortunately, the list of Line_Separator characters is not fixed. Future versions of the Unicode Standard could add additional Line_Separator characters. Even if there’s never a new version of the Unicode Standard, the list of Line_Separator characters can still change overtime. This is because of how private-use characters work. The Unicode Core Specification has this to say about private-use characters:

Properties. No private agreement can change which character codes are reserved for private use. However, many Unicode algorithms use the General_Category property or properties which are derived by reference to the General_Category property. Private agreements may override the General_Category or derivations based on it, except where overriding is expressly disallowed in the conformance statement for a specific algorithm. In other words, private agreements may define which private-use characters should be treated like spaces, digits, letters, punctuation, and so on, by all parties to those private agreements. In particular, when a private agreement overrides the General_Category of a private-use character from the default value of gc = Co to some other value such as gc = Lu or gc = Nd, such a change does not change its inherent identity as a private-use character, but merely specifies its intended behavior according to the private agreement.

The Unicode Standard Version 16.0 – Core Specification, section 23.5

This means that a C++17 compiler could decide that a certain set of private-use characters are Line_Separators. The compiler could then decide to interpret all Line_Separators as end-of-line indicators. In other words, the list of end-of-line indicators is implementation defined, even if you use UTF-8.

Working Around the Fact That End-of-Line Indicators Are Dangerous

Now that we understand the terminology, let’s take a look at that section 5.2 quote again:

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Any source file character not in the basic source character set (5.3) is replaced by the universal-character-name that designates that character.

— ISO/IEC 14882:2017, section 5.2

That quote means that end-of-line indicators are dangerous. Consider this sequence of Unicode characters:

  1. A
  2. U+000D carriage return
  3. B
  4. U+000A line feed
  5. C
  6. U+000D carriage return
  7. U+000A line feed
  8. D
  9. U+0085 next line
  10. E
  11. U+2028 LINE SEPARATOR
  12. F

It would be reasonable for a C++17 implementation to turn that sequence of Unicode characters into this sequence of basic source characters:

  1. A
  2. The new-line character (whatever it actually is)
  3. B
  4. The new-line character
  5. C
  6. The new-line character
  7. D
  8. The new-line character
  9. E
  10. The new-line character
  11. F

This transformation causes us to lose information. I want to say that all end-of-line indicators are dangerous because of this one-way transformation, but there’s no way of knowing which characters or sequences of characters are end-of-line indicators. What can we do in this situation?

Luckily, the section 5.2 quote gives us an escape hatch. It says that characters are mapped if necessary. If we only ever use characters in the basic source character set, then no mapping will be necessary. If no mapping is necessary, then no mapping will happen, and we’ll avoid this one-way transformation. This means that all characters that aren’t in the basic source character set are dangerous and should be escaped.

Characters That Can’t Be in S-Char-Sequences

Section 5.13.5 of the C++17 standard describes string literals. Here’s the relevant syntax notation:

5.13.5 String literals

  • string-literal:

    • encoding-prefixopt " s-char-sequenceopt "
    • encoding-prefixopt R raw-string
  • s-char-sequence:

    • s-char
    • s-char-sequence s-char
  • s-char:

    • any member of the source character set except
      the double-quote ", backslash \, or new-line character
    • escape-sequence
    • universal-character-name
— ISO/IEC 14882:2017, section 5.13.5

This means that the double-quote, backslash and new-line characters are all dangerous. They should all be escaped.

Trigraphs

Question marks can be used to create trigraph sequences. For example, if a C++ compiler encounters ??', then it might replace those three characters with ^. Trigraphs were partially removed in C++17:

Change: Removal of trigraph support as a required feature.

Rationale: Prevents accidental uses of trigraphs in non-raw string literals and comments.

Effect on original feature: Valid C ++ 2014 code that uses trigraphs may not be valid or may have different semantics in this International Standard. Implementations may choose to translate trigraphs as specified in C ++ 2014 if they appear outside of a raw string literal, as part of the implementation-defined mapping from physical source file characters to the basic source character set.

— ISO/IEC 14882:2017, section C.4.1

This means that question marks are potentially dangerous. They should be escaped just in case code is compiled with trigraphs enabled.

The Final List

Here’s the final list of dangerous characters that should be escaped:

  • any character that’s not in the basic source character set,
  • the double-quote character ("),
  • the backslash character (\),
  • the new-line character and
  • the question mark character (?).

Potential Algorithms

Now that we know all of the relevant constraints, we can actually start writing some code! I’m going to start by implementing these algorithms in the Python® programming language because it’s the language that I’m most familiar with. After that, I’ll see if it’s possible to port the code to the CMake Language. I need to find an algorithm that will work in the CMake Language because Descent 3 uses CMake as its build system.

Algorithm 1: Use Universal-Character-Names for Dangerous Characters

We can replace any dangerous character with a universal-character-name to prevent the dangerous character from causing problems. Here’s how we would do that in the Python programming language:

basic_source_character_set = {
    " ", "\t", "\v", "\f", "\n",

    "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p",
    "q", "r", "s", "t", "u", "v", "w", "x", "y", "z",

    "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P",
    "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z",

    "0", "1", "2", "3", "4", "5", "6", "7", "8", "9",

    "_", "{", "}", "[", "]", "#", "(", ")", "<", ">", "%", ":", ";", ".", "?", "*",
    "+", "-", "/", "^", "&", "|", "~", "!", "=", ",", "\\", "\"", "'"
}
dangerous_characters_in_the_bscs = { "\"", "\\", "\n", "?" }
safe_character_set = basic_source_character_set - dangerous_characters_in_the_bscs

input_string = "/home/user/ディセント3"
# Input strings can contain Unicode characters, but the execution character set
# (and the execution wide-character set) may not support all Unicode
# characters. Using a prefix like u8 is wise here because it guarantees that
# our string will be capable of holding any Unicode character (other than
# U+0000 null, I guess).
output_string = "u8\""
for character in input_string:
    if character in safe_character_set:
        output_string += character
    else:
        output_string += f"\\U{ord(character):08X}"
output_string += "\""
print(output_string)

I really like this solution, but… it has an annoying problem. Unfortunately, I don’t think that it’s possible to port this algorithm to the CMake Language. In the Python programming language, you can use the the ord() function to get the Unicode code point for a particular character. As far as I can tell, there’s no way to do that in the CMake Language. I haven’t been able to find a CMake equivalent of the ord() function.

Algorithm 2: Use \xhh for Each Byte in the String

In C++, narrow string literals are always supposed to be encoded using the execution character set. Compilers like GCC allow you to choose which execution character set to use when compiling your C++ program. Descent 3 is supposed to always use UTF-8 as its execution character set. We can take advantage of that fact:

input_string = "/home/user/ディセント3"
input_string_as_bytes = input_string.encode(encoding="utf_8")
# For C++17, using the u8 prefix here isn’t going to do anything here. In
# C++17, the u8 prefix affects how characters are turned into bytes, but we’re
# taking direct control over what the bytes are going to be. That being said,
# the u8 prefix will be a good thing in future versions of C++. In C++17, the
# type for both "example" and u8"example" is const char[N]. In C++20 and later,
# the type for "example" is const char[N] and the type for u8"example" is
# const char8_t[N].
# See <https://en.cppreference.com/w/cpp/language/string_literal>.
output_string = "u8\""
for byte in input_string.encode(encoding="utf_8"):
    output_string += f"\\x{byte:02X}"
output_string += "\""
print(output_string)

Can this algorithm be ported to the CMake Language? The answer is: probably. In the Python programming language, you use the str data type when you want to store sequences of characters and you use the bytes data type when you want to store sequences of bytes. We wanted to turn a sequence of characters into a sequence of bytes, so we had to convert a str object into a bytes object. When we do the conversion, we explicitly set encoding to "utf_8", so we knew for a fact that the bytes would use the correct encoding.

Unfortunately, the CMake Language doesn’t make the same distinction. In the CMake Language, you use the STRING data type when you want to store sequences of characters and you use the same STRING data type when you want to store sequences of bytes. You can figure out the hexadecimal value for each of the bytes in a STRING using string(HEX…). Here’s the problem: if I create a STRING variable using a string literal, then what character encoding will that STRING use? If it always uses UTF-8, then string(HEX…) will always do the right thing here. If it sometimes uses something other than UTF-8, then string(HEX…) will sometimes do the wrong thing here.

I wasn’t able to find any documentation about the character encoding of string literals in the CMake Language, so I decided to ask about it on the CMake Discourse instance. I haven’t received a response yet. In the meantime, I’ve decided to do an experiment. I created this CMakeLists.txt file:

cmake_minimum_required(VERSION 3.29.5)
project(example_project)

set(EXAMPLE_VARIABLE "🅭" CACHE STRING "Example cache variable")

string(HEX "${EXAMPLE_VARIABLE}" EXAMPLE_VARIABLE_HEX)
message(STATUS "That previous string as hex: ${EXAMPLE_VARIABLE_HEX}")

If the CMake Language always uses UTF-8 for string literals, then this should be printed every time I use that CMakeLists.txt file:

-- That previous string as hex: f09f85ad

I tried using that CMakeLists.txt file in these situations:

  • NixOS 24.11 with an en_US.UTF-8 locale,
  • NixOS 24.11 with an en_US.ANSI_X3.4-1968 locale,
  • Windows 11 with an English (United States) locale and “Beta: Use Unicode UTF-8 for worldwide language support” turned off,
  • Windows 11 with an English (United States) locale and “Beta: Use Unicode UTF-8 for worldwide language support” turned on,
  • Windows 11 with an Japanese (Japan) locale and “Beta: Use Unicode UTF-8 for worldwide language support” turned off and finally
  • Windows 11 with an Japanese (Japan) locale and “Beta: Use Unicode UTF-8 for worldwide language support” turned on.

In all of those situations, the expected message was printed which seems to suggest that CMake STRING literals always use UTF-8. I wish that there was some sort of guarantee that that would be true, but I haven’t found one yet.

This means that algorithm 2 can be ported to the CMake Language, but we can’t 100% guarantee that it will work properly:

cmake_minimum_required(VERSION 3.29.5)
project(example_project)

set(INPUT_STRING "/home/user/ディセント3" CACHE STRING "The STRING that will be converted into a C++ string literal")
set(OUTPUT_STRING "u8\"")

string(LENGTH "${INPUT_STRING}" INPUT_STRING_LENGTH)
math(EXPR LAST_INDEX "${INPUT_STRING_LENGTH} - 1")
foreach(I RANGE "${LAST_INDEX}")
	string(SUBSTRING "${INPUT_STRING}" "${I}" 1 CURRENT_BYTE)
	string(HEX "${CURRENT_BYTE}" CURRENT_BYTE_HEX)
	string(APPEND OUTPUT_STRING "\\x${CURRENT_BYTE_HEX}")
endforeach()
string(APPEND OUTPUT_STRING "\"")
message(STATUS "${OUTPUT_STRING}")

Conclusion

This whole situation sucks. I often try to make sure that my code does the right thing in all possible situations. The more I try to do that, the more I feel alienated. It seems like everyone else just settles for good enough. I don’t want things to be good enough. I want them the be the best that they can possibly be, but the way that C++17 and the CMake Language were designed keep consistently making that difficult.