diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-07 09:22:09 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-07 09:22:09 +0000 |
commit | 43a97878ce14b72f0981164f87f2e35e14151312 (patch) | |
tree | 620249daf56c0258faa40cbdcf9cfba06de2a846 /xpcom/docs/stringguide.rst | |
parent | Initial commit. (diff) | |
download | firefox-upstream.tar.xz firefox-upstream.zip |
Adding upstream version 110.0.1.upstream/110.0.1upstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'xpcom/docs/stringguide.rst')
-rw-r--r-- | xpcom/docs/stringguide.rst | 1110 |
1 files changed, 1110 insertions, 0 deletions
diff --git a/xpcom/docs/stringguide.rst b/xpcom/docs/stringguide.rst new file mode 100644 index 0000000000..97613276a4 --- /dev/null +++ b/xpcom/docs/stringguide.rst @@ -0,0 +1,1110 @@ +String Guide +============ + +Most of the Mozilla code uses a C++ class hierarchy to pass string data, +rather than using raw pointers. This guide documents the string classes which +are visible to code within the Mozilla codebase (code which is linked into +``libxul``). + +Introduction +------------ + +The string classes are a library of C++ classes which are used to manage +buffers of wide (16-bit) and narrow (8-bit) character strings. The headers +and implementation are in the `xpcom/string +<https://searchfox.org/mozilla-central/source/xpcom/string>`_ directory. All +strings are stored as a single contiguous buffer of characters. + +The 8-bit and 16-bit string classes have completely separate base classes, +but share the same APIs. As a result, you cannot assign a 8-bit string to a +16-bit string without some kind of conversion helper class or routine. For +the purpose of this document, we will refer to the 16-bit string classes in +class documentation. Every 16-bit class has an equivalent 8-bit class: + +===================== ====================== +Wide Narrow +===================== ====================== +``nsAString`` ``nsACString`` +``nsString`` ``nsCString`` +``nsAutoString`` ``nsAutoCString`` +``nsDependentString`` ``nsDependentCString`` +===================== ====================== + +The string classes distinguish, as part of the type hierarchy, between +strings that must have a null-terminator at the end of their buffer +(``ns[C]String``) and strings that are not required to have a null-terminator +(``nsA[C]String``). nsA[C]String is the base of the string classes (since it +imposes fewer requirements) and ``ns[C]String`` is a class derived from it. +Functions taking strings as parameters should generally take one of these +four types. + +In order to avoid unnecessary copying of string data (which can have +significant performance cost), the string classes support different ownership +models. All string classes support the following three ownership models +dynamically: + +* reference counted, copy-on-write, buffers (the default) + +* adopted buffers (a buffer that the string class owns, but is not reference + counted, because it came from somewhere else) + +* dependent buffers, that is, an underlying buffer that the string class does + not own, but that the caller that constructed the string guarantees will + outlive the string instance + +Auto strings will prefer reference counting an existing reference-counted +buffer over their stack buffer, but will otherwise use their stack buffer for +anything that will fit in it. + +There are a number of additional string classes: + + +* Classes which exist primarily as constructors for the other types, + particularly ``nsDependent[C]String`` and ``nsDependent[C]Substring``. These + types are really just convenient notation for constructing an + ``nsA[C]String`` with a non-default ownership mode; they should not be + thought of as different types. + +* ``nsLiteral[C]String`` which should rarely be constructed explicitly but + usually through the ``""_ns`` and ``u""_ns`` user-defined string literals. + ``nsLiteral[C]String`` is trivially constructible and destructible, and + therefore does not emit construction/destruction code when stored in static, + as opposed to the other string classes. + +The Major String Classes +------------------------ + +The list below describes the main base classes. Once you are familiar with +them, see the appendix describing What Class to Use When. + + +* **nsAString**/**nsACString**: the abstract base class for all strings. It + provides an API for assignment, individual character access, basic + manipulation of characters in the string, and string comparison. This class + corresponds to the XPIDL ``AString`` or ``ACString`` parameter types. + ``nsA[C]String`` is not necessarily null-terminated. + +* **nsString**/**nsCString**: builds on ``nsA[C]String`` by guaranteeing a + null-terminated storage. This allows for a method (``.get()``) to access the + underlying character buffer. + +The remainder of the string classes inherit from either ``nsA[C]String`` or +``ns[C]String``. Thus, every string class is compatible with ``nsA[C]String``. + +.. note:: + + In code which is generic over string width, ``nsA[C]String`` is sometimes + known as ``nsTSubstring<CharT>``. ``nsAString`` is a type alias for + ``nsTSubstring<char16_t>``, and ``nsACString`` is a type alias for + ``nsTSubstring<char>``. + +.. note:: + + The type ``nsLiteral[C]String`` technically does not inherit from + ``nsA[C]String``, but instead inherits from ``nsStringRepr<CharT>``. This + allows the type to not generate destructors when stored in static + storage. + + It can be implicitly coerced to ``const ns[C]String&`` (though can never + be accessed mutably) and generally acts as-if it was a subclass of + ``ns[C]String`` in most cases. + +Since every string derives from ``nsAString`` (or ``nsACString``), they all +share a simple API. Common read-only methods include: + +* ``.Length()`` - the number of code units (bytes for 8-bit string classes and ``char16_t`` for 16-bit string classes) in the string. +* ``.IsEmpty()`` - the fastest way of determining if the string has any value. Use this instead of testing ``string.Length() == 0`` +* ``.Equals(string)`` - ``true`` if the given string has the same value as the current string. Approximately the same as ``operator==``. + +Common methods that modify the string: + +* ``.Assign(string)`` - Assigns a new value to the string. Approximately the same as ``operator=``. +* ``.Append(string)`` - Appends a value to the string. +* ``.Insert(string, position)`` - Inserts the given string before the code unit at position. +* ``.Truncate(length)`` - shortens the string to the given length. + +More complete documentation can be found in the `Class Reference`_. + +As function parameters +~~~~~~~~~~~~~~~~~~~~~~ + +In general, use ``nsA[C]String`` references to pass strings across modules. For example: + +.. code-block:: c++ + + // when passing a string to a method, use const nsAString& + nsFoo::PrintString(const nsAString& str); + + // when getting a string from a method, use nsAString& + nsFoo::GetString(nsAString& result); + +The Concrete Classes - which classes to use when +------------------------------------------------ + +The concrete classes are for use in code that actually needs to store string +data. The most common uses of the concrete classes are as local variables, +and members in classes or structs. + +.. digraph:: concreteclasses + + node [shape=rectangle] + + "nsA[C]String" -> "ns[C]String"; + "ns[C]String" -> "nsDependent[C]String"; + "nsA[C]String" -> "nsDependent[C]Substring"; + "nsA[C]String" -> "ns[C]SubstringTuple"; + "ns[C]String" -> "nsAuto[C]StringN"; + "ns[C]String" -> "nsLiteral[C]String" [style=dashed]; + "nsAuto[C]StringN" -> "nsPromiseFlat[C]String"; + "nsAuto[C]StringN" -> "nsPrintfCString"; + +The following is a list of the most common concrete classes. Once you are +familiar with them, see the appendix describing What Class to Use When. + +* ``ns[C]String`` - a null-terminated string whose buffer is allocated on the + heap. Destroys its buffer when the string object goes away. + +* ``nsAuto[C]String`` - derived from ``nsString``, a string which owns a 64 + code unit buffer in the same storage space as the string itself. If a string + less than 64 code units is assigned to an ``nsAutoString``, then no extra + storage will be allocated. For larger strings, a new buffer is allocated on + the heap. + + If you want a number other than 64, use the templated types ``nsAutoStringN`` + / ``nsAutoCStringN``. (``nsAutoString`` and ``nsAutoCString`` are just + typedefs for ``nsAutoStringN<64>`` and ``nsAutoCStringN<64>``, respectively.) + +* ``nsDependent[C]String`` - derived from ``nsString``, this string does not + own its buffer. It is useful for converting a raw string pointer (``const + char16_t*`` or ``const char*``) into a class of type ``nsAString``. Note that + you must null-terminate buffers used by to ``nsDependentString``. If you + don't want to or can't null-terminate the buffer, use + ``nsDependentSubstring``. + +* ``nsPrintfCString`` - derived from ``nsCString``, this string behaves like an + ``nsAutoCString``. The constructor takes parameters which allows it to + construct a 8-bit string from a printf-style format string and parameter + list. + +There are also a number of concrete classes that are created as a side-effect +of helper routines, etc. You should avoid direct use of these classes. Let +the string library create the class for you. + +* ``ns[C]SubstringTuple`` - created via string concatenation +* ``nsDependent[C]Substring`` - created through ``Substring()`` +* ``nsPromiseFlat[C]String`` - created through ``PromiseFlatString()`` +* ``nsLiteral[C]String`` - created through the ``""_ns`` and ``u""_ns`` user-defined literals + +Of course, there are times when it is necessary to reference these string +classes in your code, but as a general rule they should be avoided. + +Iterators +--------- + +Because Mozilla strings are always a single buffer, iteration over the +characters in the string is done using raw pointers: + +.. code-block:: c++ + + /** + * Find whether there is a tab character in `data` + */ + bool HasTab(const nsAString& data) { + const char16_t* cur = data.BeginReading(); + const char16_t* end = data.EndReading(); + + for (; cur < end; ++cur) { + if (char16_t('\t') == *cur) { + return true; + } + } + return false; + } + +Note that ``end`` points to the character after the end of the string buffer. +It should never be dereferenced. + +Writing to a mutable string is also simple: + +.. code-block:: c++ + + /** + * Replace every tab character in `data` with a space. + */ + void ReplaceTabs(nsAString& data) { + char16_t* cur = data.BeginWriting(); + char16_t* end = data.EndWriting(); + + for (; cur < end; ++cur) { + if (char16_t('\t') == *cur) { + *cur = char16_t(' '); + } + } + } + +You may change the length of a string via ``SetLength()``. Note that +Iterators become invalid after changing the length of a string. If a string +buffer becomes smaller while writing it, use ``SetLength`` to inform the +string class of the new size: + +.. code-block:: c++ + + /** + * Remove every tab character from `data` + */ + void RemoveTabs(nsAString& data) { + int len = data.Length(); + char16_t* cur = data.BeginWriting(); + char16_t* end = data.EndWriting(); + + while (cur < end) { + if (char16_t('\t') == *cur) { + len -= 1; + end -= 1; + if (cur < end) + memmove(cur, cur + 1, (end - cur) * sizeof(char16_t)); + } else { + cur += 1; + } + } + + data.SetLength(len); + } + +Note that using ``BeginWriting()`` to make a string longer is not OK. +``BeginWriting()`` must not be used to write past the logical length of the +string indicated by ``EndWriting()`` or ``Length()``. Calling +``SetCapacity()`` before ``BeginWriting()`` does not affect what the previous +sentence says. To make the string longer, call ``SetLength()`` before +``BeginWriting()`` or use the ``BulkWrite()`` API described below. + +Bulk Write +---------- + +``BulkWrite()`` allows capacity-aware cache-friendly low-level writes to the +string's buffer. + +Capacity-aware means that the caller is made aware of how the +caller-requested buffer capacity was rounded up to mozjemalloc buckets. This +is useful when initially requesting best-case buffer size without yet knowing +the true size need. If the data that actually needs to be written is larger +than the best-case estimate but still fits within the rounded-up capacity, +there is no need to reallocate despite requesting the best-case capacity. + +Cache-friendly means that the zero terminator for C compatibility is written +after the new content of the string has been written, so the result is a +forward-only linear write access pattern instead of a non-linear +back-and-forth sequence resulting from using ``SetLength()`` followed by +``BeginWriting()``. + +Low-level means that writing via a raw pointer is possible as with +``BeginWriting()``. + +``BulkWrite()`` takes three arguments: The new capacity (which may be rounded +up), the number of code units at the beginning of the string to preserve +(typically the old logical length), and a boolean indicating whether +reallocating a smaller buffer is OK if the requested capacity would fit in a +buffer that's smaller than current one. It returns a ``mozilla::Result`` which +contains either a usable ``mozilla::BulkWriteHandle<T>`` (where ``T`` is the +string's ``char_type``) or an ``nsresult`` explaining why none can be had +(presumably OOM). + +The actual writes are performed through the returned +``mozilla::BulkWriteHandle<T>``. You must not access the string except via this +handle until you call ``Finish()`` on the handle in the success case or you let +the handle go out of scope without calling ``Finish()`` in the failure case, in +which case the destructor of the handle puts the string in a mostly harmless but +consistent state (containing a single REPLACEMENT CHARACTER if a capacity +greater than 0 was requested, or in the ``char`` case if the three-byte UTF-8 +representation of the REPLACEMENT CHARACTER doesn't fit, an ASCII SUBSTITUTE). + +``mozilla::BulkWriteHandle<T>`` autoconverts to a writable +``mozilla::Span<T>`` and also provides explicit access to itself as ``Span`` +(``AsSpan()``) or via component accessors named consistently with those on +``Span``: ``Elements()`` and ``Length()``. (The latter is not the logical +length of the string but the writable length of the buffer.) The buffer +exposed via these methods includes the prefix that you may have requested to +be preserved. It's up to you to skip past it so as to not overwrite it. + +If there's a need to request a different capacity before you are ready to +call ``Finish()``, you can call ``RestartBulkWrite()`` on the handle. It +takes three arguments that match the first three arguments of +``BulkWrite()``. It returns ``mozilla::Result<mozilla::Ok, nsresult>`` to +indicate success or OOM. Calling ``RestartBulkWrite()`` invalidates +previously-obtained span, raw pointer or length. + +Once you are done writing, call ``Finish()``. It takes two arguments: the new +logical length of the string (which must not exceed the capacity returned by +the ``Length()`` method of the handle) and a boolean indicating whether it's +OK to attempt to reallocate a smaller buffer in case a smaller mozjemalloc +bucket could accommodate the new logical length. + +Helper Classes and Functions +---------------------------- + +Converting Cocoa strings +~~~~~~~~~~~~~~~~~~~~~~~~ + +Use ``mozilla::CopyCocoaStringToXPCOMString()`` in +``mozilla/MacStringHelpers.h`` to convert Cocoa strings to XPCOM strings. + +Searching strings - looking for substrings, characters, etc. +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``nsReadableUtils.h`` header provides helper methods for searching in runnables. + +.. code-block:: c++ + + bool FindInReadable(const nsAString& pattern, + nsAString::const_iterator start, nsAString::const_iterator end, + nsStringComparator& aComparator = nsDefaultStringComparator()); + +To use this, ``start`` and ``end`` should point to the beginning and end of a +string that you would like to search. If the search string is found, +``start`` and ``end`` will be adjusted to point to the beginning and end of +the found pattern. The return value is ``true`` or ``false``, indicating +whether or not the string was found. + +An example: + +.. code-block:: c++ + + const nsAString& str = GetSomeString(); + nsAString::const_iterator start, end; + + str.BeginReading(start); + str.EndReading(end); + + constexpr auto valuePrefix = u"value="_ns; + + if (FindInReadable(valuePrefix, start, end)) { + // end now points to the character after the pattern + valueStart = end; + } + +Checking for Memory Allocation failure +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Like other types in Gecko, the string classes use infallible memory +allocation by default, so you do not need to check for success when +allocating/resizing "normal" strings. + +Most functions that modify strings (``Assign()``, ``SetLength()``, etc.) also +have an overload that takes a ``mozilla::fallible_t`` parameter. These +overloads return ``false`` instead of aborting if allocation fails. Use them +when creating/allocating strings which may be very large, and which the +program could recover from if the allocation fails. + +Substrings (string fragments) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It is very simple to refer to a substring of an existing string without +actually allocating new space and copying the characters into that substring. +``Substring()`` is the preferred method to create a reference to such a +string. + +.. code-block:: c++ + + void ProcessString(const nsAString& str) { + const nsAString& firstFive = Substring(str, 0, 5); // from index 0, length 5 + // firstFive is now a string representing the first 5 characters + } + +Unicode Conversion +------------------ + +Strings can be stored in two basic formats: 8-bit code unit (byte/``char``) +strings, or 16-bit code unit (``char16_t``) strings. Any string class with a +capital "C" in the classname contains 8-bit bytes. These classes include +``nsCString``, ``nsDependentCString``, and so forth. Any string class without +the "C" contains 16-bit code units. + +A 8-bit string can be in one of many character encodings while a 16-bit +string is always in potentially-invalid UTF-16. (You can make a 16-bit string +guaranteed-valid UTF-16 by passing it to ``EnsureUTF16Validity()``.) The most +common encodings are: + + +* ASCII - 7-bit encoding for basic English-only strings. Each ASCII value + is stored in exactly one byte in the array with the most-significant 8th bit + set to zero. + +* `UCS2 <http://www.unicode.org/glossary/#UCS_2>`_ - 16-bit encoding for a + subset of Unicode, `BMP <http://www.unicode.org/glossary/#BMP>`_. The Unicode + value of a character stored in UCS2 is stored in exactly one 16-bit + ``char16_t`` in a string class. + +* `UTF-8 <http://www.faqs.org/rfcs/rfc3629.html>`_ - 8-bit encoding for + Unicode characters. Each Unicode characters is stored in up to 4 bytes in a + string class. UTF-8 is capable of representing the entire Unicode character + repertoire, and it efficiently maps to `UTF-32 + <http://www.unicode.org/glossary/#UTF_32>`_. (Gtk and Rust natively use + UTF-8.) + +* `UTF-16 <http://www.unicode.org/glossary/#UTF_16>`_ - 16-bit encoding for + Unicode storage, backwards compatible with UCS2. The Unicode value of a + character stored in UTF-16 may require one or two 16-bit ``char16_t`` in a + string class. The contents of ``nsAString`` always has to be regarded as in + this encoding instead of UCS2. UTF-16 is capable of representing the entire + Unicode character repertoire, and it efficiently maps to UTF-32. (Win32 W + APIs and Mac OS X natively use UTF-16.) + +* Latin1 - 8-bit encoding for the first 256 Unicode code points. Used for + HTTP headers and for size-optimized storage in text node and SpiderMonkey + strings. Latin1 converts to UTF-16 by zero-extending each byte to a 16-bit + code unit. Note that this kind of "Latin1" is not available for encoding + HTML, CSS, JS, etc. Specifying ``charset=latin1`` means the same as + ``charset=windows-1252``. Windows-1252 is a similar but different encoding + used for interchange. + +In addition, there exist multiple other (legacy) encodings. The Web-relevant +ones are defined in the `Encoding Standard <https://encoding.spec.whatwg.org/>`_. +Conversions from these encodings to +UTF-8 and UTF-16 are provided by `mozilla::Encoding +<https://searchfox.org/mozilla-central/source/intl/Encoding.h#109>`_. +Additionally, on Windows the are some rare cases (e.g. drag&drop) where it's +necessary to call a system API with data encoded in the Windows +locale-dependent legacy encoding instead of UTF-16. In those rare cases, use +``MultiByteToWideChar``/``WideCharToMultiByte`` from kernel32.dll. Do not use +``iconv`` on *nix. We only support UTF-8-encoded file paths on *nix, non-path +Gtk strings are always UTF-8 and Cocoa and Java strings are always UTF-16. + +When working with existing code, it is important to examine the current usage +of the strings that you are manipulating, to determine the correct conversion +mechanism. + +When writing new code, it can be confusing to know which storage class and +encoding is the most appropriate. There is no single answer to this question, +but the important points are: + + +* **Surprisingly many strings are very often just ASCII.** ASCII is a subset of + UTF-8 and is, therefore, efficient to represent as UTF-8. Representing ASCII + as UTF-16 bad both for memory usage and cache locality. + +* **Rust strongly prefers UTF-8.** If your C++ code is interacting with Rust + code, using UTF-8 in ``nsACString`` and merely validating it when converting + to Rust strings is more efficient than using ``nsAString`` on the C++ side. + +* **Networking code prefers 8-bit strings.** Networking code tends to use 8-bit + strings: either with UTF-8 or Latin1 (byte value is the Unicode scalar value) + semantics. + +* **JS and DOM prefer UTF-16.** Most Gecko code uses UTF-16 for compatibility + with JS strings and DOM string which are potentially-invalid UTF-16. However, + both DOM text nodes and JS strings store strings that only contain code points + below U+0100 as Latin1 (byte value is the Unicode scalar value). + +* **Windows and Cocoa use UTF-16.** Windows system APIs take UTF-16. Cocoa + ``NSString`` is UTF-16. + +* **Gtk uses UTF-8.** Gtk APIs take UTF-8 for non-file paths. In the Gecko + case, we support only UTF-8 file paths outside Windows, so all Gtk strings + are UTF-8 for our purposes though file paths received from Gtk may not be + valid UTF-8. + +To assist with ASCII, Latin1, UTF-8, and UTF-16 conversions, there are some +helper methods and classes. Some of these classes look like functions, +because they are most often used as temporary objects on the stack. + +Short zero-terminated ASCII strings +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If you have a short zero-terminated string that you are certain is always +ASCII, use these special-case methods instead of the conversions described in +the later sections. + +* If you are assigning an ASCII literal to an ``nsACString``, use + ``AssignLiteral()``. +* If you are assigning a literal to an ``nsAString``, use ``AssignLiteral()`` + and make the literal a ``u""`` literal. If the literal has to be a ``""`` + literal (as opposed to ``u""``) and is ASCII, still use ``AppendLiteral()``, + but be aware that this involves a run-time inflation. +* If you are assigning a zero-terminated ASCII string that's not a literal from + the compiler's point of view at the call site and you don't know the length + of the string either (e.g. because it was looked up from an array of literals + of varying lengths), use ``AssignASCII()``. + +UTF-8 / UTF-16 conversion +~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. cpp:function:: NS_ConvertUTF8toUTF16(const nsACString&) + + a ``nsAutoString`` subclass that converts a UTF-8 encoded ``nsACString`` + or ``const char*`` to a 16-bit UTF-16 string. If you need a ``const + char16_t*`` buffer, you can use the ``.get()`` method. For example: + + .. code-block:: c++ + + /* signature: void HandleUnicodeString(const nsAString& str); */ + object->HandleUnicodeString(NS_ConvertUTF8toUTF16(utf8String)); + + /* signature: void HandleUnicodeBuffer(const char16_t* str); */ + object->HandleUnicodeBuffer(NS_ConvertUTF8toUTF16(utf8String).get()); + +.. cpp:function:: NS_ConvertUTF16toUTF8(const nsAString&) + + a ``nsAutoCString`` which converts a 16-bit UTF-16 string (``nsAString``) + to a UTF-8 encoded string. As above, you can use ``.get()`` to access a + ``const char*`` buffer. + + .. code-block:: c++ + + /* signature: void HandleUTF8String(const nsACString& str); */ + object->HandleUTF8String(NS_ConvertUTF16toUTF8(utf16String)); + + /* signature: void HandleUTF8Buffer(const char* str); */ + object->HandleUTF8Buffer(NS_ConvertUTF16toUTF8(utf16String).get()); + +.. cpp:function:: CopyUTF8toUTF16(const nsACString&, nsAString&) + + converts and copies: + + .. code-block:: c++ + + // return a UTF-16 value + void Foo::GetUnicodeValue(nsAString& result) { + CopyUTF8toUTF16(mLocalUTF8Value, result); + } + +.. cpp:function:: AppendUTF8toUTF16(const nsACString&, nsAString&) + + converts and appends: + + .. code-block:: c++ + + // return a UTF-16 value + void Foo::GetUnicodeValue(nsAString& result) { + result.AssignLiteral("prefix:"); + AppendUTF8toUTF16(mLocalUTF8Value, result); + } + +.. cpp:function:: CopyUTF16toUTF8(const nsAString&, nsACString&) + + converts and copies: + + .. code-block:: c++ + + // return a UTF-8 value + void Foo::GetUTF8Value(nsACString& result) { + CopyUTF16toUTF8(mLocalUTF16Value, result); + } + +.. cpp:function:: AppendUTF16toUTF8(const nsAString&, nsACString&) + + converts and appends: + + .. code-block:: c++ + + // return a UTF-8 value + void Foo::GetUnicodeValue(nsACString& result) { + result.AssignLiteral("prefix:"); + AppendUTF16toUTF8(mLocalUTF16Value, result); + } + + +Latin1 / UTF-16 Conversion +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The following should only be used when you can guarantee that the original +string is ASCII or Latin1 (in the sense that the byte value is the Unicode +scalar value; not in the windows-1252 sense). These helpers are very similar +to the UTF-8 / UTF-16 conversion helpers above. + + +UTF-16 to Latin1 converters +``````````````````````````` + +These converters are **very dangerous** because they **lose information** +during the conversion process. You should **avoid UTF-16 to Latin1 +conversions** unless your strings are guaranteed to be Latin1 or ASCII. (In +the future, these conversions may start asserting in debug builds that their +input is in the permissible range.) If the input is actually in the Latin1 +range, each 16-bit code unit in narrowed to an 8-bit byte by removing the +high half. Unicode code points above U+00FF result in garbage whose nature +must not be relied upon. (In the future the nature of the garbage will be CPU +architecture-dependent.) If you want to ``printf()`` something and don't care +what happens to non-ASCII, please convert to UTF-8 instead. + + +.. cpp:function:: NS_LossyConvertUTF16toASCII(const nsAString&) + + A ``nsAutoCString`` which holds a temporary buffer containing the Latin1 + value of the string. + +.. cpp:function:: void LossyCopyUTF16toASCII(Span<const char16_t>, nsACString&) + + Does an in-place conversion from UTF-16 into an Latin1 string object. + +.. cpp:function:: void LossyAppendUTF16toASCII(Span<const char16_t>, nsACString&) + + Appends a UTF-16 string to a Latin1 string. + +Latin1 to UTF-16 converters +``````````````````````````` + +These converters are very dangerous because they will **produce wrong results +for non-ASCII UTF-8 or windows-1252 input** into a meaningless UTF-16 string. +You should **avoid ASCII to UTF-16 conversions** unless your strings are +guaranteed to be ASCII or Latin1 in the sense of the byte value being the +Unicode scalar value. Every byte is zero-extended into a 16-bit code unit. + +It is correct to use these on most HTTP header values, but **it's always +wrong to use these on HTTP response bodies!** (Use ``mozilla::Encoding`` to +deal with response bodies.) + +.. cpp:function:: NS_ConvertASCIItoUTF16(const nsACString&) + + A ``nsAutoString`` which holds a temporary buffer containing the value of + the Latin1 to UTF-16 conversion. + +.. cpp:function:: void CopyASCIItoUTF16(Span<const char>, nsAString&) + + does an in-place conversion from Latin1 to UTF-16. + +.. cpp:function:: void AppendASCIItoUTF16(Span<const char>, nsAString&) + + appends a Latin1 string to a UTF-16 string. + +Comparing ns*Strings with C strings +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can compare ``ns*Strings`` with C strings by converting the ``ns*String`` +to a C string, or by comparing directly against a C String. + +.. cpp:function:: bool nsAString::EqualsASCII(const char*) + + Compares with an ASCII C string. + +.. cpp:function:: bool nsAString::EqualsLiteral(...) + + Compares with a string literal. + +Common Patterns +--------------- + +Literal Strings +~~~~~~~~~~~~~~~ + +A literal string is a raw string value that is written in some C++ code. For +example, in the statement ``printf("Hello World\n");`` the value ``"Hello +World\n"`` is a literal string. It is often necessary to insert literal +string values when an ``nsAString`` or ``nsACString`` is required. Two +user-defined literals are provided that implicitly convert to ``const +nsString&`` resp. ``const nsCString&``: + +* ``""_ns`` for 8-bit literals, converting implicitly to ``const nsCString&`` +* ``u""_ns`` for 16-bit literals, converting implicitly to ``const nsString&`` + +The benefits of the user-defined literals may seem unclear, given that +``nsDependentCString`` will also wrap a string value in an ``nsCString``. The +advantage of the user-defined literals is twofold. + +* The length of these strings is calculated at compile time, so the string does + not need to be scanned at runtime to determine its length. + +* Literal strings live for the lifetime of the binary, and can be moved between + the ``ns[C]String`` classes without being copied or freed. + +Here are some examples of proper usage of the literals (both standard and +user-defined): + +.. code-block:: c++ + + // call Init(const nsLiteralString&) - enforces that it's only called with literals + Init(u"start value"_ns); + + // call Init(const nsAString&) + Init(u"start value"_ns); + + // call Init(const nsACString&) + Init("start value"_ns); + +In case a literal is defined via a macro, you can just convert it to +``nsLiteralString`` or ``nsLiteralCString`` using their constructor. You +could consider not using a macro at all but a named ``constexpr`` constant +instead. + +In some cases, an 8-bit literal is defined via a macro, either within code or +from the environment, but it can't be changed or is used both as an 8-bit and +a 16-bit string. In these cases, you can use the +``NS_LITERAL_STRING_FROM_CSTRING`` macro to construct a ``nsLiteralString`` +and do the conversion at compile-time. + +String Concatenation +~~~~~~~~~~~~~~~~~~~~ + +Strings can be concatenated together using the + operator. The resulting +string is a ``const nsSubstringTuple`` object. The resulting object can be +treated and referenced similarly to a ``nsAString`` object. Concatenation *does +not copy the substrings*. The strings are only copied when the concatenation +is assigned into another string object. The ``nsSubstringTuple`` object holds +pointers to the original strings. Therefore, the ``nsSubstringTuple`` object is +dependent on all of its substrings, meaning that their lifetime must be at +least as long as the ``nsSubstringTuple`` object. + +For example, you can use the value of two strings and pass their +concatenation on to another function which takes an ``const nsAString&``: + +.. code-block:: c++ + + void HandleTwoStrings(const nsAString& one, const nsAString& two) { + // call HandleString(const nsAString&) + HandleString(one + two); + } + +NOTE: The two strings are implicitly combined into a temporary ``nsString`` +in this case, and the temporary string is passed into ``HandleString``. If +``HandleString`` assigns its input into another ``nsString``, then the string +buffer will be shared in this case negating the cost of the intermediate +temporary. You can concatenate N strings and store the result in a temporary +variable: + +.. code-block:: c++ + + constexpr auto start = u"start "_ns; + constexpr auto middle = u"middle "_ns; + constexpr auto end = u"end"_ns; + // create a string with 3 dependent fragments - no copying involved! + nsString combinedString = start + middle + end; + + // call void HandleString(const nsAString&); + HandleString(combinedString); + +It is safe to concatenate user-defined literals because the temporary +``nsLiteral[C]String`` objects will live as long as the temporary +concatenation object (of type ``nsSubstringTuple``). + +.. code-block:: c++ + + // call HandlePage(const nsAString&); + // safe because the concatenated-string will live as long as its substrings + HandlePage(u"start "_ns + u"end"_ns); + +Local Variables +~~~~~~~~~~~~~~~ + +Local variables within a function are usually stored on the stack. The +``nsAutoString``/``nsAutoCString`` classes are subclasses of the +``nsString``/``nsCString`` classes. They own a 64-character buffer allocated +in the same storage space as the string itself. If the ``nsAutoString`` is +allocated on the stack, then it has at its disposal a 64-character stack +buffer. This allows the implementation to avoid allocating extra memory when +dealing with small strings. ``nsAutoStringN``/``nsAutoCStringN`` are more +general alternatives that let you choose the number of characters in the +inline buffer. + +.. code-block:: c++ + + ... + nsAutoString value; + GetValue(value); // if the result is less than 64 code units, + // then this just saved us an allocation + ... + +Member Variables +~~~~~~~~~~~~~~~~ + +In general, you should use the concrete classes ``nsString`` and +``nsCString`` for member variables. + +.. code-block:: c++ + + class Foo { + ... + // these store UTF-8 and UTF-16 values respectively + nsCString mLocalName; + nsString mTitle; + }; + +A common incorrect pattern is to use ``nsAutoString``/``nsAutoCString`` +for member variables. As described in `Local Variables`_, these classes have +a built in buffer that make them very large. This means that if you include +them in a class, they bloat the class by 64 bytes (``nsAutoCString``) or 128 +bytes (``nsAutoString``). + + +Raw Character Pointers +~~~~~~~~~~~~~~~~~~~~~~ + +``PromiseFlatString()`` and ``PromiseFlatCString()`` can be used to create a +temporary buffer which holds a null-terminated buffer containing the same +value as the source string. ``PromiseFlatString()`` will create a temporary +buffer if necessary. This is most often used in order to pass an +``nsAString`` to an API which requires a null-terminated string. + +In the following example, an ``nsAString`` is combined with a literal string, +and the result is passed to an API which requires a simple character buffer. + +.. code-block:: c++ + + // Modify the URL and pass to AddPage(const char16_t* url) + void AddModifiedPage(const nsAString& url) { + constexpr auto httpPrefix = u"http://"_ns; + const nsAString& modifiedURL = httpPrefix + url; + + // creates a temporary buffer + AddPage(PromiseFlatString(modifiedURL).get()); + } + +``PromiseFlatString()`` is smart when handed a string that is already +null-terminated. It avoids creating the temporary buffer in such cases. + +.. code-block:: c++ + + // Modify the URL and pass to AddPage(const char16_t* url) + void AddModifiedPage(const nsAString& url, PRBool addPrefix) { + if (addPrefix) { + // MUST create a temporary buffer - string is multi-fragmented + constexpr auto httpPrefix = u"http://"_ns; + AddPage(PromiseFlatString(httpPrefix + modifiedURL)); + } else { + // MIGHT create a temporary buffer, does a runtime check + AddPage(PromiseFlatString(url).get()); + } + } + +.. note:: + + It is **not** possible to efficiently transfer ownership of a string + class' internal buffer into an owned ``char*`` which can be safely + freed by other components due to the COW optimization. + + If working with a legacy API which requires malloced ``char*`` buffers, + prefer using ``ToNewUnicode``, ``ToNewCString`` or ``ToNewUTF8String`` + over ``strdup`` to create owned ``char*`` pointers. + +``printf`` and a UTF-16 string +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +For debugging, it's useful to ``printf`` a UTF-16 string (``nsString``, +``nsAutoString``, etc). To do this usually requires converting it to an 8-bit +string, because that's what ``printf`` expects. Use: + +.. code-block:: c++ + + printf("%s\n", NS_ConvertUTF16toUTF8(yourString).get()); + +Sequence of appends without reallocating +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``SetCapacity()`` allows you to give the string a hint of the future string +length caused by a sequence of appends (excluding appends that convert +between UTF-16 and UTF-8 in either direction) in order to avoid multiple +allocations during the sequence of appends. However, the other +allocation-avoidance features of XPCOM strings interact badly with +``SetCapacity()`` making it something of a footgun. + +``SetCapacity()`` is appropriate to use before a sequence of multiple +operations from the following list (without operations that are not on the +list between the ``SetCapacity()`` call and operations from the list): + +* ``Append()`` +* ``AppendASCII()`` +* ``AppendLiteral()`` +* ``AppendPrintf()`` +* ``AppendInt()`` +* ``AppendFloat()`` +* ``LossyAppendUTF16toASCII()`` +* ``AppendASCIItoUTF16()`` + +**DO NOT** call ``SetCapacity()`` if the subsequent operations on the string +do not meet the criteria above. Operations that undo the benefits of +``SetCapacity()`` include but are not limited to: + +* ``SetLength()`` +* ``Truncate()`` +* ``Assign()`` +* ``AssignLiteral()`` +* ``Adopt()`` +* ``CopyASCIItoUTF16()`` +* ``LossyCopyUTF16toASCII()`` +* ``AppendUTF16toUTF8()`` +* ``AppendUTF8toUTF16()`` +* ``CopyUTF16toUTF8()`` +* ``CopyUTF8toUTF16()`` + +If your string is an ``nsAuto[C]String`` and you are calling +``SetCapacity()`` with a constant ``N``, please instead declare the string as +``nsAuto[C]StringN<N+1>`` without calling ``SetCapacity()`` (while being +mindful of not using such a large ``N`` as to overflow the run-time stack). + +There is no need to include room for the null terminator: it is the job of +the string class. + +Note: Calling ``SetCapacity()`` does not give you permission to use the +pointer obtained from ``BeginWriting()`` to write past the current length (as +returned by ``Length()``) of the string. Please use either ``BulkWrite()`` or +``SetLength()`` instead. + +.. _stringguide.xpidl: + +XPIDL +----- + +The string library is also available through IDL. By declaring attributes and +methods using the specially defined IDL types, string classes are used as +parameters to the corresponding methods. + +XPIDL String types +~~~~~~~~~~~~~~~~~~ + +The C++ signatures follow the abstract-type convention described above, such +that all method parameters are based on the abstract classes. The following +table describes the purpose of each string type in IDL. + ++-----------------+----------------+----------------------------------------------------------------------------------+ +| XPIDL Type | C++ Type | Purpose | ++=================+================+==================================================================================+ +| ``string`` | ``char*`` | Raw character pointer to ASCII (7-bit) string, no string classes used. | +| | | | +| | | High bit is not guaranteed across XPConnect boundaries. | ++-----------------+----------------+----------------------------------------------------------------------------------+ +| ``wstring`` | ``char16_t*`` | Raw character pointer to UTF-16 string, no string classes used. | ++-----------------+----------------+----------------------------------------------------------------------------------+ +| ``AString`` | ``nsAString`` | UTF-16 string. | ++-----------------+----------------+----------------------------------------------------------------------------------+ +| ``ACString`` | ``nsACString`` | 8-bit string. All bits are preserved across XPConnect boundaries. | ++-----------------+----------------+----------------------------------------------------------------------------------+ +| ``AUTF8String`` | ``nsACString`` | UTF-8 string. | +| | | | +| | | Converted to UTF-16 as necessary when value is used across XPConnect boundaries. | ++-----------------+----------------+----------------------------------------------------------------------------------+ + +Callers should prefer using the string classes ``AString``, ``ACString`` and +``AUTF8String`` over the raw pointer types ``string`` and ``wstring`` in +almost all situations. + +C++ Signatures +~~~~~~~~~~~~~~ + +In XPIDL, ``in`` parameters are read-only, and the C++ signatures for +``*String`` parameters follows the above guidelines by using ``const +nsAString&`` for these parameters. ``out`` and ``inout`` parameters are +defined simply as ``nsAString&`` so that the callee can write to them. + +.. code-block:: + + interface nsIFoo : nsISupports { + attribute AString utf16String; + AUTF8String getValue(in ACString key); + }; + +.. code-block:: c++ + + class nsIFoo : public nsISupports { + NS_IMETHOD GetUtf16String(nsAString& aResult) = 0; + NS_IMETHOD SetUtf16String(const nsAString& aValue) = 0; + NS_IMETHOD GetValue(const nsACString& aKey, nsACString& aResult) = 0; + }; + +In the above example, ``utf16String`` is treated as a UTF-16 string. The +implementation of ``GetUtf16String()`` will use ``aResult.Assign`` to +"return" the value. In ``SetUtf16String()`` the value of the string can be +used through a variety of methods including `Iterators`_, +``PromiseFlatString``, and assignment to other strings. + +In ``GetValue()``, the first parameter, ``aKey``, is treated as a raw +sequence of 8-bit values. Any non-ASCII characters in ``aKey`` will be +preserved when crossing XPConnect boundaries. The implementation of +``GetValue()`` will assign a UTF-8 encoded 8-bit string into ``aResult``. If +the this method is called across XPConnect boundaries, such as from a script, +then the result will be decoded from UTF-8 into UTF-16 and used as a Unicode +value. + +String Guidelines +----------------- + +Follow these simple rules in your code to keep your fellow developers, +reviewers, and users happy. + +* Use the most abstract string class that you can. Usually this is: + * ``nsAString`` for function parameters + * ``nsString`` for member variables + * ``nsAutoString`` for local (stack-based) variables +* Use the ``""_ns`` and ``u""_ns`` user-defined literals to represent literal strings (e.g. ``"foo"_ns``) as nsAString-compatible objects. +* Use string concatenation (i.e. the "+" operator) when combining strings. +* Use ``nsDependentString`` when you have a raw character pointer that you need to convert to an nsAString-compatible string. +* Use ``Substring()`` to extract fragments of existing strings. +* Use `iterators`_ to parse and extract string fragments. + +Class Reference +--------------- + +.. cpp:class:: template<T> nsTSubstring<T> + + .. note:: + + The ``nsTSubstring<char_type>`` class is usually written as + ``nsAString`` or ``nsACString``. + + .. cpp:function:: size_type Length() const + + .. cpp:function:: bool IsEmpty() const + + .. cpp:function:: bool IsVoid() const + + .. cpp:function:: const char_type* BeginReading() const + + .. cpp:function:: const char_type* EndReading() const + + .. cpp:function:: bool Equals(const self_type&, comparator_type = ...) const + + .. cpp:function:: char_type First() const + + .. cpp:function:: char_type Last() const + + .. cpp:function:: size_type CountChar(char_type) const + + .. cpp:function:: int32_t FindChar(char_type, index_type aOffset = 0) const + + .. cpp:function:: void Assign(const self_type&) + + .. cpp:function:: void Append(const self_type&) + + .. cpp:function:: void Insert(const self_type&, index_type aPos) + + .. cpp:function:: void Cut(index_type aCutStart, size_type aCutLength) + + .. cpp:function:: void Replace(index_type aCutStart, size_type aCutLength, const self_type& aStr) + + .. cpp:function:: void Truncate(size_type aLength) + + .. cpp:function:: void SetIsVoid(bool) + + Make it null. XPConnect and WebIDL will convert void nsAStrings to + JavaScript ``null``. + + .. cpp:function:: char_type* BeginWriting() + + .. cpp:function:: char_type* EndWriting() + + .. cpp:function:: void SetCapacity(size_type) + + Inform the string about buffer size need before a sequence of calls + to ``Append()`` or converting appends that convert between UTF-16 and + Latin1 in either direction. (Don't use if you use appends that + convert between UTF-16 and UTF-8 in either direction.) Calling this + method does not give you permission to use ``BeginWriting()`` to + write past the logical length of the string. Use ``SetLength()`` or + ``BulkWrite()`` as appropriate. + + .. cpp:function:: void SetLength(size_type) + + .. cpp:function:: Result<BulkWriteHandle<char_type>, nsresult> BulkWrite(size_type aCapacity, size_type aPrefixToPreserve, bool aAllowShrinking) + + +Original Document Information +----------------------------- + +This document was originally hosted on MDN as part of the XPCOM guide. + +* Author: `Alec Flett <mailto:alecf@flett.org>`_ +* Copyright Information: Portions of this content are © 1998–2007 by individual mozilla.org contributors; content available under a Creative Commons license. +* Thanks to David Baron for `actual docs <http://dbaron.org/mozilla/coding-practices>`_, +* Peter Annema for lots of direction +* Myk Melez for some more docs +* David Bradley for a diagram +* Revised by Darin Fisher for Mozilla 1.7 +* Revised by Jungshik Shin to clarify character encoding issues +* Migrated to in-tree documentation by Nika Layzell |