summaryrefslogtreecommitdiffstats
path: root/src/doc/book/src/ch08-02-strings.md
diff options
context:
space:
mode:
Diffstat (limited to 'src/doc/book/src/ch08-02-strings.md')
-rw-r--r--src/doc/book/src/ch08-02-strings.md412
1 files changed, 412 insertions, 0 deletions
diff --git a/src/doc/book/src/ch08-02-strings.md b/src/doc/book/src/ch08-02-strings.md
new file mode 100644
index 000000000..bf79c3925
--- /dev/null
+++ b/src/doc/book/src/ch08-02-strings.md
@@ -0,0 +1,412 @@
+## Storing UTF-8 Encoded Text with Strings
+
+We talked about strings in Chapter 4, but we’ll look at them in more depth now.
+New Rustaceans commonly get stuck on strings for a combination of three
+reasons: Rust’s propensity for exposing possible errors, strings being a more
+complicated data structure than many programmers give them credit for, and
+UTF-8. These factors combine in a way that can seem difficult when you’re
+coming from other programming languages.
+
+We discuss strings in the context of collections because strings are
+implemented as a collection of bytes, plus some methods to provide useful
+functionality when those bytes are interpreted as text. In this section, we’ll
+talk about the operations on `String` that every collection type has, such as
+creating, updating, and reading. We’ll also discuss the ways in which `String`
+is different from the other collections, namely how indexing into a `String` is
+complicated by the differences between how people and computers interpret
+`String` data.
+
+### What Is a String?
+
+We’ll first define what we mean by the term *string*. Rust has only one string
+type in the core language, which is the string slice `str` that is usually seen
+in its borrowed form `&str`. In Chapter 4, we talked about *string slices*,
+which are references to some UTF-8 encoded string data stored elsewhere. String
+literals, for example, are stored in the program’s binary and are therefore
+string slices.
+
+The `String` type, which is provided by Rust’s standard library rather than
+coded into the core language, is a growable, mutable, owned, UTF-8 encoded
+string type. When Rustaceans refer to “strings” in Rust, they might be
+referring to either the `String` or the string slice `&str` types, not just one
+of those types. Although this section is largely about `String`, both types are
+used heavily in Rust’s standard library, and both `String` and string slices
+are UTF-8 encoded.
+
+### Creating a New String
+
+Many of the same operations available with `Vec<T>` are available with `String`
+as well, because `String` is actually implemented as a wrapper around a vector
+of bytes with some extra guarantees, restrictions, and capabilities. An example
+of a function that works the same way with `Vec<T>` and `String` is the `new`
+function to create an instance, shown in Listing 8-11.
+
+```rust
+{{#rustdoc_include ../listings/ch08-common-collections/listing-08-11/src/main.rs:here}}
+```
+
+<span class="caption">Listing 8-11: Creating a new, empty `String`</span>
+
+This line creates a new empty string called `s`, which we can then load data
+into. Often, we’ll have some initial data that we want to start the string
+with. For that, we use the `to_string` method, which is available on any type
+that implements the `Display` trait, as string literals do. Listing 8-12 shows
+two examples.
+
+```rust
+{{#rustdoc_include ../listings/ch08-common-collections/listing-08-12/src/main.rs:here}}
+```
+
+<span class="caption">Listing 8-12: Using the `to_string` method to create a
+`String` from a string literal</span>
+
+This code creates a string containing `initial contents`.
+
+We can also use the function `String::from` to create a `String` from a string
+literal. The code in Listing 8-13 is equivalent to the code from Listing 8-12
+that uses `to_string`.
+
+```rust
+{{#rustdoc_include ../listings/ch08-common-collections/listing-08-13/src/main.rs:here}}
+```
+
+<span class="caption">Listing 8-13: Using the `String::from` function to create
+a `String` from a string literal</span>
+
+Because strings are used for so many things, we can use many different generic
+APIs for strings, providing us with a lot of options. Some of them can seem
+redundant, but they all have their place! In this case, `String::from` and
+`to_string` do the same thing, so which you choose is a matter of style and
+readability.
+
+Remember that strings are UTF-8 encoded, so we can include any properly encoded
+data in them, as shown in Listing 8-14.
+
+```rust
+{{#rustdoc_include ../listings/ch08-common-collections/listing-08-14/src/main.rs:here}}
+```
+
+<span class="caption">Listing 8-14: Storing greetings in different languages in
+strings</span>
+
+All of these are valid `String` values.
+
+### Updating a String
+
+A `String` can grow in size and its contents can change, just like the contents
+of a `Vec<T>`, if you push more data into it. In addition, you can conveniently
+use the `+` operator or the `format!` macro to concatenate `String` values.
+
+#### Appending to a String with `push_str` and `push`
+
+We can grow a `String` by using the `push_str` method to append a string slice,
+as shown in Listing 8-15.
+
+```rust
+{{#rustdoc_include ../listings/ch08-common-collections/listing-08-15/src/main.rs:here}}
+```
+
+<span class="caption">Listing 8-15: Appending a string slice to a `String`
+using the `push_str` method</span>
+
+After these two lines, `s` will contain `foobar`. The `push_str` method takes a
+string slice because we don’t necessarily want to take ownership of the
+parameter. For example, in the code in Listing 8-16, we want to be able to use
+`s2` after appending its contents to `s1`.
+
+```rust
+{{#rustdoc_include ../listings/ch08-common-collections/listing-08-16/src/main.rs:here}}
+```
+
+<span class="caption">Listing 8-16: Using a string slice after appending its
+contents to a `String`</span>
+
+If the `push_str` method took ownership of `s2`, we wouldn’t be able to print
+its value on the last line. However, this code works as we’d expect!
+
+The `push` method takes a single character as a parameter and adds it to the
+`String`. Listing 8-17 adds the letter “l” to a `String` using the `push`
+method.
+
+```rust
+{{#rustdoc_include ../listings/ch08-common-collections/listing-08-17/src/main.rs:here}}
+```
+
+<span class="caption">Listing 8-17: Adding one character to a `String` value
+using `push`</span>
+
+As a result, `s` will contain `lol`.
+
+#### Concatenation with the `+` Operator or the `format!` Macro
+
+Often, you’ll want to combine two existing strings. One way to do so is to use
+the `+` operator, as shown in Listing 8-18.
+
+```rust
+{{#rustdoc_include ../listings/ch08-common-collections/listing-08-18/src/main.rs:here}}
+```
+
+<span class="caption">Listing 8-18: Using the `+` operator to combine two
+`String` values into a new `String` value</span>
+
+The string `s3` will contain `Hello, world!`. The reason `s1` is no longer
+valid after the addition, and the reason we used a reference to `s2`, has to do
+with the signature of the method that’s called when we use the `+` operator.
+The `+` operator uses the `add` method, whose signature looks something like
+this:
+
+```rust,ignore
+fn add(self, s: &str) -> String {
+```
+
+In the standard library, you'll see `add` defined using generics and associated
+types. Here, we’ve substituted in concrete types, which is what happens when we
+call this method with `String` values. We’ll discuss generics in Chapter 10.
+This signature gives us the clues we need to understand the tricky bits of the
+`+` operator.
+
+First, `s2` has an `&`, meaning that we’re adding a *reference* of the second
+string to the first string. This is because of the `s` parameter in the `add`
+function: we can only add a `&str` to a `String`; we can’t add two `String`
+values together. But wait—the type of `&s2` is `&String`, not `&str`, as
+specified in the second parameter to `add`. So why does Listing 8-18 compile?
+
+The reason we’re able to use `&s2` in the call to `add` is that the compiler
+can *coerce* the `&String` argument into a `&str`. When we call the `add`
+method, Rust uses a *deref coercion*, which here turns `&s2` into `&s2[..]`.
+We’ll discuss deref coercion in more depth in Chapter 15. Because `add` does
+not take ownership of the `s` parameter, `s2` will still be a valid `String`
+after this operation.
+
+Second, we can see in the signature that `add` takes ownership of `self`,
+because `self` does *not* have an `&`. This means `s1` in Listing 8-18 will be
+moved into the `add` call and will no longer be valid after that. So although
+`let s3 = s1 + &s2;` looks like it will copy both strings and create a new one,
+this statement actually takes ownership of `s1`, appends a copy of the contents
+of `s2`, and then returns ownership of the result. In other words, it looks
+like it’s making a lot of copies but isn’t; the implementation is more
+efficient than copying.
+
+If we need to concatenate multiple strings, the behavior of the `+` operator
+gets unwieldy:
+
+```rust
+{{#rustdoc_include ../listings/ch08-common-collections/no-listing-01-concat-multiple-strings/src/main.rs:here}}
+```
+
+At this point, `s` will be `tic-tac-toe`. With all of the `+` and `"`
+characters, it’s difficult to see what’s going on. For more complicated string
+combining, we can instead use the `format!` macro:
+
+```rust
+{{#rustdoc_include ../listings/ch08-common-collections/no-listing-02-format/src/main.rs:here}}
+```
+
+This code also sets `s` to `tic-tac-toe`. The `format!` macro works like
+`println!`, but instead of printing the output to the screen, it returns a
+`String` with the contents. The version of the code using `format!` is much
+easier to read, and the code generated by the `format!` macro uses references
+so that this call doesn’t take ownership of any of its parameters.
+
+### Indexing into Strings
+
+In many other programming languages, accessing individual characters in a
+string by referencing them by index is a valid and common operation. However,
+if you try to access parts of a `String` using indexing syntax in Rust, you’ll
+get an error. Consider the invalid code in Listing 8-19.
+
+```rust,ignore,does_not_compile
+{{#rustdoc_include ../listings/ch08-common-collections/listing-08-19/src/main.rs:here}}
+```
+
+<span class="caption">Listing 8-19: Attempting to use indexing syntax with a
+String</span>
+
+This code will result in the following error:
+
+```console
+{{#include ../listings/ch08-common-collections/listing-08-19/output.txt}}
+```
+
+The error and the note tell the story: Rust strings don’t support indexing. But
+why not? To answer that question, we need to discuss how Rust stores strings in
+memory.
+
+#### Internal Representation
+
+A `String` is a wrapper over a `Vec<u8>`. Let’s look at some of our properly
+encoded UTF-8 example strings from Listing 8-14. First, this one:
+
+```rust
+{{#rustdoc_include ../listings/ch08-common-collections/listing-08-14/src/main.rs:spanish}}
+```
+
+In this case, `len` will be 4, which means the vector storing the string “Hola”
+is 4 bytes long. Each of these letters takes 1 byte when encoded in UTF-8. The
+following line, however, may surprise you. (Note that this string begins with
+the capital Cyrillic letter Ze, not the Arabic number 3.)
+
+```rust
+{{#rustdoc_include ../listings/ch08-common-collections/listing-08-14/src/main.rs:russian}}
+```
+
+Asked how long the string is, you might say 12. In fact, Rust’s answer is 24:
+that’s the number of bytes it takes to encode “Здравствуйте” in UTF-8, because
+each Unicode scalar value in that string takes 2 bytes of storage. Therefore,
+an index into the string’s bytes will not always correlate to a valid Unicode
+scalar value. To demonstrate, consider this invalid Rust code:
+
+```rust,ignore,does_not_compile
+let hello = "Здравствуйте";
+let answer = &hello[0];
+```
+
+You already know that `answer` will not be `З`, the first letter. When encoded
+in UTF-8, the first byte of `З` is `208` and the second is `151`, so it would
+seem that `answer` should in fact be `208`, but `208` is not a valid character
+on its own. Returning `208` is likely not what a user would want if they asked
+for the first letter of this string; however, that’s the only data that Rust
+has at byte index 0. Users generally don’t want the byte value returned, even
+if the string contains only Latin letters: if `&"hello"[0]` were valid code
+that returned the byte value, it would return `104`, not `h`.
+
+The answer, then, is that to avoid returning an unexpected value and causing
+bugs that might not be discovered immediately, Rust doesn’t compile this code
+at all and prevents misunderstandings early in the development process.
+
+#### Bytes and Scalar Values and Grapheme Clusters! Oh My!
+
+Another point about UTF-8 is that there are actually three relevant ways to
+look at strings from Rust’s perspective: as bytes, scalar values, and grapheme
+clusters (the closest thing to what we would call *letters*).
+
+If we look at the Hindi word “नमस्ते” written in the Devanagari script, it is
+stored as a vector of `u8` values that looks like this:
+
+```text
+[224, 164, 168, 224, 164, 174, 224, 164, 184, 224, 165, 141, 224, 164, 164,
+224, 165, 135]
+```
+
+That’s 18 bytes and is how computers ultimately store this data. If we look at
+them as Unicode scalar values, which are what Rust’s `char` type is, those
+bytes look like this:
+
+```text
+['न', 'म', 'स', '्', 'त', 'े']
+```
+
+There are six `char` values here, but the fourth and sixth are not letters:
+they’re diacritics that don’t make sense on their own. Finally, if we look at
+them as grapheme clusters, we’d get what a person would call the four letters
+that make up the Hindi word:
+
+```text
+["न", "म", "स्", "ते"]
+```
+
+Rust provides different ways of interpreting the raw string data that computers
+store so that each program can choose the interpretation it needs, no matter
+what human language the data is in.
+
+A final reason Rust doesn’t allow us to index into a `String` to get a
+character is that indexing operations are expected to always take constant time
+(O(1)). But it isn’t possible to guarantee that performance with a `String`,
+because Rust would have to walk through the contents from the beginning to the
+index to determine how many valid characters there were.
+
+### Slicing Strings
+
+Indexing into a string is often a bad idea because it’s not clear what the
+return type of the string-indexing operation should be: a byte value, a
+character, a grapheme cluster, or a string slice. If you really need to use
+indices to create string slices, therefore, Rust asks you to be more specific.
+
+Rather than indexing using `[]` with a single number, you can use `[]` with a
+range to create a string slice containing particular bytes:
+
+```rust
+let hello = "Здравствуйте";
+
+let s = &hello[0..4];
+```
+
+Here, `s` will be a `&str` that contains the first 4 bytes of the string.
+Earlier, we mentioned that each of these characters was 2 bytes, which means
+`s` will be `Зд`.
+
+If we were to try to slice only part of a character’s bytes with something like
+`&hello[0..1]`, Rust would panic at runtime in the same way as if an invalid
+index were accessed in a vector:
+
+```console
+{{#include ../listings/ch08-common-collections/output-only-01-not-char-boundary/output.txt}}
+```
+
+You should use ranges to create string slices with caution, because doing so
+can crash your program.
+
+### Methods for Iterating Over Strings
+
+The best way to operate on pieces of strings is to be explicit about whether
+you want characters or bytes. For individual Unicode scalar values, use the
+`chars` method. Calling `chars` on “Зд” separates out and returns two values
+of type `char`, and you can iterate over the result to access each element:
+
+```rust
+for c in "Зд".chars() {
+ println!("{}", c);
+}
+```
+
+This code will print the following:
+
+```text
+```
+
+Alternatively, the `bytes` method returns each raw byte, which might be
+appropriate for your domain:
+
+```rust
+for b in "Зд".bytes() {
+ println!("{}", b);
+}
+```
+
+This code will print the four bytes that make up this string:
+
+```text
+208
+151
+208
+180
+```
+
+But be sure to remember that valid Unicode scalar values may be made up of more
+than 1 byte.
+
+Getting grapheme clusters from strings as with the Devanagari script is
+complex, so this functionality is not provided by the standard library. Crates
+are available on [crates.io](https://crates.io/)<!-- ignore --> if this is the
+functionality you need.
+
+### Strings Are Not So Simple
+
+To summarize, strings are complicated. Different programming languages make
+different choices about how to present this complexity to the programmer. Rust
+has chosen to make the correct handling of `String` data the default behavior
+for all Rust programs, which means programmers have to put more thought into
+handling UTF-8 data upfront. This trade-off exposes more of the complexity of
+strings than is apparent in other programming languages, but it prevents you
+from having to handle errors involving non-ASCII characters later in your
+development life cycle.
+
+The good news is that the standard library offers a lot of functionality built
+off the `String` and `&str` types to help handle these complex situations
+correctly. Be sure to check out the documentation for useful methods like
+`contains` for searching in a string and `replace` for substituting parts of a
+string with another string.
+
+Let’s switch to something a bit less complex: hash maps!