diff options
Diffstat (limited to '')
-rw-r--r-- | src/doc/book/nostarch/chapter08.md | 1284 |
1 files changed, 1284 insertions, 0 deletions
diff --git a/src/doc/book/nostarch/chapter08.md b/src/doc/book/nostarch/chapter08.md new file mode 100644 index 000000000..1c7968c99 --- /dev/null +++ b/src/doc/book/nostarch/chapter08.md @@ -0,0 +1,1284 @@ +<!-- DO NOT EDIT THIS FILE. + +This file is periodically generated from the content in the `/src/` +directory, so all fixes need to be made in `/src/`. +--> + +[TOC] + +# Common Collections + +Rust’s standard library includes a number of very useful data structures called +*collections*. Most other data types represent one specific value, but +collections can contain multiple values. Unlike the built-in array and tuple +types, the data these collections point to is stored on the heap, which means +the amount of data does not need to be known at compile time and can grow or +shrink as the program runs. Each kind of collection has different capabilities +and costs, and choosing an appropriate one for your current situation is a +skill you’ll develop over time. In this chapter, we’ll discuss three +collections that are used very often in Rust programs: + +* A *vector* allows you to store a variable number of values next to each other. +* A *string* is a collection of characters. We’ve mentioned the `String` type + previously, but in this chapter we’ll talk about it in depth. +* A *hash map* allows you to associate a value with a particular key. It’s a + particular implementation of the more general data structure called a *map*. + +To learn about the other kinds of collections provided by the standard library, +see the documentation at *https://doc.rust-lang.org/std/collections/index.html*. + +We’ll discuss how to create and update vectors, strings, and hash maps, as well +as what makes each special. + +## Storing Lists of Values with Vectors + +The first collection type we’ll look at is `Vec<T>`, also known as a *vector*. +Vectors allow you to store more than one value in a single data structure that +puts all the values next to each other in memory. Vectors can only store values +of the same type. They are useful when you have a list of items, such as the +lines of text in a file or the prices of items in a shopping cart. + +### Creating a New Vector + +To create a new empty vector, we call the `Vec::new` function, as shown in +Listing 8-1. + +``` +let v: Vec<i32> = Vec::new(); +``` + +Listing 8-1: Creating a new, empty vector to hold values of type `i32` + +Note that we added a type annotation here. Because we aren’t inserting any +values into this vector, Rust doesn’t know what kind of elements we intend to +store. This is an important point. Vectors are implemented using generics; +we’ll cover how to use generics with your own types in Chapter 10. For now, +know that the `Vec<T>` type provided by the standard library can hold any type. +When we create a vector to hold a specific type, we can specify the type within +angle brackets. In Listing 8-1, we’ve told Rust that the `Vec<T>` in `v` will +hold elements of the `i32` type. + +More often, you’ll create a `Vec<T>` with initial values and Rust will infer +the type of value you want to store, so you rarely need to do this type +annotation. Rust conveniently provides the `vec!` macro, which will create a +new vector that holds the values you give it. Listing 8-2 creates a new +`Vec<i32>` that holds the values `1`, `2`, and `3`. The integer type is `i32` +because that’s the default integer type, as we discussed in the “Data Types” +section of Chapter 3. + +``` +let v = vec![1, 2, 3]; +``` + +Listing 8-2: Creating a new vector containing values + +Because we’ve given initial `i32` values, Rust can infer that the type of `v` +is `Vec<i32>`, and the type annotation isn’t necessary. Next, we’ll look at how +to modify a vector. + +### Updating a Vector + +To create a vector and then add elements to it, we can use the `push` method, +as shown in Listing 8-3. + +``` +let mut v = Vec::new(); + +v.push(5); +v.push(6); +v.push(7); +v.push(8); +``` + +Listing 8-3: Using the `push` method to add values to a vector + +As with any variable, if we want to be able to change its value, we need to +make it mutable using the `mut` keyword, as discussed in Chapter 3. The numbers +we place inside are all of type `i32`, and Rust infers this from the data, so +we don’t need the `Vec<i32>` annotation. + +<!-- +I think people from other languages may get stuck a bit here because this is +the first time (I think?) that we're showing a hindley-milner style type +inference in action (rather than using the initializer to infer the type). + +Should we show the definition for `push`? That'd let us tie together the method +call, mutable reference to self drawing on the `impl` we saw in earlier +chapters and help to explain a little why the above works without having to +annotate the type of the Vec. +/JT ---> +<!-- I think readers would be more confused showing the definition of `push` +here because we haven't covered generics yet. I haven't gotten comments about +people being confused at this point (which doesn't mean they aren't), but +personally when I learned this, it made sense to me that the type of the vector +would be known from what I put in it. I'm leaning towards not elaborating here. +/Carol --> + +### Reading Elements of Vectors + +There are two ways to reference a value stored in a vector: via indexing or +using the `get` method. In the following examples, we’ve annotated the types of +the values that are returned from these functions for extra clarity. + +Listing 8-4 shows both methods of accessing a value in a vector, with indexing +syntax and the `get` method. + +``` +let v = vec![1, 2, 3, 4, 5]; + +[1] let third: &i32 = &v[2]; +println!("The third element is {}", third); + +[2] let third: Option<&i32> = v.get(2); +match third { + Some(third) => println!("The third element is {}", third), + None => println!("There is no third element."), +} +``` + +Listing 8-4: Using indexing syntax or the `get` method to access an item in a +vector + +Note a few details here. We use the index value of `2` to get the third element +[1] because vectors are indexed by number, starting at zero. Using `&` and `[]` +gives us a reference to the element at the index value. When we use the `get` +method with the index passed as an argument [2], we get an `Option<&T>` that we +can use with `match`. + +<!--- +I think it should be "Second, we get the third element by using both `&` and +`[]`" +/JT ---> +<!-- No, it shouldn't, but I reworded this whole paragraph and added wingdings +because it was unclear /Carol --> + +The reason Rust provides these two ways to reference an element is so you can +choose how the program behaves when you try to use an index value outside the +range of existing elements. As an example, let’s see what happens when we have +a vector of five elements and then we try to access an element at index 100 +with each technique, as shown in Listing 8-5. + +``` +let v = vec![1, 2, 3, 4, 5]; + +let does_not_exist = &v[100]; +let does_not_exist = v.get(100); +``` + +Listing 8-5: Attempting to access the element at index 100 in a vector +containing five elements + +When we run this code, the first `[]` method will cause the program to panic +because it references a nonexistent element. This method is best used when you +want your program to crash if there’s an attempt to access an element past the +end of the vector. + +When the `get` method is passed an index that is outside the vector, it returns +`None` without panicking. You would use this method if accessing an element +beyond the range of the vector may happen occasionally under normal +circumstances. Your code will then have logic to handle having either +`Some(&element)` or `None`, as discussed in Chapter 6. For example, the index +could be coming from a person entering a number. If they accidentally enter a +number that’s too large and the program gets a `None` value, you could tell the +user how many items are in the current vector and give them another chance to +enter a valid value. That would be more user-friendly than crashing the program +due to a typo! + +When the program has a valid reference, the borrow checker enforces the +ownership and borrowing rules (covered in Chapter 4) to ensure this reference +and any other references to the contents of the vector remain valid. Recall the +rule that states you can’t have mutable and immutable references in the same +scope. That rule applies in Listing 8-6, where we hold an immutable reference +to the first element in a vector and try to add an element to the end. This +program won’t work if we also try to refer to that element later in the +function: + +``` +let mut v = vec![1, 2, 3, 4, 5]; + +let first = &v[0]; + +v.push(6); + +println!("The first element is: {}", first); +``` + +Listing 8-6: Attempting to add an element to a vector while holding a reference +to an item + +Compiling this code will result in this error: + +``` + --> src/main.rs:6:5 + | +4 | let first = &v[0]; + | - immutable borrow occurs here +5 | +6 | v.push(6); + | ^^^^^^^^^ mutable borrow occurs here +7 | +8 | println!("The first element is: {}", first); + | ----- immutable borrow later used here +``` + +The code in Listing 8-6 might look like it should work: why should a reference +to the first element care about changes at the end of the vector? This error is +due to the way vectors work: because vectors put the values next to each other +in memory, adding a new element onto the end of the vector might require +allocating new memory and copying the old elements to the new space, if there +isn’t enough room to put all the elements next to each other where the vector +is currently stored. In that case, the reference to the first element would be +pointing to deallocated memory. The borrowing rules prevent programs from +ending up in that situation. + +> Note: For more on the implementation details of the `Vec<T>` type, see “The +> Rustonomicon” at *https://doc.rust-lang.org/nomicon/vec/vec.html*. + +### Iterating over the Values in a Vector + +To access each element in a vector in turn, we would iterate through all of the +elements rather than use indices to access one at a time. Listing 8-7 shows how +to use a `for` loop to get immutable references to each element in a vector of +`i32` values and print them. + +``` +let v = vec![100, 32, 57]; +for i in &v { + println!("{}", i); +} +``` + +Listing 8-7: Printing each element in a vector by iterating over the elements +using a `for` loop + +We can also iterate over mutable references to each element in a mutable vector +in order to make changes to all the elements. The `for` loop in Listing 8-8 +will add `50` to each element. + +``` +let mut v = vec![100, 32, 57]; +for i in &mut v { + *i += 50; +} +``` + +Listing 8-8: Iterating over mutable references to elements in a vector + +To change the value that the mutable reference refers to, we have to use the +`*` dereference operator to get to the value in `i` before we can use the +`+=` operator. We’ll talk more about the dereference operator in the +“Following the Pointer to the Value with the Dereference Operator” +section of Chapter 15. + +Iterating over a vector, whether immutably or mutably, is safe because of the +borrow checker’s rules. If we attempted to insert or remove items in the `for` +loop bodies in Listing 8-7 and Listing 8-8, we would get a compiler error +similar to the one we got with the code in Listing 8-6. The reference to the +vector that the `for` loop holds prevents simultaneous modification of the +whole vector. + +<!-- +Maybe worth a mention: the above use of the mutable reference while you iterate +is perfectly safe because there's no changing that's happening to the vector +that would invalidate the iterator. But, if you wanted to iterate the vector +while also trying to remove or insert elements, you'd get an error. For example: + +``` +let mut v = vec![100, 32, 57]; +for i in &mut v { + *i += 50; + if *i > 100 { + v.push(10); // <-- a second mutable reference is needed and will fail to compile + } +} +``` + +Things like this help Rust prevent some classic C++ issues where people didn't +think about the implications of growing/shrinking a container while iterating +over it. +/JT ---> +<!-- I thought Listing 8-6 covered this, but I can see how driving home the +connection with iteration as well is worthwhile so I added a paragraph just +before this comment. Please check for clarity Liz! /Carol --> + +### Using an Enum to Store Multiple Types + +Vectors can only store values that are the same type. This can be inconvenient; +there are definitely use cases for needing to store a list of items of +different types. Fortunately, the variants of an enum are defined under the +same enum type, so when we need one type to represent elements of different +types, we can define and use an enum! + +For example, say we want to get values from a row in a spreadsheet in which +some of the columns in the row contain integers, some floating-point numbers, +and some strings. We can define an enum whose variants will hold the different +value types, and all the enum variants will be considered the same type: that +of the enum. Then we can create a vector to hold that enum and so, ultimately, +holds different types. We’ve demonstrated this in Listing 8-9. + +``` +enum SpreadsheetCell { + Int(i32), + Float(f64), + Text(String), +} + +let row = vec![ + SpreadsheetCell::Int(3), + SpreadsheetCell::Text(String::from("blue")), + SpreadsheetCell::Float(10.12), +]; +``` + +Listing 8-9: Defining an `enum` to store values of different types in one +vector + +Rust needs to know what types will be in the vector at compile time so it knows +exactly how much memory on the heap will be needed to store each element. We +must also be explicit about what types are allowed in this vector. If Rust +allowed a vector to hold any type, there would be a chance that one or more of +the types would cause errors with the operations performed on the elements of +the vector. Using an enum plus a `match` expression means that Rust will ensure +at compile time that every possible case is handled, as discussed in Chapter 6. + +If you don’t know the exhaustive set of types a program will get at runtime to +store in a vector, the enum technique won’t work. Instead, you can use a trait +object, which we’ll cover in Chapter 17. + +Now that we’ve discussed some of the most common ways to use vectors, be sure +to review the API documentation for all the many useful methods defined on +`Vec<T>` by the standard library. For example, in addition to `push`, a `pop` +method removes and returns the last element. + +### Dropping a Vector Drops Its Elements + +Like any other `struct`, a vector is freed when it goes out of scope, as +annotated in Listing 8-10. + +``` +{ + let v = vec![1, 2, 3, 4]; + + // do stuff with v +} // <- v goes out of scope and is freed here +``` + +Listing 8-10: Showing where the vector and its elements are dropped + +When the vector gets dropped, all of its contents are also dropped, meaning the +integers it holds will be cleaned up. The borrow checker ensures that any +references to contents of a vector are only used while the vector itself is +valid. + +Let’s move on to the next collection type: `String`! + +<!-- +nit: I think "meaning the integers it holds will be cleaned up" reads a little +better + +nit #2: imho dropping isn't as imports when you start using vectors as reading +elements from the vector. Is it better for training to mention it here, or +would it be possible to move it later? +/JT --> +<!-- Took both nit suggestions-- reworded for nit #1 and moved this section to +the end of the Vec section (and renumbered the listings) for nit #2. Liz, +please check to make sure I didn't miss anything in the way the Vec section +flows now! /Carol --> + +## Storing UTF-8 Encoded Text with Strings + +We talked about strings in Chapter 4, but we’ll look at them in more depth now. +New Rustaceans commonly get stuck on strings for a combination of three +reasons: Rust’s propensity for exposing possible errors, strings being a more +complicated data structure than many programmers give them credit for, and +UTF-8. These factors combine in a way that can seem difficult when you’re +coming from other programming languages. + +We discuss strings in the context of collections because strings are +implemented as a collection of bytes, plus some methods to provide useful +functionality when those bytes are interpreted as text. In this section, we’ll +talk about the operations on `String` that every collection type has, such as +creating, updating, and reading. We’ll also discuss the ways in which `String` +is different from the other collections, namely how indexing into a `String` is +complicated by the differences between how people and computers interpret +`String` data. + +### What Is a String? + +We’ll first define what we mean by the term *string*. Rust has only one string +type in the core language, which is the string slice `str` that is usually seen +in its borrowed form `&str`. In Chapter 4, we talked about *string slices*, +which are references to some UTF-8 encoded string data stored elsewhere. String +literals, for example, are stored in the program’s binary and are therefore +string slices. + +The `String` type, which is provided by Rust’s standard library rather than +coded into the core language, is a growable, mutable, owned, UTF-8 encoded +string type. When Rustaceans refer to “strings” in Rust, they might be +referring to either the `String` or the string slice `&str` types, not just one +of those types. Although this section is largely about `String`, both types are +used heavily in Rust’s standard library, and both `String` and string slices +are UTF-8 encoded. + +<!--- +I'm wondering if listing the above makes it a bit more cumbersome. In effect, +out of gate we're saying there are a lot of different string types. + +But perhaps we could focus on String and &str here and let them learn about +CString/CStr when doing FFI and OsString/OsStr when they work on paths? +Basically, I'm wondering if we should cut down on the concept count and let +them come across those alternate strings more naturally. +/JT ---> +<!-- I'm ok with that! I removed the paragraph talking about the other, rarer +string types. /Carol --> + +### Creating a New String + +Many of the same operations available with `Vec<T>` are available with `String` +as well, because `String` is actually implemented as a wrapper around a vector +of bytes with some extra guarantees, restrictions, and capabilities. An example +of a function that works the same way with `Vec<T>` and `String` is the `new` +function to create an instance, shown in Listing 8-11. + +``` +let mut s = String::new(); +``` + +Listing 8-11: Creating a new, empty `String` + +This line creates a new empty string called `s`, which we can then load data +into. Often, we’ll have some initial data that we want to start the string +with. For that, we use the `to_string` method, which is available on any type +that implements the `Display` trait, as string literals do. Listing 8-12 shows +two examples. + +``` +let data = "initial contents"; + +let s = data.to_string(); + +// the method also works on a literal directly: +let s = "initial contents".to_string(); +``` + +Listing 8-12: Using the `to_string` method to create a `String` from a string +literal + +This code creates a string containing `initial contents`. + +We can also use the function `String::from` to create a `String` from a string +literal. The code in Listing 8-13 is equivalent to the code from Listing 8-12 +that uses `to_string`. + +``` +let s = String::from("initial contents"); +``` + +Listing 8-13: Using the `String::from` function to create a `String` from a +string literal + +Because strings are used for so many things, we can use many different generic +APIs for strings, providing us with a lot of options. Some of them can seem +redundant, but they all have their place! In this case, `String::from` and +`to_string` do the same thing, so which you choose is a matter of style and +readability. + +Remember that strings are UTF-8 encoded, so we can include any properly encoded +data in them, as shown in Listing 8-14. + +``` +let hello = String::from("السلام عليكم"); +let hello = String::from("Dobrý den"); +let hello = String::from("Hello"); +let hello = String::from("שָׁלוֹם"); +let hello = String::from("नमस्ते"); +let hello = String::from("こんにちは"); +let hello = String::from("안녕하세요"); +let hello = String::from("你好"); +let hello = String::from("Olá"); +let hello = String::from("Здравствуйте"); +let hello = String::from("Hola"); +``` + +Listing 8-14: Storing greetings in different languages in strings + +All of these are valid `String` values. + +### Updating a String + +A `String` can grow in size and its contents can change, just like the contents +of a `Vec<T>`, if you push more data into it. In addition, you can conveniently +use the `+` operator or the `format!` macro to concatenate `String` values. + +#### Appending to a String with `push_str` and `push` + +We can grow a `String` by using the `push_str` method to append a string slice, +as shown in Listing 8-15. + +``` +let mut s = String::from("foo"); +s.push_str("bar"); +``` + +Listing 8-15: Appending a string slice to a `String` using the `push_str` method + +After these two lines, `s` will contain `foobar`. The `push_str` method takes a +string slice because we don’t necessarily want to take ownership of the +parameter. For example, in the code in Listing 8-16, we want to be able to use +`s2` after appending its contents to `s1`. + +``` +let mut s1 = String::from("foo"); +let s2 = "bar"; +s1.push_str(s2); +println!("s2 is {}", s2); +``` + +Listing 8-16: Using a string slice after appending its contents to a `String` + +If the `push_str` method took ownership of `s2`, we wouldn’t be able to print +its value on the last line. However, this code works as we’d expect! + +The `push` method takes a single character as a parameter and adds it to the +`String`. Listing 8-17 adds the letter "l" to a `String` using the `push` +method. + +``` +let mut s = String::from("lo"); +s.push('l'); +``` + +Listing 8-17: Adding one character to a `String` value using `push` + +As a result, `s` will contain `lol`. + +#### Concatenation with the `+` Operator or the `format!` Macro + +Often, you’ll want to combine two existing strings. One way to do so is to use +the `+` operator, as shown in Listing 8-18. + +``` +let s1 = String::from("Hello, "); +let s2 = String::from("world!"); +let s3 = s1 + &s2; // note s1 has been moved here and can no longer be used +``` + +Listing 8-18: Using the `+` operator to combine two `String` values into a new +`String` value + +The string `s3` will contain `Hello, world!`. The reason `s1` is no longer +valid after the addition, and the reason we used a reference to `s2`, has to do +with the signature of the method that’s called when we use the `+` operator. +The `+` operator uses the `add` method, whose signature looks something like +this: + +``` +fn add(self, s: &str) -> String { +``` + +In the standard library, you’ll see `add` defined using generics and associated +types. Here, we’ve substituted in concrete types, which is what happens when we +call this method with `String` values. We’ll discuss generics in Chapter 10. +This signature gives us the clues we need to understand the tricky bits of the +`+` operator. + +First, `s2` has an `&`, meaning that we’re adding a *reference* of the second +string to the first string. This is because of the `s` parameter in the `add` +function: we can only add a `&str` to a `String`; we can’t add two `String` +values together. But wait—the type of `&s2` is `&String`, not `&str`, as +specified in the second parameter to `add`. So why does Listing 8-18 compile? + +<!-- +The above isn't quite right - the trait for ops::Add uses an Rhs associated type +instead of using T for both lhs and rhs. + +``` +pub trait Add<Rhs = Self> { + type Output; + fn add(self, rhs: Rhs) -> Self::Output; +} +``` + +The implementation of Add for String fills in Rhs with the slice: + +``` +impl<'_> Add<&'_ str> for String +``` + +Not sure if it's better to fix the description and not have deref coercion +discussion following, or fix the example so you can have the coercion +discussion. +/JT ---> +<!-- I've made an edit above to address this /Carol --> + +The reason we’re able to use `&s2` in the call to `add` is that the compiler +can *coerce* the `&String` argument into a `&str`. When we call the `add` +method, Rust uses a *deref coercion*, which here turns `&s2` into `&s2[..]`. +We’ll discuss deref coercion in more depth in Chapter 15. Because `add` does +not take ownership of the `s` parameter, `s2` will still be a valid `String` +after this operation. + +Second, we can see in the signature that `add` takes ownership of `self`, +because `self` does *not* have an `&`. This means `s1` in Listing 8-18 will be +moved into the `add` call and will no longer be valid after that. So although +`let s3 = s1 + &s2;` looks like it will copy both strings and create a new one, +this statement actually takes ownership of `s1`, appends a copy of the contents +of `s2`, and then returns ownership of the result. In other words, it looks +like it’s making a lot of copies but isn’t; the implementation is more +efficient than copying. + +If we need to concatenate multiple strings, the behavior of the `+` operator +gets unwieldy: + +``` +let s1 = String::from("tic"); +let s2 = String::from("tac"); +let s3 = String::from("toe"); + +let s = s1 + "-" + &s2 + "-" + &s3; +``` + +At this point, `s` will be `tic-tac-toe`. With all of the `+` and `"` +characters, it’s difficult to see what’s going on. For more complicated string +combining, we can instead use the `format!` macro: + +``` +let s1 = String::from("tic"); +let s2 = String::from("tac"); +let s3 = String::from("toe"); + +let s = format!("{}-{}-{}", s1, s2, s3); +``` + +This code also sets `s` to `tic-tac-toe`. The `format!` macro works like +`println!`, but instead of printing the output to the screen, it returns a +`String` with the contents. The version of the code using `format!` is much +easier to read, and the code generated by the `format!` macro uses references +so that this call doesn’t take ownership of any of its parameters. + +### Indexing into Strings + +In many other programming languages, accessing individual characters in a +string by referencing them by index is a valid and common operation. However, +if you try to access parts of a `String` using indexing syntax in Rust, you’ll +get an error. Consider the invalid code in Listing 8-19. + +``` +let s1 = String::from("hello"); +let h = s1[0]; +``` + +Listing 8-19: Attempting to use indexing syntax with a String + +This code will result in the following error: + +``` +error[E0277]: the type `String` cannot be indexed by `{integer}` + --> src/main.rs:3:13 + | +3 | let h = s1[0]; + | ^^^^^ `String` cannot be indexed by `{integer}` + | + = help: the trait `Index<{integer}>` is not implemented for `String` +``` + +The error and the note tell the story: Rust strings don’t support indexing. But +why not? To answer that question, we need to discuss how Rust stores strings in +memory. + +#### Internal Representation + +A `String` is a wrapper over a `Vec<u8>`. Let’s look at some of our properly +encoded UTF-8 example strings from Listing 8-14. First, this one: + +``` +let hello = String::from("Hola"); +``` + +In this case, `len` will be 4, which means the vector storing the string “Hola” +is 4 bytes long. Each of these letters takes 1 byte when encoded in UTF-8. The +following line, however, may surprise you. (Note that this string begins with +the capital Cyrillic letter Ze, not the Arabic number 3.) + +``` +let hello = String::from("Здравствуйте"); +``` + +Asked how long the string is, you might say 12. In fact, Rust’s answer is 24: +that’s the number of bytes it takes to encode “Здравствуйте” in UTF-8, because +each Unicode scalar value in that string takes 2 bytes of storage. Therefore, +an index into the string’s bytes will not always correlate to a valid Unicode +scalar value. To demonstrate, consider this invalid Rust code: + +``` +let hello = "Здравствуйте"; +let answer = &hello[0]; +``` + +You already know that `answer` will not be `З`, the first letter. When encoded +in UTF-8, the first byte of `З` is `208` and the second is `151`, so it would +seem that `answer` should in fact be `208`, but `208` is not a valid character +on its own. Returning `208` is likely not what a user would want if they asked +for the first letter of this string; however, that’s the only data that Rust +has at byte index 0. Users generally don’t want the byte value returned, even +if the string contains only Latin letters: if `&"hello"[0]` were valid code +that returned the byte value, it would return `104`, not `h`. + +The answer, then, is that to avoid returning an unexpected value and causing +bugs that might not be discovered immediately, Rust doesn’t compile this code +at all and prevents misunderstandings early in the development process. + +#### Bytes and Scalar Values and Grapheme Clusters! Oh My! + +Another point about UTF-8 is that there are actually three relevant ways to +look at strings from Rust’s perspective: as bytes, scalar values, and grapheme +clusters (the closest thing to what we would call *letters*). + +If we look at the Hindi word “नमस्ते” written in the Devanagari script, it is +stored as a vector of `u8` values that looks like this: + +``` +[224, 164, 168, 224, 164, 174, 224, 164, 184, 224, 165, 141, 224, 164, 164, +224, 165, 135] +``` + +That’s 18 bytes and is how computers ultimately store this data. If we look at +them as Unicode scalar values, which are what Rust’s `char` type is, those +bytes look like this: + +``` +['न', 'म', 'स', '्', 'त', 'े'] +``` + +There are six `char` values here, but the fourth and sixth are not letters: +they’re diacritics that don’t make sense on their own. Finally, if we look at +them as grapheme clusters, we’d get what a person would call the four letters +that make up the Hindi word: + +``` +["न", "म", "स्", "ते"] +``` + +Rust provides different ways of interpreting the raw string data that computers +store so that each program can choose the interpretation it needs, no matter +what human language the data is in. + +A final reason Rust doesn’t allow us to index into a `String` to get a +character is that indexing operations are expected to always take constant time +(O(1)). But it isn’t possible to guarantee that performance with a `String`, +because Rust would have to walk through the contents from the beginning to the +index to determine how many valid characters there were. + +### Slicing Strings + +Indexing into a string is often a bad idea because it’s not clear what the +return type of the string-indexing operation should be: a byte value, a +character, a grapheme cluster, or a string slice. If you really need to use +indices to create string slices, therefore, Rust asks you to be more specific. + +Rather than indexing using `[]` with a single number, you can use `[]` with a +range to create a string slice containing particular bytes: + +``` +let hello = "Здравствуйте"; + +let s = &hello[0..4]; +``` + +Here, `s` will be a `&str` that contains the first 4 bytes of the string. +Earlier, we mentioned that each of these characters was 2 bytes, which means +`s` will be `Зд`. + +If we were to try to slice only part of a character’s bytes with something like +`&hello[0..1]`, Rust would panic at runtime in the same way as if an invalid +index were accessed in a vector: + +``` +thread 'main' panicked at 'byte index 1 is not a char boundary; it is inside 'З' (bytes 0..2) of `Здравствуйте`', src/main.rs:4:14 +``` + +You should use ranges to create string slices with caution, because doing so +can crash your program. + +### Methods for Iterating Over Strings + +<!--- is there a reason this comes after how to slice, rather than after the +discussion on why we can't directly index into a string? /LC ---> +<!-- I think the idea was that we show this progression of from worst technique +to best: + +1. direct indexing, which doesn't compile +2. slicing with a range, which looks similar to indexing, which does compile +but might panic at runtime +3. iterating over chars or bytes, which compiles and won't panic + +Do you have suggestions on making this clearer? I've tried to add a bit at the +beginning of this section /Carol +--> +<!-- JT, what do you think -- is this ordering clear to you? /LC --> +<!--- +I'm okay with the current order - I think showing why it's bad, what's close to +what you try first, and then finally the idiomatic Rust solution reads okay. + +One tiny nit, for flow, would be to use the Cyrillic example first here to show +how `.chars()` works well for it and then mention that for more complex +scripts, like Hindi, you'll need to use the more full-featured string handling +you find on crates.io. +/JT ---> +<!-- I've taken JT's suggestion here to use part of the Cyrillic string, then +mention you'll need a crate to correctly get the grapheme clusters for Hindi +/Carol --> + +The best way to operate on pieces of strings is to be explicit about whether +you want characters or bytes. For individual Unicode scalar values, use the +`chars` method. Calling `chars` on “Зд” separates out and returns two values +of type `char`, and you can iterate over the result to access each element: + +```rust +for c in "Зд".chars() { + println!("{}", c); +} +``` + +This code will print the following: + +```text +З +д +``` + +Alternatively, the `bytes` method returns each raw byte, which might be +appropriate for your domain: + +```rust +for b in "Зд".bytes() { + println!("{}", b); +} +``` + +This code will print the four bytes that make up this string: + +```text +208 +151 +208 +180 +``` + +But be sure to remember that valid Unicode scalar values may be made up of more +than 1 byte. + +Getting grapheme clusters from strings as with the Devanagari script is +complex, so this functionality is not provided by the standard library. Crates +are available on *https://crates.io/* if this is the functionality you need. + +### Strings Are Not So Simple + +<!--- +Because Strings are quite complicated, and have complications that are all +their own and unlike any other containers, I wonder if maybe this chapter +should be two different chapters with one specifically being about strings, +string slices, chars, and related? +/JT ---> +<!-- I don't think I want to make that big of a change at this point... the +original idea was to compare and contrast the different containers, perhaps +that's not serving its purpose as well as a chapter split could... I'll think +about this for the next major revision. /Carol --> + +<!--- +We don't talk about searching in a string. Feels like it could use an example +or two? +/JT ---> +<!-- To address this suggestion and a bit of the previous suggestion as well, I +changed the first paragraph in the "Creating a New String" section to mention +that a `String` is implemented using a `Vec`. Then, to echo the last paragraph +before the "Dropping a Vector Drops Its Elements" section, I've added some text +here to again urge the reader to check out the standard library documentation +for more functionality. /Carol --> + +To summarize, strings are complicated. Different programming languages make +different choices about how to present this complexity to the programmer. Rust +has chosen to make the correct handling of `String` data the default behavior +for all Rust programs, which means programmers have to put more thought into +handling UTF-8 data upfront. This trade-off exposes more of the complexity of +strings than is apparent in other programming languages, but it prevents you +from having to handle errors involving non-ASCII characters later in your +development life cycle. + +The good news is that the standard library offers a lot of functionality built +off the `String` and `&str` types to help handle these complex situations +correctly. Be sure to check out the documentation for useful methods like +`contains` for searching in a string and `replace` for substituting parts of a +string with another string. + +Let’s switch to something a bit less complex: hash maps! + +## Storing Keys with Associated Values in Hash Maps + +The last of our common collections is the *hash map*. The type `HashMap<K, V>` +stores a mapping of keys of type `K` to values of type `V` using a +*hashing function*, which determines how it places these keys and values into +memory. Many programming languages support this kind of data structure, but +they often use a different name, such as hash, map, object, hash table, +dictionary, or associative array, just to name a few. + +Hash maps are useful when you want to look up data not by using an index, as +you can with vectors, but by using a key that can be of any type. For example, +in a game, you could keep track of each team’s score in a hash map in which +each key is a team’s name and the values are each team’s score. Given a team +name, you can retrieve its score. + +We’ll go over the basic API of hash maps in this section, but many more goodies +are hiding in the functions defined on `HashMap<K, V>` by the standard library. +As always, check the standard library documentation for more information. + +### Creating a New Hash Map + +One way to create an empty hash map is using `new` and adding elements with +`insert`. In Listing 8-20, we’re keeping track of the scores of two teams whose +names are *Blue* and *Yellow*. The Blue team starts with 10 points, and the +Yellow team starts with 50. + +``` +use std::collections::HashMap; + +let mut scores = HashMap::new(); + +scores.insert(String::from("Blue"), 10); +scores.insert(String::from("Yellow"), 50); +``` + +Listing 8-20: Creating a new hash map and inserting some keys and values + +Note that we need to first `use` the `HashMap` from the collections portion of +the standard library. Of our three common collections, this one is the least +often used, so it’s not included in the features brought into scope +automatically in the prelude. Hash maps also have less support from the +standard library; there’s no built-in macro to construct them, for example. + +Just like vectors, hash maps store their data on the heap. This `HashMap` has +keys of type `String` and values of type `i32`. Like vectors, hash maps are +homogeneous: all of the keys must have the same type as each other, and all of +the values must have the same type. + +<!--- +I'm not sure I've seen this in the wild? I'm tempted to say to skip the zip +example for flow and go from creating the hash map to working with its +contents. +/JT ---> +<!-- Cut Listing 8-21 and renumbered! /Carol --> + +### Accessing Values in a Hash Map + +<!--- +For flow, would it make sense for this section to follow creating the hash map? +That way we introduce a useful concept and also continue the teams example. +/JT ---> +<!-- Ok, I've switched the order of "Accessing Values in a Hash Map" and "Hash +Maps and Ownership" and renumbered! Does this still make sense Liz? /Carol --> + +We can get a value out of the hash map by providing its key to the `get` +method, as shown in Listing 8-21. + +``` +use std::collections::HashMap; + +let mut scores = HashMap::new(); + +scores.insert(String::from("Blue"), 10); +scores.insert(String::from("Yellow"), 50); + +let team_name = String::from("Blue"); +let score = scores.get(&team_name).unwrap_or(0); +``` + +Listing 8-21: Accessing the score for the Blue team stored in the hash map + +Here, `score` will have the value that’s associated with the Blue team, and the +result will be `10`. The `get` method returns an `Option<&V>`; if there’s no +value for that key in the hash map, `get` will return `None`. This program +handles the `Option` by calling `unwrap_or` to set `score` to zero if `scores` +doesn’t have an entry for the key. + +<!--- +Should there be a quick example here to show handling Some/None again before +we move on to iteration? +/JT ---> +<!-- I've changed the code in Listing 8-21 a bit to actually handle the +`Option` instead of referencing chapter 6, what do you think Liz? /Carol --> + +We can iterate over each key/value pair in a hash map in a similar manner as we +do with vectors, using a `for` loop: + +``` +use std::collections::HashMap; + +let mut scores = HashMap::new(); + +scores.insert(String::from("Blue"), 10); +scores.insert(String::from("Yellow"), 50); + +for (key, value) in &scores { + println!("{}: {}", key, value); +} +``` + +This code will print each pair in an arbitrary order: + +``` +Yellow: 50 +Blue: 10 +``` + +### Hash Maps and Ownership + +For types that implement the `Copy` trait, like `i32`, the values are copied +into the hash map. For owned values like `String`, the values will be moved and +the hash map will be the owner of those values, as demonstrated in Listing 8-22. + +``` +use std::collections::HashMap; + +let field_name = String::from("Favorite color"); +let field_value = String::from("Blue"); + +let mut map = HashMap::new(); +map.insert(field_name, field_value); +// field_name and field_value are invalid at this point, try using them and +// see what compiler error you get! +``` + +Listing 8-22: Showing that keys and values are owned by the hash map once +they’re inserted + +We aren’t able to use the variables `field_name` and `field_value` after +they’ve been moved into the hash map with the call to `insert`. + +If we insert references to values into the hash map, the values won’t be moved +into the hash map. The values that the references point to must be valid for at +least as long as the hash map is valid. We’ll talk more about these issues in +the “Validating References with Lifetimes” section in Chapter 10. + +### Updating a Hash Map + +Although the number of key and value pairs is growable, each unique key can +only have one value associated with it at a time (but not vice versa: for +example, both the Blue team and the Yellow team could have value 10 stored in +the `scores` hash map). +<!--- And vice versa? /LC ---> +<!-- No, you could have a hashmap that has ("Blue", 10) and ("Yellow", 10) for +example. Stating this here feels a bit off topic for updating the value of an +existing key, though, I'm not sure how to work it in. Do you think that's +important enough to state here? If so, do you have suggestions on how to do it +without distracting from the main point of this section? /Carol --> +<!-- It may not be important enough, what do you think JT? /LC --> +<!--- +I think it's maybe worth calling out. Something you could use to drive +this home is the `.entry()` call. This makes it clear that for any key there's +one cell (or entry) that you're updating in the hash map. I see we use it +later, though worth a thought if bringing it earlier helps? +/JT ---> +<!-- I've added a short sentence here, but every time I try to add something +more, I end up getting tangled in saying things like "key value" as opposed to +"value value", which is terrible... or I worry about misleading readers into +thinking that you can't use a `Vec<T>` as a HashMap value type, which you +totally can to store multiple "values" in one vector "value", which you totally +can, it's just a little more complicated. Or I try to say "multiple keys can +have the same value" which sounds like it could imply that there would be a +*shared* value stored in the HashMap, which wouldn't be the case, there would +be two separate allocations that would happen to have the same value... I just +haven't heard a reader wondering if each value can only have one key with it +before (which doesn't mean they haven't wondered it, I just haven't heard of +it) so I don't want to lead readers astray if they weren't already going that +direction? What do you think about what's here now, Liz? /Carol --> + +When you want to change the data in a hash map, you have to decide how to +handle the case when a key already has a value assigned. You could replace the +old value with the new value, completely disregarding the old value. You could +keep the old value and ignore the new value, only adding the new value if the +key *doesn’t* already have a value. Or you could combine the old value and the +new value. Let’s look at how to do each of these! + +#### Overwriting a Value + +If we insert a key and a value into a hash map and then insert that same key +with a different value, the value associated with that key will be replaced. +Even though the code in Listing 8-23 calls `insert` twice, the hash map will +only contain one key/value pair because we’re inserting the value for the Blue +team’s key both times. + +``` +use std::collections::HashMap; + +let mut scores = HashMap::new(); + +scores.insert(String::from("Blue"), 10); +scores.insert(String::from("Blue"), 25); + +println!("{:?}", scores); +``` + +Listing 8-23: Replacing a value stored with a particular key + +This code will print `{"Blue": 25}`. The original value of `10` has been +overwritten. + +#### Adding a Key and Value Only If a Key Isn’t Present + +<!--- to be clear, are we talking about default values here, or just checking +for an existing value before allowing insertion of a value? /LC---> +<!-- I'm not sure what you mean exactly. Checking for an existing value before +allowing insertion of a value can be used to insert whatever value would mean +"default" in your program, or it can be used to insert some other value that +you wouldn't call a default. That is, in Listing 8-25, would you call 50 a +default value or no? (I don't think we've given enough information about what +the program is ultimately trying to do to tell if 50 is a default or not, and I +don't think it matters, but I am interested to know if there's something I'm +missing that you're trying to get at). Can you elaborate on what was confusing +and perhaps propose wording that would have cleared this up for you, and I can +fix if needed? /Carol--> +<!-- I suppose what I'm asking is whether a value is inserted from the started +as a default value and then updated, meaning the key never has no value, or +whether we're only allowing insertion of a value if there isn't already a +value. I think it's the latter and maybe that's clear enough as is! JT, what do +you think? /LC --> +<!--- +I think the idea is generally right, we're going to insert the value if the +key is not already in the hash map. Maybe the title could be: + +"Adding a key and value only if a key isn't present" + +Worth a note: I think "default" values are a bit of a loaded term in Rust. If +we use it, we may confuse people later if we they come across `Default`, which +is the default value of a type (like 0 is for i64, via `i64::default()`) +/JT ---> +<!-- Ok, I've taken JT's suggestion for the section title and tried to reword +the text here a bit; is this clearer, Liz? I share JT's concern about using the +word "default"... /Carol --> + +It’s common to check whether a particular key already exists in the hash map +with a value then take the following actions: if the key does exist in the hash +map, the existing value should remain the way it is. If the key doesn’t exist, +insert it and a value for it. + +Hash maps have a special API for this called `entry` that takes the key you +want to check as a parameter. The return value of the `entry` method is an enum +called `Entry` that represents a value that might or might not exist. Let’s say +we want to check whether the key for the Yellow team has a value associated +with it. If it doesn’t, we want to insert the value 50, and the same for the +Blue team. Using the `entry` API, the code looks like Listing 8-24. + +``` +use std::collections::HashMap; + +let mut scores = HashMap::new(); +scores.insert(String::from("Blue"), 10); + +scores.entry(String::from("Yellow")).or_insert(50); +scores.entry(String::from("Blue")).or_insert(50); + +println!("{:?}", scores); +``` + +Listing 8-24: Using the `entry` method to only insert if the key does not +already have a value + +The `or_insert` method on `Entry` is defined to return a mutable reference to +the value for the corresponding `Entry` key if that key exists, and if not, +inserts the parameter as the new value for this key and returns a mutable +reference to the new value. This technique is much cleaner than writing the +logic ourselves and, in addition, plays more nicely with the borrow checker. + +Running the code in Listing 8-24 will print `{"Yellow": 50, "Blue": 10}`. The +first call to `entry` will insert the key for the Yellow team with the value +50 because the Yellow team doesn’t have a value already. The second call to +`entry` will not change the hash map because the Blue team already has the +value 10. + +#### Updating a Value Based on the Old Value + +Another common use case for hash maps is to look up a key’s value and then +update it based on the old value. For instance, Listing 8-25 shows code that +counts how many times each word appears in some text. We use a hash map with +the words as keys and increment the value to keep track of how many times we’ve +seen that word. If it’s the first time we’ve seen a word, we’ll first insert +the value 0. + +``` +use std::collections::HashMap; + +let text = "hello world wonderful world"; + +let mut map = HashMap::new(); + +for word in text.split_whitespace() { + let count = map.entry(word).or_insert(0); + *count += 1; +} + +println!("{:?}", map); +``` + +Listing 8-25: Counting occurrences of words using a hash map that stores words +and counts + +This code will print `{"world": 2, "hello": 1, "wonderful": 1}`. You might see +the same key/value pairs printed in a different order: recall from the +“Accessing Values in a Hash Map” section that iterating over a hash map happens +in an arbitrary order. + +The `split_whitespace` method returns an iterator over sub-slices, separated by +whitespace, of the value in `text`. The `or_insert` method returns a mutable +reference (`&mut V`) to the value for the specified key. Here we store that +mutable reference in the `count` variable, so in order to assign to that value, +we must first dereference `count` using the asterisk (`*`). The mutable +reference goes out of scope at the end of the `for` loop, so all of these +changes are safe and allowed by the borrowing rules. + +<!--- +Running the above gave me `{"world": 2, "wonderful": 1, "hello": 1}` so the key +order may not be deterministic or may change based on changes to the hashing +function in the std lib. +/JT ---> +<!-- I've added a note that getting a different order is perfectly normal +/Carol --> + +### Hashing Functions + +By default, `HashMap` uses a hashing function called *SipHash* that can provide +resistance to Denial of Service (DoS) attacks involving hash tables. This is +not the fastest hashing algorithm available, but the trade-off for better +security that comes with the drop in performance is worth it. If you profile +your code and find that the default hash function is too slow for your +purposes, you can switch to another function by specifying a different hasher. +A *hasher* is a type that implements the `BuildHasher` trait. We’ll talk about +traits and how to implement them in Chapter 10. You don’t necessarily have to +implement your own hasher from scratch; *https://crates.io/* has libraries +shared by other Rust users that provide hashers implementing many common +hashing algorithms. + +## Summary + +Vectors, strings, and hash maps will provide a large amount of functionality +necessary in programs when you need to store, access, and modify data. Here are +some exercises you should now be equipped to solve: + +* Given a list of integers, use a vector and return the median (when sorted, + the value in the middle position) and mode (the value that occurs most often; + a hash map will be helpful here) of the list. +* Convert strings to pig latin. The first consonant of each word is moved to + the end of the word and “ay” is added, so “first” becomes “irst-fay.” Words + that start with a vowel have “hay” added to the end instead (“apple” becomes + “apple-hay”). Keep in mind the details about UTF-8 encoding! +* Using a hash map and vectors, create a text interface to allow a user to add + employee names to a department in a company. For example, “Add Sally to + Engineering” or “Add Amir to Sales.” Then let the user retrieve a list of all + people in a department or all people in the company by department, sorted + alphabetically. + +The standard library API documentation describes methods that vectors, strings, +and hash maps have that will be helpful for these exercises! + +We’re getting into more complex programs in which operations can fail, so, it’s +a perfect time to discuss error handling. We’ll do that next! + |