summaryrefslogtreecommitdiffstats
path: root/src/doc/rustc-dev-guide/src/parallel-rustc.md
blob: e93f51dbbdac6d6f161cceceaefc9321d8ee0b1e (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
# Parallel Compilation

As of <!-- date-check --> August 2022, the only stage of the compiler that 
is already parallel is codegen. Some parts of the compiler already have 
parallel implementations, such as query evaluation, type check and 
monomorphization, but the general version of the compiler does not include 
these parallelization functions. **To try out the current parallel compiler**, 
one can install rustc from source code with `parallel-compiler = true` in 
the `config.toml`.

The lack of parallelism at other stages (for example, macro expansion) also 
represents an opportunity for improving compiler performance.

These next few sections describe where and how parallelism is currently used,
and the current status of making parallel compilation the default in `rustc`.

## Codegen

During [monomorphization][monomorphization] the compiler splits up all the code to
be generated into smaller chunks called _codegen units_. These are then generated by
independent instances of LLVM running in parallel. At the end, the linker
is run to combine all the codegen units together into one binary. This process
occurs in the `rustc_codegen_ssa::base` module.

## Data Structures

The underlying thread-safe data-structures used in the parallel compiler
can be found in the `rustc_data_structures::sync` module. These data structures 
are implemented diferently depending on whether `parallel-compiler` is true.

| data structure                   | parallel                                            | non-parallel |
| -------------------------------- | --------------------------------------------------- | ------------ |
| Lrc                              | std::sync::Arc | std::rc::Rc |
| Weak                             | std::sync::Weak                                     | std::rc::Weak |
| Atomic{Bool}/{Usize}/{U32}/{U64} | std::sync::atomic::Atomic{Bool}/{Usize}/{U32}/{U64} | (std::cell::Cell<bool/usize/u32/u64>) |
| OnceCell                         | std::sync::OnceLock                                 | std::cell::OnceCell |
| Lock\<T> | (parking_lot::Mutex\<T>) | (std::cell::RefCell) |
| RwLock\<T> | (parking_lot::RwLock\<T>) | (std::cell::RefCell) |
| MTRef<'a, T> | &'a T | &'a mut T |
| MTLock\<T> | (Lock\<T>) | (T) |
| ReadGuard | parking_lot::RwLockReadGuard | std::cell::Ref |
| MappedReadGuard | parking_lot::MappedRwLockReadGuard | std::cell::Ref |
| WriteGuard | parking_lot::RwLockWriteGuard | std::cell::RefMut |
| MappedWriteGuard | parking_lot::MappedRwLockWriteGuard | std::cell::RefMut |
| LockGuard | parking_lot::MutexGuard | std::cell::RefMut |
| MappedLockGuard | parking_lot::MappedMutexGuard | std::cell::RefMut |
| MetadataRef | [`OwningRef<Box<dyn Erased + Send + Sync>, [u8]>`][OwningRef] | [`OwningRef<Box<dyn Erased>, [u8]>`][OwningRef] |

- These thread-safe data structures interspersed during compilation can 
  cause a lot of lock contention, which actually degrades performance as the
  number of threads increases beyond 4. This inspires us to audit the use 
  of these data structures, leading to either refactoring to reduce use of 
  shared state, or persistent documentation covering invariants, atomicity, 
  and lock orderings.

- On the other hand, we still need to figure out what other invariants 
  during compilation might not hold in parallel compilation.

### WorkLocal

`WorkLocal` is a special data structure implemented for parallel compiler.
It holds worker-locals values for each thread in a thread pool. You can only 
access the worker local value through the Deref impl on the thread pool it 
was constructed on. It will panic otherwise. 

`WorkLocal` is used to implement the `Arena` allocator in the parallel 
environment, which is critical in parallel queries. Its implementation 
is located in the `rustc-rayon-core::worker_local` module. However, in the 
non-parallel compiler, it is implemented as `(OneThread<T>)`, whose `T` 
can be accessed directly through `Deref::deref`.

## Parallel Iterator

The parallel iterators provided by the [`rayon`] crate are easy ways 
to implement parallelism. In the current implementation of the parallel 
compiler we use a custom [fork][rustc-rayon] of [`rayon`] to run tasks in parallel.

Some iterator functions are implemented to run loops in parallel 
when `parallel-compiler` is true.

| Function(Omit `Send` and `Sync`)                             | Introduction                                                 | Owning Module              |
| ------------------------------------------------------------ | ------------------------------------------------------------ | -------------------------- |
| **par_iter**<T: IntoParallelIterator>(t: T) -> T::Iter       | generate a parallel iterator                                 | rustc_data_structure::sync |
| **par_for_each_in**<T: IntoParallelIterator>(t: T, for_each: impl Fn(T::Item)) | generate a parallel iterator and run `for_each` on each element | rustc_data_structure::sync |
| **Map::par_body_owners**(self, f: impl Fn(LocalDefId))       | run `f` on all hir owners in the crate                       | rustc_middle::hir::map     |
| **Map::par_for_each_module**(self, f: impl Fn(LocalDefId))   | run `f` on all modules and sub modules in the crate          | rustc_middle::hir::map     |
| **ModuleItems::par_items**(&self, f: impl Fn(ItemId))        | run `f` on all items in the module                           | rustc_middle::hir          |
| **ModuleItems::par_trait_items**(&self, f: impl Fn(TraitItemId)) | run `f` on all trait items in the module                     | rustc_middle::hir          |
| **ModuleItems::par_impl_items**(&self, f: impl Fn(ImplItemId)) | run `f` on all impl items in the module                      | rustc_middle::hir          |
| **ModuleItems::par_foreign_items**(&self, f: impl Fn(ForeignItemId)) | run `f` on all foreign items in the module                   | rustc_middle::hir          |

There are a lot of loops in the compiler which can possibly be 
parallelized using these functions. As of <!-- date-check--> August 
2022, scenarios where the parallel iterator function has been used 
are as follows:

| caller                                                  | scenario                                                     | callee                   |
| ------------------------------------------------------- | ------------------------------------------------------------ | ------------------------ |
| rustc_metadata::rmeta::encoder::prefetch_mir            | Prefetch queries which will be needed later by metadata encoding | par_iter                 |
| rustc_monomorphize::collector::collect_crate_mono_items | Collect monomorphized items reachable from non-generic items | par_for_each_in          |
| rustc_interface::passes::analysis                       | Check the validity of the match statements                   | Map::par_body_owners     |
| rustc_interface::passes::analysis                       | MIR borrow check                                             | Map::par_body_owners     |
| rustc_typeck::check::typeck_item_bodies                 | Type check                                                   | Map::par_body_owners     |
| rustc_interface::passes::hir_id_validator::check_crate  | Check the validity of hir                                    | Map::par_for_each_module |
| rustc_interface::passes::analysis                       | Check the validity of loops body, attributes, naked functions, unstable abi, const bodys | Map::par_for_each_module |
| rustc_interface::passes::analysis                       | Liveness and intrinsic checking of MIR                       | Map::par_for_each_module |
| rustc_interface::passes::analysis                       | Deathness checking                                           | Map::par_for_each_module |
| rustc_interface::passes::analysis                       | Privacy checking                                             | Map::par_for_each_module |
| rustc_lint::late::check_crate                           | Run per-module lints                                         | Map::par_for_each_module |
| rustc_typeck::check_crate                               | Well-formedness checking                                         | Map::par_for_each_module |

There are still many loops that have the potential to use parallel iterators.

## Query System

The query model has some properties that make it actually feasible to evaluate
multiple queries in parallel without too much of an effort:

- All data a query provider can access is accessed via the query context, so
  the query context can take care of synchronizing access.
- Query results are required to be immutable so they can safely be used by
  different threads concurrently.

When a query `foo` is evaluated, the cache table for `foo` is locked.

- If there already is a result, we can clone it, release the lock and
  we are done.
- If there is no cache entry and no other active query invocation computing the
  same result, we mark the key as being "in progress", release the lock and
  start evaluating.
- If there *is* another query invocation for the same key in progress, we
  release the lock, and just block the thread until the other invocation has
  computed the result we are waiting for. **Cycle error detection** in the parallel 
  compiler requires more complex logic than in single-threaded mode. When 
  worker threads in parallel queries stop making progress due to interdependence, 
  the compiler uses an extra thread *(named deadlock handler)* to detect, remove and 
  report the cycle error.

Parallel query still has a lot of work to do, most of which is related to 
the previous `Data Structures` and `Parallel Iterators`. See [this tracking issue][tracking].

## Rustdoc

As of <!-- date-check--> May 2022, there are still a number of steps
to complete before rustdoc rendering can be made parallel. More details on
this issue can be found [here][parallel-rustdoc].

## Resources

Here are some resources that can be used to learn more (note that some of them
are a bit out of date):

- [This IRLO thread by Zoxc, one of the pioneers of the effort][irlo0]
- [This list of interior mutability in the compiler by nikomatsakis][imlist]
- [This IRLO thread by alexchricton about performance][irlo1]

[`rayon`]: https://crates.io/crates/rayon
[rustc-rayon]: https://github.com/rust-lang/rustc-rayon
[irlo0]: https://internals.rust-lang.org/t/parallelizing-rustc-using-rayon/6606
[imlist]: https://github.com/nikomatsakis/rustc-parallelization/blob/master/interior-mutability-list.md
[irlo1]: https://internals.rust-lang.org/t/help-test-parallel-rustc/11503
[tracking]: https://github.com/rust-lang/rust/issues/48685
[monomorphization]: backend/monomorph.md
[parallel-rustdoc]: https://github.com/rust-lang/rust/issues/82741
[Arc]: https://doc.rust-lang.org/std/sync/struct.Arc.html
[Rc]: https://doc.rust-lang.org/std/rc/struct.Rc.html
[OwningRef]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_data_structures/owning_ref/index.html