# imara-diff

[![crates.io](https://img.shields.io/crates/v/imara-diff?style=flat-square)](https://crates.io/crates/imara-diff)
[![crates.io](https://img.shields.io/docsrs/imara-diff?style=flat-square)](https://docs.rs/imara-diff/latest/imara_diff/)
![crates.io](https://img.shields.io/crates/l/imara-diff?style=flat-square)

Imara-diff is a solid (imara in swahili) diff library for rust.
Solid refers to the fact that imara-diff provides very good runtime performance even
in pathologic cases so that your application never appears to freeze while waiting on a diff.
The performance improvements are achieved using battle tested heuristics used in gnu-diff and git
that are known to perform well while still providing good results. 

Imara-diff is also designed to be flexible so that it can be used with arbitrary collections and
not just lists and strings and even allows reusing large parts of the computation when
comparing the same file to multiple different files.

Imara-diff provides two diff algorithms:

* The linear-space variant of the well known [Myers algorithm](http://www.xmailserver.org/diff2.pdf) 
* The **Histogram** algorithm which variant of the patience diff algorithm.

Myers algorithm has been enhanced with preprocessing and multiple heuristics to ensure fast runtime in pathological 
cases to avoid quadratic time complexity and closely matches the behavior of gnu-diff and git.
The histogram algorithm was originally ported from git but has been heavily optimized.
The **Histogram algorithm outperforms Myers algorithm** by 10% - 100% across a **wide variety of workloads**.

## Limitations

Even with the optimizations in this crate, performing a large diff without any tokenization (like character diff for a string) does not perform well.
To work around this problem a diff of the entire file with large tokens (like lines for a string) can be performed first.
The `Sink` implementation can then perform fine-grained diff on changed regions.
Note that this fine-grained diff should not be performed for pure insertions, pure deletions and very large changes.

In an effort to improve performance, `imara-diff` makes heavy use of pointer compression.
That means that it can only support files with at most `2^31 - 2` tokens.
This should be rarely an issue in practice for textual diffs, because most (large) real-world files 
have an average line-length of at least 8.
That means that this limitation only becomes a problem for files above 16GB while performing line-diffs.

## Benchmarks

The most used diffing libraries in the rust ecosystem are [similar](https://crates.io/crates/similar) and [dissimilar](https://crates.io/crates/similar).
The fastest diff implementation both of these offer is a simple implementation of Myers algorithm
without preprocessing or additional heuristics.
As these implementations are very similar only `similar` was included in the benchmark.

To provide a benchmark to reflects real-world workloads, the git history of different open source projects were used.
For each repo two (fairly different) tags were chosen.
A tree diff is performed with [gitoxide](https://github.com/Byron/gitoxide) and the pairs of files that should be saved are stored in memory.
The diffs collected using this method are often fairly large, because the repositories are compared over a large span of time.
Therefore, the tree diff of the last 30 commit before the tag (equivalent of `git diff TAG^ TAG`, `git diff TAG^^ TAG^^`) were also used to also include smaller diffs.

The benchmark measures the runtime of performing a **line diff** between the collected files.
As a measure of complexity for each change `(M + N) D` was used where `M` and `N` are the lengths of the two compared files
and `D` is the length of the edit script required to transform these files into each other (determined with Myers algorithm).
This complexity measure is used to divide the changes into 10 badges.
The time to compute the line diffs in each badge was benchmarked.

The plots below show the runtime for each **average** complexity (runtime is normalized by the number of diffs).
Note that these plots are shown in logarithmic scale due to the large runtime of `similar` for complex diffs.
Furthermore, to better highlight the performance of the Histogram algorithm, the speedup of the Histogram algorithm
compared to the Myers algorithm is shown separately.

* [Linux](#Linux)
* [Rust](#Rust)
* [VSCode](#VSCode)
* [Helix](#Helix)

# Linux

The sourcecode of the linux kernel.

- **Repo** - https://kernel.org
- **Tags** - `v5.7` and `v6.0`

<img src='plots/linux_comparison.svg' width="700">
<img src='plots/linux_speedup.svg' width="700">

# Rust

The sourcecode of the rust compiler, standard library and various related tooling.

- **Repo** - https://github.com/rust-lang/rust
- **Tags** - `1.50.0` and `1.64.0`

<img src='plots/rust_comparison.svg' width="700">
<img src='plots/rust_speedup.svg' width="700">

# VScode

The sourcecode of the vscode editor.

- **Repo** - https://github.com/microsoft/vscode
- **Tags** - `1.41.0` and `1.72.2`

<img src='plots/vscode_comparison.svg' width="700">
<img src='plots/vscode_speedup.svg' width="700">

# Helix

The sourcecode of the helix editor.

- **Repo** - https://github.com/helix-editor/helix
- **Tags** - `v0.5.0` and `22.08.1`

<img src='plots/helix_comparison.svg' width="700">
<img src='plots/helix_speedup.svg' width="700">