summaryrefslogtreecommitdiffstats
path: root/doc/userguide/performance
diff options
context:
space:
mode:
Diffstat (limited to 'doc/userguide/performance')
-rw-r--r--doc/userguide/performance/analysis.rst186
-rw-r--r--doc/userguide/performance/analysis/htopelephantflow.pngbin0 -> 35627 bytes
-rw-r--r--doc/userguide/performance/analysis/perftop.pngbin0 -> 36521 bytes
-rw-r--r--doc/userguide/performance/high-performance-config.rst382
-rw-r--r--doc/userguide/performance/hyperscan.rst84
-rw-r--r--doc/userguide/performance/ignoring-traffic.rst106
-rw-r--r--doc/userguide/performance/index.rst16
-rw-r--r--doc/userguide/performance/packet-capture.rst77
-rw-r--r--doc/userguide/performance/packet-profiling.rst58
-rw-r--r--doc/userguide/performance/rule-profiling.rst33
-rw-r--r--doc/userguide/performance/runmodes.rst66
-rw-r--r--doc/userguide/performance/runmodes/Runmode_autofp.pngbin0 -> 51070 bytes
-rw-r--r--doc/userguide/performance/runmodes/autofp1.pngbin0 -> 42331 bytes
-rw-r--r--doc/userguide/performance/runmodes/autofp2.pngbin0 -> 50616 bytes
-rw-r--r--doc/userguide/performance/runmodes/single.pngbin0 -> 23671 bytes
-rw-r--r--doc/userguide/performance/runmodes/threading1.pngbin0 -> 17080 bytes
-rw-r--r--doc/userguide/performance/runmodes/workers.pngbin0 -> 30595 bytes
-rw-r--r--doc/userguide/performance/statistics.rst161
-rw-r--r--doc/userguide/performance/tcmalloc.rst39
-rw-r--r--doc/userguide/performance/tuning-considerations.rst133
20 files changed, 1341 insertions, 0 deletions
diff --git a/doc/userguide/performance/analysis.rst b/doc/userguide/performance/analysis.rst
new file mode 100644
index 0000000..cfaf636
--- /dev/null
+++ b/doc/userguide/performance/analysis.rst
@@ -0,0 +1,186 @@
+Performance Analysis
+====================
+
+There are many potential causes for performance issues. In this section we
+will guide you through some options. The first part will cover basic steps and
+introduce some helpful tools. The second part will cover more in-depth
+explanations and corner cases.
+
+System Load
+-----------
+
+The first step should be to check the system load. Run a top tool like **htop**
+to get an overview of the system load and if there is a bottleneck with the
+traffic distribution. For example if you can see that only a small number of
+cpu cores hit 100% all the time and others don't, it could be related to a bad
+traffic distribution or elephant flows like in the screenshot where one core
+peaks due to one big elephant flow.
+
+.. image:: analysis/htopelephantflow.png
+
+If all cores are at peak load the system might be too slow for the traffic load
+or it might be misconfigured. Also keep an eye on memory usage, if the actual
+memory usage is too high and the system needs to swap it will result in very
+poor performance.
+
+The load will give you a first indication where to start with the debugging at
+specific parts we describe in more detail in the second part.
+
+Logfiles
+--------
+
+The next step would be to check all the log files with a focus on **stats.log**
+and **suricata.log** if any obvious issues are seen. The most obvious indicator
+is the **capture.kernel_drops** value that ideally would not even show up but
+should be below 1% of the **capture.kernel_packets** value as high drop rates
+could lead to a reduced amount of events and alerts.
+
+If **memcap** is seen in the stats the memcap values in the configuration could
+be increased. This can result to higher memory usage and should be taken into
+account when the settings are changed.
+
+Don't forget to check any system logs as well, even a **dmesg** run can show
+potential issues.
+
+Suricata Load
+-------------
+
+Besides the system load, another indicator for potential performance issues is
+the load of Suricata itself. A helpful tool for that is **perf** which helps
+to spot performance issues. Make sure you have it installed and also the debug
+symbols installed for Suricata or the output won't be very helpful. This output
+is also helpful when you report performance issues as the Suricata Development
+team can narrow down possible issues with that.
+
+::
+
+ sudo perf top -p $(pidof suricata)
+
+If you see specific function calls at the top in red it's a hint that those are
+the bottlenecks. For example if you see **IPOnlyMatchPacket** it can be either
+a result of high drop rates or incomplete flows which result in decreased
+performance. To look into the performance issues on a specific thread you can
+pass **-t TID** to perf top. In other cases you can see functions that give you
+a hint that a specific protocol parser is used a lot and can either try to
+debug a performance bug or try to filter related traffic.
+
+.. image:: analysis/perftop.png
+
+In general try to play around with the different configuration options that
+Suricata does provide with a focus on the options described in
+:doc:`high-performance-config`.
+
+Traffic
+-------
+
+In most cases where the hardware is fast enough to handle the traffic but the
+drop rate is still high it's related to specific traffic issues.
+
+Basics
+^^^^^^
+
+Some of the basic checks are:
+
+- Check if the traffic is bidirectional, if it's mostly unidirectional you're
+ missing relevant parts of the flow (see **tshark** example at the bottom).
+ Another indicator could be a big discrepancy between SYN and SYN-ACK as well
+ as RST counter in the Suricata stats.
+
+- Check for encapsulated traffic, while GRE, MPLS etc. are supported they could
+ also lead to performance issues. Especially if there are several layers of
+ encapsulation.
+
+- Use tools like **iftop** to spot elephant flows. Flows that have a rate of
+ over 1Gbit/s for a long time can result in one cpu core peak at 100% all the
+ time and increasing the droprate while it might not make sense to dig deep
+ into this traffic.
+
+- Another approach to narrow down issues is the usage of **bpf filter**. For
+ example filter all HTTPS traffic with **not port 443** to exclude traffic
+ that might be problematic or just look into one specific port **port 25** if
+ you expect some issues with a specific protocol. See :doc:`ignoring-traffic`
+ for more details.
+
+- If VLAN is used it might help to disable **vlan.use-for-tracking** in
+ scenarios where only one direction of the flow has the VLAN tag.
+
+Advanced
+^^^^^^^^
+
+There are several advanced steps and corner cases when it comes to a deep dive
+into the traffic.
+
+If VLAN QinQ (IEEE 802.1ad) is used be very cautious if you use **cluster_qm**
+in combination with Intel drivers and AF_PACKET runmode. While the RFC expects
+ethertype 0x8100 and 0x88A8 in this case (see
+https://en.wikipedia.org/wiki/IEEE_802.1ad) most implementations only add
+0x8100 on each layer. If the first seen layer has the same VLAN tag but the
+inner one has different VLAN tags it will still end up in the same queue in
+**cluster_qm** mode. This was observed with the i40e driver up to 2.8.20 and
+the firmware version up to 7.00, feel free to report if newer versions have
+fixed this (see https://suricata.io/support/).
+
+
+If you want to use **tshark** to get an overview of the traffic direction use
+this command:
+
+::
+
+ sudo tshark -i $INTERFACE -q -z conv,ip -a duration:10
+
+The output will show you all flows within 10s and if you see 0 for one
+direction you have unidirectional traffic, thus you don't see the ACK packets
+for example. Since Suricata is trying to work on flows this will have a rather
+big impact on the visibility. Focus on fixing the unidirectional traffic. If
+it's not possible at all you can enable **async-oneside** in the **stream**
+configuration setting.
+
+Check for other unusual or complex protocols that aren't supported very well.
+You can try to filter those to see if it has any impact on the performance. In
+this example we filter Cisco Fabric Path (ethertype 0x8903) with the bpf filter
+**not ether proto 0x8903** as it's assumed to be a performance issue (see
+https://redmine.openinfosecfoundation.org/issues/3637)
+
+Elephant Flows
+^^^^^^^^^^^^^^
+
+The so called Elephant Flows or traffic spikes are quite difficult to deal
+with. In most cases those are big file transfers or backup traffic and it's not
+feasible to decode the whole traffic. From a network security monitoring
+perspective it's often enough to log the metadata of that flow and do a packet
+inspection at the beginning but not the whole flow.
+
+If you can spot specific flows as described above then try to filter those. The
+easiest solution would be a bpf filter but that would still result in a
+performance impact. Ideally you can filter such traffic even sooner on driver
+or NIC level (see eBPF/XDP) or even before it reaches the system where Suricata
+is running. Some commercial packet broker support such filtering where it's
+called **Flow Shunting** or **Flow Slicing**.
+
+Rules
+-----
+
+The Ruleset plays an important role in the detection but also in the
+performance capability of Suricata. Thus it's recommended to look into the
+impact of enabled rules as well.
+
+If you run into performance issues and struggle to narrow it down start with
+running Suricata without any rules enabled and use the tools again that have
+been explained at the first part. Keep in mind that even without signatures
+enabled Suricata still does most of the decoding and traffic analysis, so a
+fair amount of load should still be seen. If the load is still very high and
+drops are seen and the hardware should be capable to deal with such traffic
+loads you should deep dive if there is any specific traffic issue (see above)
+or report the performance issue so it can be investigated (see
+https://suricata.io/join-our-community/).
+
+Suricata also provides several specific traffic related signatures in the rules
+folder that could be enabled for testing to spot specific traffic issues. Those
+are found the **rules** and you should start with **decoder-events.rules**,
+**stream-events.rules** and **app-layer-events.rules**.
+
+It can also be helpful to use :doc:`rule-profiling` and/or
+:doc:`packet-profiling` to find problematic rules or traffic pattern. This is
+achieved by compiling Suricata with **--enable-profiling** but keep in mind
+that this has an impact on performance and should only be used for
+troubleshooting.
diff --git a/doc/userguide/performance/analysis/htopelephantflow.png b/doc/userguide/performance/analysis/htopelephantflow.png
new file mode 100644
index 0000000..0976644
--- /dev/null
+++ b/doc/userguide/performance/analysis/htopelephantflow.png
Binary files differ
diff --git a/doc/userguide/performance/analysis/perftop.png b/doc/userguide/performance/analysis/perftop.png
new file mode 100644
index 0000000..303fc7a
--- /dev/null
+++ b/doc/userguide/performance/analysis/perftop.png
Binary files differ
diff --git a/doc/userguide/performance/high-performance-config.rst b/doc/userguide/performance/high-performance-config.rst
new file mode 100644
index 0000000..7d54f7b
--- /dev/null
+++ b/doc/userguide/performance/high-performance-config.rst
@@ -0,0 +1,382 @@
+High Performance Configuration
+==============================
+
+NIC
+---
+
+One of the major dependencies for Suricata's performance is the Network
+Interface Card. There are many vendors and possibilities. Some NICs have and
+require their own specific instructions and tools of how to set up the NIC.
+This ensures the greatest benefit when running Suricata. Vendors like
+Napatech, Netronome, Accolade, Myricom include those tools and documentation
+as part of their sources.
+
+For Intel, Mellanox and commodity NICs the following suggestions below could
+be utilized.
+
+It is recommended that the latest available stable NIC drivers are used. In
+general when changing the NIC settings it is advisable to use the latest
+``ethtool`` version. Some NICs ship with their own ``ethtool`` that is
+recommended to be used. Here is an example of how to set up the ethtool
+if needed:
+
+::
+
+ wget https://mirrors.edge.kernel.org/pub/software/network/ethtool/ethtool-5.2.tar.xz
+ tar -xf ethtool-5.2.tar.xz
+ cd ethtool-5.2
+ ./configure && make clean && make && make install
+ /usr/local/sbin/ethtool --version
+
+When doing high performance optimisation make sure ``irqbalance`` is off and
+not running:
+
+::
+
+ service irqbalance stop
+
+Depending on the NIC's available queues (for example Intel's x710/i40 has 64
+available per port/interface) the worker threads can be set up accordingly.
+Usually the available queues can be seen by running:
+
+::
+
+ /usr/local/sbin/ethtool -l eth1
+
+Some NICs - generally lower end 1Gbps - do not support symmetric hashing see
+:doc:`packet-capture`. On those systems due to considerations for out of order
+packets the following setup with af-packet is suggested (the example below
+uses ``eth1``):
+
+::
+
+ /usr/local/sbin/ethtool -L eth1 combined 1
+
+then set up af-packet with number of desired workers threads ``threads: auto``
+(auto by default will use number of CPUs available) and
+``cluster-type: cluster_flow`` (also the default setting)
+
+For higher end systems/NICs a better and more performant solution could be
+utilizing the NIC itself a bit more. x710/i40 and similar Intel NICs or
+Mellanox MT27800 Family [ConnectX-5] for example can easily be set up to do
+a bigger chunk of the work using more RSS queues and symmetric hashing in order
+to allow for increased performance on the Suricata side by using af-packet
+with ``cluster-type: cluster_qm`` mode. In that mode with af-packet all packets
+linked by network card to a RSS queue are sent to the same socket. Below is
+an example of a suggested config set up based on a 16 core one CPU/NUMA node
+socket system using x710:
+
+::
+
+ rmmod i40e && modprobe i40e
+ ifconfig eth1 down
+ /usr/local/sbin/ethtool -L eth1 combined 16
+ /usr/local/sbin/ethtool -K eth1 rxhash on
+ /usr/local/sbin/ethtool -K eth1 ntuple on
+ ifconfig eth1 up
+ /usr/local/sbin/ethtool -X eth1 hkey 6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A equal 16
+ /usr/local/sbin/ethtool -A eth1 rx off
+ /usr/local/sbin/ethtool -C eth1 adaptive-rx off adaptive-tx off rx-usecs 125
+ /usr/local/sbin/ethtool -G eth1 rx 1024
+
+The commands above can be reviewed in detail in the help or manpages of the
+``ethtool``. In brief the sequence makes sure the NIC is reset, the number of
+RSS queues is set to 16, load balancing is enabled for the NIC, a low entropy
+toeplitz key is inserted to allow for symmetric hashing, receive offloading is
+disabled, the adaptive control is disabled for lowest possible latency and
+last but not least, the ring rx descriptor size is set to 1024.
+Make sure the RSS hash function is Toeplitz:
+
+::
+
+ /usr/local/sbin/ethtool -X eth1 hfunc toeplitz
+
+Let the NIC balance as much as possible:
+
+::
+
+ for proto in tcp4 udp4 tcp6 udp6; do
+ /usr/local/sbin/ethtool -N eth1 rx-flow-hash $proto sdfn
+ done
+
+In some cases:
+
+::
+
+ /usr/local/sbin/ethtool -N eth1 rx-flow-hash $proto sd
+
+might be enough or even better depending on the type of traffic. However not
+all NICs allow it. The ``sd`` specifies the multi queue hashing algorithm of
+the NIC (for the particular proto) to use src IP, dst IP only. The ``sdfn``
+allows for the tuple src IP, dst IP, src port, dst port to be used for the
+hashing algorithm.
+In the af-packet section of suricata.yaml:
+
+::
+
+ af-packet:
+ - interface: eth1
+ threads: 16
+ cluster-id: 99
+ cluster-type: cluster_qm
+ ...
+ ...
+
+CPU affinity and NUMA
+---------------------
+
+Intel based systems
+~~~~~~~~~~~~~~~~~~~
+
+If the system has more then one NUMA node there are some more possibilities.
+In those cases it is generally recommended to use as many worker threads as
+cpu cores available/possible - from the same NUMA node. The example below uses
+a 72 core machine and the sniffing NIC that Suricata uses located on NUMA node 1.
+In such 2 socket configurations it is recommended to have Suricata and the
+sniffing NIC to be running and residing on the second NUMA node as by default
+CPU 0 is widely used by many services in Linux. In a case where this is not
+possible it is recommended that (via the cpu affinity config section in
+suricata.yaml and the irq affinity script for the NIC) CPU 0 is never used.
+
+In the case below 36 worker threads are used out of NUMA node 1's CPU,
+af-packet runmode with ``cluster-type: cluster_qm``.
+
+If the CPU's NUMA set up is as follows:
+
+::
+
+ lscpu
+ Architecture: x86_64
+ CPU op-mode(s): 32-bit, 64-bit
+ Byte Order: Little Endian
+ CPU(s): 72
+ On-line CPU(s) list: 0-71
+ Thread(s) per core: 2
+ Core(s) per socket: 18
+ Socket(s): 2
+ NUMA node(s): 2
+ Vendor ID: GenuineIntel
+ CPU family: 6
+ Model: 79
+ Model name: Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz
+ Stepping: 1
+ CPU MHz: 1199.724
+ CPU max MHz: 3600.0000
+ CPU min MHz: 1200.0000
+ BogoMIPS: 4589.92
+ Virtualization: VT-x
+ L1d cache: 32K
+ L1i cache: 32K
+ L2 cache: 256K
+ L3 cache: 46080K
+ NUMA node0 CPU(s): 0-17,36-53
+ NUMA node1 CPU(s): 18-35,54-71
+
+It is recommended that 36 worker threads are used and the NIC set up could be
+as follows:
+
+::
+
+ rmmod i40e && modprobe i40e
+ ifconfig eth1 down
+ /usr/local/sbin/ethtool -L eth1 combined 36
+ /usr/local/sbin/ethtool -K eth1 rxhash on
+ /usr/local/sbin/ethtool -K eth1 ntuple on
+ ifconfig eth1 up
+ ./set_irq_affinity local eth1
+ /usr/local/sbin/ethtool -X eth1 hkey 6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A equal 36
+ /usr/local/sbin/ethtool -A eth1 rx off tx off
+ /usr/local/sbin/ethtool -C eth1 adaptive-rx off adaptive-tx off rx-usecs 125
+ /usr/local/sbin/ethtool -G eth1 rx 1024
+ for proto in tcp4 udp4 tcp6 udp6; do
+ echo "/usr/local/sbin/ethtool -N eth1 rx-flow-hash $proto sdfn"
+ /usr/local/sbin/ethtool -N eth1 rx-flow-hash $proto sdfn
+ done
+
+In the example above the ``set_irq_affinity`` script is used from the NIC
+driver's sources.
+In the cpu affinity section of suricata.yaml config:
+
+::
+
+ # Suricata is multi-threaded. Here the threading can be influenced.
+ threading:
+ cpu-affinity:
+ - management-cpu-set:
+ cpu: [ "1-10" ] # include only these CPUs in affinity settings
+ - receive-cpu-set:
+ cpu: [ "0-10" ] # include only these CPUs in affinity settings
+ - worker-cpu-set:
+ cpu: [ "18-35", "54-71" ]
+ mode: "exclusive"
+ prio:
+ low: [ 0 ]
+ medium: [ "1" ]
+ high: [ "18-35","54-71" ]
+ default: "high"
+
+In the af-packet section of suricata.yaml config :
+
+::
+
+ - interface: eth1
+ # Number of receive threads. "auto" uses the number of cores
+ threads: 18
+ cluster-id: 99
+ cluster-type: cluster_qm
+ defrag: no
+ use-mmap: yes
+ mmap-locked: yes
+ tpacket-v3: yes
+ ring-size: 100000
+ block-size: 1048576
+ - interface: eth1
+ # Number of receive threads. "auto" uses the number of cores
+ threads: 18
+ cluster-id: 99
+ cluster-type: cluster_qm
+ defrag: no
+ use-mmap: yes
+ mmap-locked: yes
+ tpacket-v3: yes
+ ring-size: 100000
+ block-size: 1048576
+
+That way 36 worker threads can be mapped (18 per each af-packet interface slot)
+in total per CPUs NUMA 1 range - 18-35,54-71. That part is done via the
+``worker-cpu-set`` affinity settings. ``ring-size`` and ``block-size`` in the
+config section above are decent default values to start with. Those can be
+better adjusted if needed as explained in :doc:`tuning-considerations`.
+
+AMD based systems
+~~~~~~~~~~~~~~~~~
+
+Another example can be using an AMD based system where the architecture and
+design of the system itself plus the NUMA node's interaction is different as
+it is based on the HyperTransport (HT) technology. In that case per NUMA
+thread/lock would not be needed. The example below shows a suggestion for such
+a configuration utilising af-packet, ``cluster-type: cluster_flow``. The
+Mellanox NIC is located on NUMA 0.
+
+The CPU set up is as follows:
+
+::
+
+ Architecture: x86_64
+ CPU op-mode(s): 32-bit, 64-bit
+ Byte Order: Little Endian
+ CPU(s): 128
+ On-line CPU(s) list: 0-127
+ Thread(s) per core: 2
+ Core(s) per socket: 32
+ Socket(s): 2
+ NUMA node(s): 8
+ Vendor ID: AuthenticAMD
+ CPU family: 23
+ Model: 1
+ Model name: AMD EPYC 7601 32-Core Processor
+ Stepping: 2
+ CPU MHz: 1200.000
+ CPU max MHz: 2200.0000
+ CPU min MHz: 1200.0000
+ BogoMIPS: 4391.55
+ Virtualization: AMD-V
+ L1d cache: 32K
+ L1i cache: 64K
+ L2 cache: 512K
+ L3 cache: 8192K
+ NUMA node0 CPU(s): 0-7,64-71
+ NUMA node1 CPU(s): 8-15,72-79
+ NUMA node2 CPU(s): 16-23,80-87
+ NUMA node3 CPU(s): 24-31,88-95
+ NUMA node4 CPU(s): 32-39,96-103
+ NUMA node5 CPU(s): 40-47,104-111
+ NUMA node6 CPU(s): 48-55,112-119
+ NUMA node7 CPU(s): 56-63,120-127
+
+The ``ethtool``, ``show_irq_affinity.sh`` and ``set_irq_affinity_cpulist.sh``
+tools are provided from the official driver sources.
+Set up the NIC, including offloading and load balancing:
+
+::
+
+ ifconfig eno6 down
+ /opt/mellanox/ethtool/sbin/ethtool -L eno6 combined 15
+ /opt/mellanox/ethtool/sbin/ethtool -K eno6 rxhash on
+ /opt/mellanox/ethtool/sbin/ethtool -K eno6 ntuple on
+ ifconfig eno6 up
+ /sbin/set_irq_affinity_cpulist.sh 1-7,64-71 eno6
+ /opt/mellanox/ethtool/sbin/ethtool -X eno6 hfunc toeplitz
+ /opt/mellanox/ethtool/sbin/ethtool -X eno6 hkey 6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A
+
+In the example above (1-7,64-71 for the irq affinity) CPU 0 is skipped as it is usually used by default on Linux systems by many applications/tools.
+Let the NIC balance as much as possible:
+
+::
+
+ for proto in tcp4 udp4 tcp6 udp6; do
+ /usr/local/sbin/ethtool -N eth1 rx-flow-hash $proto sdfn
+ done
+
+In the cpu affinity section of suricata.yaml config :
+
+::
+
+ # Suricata is multi-threaded. Here the threading can be influenced.
+ threading:
+ set-cpu-affinity: yes
+ cpu-affinity:
+ - management-cpu-set:
+ cpu: [ "120-127" ] # include only these cpus in affinity settings
+ - receive-cpu-set:
+ cpu: [ 0 ] # include only these cpus in affinity settings
+ - worker-cpu-set:
+ cpu: [ "8-55" ]
+ mode: "exclusive"
+ prio:
+ high: [ "8-55" ]
+ default: "high"
+
+In the af-packet section of suricata.yaml config:
+
+::
+
+ - interface: eth1
+ # Number of receive threads. "auto" uses the number of cores
+ threads: 48 # 48 worker threads on cpus "8-55" above
+ cluster-id: 99
+ cluster-type: cluster_flow
+ defrag: no
+ use-mmap: yes
+ mmap-locked: yes
+ tpacket-v3: yes
+ ring-size: 100000
+ block-size: 1048576
+
+
+In the example above there are 15 RSS queues pinned to cores 1-7,64-71 on NUMA
+node 0 and 40 worker threads using other CPUs on different NUMA nodes. The
+reason why CPU 0 is skipped in this set up is as in Linux systems it is very
+common for CPU 0 to be used by default by many tools/services. The NIC itself in
+this config is positioned on NUMA 0 so starting with 15 RSS queues on that
+NUMA node and keeping those off for other tools in the system could offer the
+best advantage.
+
+.. note:: Performance and optimization of the whole system can be affected upon regular NIC driver and pkg/kernel upgrades so it should be monitored regularly and tested out in QA/test environments first. As a general suggestion it is always recommended to run the latest stable firmware and drivers as instructed and provided by the particular NIC vendor.
+
+Other considerations
+~~~~~~~~~~~~~~~~~~~~
+
+Another advanced option to consider is the ``isolcpus`` kernel boot parameter
+is a way of allowing CPU cores to be isolated for use of general system
+processes. That way ensures total dedication of those CPUs/ranges for the
+Suricata process only.
+
+``stream.wrong_thread`` / ``tcp.pkt_on_wrong_thread`` are counters available
+in ``stats.log`` or ``eve.json`` as ``event_type: stats`` that indicate issues with
+the load balancing. There could be traffic/NICs settings related as well. In
+very high/heavily increasing counter values it is recommended to experiment
+with a different load balancing method either via the NIC or for example using
+XDP/eBPF. There is an issue open
+https://redmine.openinfosecfoundation.org/issues/2725 that is a placeholder
+for feedback and findings.
diff --git a/doc/userguide/performance/hyperscan.rst b/doc/userguide/performance/hyperscan.rst
new file mode 100644
index 0000000..055fa7f
--- /dev/null
+++ b/doc/userguide/performance/hyperscan.rst
@@ -0,0 +1,84 @@
+Hyperscan
+=========
+
+Introduction
+~~~~~~~~~~~~
+
+"Hyperscan is a high performance regular expression matching library (...)" (https://www.intel.com/content/www/us/en/developer/articles/technical/introduction-to-hyperscan.html)
+
+In Suricata it can be used to perform multi pattern matching (mpm) or single pattern matching (spm).
+
+Support for hyperscan in Suricata was initially implemented by Justin Viiret and Jim Xu from Intel via https://github.com/OISF/suricata/pull/1965.
+
+Hyperscan is only for Intel x86 based processor architectures at this time. For ARM processors, vectorscan is a drop in replacement for hyperscan, https://github.com/VectorCamp/vectorscan.
+
+
+Basic Installation (Package)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Some Linux distributions include hyperscan in their respective package collections.
+
+Fedora 37+/Centos 8+: sudo dnf install hyperscan-devel
+Ubuntu/Debian: sudo apt-get install libhyperscan-dev
+
+
+Advanced Installation (Source)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Hyperscan has the following dependencies in order to build from
+source:
+
+* boost development libraries (minimum boost library version is 1.58)
+* cmake
+* C++ compiler (e.g. gcc-c++)
+* libpcap development libraries
+* pcre2 development libraries
+* python3
+* ragel
+* sqlite development libraries
+
+**Note:** git is an additional dependency if cloning the
+hyperscan GitHub repository. Otherwise downloading the
+hyperscan zip from the GitHub repository will work too.
+
+The steps to build and install hyperscan are:
+
+::
+
+ git clone https://github.com/intel/hyperscan
+ cd hyperscan
+ cmake -DBUILD_STATIC_AND_SHARED=1
+ cmake --build ./
+ sudo cmake --install ./
+
+**Note:** Hyperscan can take a a long time to build/compile.
+
+**Note:** It may be necessary to add /usr/local/lib or
+/usr/local/lib64 to the `ld` search path. Typically this is
+done by adding a file under /etc/ld.so.conf.d/ with the contents
+of the directory location of libhs.so.5 (for hyperscan 5.x).
+
+
+Using Hyperscan
+~~~~~~~~~~~~~~~
+
+Confirm that the suricata version installed has hyperscan enabled.
+::
+
+
+ suricata --build-info | grep Hyperscan
+ Hyperscan support: yes
+
+
+To use hyperscan support, edit the suricata.yaml.
+Change the mpm-algo and spm-algo values to 'hs'.
+
+Alternatively, use this command-line option: --set mpm-algo=hs --set spm-algo=hs
+
+**Note**: The default suricata.yaml configuration settings for
+mpm-algo and spm-algo are "auto". Suricata will use hyperscan
+if it is present on the system in case of the "auto" setting.
+
+
+If the current suricata installation does not have hyperscan
+support, refer to :ref:`installation` \ No newline at end of file
diff --git a/doc/userguide/performance/ignoring-traffic.rst b/doc/userguide/performance/ignoring-traffic.rst
new file mode 100644
index 0000000..a2c7a88
--- /dev/null
+++ b/doc/userguide/performance/ignoring-traffic.rst
@@ -0,0 +1,106 @@
+Ignoring Traffic
+================
+
+In some cases there are reasons to ignore certain traffic. Certain hosts
+may be trusted, or perhaps a backup stream should be ignored.
+
+capture filters (BPF)
+---------------------
+
+Through BPFs the capture methods pcap, af-packet, netmap and pf_ring can be
+told what to send to Suricata, and what not. For example a simple
+filter 'tcp' will only capture tcp packets.
+
+If some hosts and or nets need to be ignored, use something like "not
+(host IP1 or IP2 or IP3 or net NET/24)".
+
+Example::
+
+ not host 1.2.3.4
+
+Capture filters are specified on the command-line after all other options::
+
+ suricata -i eth0 -v not host 1.2.3.4
+ suricata -i eno1 -c suricata.yaml tcp or udp
+
+Capture filters can be set per interface in the pcap, af-packet, netmap
+and pf_ring sections. It can also be put in a file::
+
+ echo "not host 1.2.3.4" > capture-filter.bpf
+ suricata -i ens5f0 -F capture-filter.bpf
+
+Using a capture filter limits what traffic Suricata processes. So the
+traffic not seen by Suricata will not be inspected, logged or otherwise
+recorded.
+
+BPF and IPS
+^^^^^^^^^^^
+
+In case of IPS modes using af-packet and netmap, BPFs affect how traffic
+is forwarded. If a capture NIC does not capture a packet because of a BPF,
+it will also not be forwarded to the peering NIC.
+
+So in the example of `not host 1.2.3.4`, traffic to and from the IP `1.2.3.4`
+is effectively dropped.
+
+pass rules
+----------
+
+Pass rules are Suricata rules that if matching, pass the packet and in
+case of TCP the rest of the flow. They look like normal rules, except
+that instead of `alert` or `drop` they use `pass` as the action.
+
+Example::
+
+ pass ip 1.2.3.4 any <> any any (msg:"pass all traffic from/to 1.2.3.4"; sid:1;)
+
+A big difference with capture filters is that logs such as Eve or http.log
+are still generated for this traffic.
+
+suppress
+--------
+
+Suppress rules can be used to make sure no alerts are generated for a
+host. This is not efficient however, as the suppression is only
+considered post-matching. In other words, Suricata first inspects a
+rule, and only then will it consider per-host suppressions.
+
+Example::
+
+ suppress gen_id 0, sig_id 0, track by_src, ip 1.2.3.4
+
+
+encrypted traffic
+-----------------
+
+The TLS app layer parser has the ability to stop processing encrypted traffic
+after the initial handshake. By setting the `app-layer.protocols.tls.encryption-handling`
+option to `bypass` the rest of this flow is ignored. If flow bypass is enabled,
+the bypass is done in the kernel or in hardware.
+
+.. _bypass:
+
+bypassing traffic
+-----------------
+
+Aside from using the ``bypass`` keyword in rules, there are three other ways
+to bypass traffic.
+
+- Within suricata (local bypass). Suricata reads a packet, decodes it, checks
+ it in the flow table. If the corresponding flow is local bypassed then it
+ simply skips all streaming, detection and output and the packet goes directly
+ out in IDS mode and to verdict in IPS mode.
+
+- Within the kernel (capture bypass). When Suricata decides to bypass it calls
+ a function provided by the capture method to declare the bypass in the
+ capture. For NFQ this is a simple mark that will be used by the
+ iptables/nftablesruleset. For AF_PACKET this will be a call to add an element
+ in an eBPF hash table stored in kernel.
+
+- Within the NIC driver. This method relies upon XDP, XDP can process the
+ traffic prior to reaching the kernel.
+
+Additional bypass documentation:
+
+https://suricon.net/wp-content/uploads/2017/12/SuriCon17-Manev_Purzynski.pdf
+https://www.stamus-networks.com/2016/09/28/suricata-bypass-feature/
diff --git a/doc/userguide/performance/index.rst b/doc/userguide/performance/index.rst
new file mode 100644
index 0000000..369fd74
--- /dev/null
+++ b/doc/userguide/performance/index.rst
@@ -0,0 +1,16 @@
+Performance
+===========
+
+.. toctree::
+
+ runmodes
+ packet-capture
+ tuning-considerations
+ hyperscan
+ high-performance-config
+ statistics
+ ignoring-traffic
+ packet-profiling
+ rule-profiling
+ tcmalloc
+ analysis
diff --git a/doc/userguide/performance/packet-capture.rst b/doc/userguide/performance/packet-capture.rst
new file mode 100644
index 0000000..d41f668
--- /dev/null
+++ b/doc/userguide/performance/packet-capture.rst
@@ -0,0 +1,77 @@
+Packet Capture
+==============
+
+Load balancing
+--------------
+
+To get the best performance, Suricata will need to run in 'workers' mode. This effectively means that there are multiple threads, each running a full packet pipeline and each receiving packets from the capture method. This means that we rely on the capture method to distribute the packets over the various threads. One critical aspect of this is that Suricata needs to get both sides of a flow in the same thread, in the correct order.
+
+The AF_PACKET and PF_RING capture methods both have options to select the 'cluster-type'. These default to 'cluster_flow' which instructs the capture method to hash by flow (5 tuple). This hash is symmetric. Netmap does not have a cluster_flow mode built-in. It can be added separately by using the "'lb' tool":https://github.com/luigirizzo/netmap/tree/master/apps/lb
+
+On multi-queue NICs, which is almost any modern NIC, RSS settings need to be considered.
+
+RSS
+---
+
+Receive Side Scaling is a technique used by network cards to distribute incoming traffic over various queues on the NIC. This is meant to improve performance but it is important to realize that it was designed for normal traffic, not for the IDS packet capture scenario. RSS using a hash algorithm to distribute the incoming traffic over the various queues. This hash is normally *not* symmetrical. This means that when receiving both sides of a flow, each side may end up in a different queue. Sadly, when deploying Suricata, this is the common scenario when using span ports or taps.
+
+The problem here is that by having both sides of the traffic in different queues, the order of processing of packets becomes unpredictable. Timing differences on the NIC, the driver, the kernel and in Suricata will lead to a high chance of packets coming in at a different order than on the wire. This is specifically about a mismatch between the two traffic directions. For example, Suricata tracks the TCP 3-way handshake. Due to this timing issue, the SYN/ACK may only be received by Suricata long after the client to server side has already started sending data. Suricata would see this traffic as invalid.
+
+None of the supported capture methods like AF_PACKET, PF_RING or NETMAP can fix this problem for us. It would require buffering and packet reordering which is expensive.
+
+To see how many queues are configured:
+
+::
+
+
+ $ ethtool -l ens2f1
+ Channel parameters for ens2f1:
+ Pre-set maximums:
+ RX: 0
+ TX: 0
+ Other: 1
+ Combined: 64
+ Current hardware settings:
+ RX: 0
+ TX: 0
+ Other: 1
+ Combined: 8
+
+Some NIC's allow you to set it into a symmetric mode. The Intel X(L)710 card can do this in theory, but the drivers aren't capable of enabling this yet (work is underway to try to address this). Another way to address is by setting a special "Random Secret Key" that will make the RSS symmetrical. See http://www.ndsl.kaist.edu/~kyoungsoo/papers/TR-symRSS.pdf (PDF).
+
+In most scenario's however, the optimal solution is to reduce the number of RSS queues to 1:
+
+Example:
+
+::
+
+
+ # Intel X710 with i40e driver:
+ ethtool -L $DEV combined 1
+
+Some drivers do not support setting the number of queues through ethtool. In some cases there is a module load time option. Read the driver docs for the specifics.
+
+
+Offloading
+----------
+
+Network cards, drivers and the kernel itself have various techniques to speed up packet handling. Generally these will all have to be disabled.
+
+LRO/GRO lead to merging various smaller packets into big 'super packets'. These will need to be disabled as they break the dsize keyword as well as TCP state tracking.
+
+Checksum offloading can be left enabled on AF_PACKET and PF_RING, but needs to be disabled on PCAP, NETMAP and others.
+
+
+
+Recommendations
+---------------
+
+Read your drivers documentation! E.g. for i40e the ethtool change of RSS queues may lead to kernel panics if done wrong.
+
+Generic: set RSS queues to 1 or make sure RSS hashing is symmetric. Disable NIC offloading.
+
+AF_PACKET: 1 RSS queue and stay on kernel <=4.2 or make sure you have >=4.4.16, >=4.6.5 or >=4.7. Exception: if RSS is symmetric cluster-type 'cluster_qm' can be used to bind Suricata to the RSS queues. Disable NIC offloading except the rx/tx csum.
+
+PF_RING: 1 RSS queue and use cluster-type 'cluster_flow'. Disable NIC offloading except the rx/tx csum.
+
+NETMAP: 1 RSS queue. There is no flow based load balancing built-in, but the 'lb' tool can be helpful. Another option is to use the 'autofp' runmode. Exception: if RSS is symmetric, load balancing is based on the RSS hash and multiple RSS queues can be used. Disable all NIC offloading.
diff --git a/doc/userguide/performance/packet-profiling.rst b/doc/userguide/performance/packet-profiling.rst
new file mode 100644
index 0000000..5496447
--- /dev/null
+++ b/doc/userguide/performance/packet-profiling.rst
@@ -0,0 +1,58 @@
+Packet Profiling
+================
+
+In this guide will be explained how to enable packet profiling and use
+it with the most recent code of Suricata on Ubuntu. It is based on the
+assumption that you have already installed Suricata once from the GIT
+repository.
+
+Packet profiling is convenient in case you would like to know how long
+packets take to be processed. It is a way to figure out why certain
+packets are being processed quicker than others, and this way a good
+tool for developing Suricata.
+
+Update Suricata by following the steps from :ref:`Installation from GIT`. Start
+at the end at
+
+::
+
+ cd suricata/suricata
+ git pull
+
+And follow the described next steps. To enable packet profiling, make
+sure you enter the following during the configuring stage:
+
+::
+
+ ./configure --enable-profiling
+
+Find a folder in which you have pcaps. If you do not have pcaps yet,
+you can get these with Wireshark. See `Sniffing Packets with Wireshark
+<https://redmine.openinfosecfoundation.org/projects/suricata/wiki/Sniffing_Packets_with_Wireshark>`_.
+
+Go to the directory of your pcaps. For example:
+
+::
+
+ cd ~/Desktop
+
+With the ls command you can see the content of the folder. Choose a
+folder and a pcap file
+
+for example:
+
+::
+
+ cd ~/Desktop/2011-05-05
+
+Run Suricata with that pcap:
+
+::
+
+ suricata -c /etc/suricata/suricata.yaml -r log.pcap.(followed by the number/name of your pcap)
+
+for example:
+
+::
+
+ suricata -c /etc/suricata/suricata.yaml -r log.pcap.1304589204
diff --git a/doc/userguide/performance/rule-profiling.rst b/doc/userguide/performance/rule-profiling.rst
new file mode 100644
index 0000000..f05e8fb
--- /dev/null
+++ b/doc/userguide/performance/rule-profiling.rst
@@ -0,0 +1,33 @@
+Rule Profiling
+==============
+
+::
+
+ --------------------------------------------------------------------------
+ Date: 9/5/2013 -- 14:59:58
+ --------------------------------------------------------------------------
+ Num Rule Gid Rev Ticks % Checks Matches Max Ticks Avg Ticks Avg Match Avg No Match
+ -------- ------------ -------- -------- ------------ ------ -------- -------- ----------- ----------- ----------- --------------
+ 1 2210021 1 3 12037 4.96 1 1 12037 12037.00 12037.00 0.00
+ 2 2210054 1 1 107479 44.26 12 0 35805 8956.58 0.00 8956.58
+ 3 2210053 1 1 4513 1.86 1 0 4513 4513.00 0.00 4513.00
+ 4 2210023 1 1 3077 1.27 1 0 3077 3077.00 0.00 3077.00
+ 5 2210008 1 1 3028 1.25 1 0 3028 3028.00 0.00 3028.00
+ 6 2210009 1 1 2945 1.21 1 0 2945 2945.00 0.00 2945.00
+ 7 2210055 1 1 2945 1.21 1 0 2945 2945.00 0.00 2945.00
+ 8 2210007 1 1 2871 1.18 1 0 2871 2871.00 0.00 2871.00
+ 9 2210005 1 1 2871 1.18 1 0 2871 2871.00 0.00 2871.00
+ 10 2210024 1 1 2846 1.17 1 0 2846 2846.00 0.00 2846.00
+
+The meaning of the individual fields:
+
+* Ticks -- total ticks spent on this rule, so a sum of all inspections
+* % -- share of this single sig in the total cost of inspection
+* Checks -- number of times a signature was inspected
+* Matches -- number of times it matched. This may not have resulted in an alert due to suppression and thresholding.
+* Max ticks -- single most expensive inspection
+* Avg ticks -- per inspection average, so "ticks" / "checks".
+* Avg match -- avg ticks spent resulting in match
+* Avg No Match -- avg ticks spent resulting in no match.
+
+The "ticks" are CPU clock ticks: http://en.wikipedia.org/wiki/CPU_time
diff --git a/doc/userguide/performance/runmodes.rst b/doc/userguide/performance/runmodes.rst
new file mode 100644
index 0000000..4afc5d5
--- /dev/null
+++ b/doc/userguide/performance/runmodes.rst
@@ -0,0 +1,66 @@
+Runmodes
+========
+
+Suricata consists of several 'building blocks' called threads,
+thread-modules and queues. A thread is like a process that runs on a
+computer. Suricata is multi-threaded, so multiple threads are active
+at once. A thread-module is a part of a functionality. One module is
+for example for decoding a packet, another is the detect-module and
+another one the output-module. A packet can be processed by more than
+one thread. The packet will then be passed on to the next thread through
+a queue. Packets will be processed by one thread at a time, but there
+can be multiple packets being processed at a time by the engine (see
+:ref:`suricata-yaml-max-pending-packets`). A thread can have one or
+more thread-modules. If they have more modules, they can only be
+active one a a time. The way threads, modules and queues are arranged
+together is called the "Runmode".
+
+Different runmodes
+~~~~~~~~~~~~~~~~~~
+
+You can choose a runmode out of several predefined runmodes. The
+command line option ``--list-runmodes`` shows all available runmodes. All
+runmodes have a name: single, workers, autofp.
+
+Generally, the ``workers`` runmode performs the best. In this mode the
+NIC/driver makes sure packets are properly balanced over Suricata's
+processing threads. Each packet processing thread then contains the
+full packet pipeline.
+
+.. image:: runmodes/workers.png
+
+For processing PCAP files, or in case of certain IPS setups (like NFQ),
+``autofp`` is used. Here there are one or more capture threads, that
+capture the packet and do the packet decoding, after which it is passed
+on to the ``flow worker`` threads.
+
+.. image:: runmodes/autofp1.png
+
+.. image:: runmodes/autofp2.png
+
+Finally, the ``single`` runmode is the same as the ``workers`` mode,
+however there is only a single packet processing thread. This is mostly
+useful during development.
+
+.. image:: runmodes/single.png
+
+For more information about the command line options concerning the
+runmode, see :doc:`../command-line-options`.
+
+Load balancing
+~~~~~~~~~~~~~~
+
+Suricata may use different ways to load balance the packets to process
+between different threads with the configuration option `autofp-scheduler`.
+
+The default value is `hash`, which means the packet is assigned to threads
+using the 5-7 tuple hash, which is also used anyways to store the flows
+in memory.
+
+This option can also be set to
+- `ippair` : packets are assigned to threads using addresses only.
+- `ftp-hash` : same as `hash` except for flows that may be ftp or ftp-data
+so that these flows get processed by the same thread. Like so, there is no
+concurrency issue in recognizing ftp-data flows due to processing them
+before the ftp flow got processed. In case of such a flow, a variant of the
+hash is used.
diff --git a/doc/userguide/performance/runmodes/Runmode_autofp.png b/doc/userguide/performance/runmodes/Runmode_autofp.png
new file mode 100644
index 0000000..42db21d
--- /dev/null
+++ b/doc/userguide/performance/runmodes/Runmode_autofp.png
Binary files differ
diff --git a/doc/userguide/performance/runmodes/autofp1.png b/doc/userguide/performance/runmodes/autofp1.png
new file mode 100644
index 0000000..6bbcc94
--- /dev/null
+++ b/doc/userguide/performance/runmodes/autofp1.png
Binary files differ
diff --git a/doc/userguide/performance/runmodes/autofp2.png b/doc/userguide/performance/runmodes/autofp2.png
new file mode 100644
index 0000000..d9c944d
--- /dev/null
+++ b/doc/userguide/performance/runmodes/autofp2.png
Binary files differ
diff --git a/doc/userguide/performance/runmodes/single.png b/doc/userguide/performance/runmodes/single.png
new file mode 100644
index 0000000..1623a4b
--- /dev/null
+++ b/doc/userguide/performance/runmodes/single.png
Binary files differ
diff --git a/doc/userguide/performance/runmodes/threading1.png b/doc/userguide/performance/runmodes/threading1.png
new file mode 100644
index 0000000..399bf67
--- /dev/null
+++ b/doc/userguide/performance/runmodes/threading1.png
Binary files differ
diff --git a/doc/userguide/performance/runmodes/workers.png b/doc/userguide/performance/runmodes/workers.png
new file mode 100644
index 0000000..eabbe27
--- /dev/null
+++ b/doc/userguide/performance/runmodes/workers.png
Binary files differ
diff --git a/doc/userguide/performance/statistics.rst b/doc/userguide/performance/statistics.rst
new file mode 100644
index 0000000..454777f
--- /dev/null
+++ b/doc/userguide/performance/statistics.rst
@@ -0,0 +1,161 @@
+Statistics
+==========
+
+The stats.log produces statistics records on a fixed interval, by
+default every 8 seconds.
+
+stats.log file
+--------------
+
+::
+
+ -------------------------------------------------------------------
+ Counter | TM Name | Value
+ -------------------------------------------------------------------
+ flow_mgr.closed_pruned | FlowManagerThread | 154033
+ flow_mgr.new_pruned | FlowManagerThread | 67800
+ flow_mgr.est_pruned | FlowManagerThread | 100921
+ flow.memuse | FlowManagerThread | 6557568
+ flow.spare | FlowManagerThread | 10002
+ flow.emerg_mode_entered | FlowManagerThread | 0
+ flow.emerg_mode_over | FlowManagerThread | 0
+ decoder.pkts | RxPcapem21 | 450001754
+ decoder.bytes | RxPcapem21 | 409520714250
+ decoder.ipv4 | RxPcapem21 | 449584047
+ decoder.ipv6 | RxPcapem21 | 9212
+ decoder.ethernet | RxPcapem21 | 450001754
+ decoder.raw | RxPcapem21 | 0
+ decoder.sll | RxPcapem21 | 0
+ decoder.tcp | RxPcapem21 | 448124337
+ decoder.udp | RxPcapem21 | 542040
+ decoder.sctp | RxPcapem21 | 0
+ decoder.icmpv4 | RxPcapem21 | 82292
+ decoder.icmpv6 | RxPcapem21 | 9164
+ decoder.ppp | RxPcapem21 | 0
+ decoder.pppoe | RxPcapem21 | 0
+ decoder.gre | RxPcapem21 | 0
+ decoder.vlan | RxPcapem21 | 0
+ decoder.avg_pkt_size | RxPcapem21 | 910
+ decoder.max_pkt_size | RxPcapem21 | 1514
+ defrag.ipv4.fragments | RxPcapem21 | 4
+ defrag.ipv4.reassembled | RxPcapem21 | 1
+ defrag.ipv4.timeouts | RxPcapem21 | 0
+ defrag.ipv6.fragments | RxPcapem21 | 0
+ defrag.ipv6.reassembled | RxPcapem21 | 0
+ defrag.ipv6.timeouts | RxPcapem21 | 0
+ tcp.sessions | Detect | 41184
+ tcp.ssn_memcap_drop | Detect | 0
+ tcp.pseudo | Detect | 2087
+ tcp.invalid_checksum | Detect | 8358
+ tcp.no_flow | Detect | 0
+ tcp.reused_ssn | Detect | 11
+ tcp.memuse | Detect | 36175872
+ tcp.syn | Detect | 85902
+ tcp.synack | Detect | 83385
+ tcp.rst | Detect | 84326
+ tcp.segment_memcap_drop | Detect | 0
+ tcp.stream_depth_reached | Detect | 109
+ tcp.reassembly_memuse | Detect | 67755264
+ tcp.reassembly_gap | Detect | 789
+ detect.alert | Detect | 14721
+
+Detecting packet loss
+~~~~~~~~~~~~~~~~~~~~~
+
+At shut down, Suricata reports the packet loss statistics it gets from
+pcap, pfring or afpacket
+
+::
+
+ [18088] 30/5/2012 -- 07:39:18 - (RxPcapem21) Packets 451595939, bytes 410869083410
+ [18088] 30/5/2012 -- 07:39:18 - (RxPcapem21) Pcap Total:451674222 Recv:451596129 Drop:78093 (0.0%).
+
+Usually, this is not the complete story though. These are kernel drop
+stats, but the NIC may also have dropped packets. Use ethtool to get
+to those:
+
+::
+
+ # ethtool -S em2
+ NIC statistics:
+ rx_packets: 35430208463
+ tx_packets: 216072
+ rx_bytes: 32454370137414
+ tx_bytes: 53624450
+ rx_broadcast: 17424355
+ tx_broadcast: 133508
+ rx_multicast: 5332175
+ tx_multicast: 82564
+ rx_errors: 47
+ tx_errors: 0
+ tx_dropped: 0
+ multicast: 5332175
+ collisions: 0
+ rx_length_errors: 0
+ rx_over_errors: 0
+ rx_crc_errors: 51
+ rx_frame_errors: 0
+ rx_no_buffer_count: 0
+ rx_missed_errors: 0
+ tx_aborted_errors: 0
+ tx_carrier_errors: 0
+ tx_fifo_errors: 0
+ tx_heartbeat_errors: 0
+ tx_window_errors: 0
+ tx_abort_late_coll: 0
+ tx_deferred_ok: 0
+ tx_single_coll_ok: 0
+ tx_multi_coll_ok: 0
+ tx_timeout_count: 0
+ tx_restart_queue: 0
+ rx_long_length_errors: 0
+ rx_short_length_errors: 0
+ rx_align_errors: 0
+ tx_tcp_seg_good: 0
+ tx_tcp_seg_failed: 0
+ rx_flow_control_xon: 0
+ rx_flow_control_xoff: 0
+ tx_flow_control_xon: 0
+ tx_flow_control_xoff: 0
+ rx_long_byte_count: 32454370137414
+ rx_csum_offload_good: 35270755306
+ rx_csum_offload_errors: 65076
+ alloc_rx_buff_failed: 0
+ tx_smbus: 0
+ rx_smbus: 0
+ dropped_smbus: 0
+
+Kernel drops
+------------
+
+stats.log contains interesting information in the
+capture.kernel_packets and capture.kernel_drops. The meaning of them
+is different following the capture mode.
+
+In AF_PACKET mode:
+
+* kernel_packets is the number of packets correctly sent to userspace
+* kernel_drops is the number of packets that have been discarded instead of being sent to userspace
+
+In PF_RING mode:
+
+* kernel_packets is the total number of packets seen by pf_ring
+* kernel_drops is the number of packets that have been discarded instead of being sent to userspace
+
+In the Suricata stats.log the TCP data gap counter is also an
+indicator, as it accounts missing data packets in TCP streams:
+
+::
+
+ tcp.reassembly_gap | Detect | 789
+
+Ideally, this number is 0. Not only pkt loss affects it though, also
+bad checksums and stream engine running out of memory.
+
+Tools to plot graphs
+--------------------
+
+Some people made nice tools to plot graphs of the statistics file.
+
+* `ipython and matplotlib script <https://github.com/regit/suri-stats>`_
+* `Monitoring with Zabbix or other <http://christophe.vandeplas.com/2013/11/suricata-monitoring-with-zabbix-or-other.html>`_ and `Code on GitHub <https://github.com/cvandeplas/suricata_stats>`_
diff --git a/doc/userguide/performance/tcmalloc.rst b/doc/userguide/performance/tcmalloc.rst
new file mode 100644
index 0000000..b5f559f
--- /dev/null
+++ b/doc/userguide/performance/tcmalloc.rst
@@ -0,0 +1,39 @@
+Tcmalloc
+========
+
+'tcmalloc' is a library Google created as part of the google-perftools
+suite for improving memory handling in a threaded program. It's very
+simple to use and does work fine with Suricata. It leads to minor
+speed ups and also reduces memory usage quite a bit.
+
+Installation
+~~~~~~~~~~~~
+
+On Ubuntu, install the libtcmalloc-minimal4 package:
+
+::
+
+ apt-get install libtcmalloc-minimal4
+
+On Fedora, install the gperftools-libs package:
+
+::
+
+ yum install gperftools-libs
+
+Usage
+~~~~~
+
+Use the tcmalloc by preloading it:
+
+Ubuntu:
+
+::
+
+ LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4" suricata -c suricata.yaml -i eth0
+
+Fedora:
+
+::
+
+ LD_PRELOAD="/usr/lib64/libtcmalloc_minimal.so.4" suricata -c suricata.yaml -i eth0
diff --git a/doc/userguide/performance/tuning-considerations.rst b/doc/userguide/performance/tuning-considerations.rst
new file mode 100644
index 0000000..b184f6c
--- /dev/null
+++ b/doc/userguide/performance/tuning-considerations.rst
@@ -0,0 +1,133 @@
+Tuning Considerations
+=====================
+
+Settings to check for optimal performance.
+
+max-pending-packets: <number>
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This setting controls the number simultaneous packets that the engine
+can handle. Setting this higher generally keeps the threads more busy,
+but setting it too high will lead to degradation.
+
+Suggested setting: 10000 or higher. Max is ~65000. This setting is per thread.
+The memory is set up at start and the usage is as follows:
+
+::
+
+ number_of.threads X max-pending-packets X (default-packet-size + ~750 bytes)
+
+mpm-algo: <ac|hs|ac-bs|ac-ks>
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Controls the pattern matcher algorithm. AC (``Aho–Corasick``) is the default.
+On supported platforms, :doc:`hyperscan` is the best option. On commodity
+hardware if Hyperscan is not available the suggested setting is
+``mpm-algo: ac-ks`` (``Aho–Corasick`` Ken Steele variant) as it performs better than
+``mpm-algo: ac``
+
+detect.profile: <low|medium|high|custom>
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The detection engine tries to split out separate signatures into
+groups so that a packet is only inspected against signatures that can
+actually match. As in large rule set this would result in way too many
+groups and memory usage similar groups are merged together. The
+profile setting controls how aggressive this merging is done. The default
+setting of ``high`` usually is good enough.
+
+The "custom" setting allows modification of the group sizes:
+
+::
+
+ custom-values:
+ toclient-groups: 100
+ toserver-groups: 100
+
+In general, increasing will improve performance. It will lead to minimal
+increase in memory usage.
+The default value for ``toclient-groups`` and ``toserver-groups`` with
+``detect.profile: high`` is 75.
+
+detect.sgh-mpm-context: <auto|single|full>
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The multi pattern matcher can have it's context per signature group
+(full) or globally (single). Auto selects between single and full
+based on the **mpm-algo** selected. ac, ac-bs, ac-ks, hs default to "single".
+Setting this to "full" with ``mpm-algo: ac`` or ``mpm-algo: ac-ks`` offers
+better performance. Setting this to "full" with ``mpm-algo: hs`` is not
+recommended as it leads to much higher startup time. Instead with Hyperscan
+either ``detect.profile: high`` or bigger custom group size settings can be
+used as explained above which offers better performance than ``ac`` and
+``ac-ks`` even with ``detect.sgh-mpm-context: full``.
+
+af-packet
+~~~~~~~~~
+
+If using ``af-packet`` (default on Linux) it is recommended that af-packet v3
+is used for IDS/NSM deployments. For IPS it is recommended af-packet v2. To make
+sure af-packet v3 is used it can specifically be enforced it in the
+``af-packet`` config section of suricata.yaml like so:
+
+::
+
+ af-packet:
+ - interface: eth0
+ ....
+ ....
+ ....
+ use-mmap: yes
+ tpacket-v3: yes
+
+ring-size
+~~~~~~~~~
+
+Ring-size is another ``af-packet`` variable that can be considered for tuning
+and performance benefits. It basically means the buffer size for packets per
+thread. So if the setting is ``ring-size: 100000`` like below:
+
+::
+
+ af-packet:
+ - interface: eth0
+ threads: 5
+ ring-size: 100000
+
+it means there will be 100,000 packets allowed in each buffer of the 5 threads.
+If any of the buffers gets filled (for example packet processing can not keep up)
+that will result in packet ``drop`` counters increasing in the stats logs.
+
+The memory used for those is set up and dedicated at start and is calculated
+as follows:
+
+::
+
+ af-packet.threads X af-packet.ring-size X (default-packet-size + ~750 bytes)
+
+where ``af-packet.threads``, ``af-packet.ring-size``, ``default-packet-size``
+are the values set in suricata.yaml. Config values for example for af-packet
+could be quickly displayed with on the command line as well with
+``suricata --dump-config |grep af-packet``.
+
+stream.bypass
+~~~~~~~~~~~~~
+
+Another option that can be used to improve performance is ``stream.bypass``.
+In the example below:
+
+::
+
+ stream:
+ memcap: 64mb
+ checksum-validation: yes # reject wrong csums
+ inline: auto # auto will use inline mode in IPS mode, yes or no set it statically
+ bypass: yes
+ reassembly:
+ memcap: 256mb
+ depth: 1mb # reassemble 1mb into a stream
+ toserver-chunk-size: 2560
+ toclient-chunk-size: 2560
+ randomize-chunk-size: yes
+
+Inspection will be skipped when ``stream.reassembly.depth`` of 1mb is reached for a particular flow.