summaryrefslogtreecommitdiffstats
path: root/collectors/tc.plugin/README.md
blob: 6670c491f1c853e4abc9030d58a5142f7a8f2fa5 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
## tc.plugin

Live demo - **[see it in action here](https://registry.my-netdata.io/#menu_tc)** !

![qos](https://cloud.githubusercontent.com/assets/2662304/14439411/b7f36254-0033-11e6-93f0-c739bb6a1c3a.gif)

Netdata monitors `tc` QoS classes for all interfaces.

If you also use [FireQOS](http://firehol.org/tutorial/fireqos-new-user/) it will collect
interface and class names.

There is a [shell helper](tc-qos-helper.sh.in) for this (all parsing is done by the plugin
in `C` code - this shell script is just a configuration for the command to run to get `tc` output).

The source of the tc plugin is [here](plugin_tc.c). It is somewhat complex, because a state
machine was needed to keep track of all the `tc` classes, including the pseudo classes tc
dynamically creates.

## Motivation

One category of metrics missing in Linux monitoring, is bandwidth consumption for each open
socket (inbound and outbound traffic). So, you cannot tell how much bandwidth your web server,
your database server, your backup, your ssh sessions, etc are using.

To solve this problem, the most *adventurous* Linux monitoring tools install kernel modules to
capture all traffic, analyze it and provide reports per application. A lot of work, CPU intensive
and with a great degree of risk (due to the kernel modules involved which might affect the
stability of the whole system). Not to mention that such solutions are probably better suited
for a core linux router in your network.

Others use NFACCT, the netfilter accounting module which is already part of the Linux firewall.
However, this would require configuring a firewall on every system you want to measure bandwidth.

QoS monitoring attempts to solve this in a much cleaner way.

## Introduction to QoS

One of the features the Linux kernel has, but it is rarely used, is its ability to
**apply QoS on traffic**. Even most interesting is that it can apply QoS to **both inbound and
outbound traffic**.

QoS is about 2 features:

1. **Classify traffic**

   Classification is the process of organizing traffic in groups, called **classes**.
   Classification can evaluate every aspect of network packets, like source and destination ports,
   source and destination IPs, netfilter marks, etc.

   When you classify traffic, you just assign a label to it. For example **I call `web server`
   traffic, the traffic from my server's tcp/80, tcp/443 and to my server's tcp/80, tcp/443,
   while I call `web surfing` all other tcp/80 and tcp/443 traffic**. You can use any combinations
   you like. There is no limit.

2. **Apply traffic shaping rules to these classes**

   Traffic shaping is used to control how network interface bandwidth should be shared among the
   classes. Of course we are not interested for this feature to just monitor the traffic.
   Classification will be enough for monitoring everything.

The key reasons of applying QoS on all servers (even cloud ones) are:

   - **ensure administrative tasks (like ssh, dns, etc) will always have a small but guaranteed
      bandwidth.** QoS can guarantee that services like ssh, dns, ntp, etc will always have a small
      supply of bandwidth. So, no matter what happens, you will be able to ssh to your server and
      DNS will always work.

   - **ensure other administrative tasks will not monopolize all the available bandwidth.**
      Services like backups, file copies, database dumps, etc can easily monopolize all the
      available bandwidth. It is common for example a nightly backup, or a huge file transfer
      to negatively influence the end-user experience. QoS can fix that.

   - **ensure each end-user connection will get a fair cut of the available bandwidth.**
      Several QoS queuing disciplines in Linux do this automatically, without any configuration from you.
      The result is that new sockets are favored over older ones, so that users will get a snappier
      experience, while others are transferring large amounts of traffic.
   
   - **protect the servers from DDoS attacks.**
      When your system is under a DDoS attack, it will get a lot more bandwidth compared to the one it
      can handle and probably your applications will crash. Setting a limit on the inbound traffic using
      QoS, will protect your servers (throttle the requests) and depending on the size of the attack may
      allow your legitimate users to access the server, while the attack is taking place.


Once **traffic classification** is applied, netdata can visualize the bandwidth consumption per
class in real-time (no configuration is needed for netdata - it will figure it out).

QoS, is extremely light. You will configure it once, and this is it. It will not bother you again
and it will not use any noticeable CPU resources, especially on application and database servers.

## QoS in Linux? Have you lost your mind?

Yes I know... but no, I have not!

Of course, `tc` is probably **the most undocumented, complicated and unfriendly** command in Linux. 

For example, for matching a simple port range in `tc`, e.g. all the high ports, from 1025 to 65535
inclusive, you have to match these:

```
1025/0xffff 1026/0xfffe 1028/0xfffc  1032/0xfff8  1040/0xfff0
1056/0xffe0 1088/0xffc0 1152/0xff80  1280/0xff00  1536/0xfe00
2048/0xf800 4096/0xf000 8192/0xe000 16384/0xc000 32768/0x8000
```

I know what you are thinking right now! **And I agree!**

This is why I wrote **[FireQOS](https://firehol.org/tutorial/fireqos-new-user/)**, a tool to
simplify QoS management in Linux.

The **[FireHOL](https://firehol.org/)** package already distributes **[FireQOS](https://firehol.org/tutorial/fireqos-new-user/)**.
Check the **[FireQOS tutorial](https://firehol.org/tutorial/fireqos-new-user/)**
to learn how to write your own QoS configuration.

With **[FireQOS](https://firehol.org/tutorial/fireqos-new-user/)**, it is **really simple for everyone
to use QoS in Linux**. Just install the package `firehol`. It should already be available for your
distribution. If not, check the **[FireHOL Installation Guide](https://firehol.org/installing/)**.
After that, you will have the `fireqos` command.

This is the file `/etc/firehol/fireqos.conf` we use at the netdata demo site:

```sh
    # configure the netdata ports
    server_netdata_ports="tcp/19999"

    interface eth0 world bidirectional ethernet balanced rate 50Mbit
       class arp
          match arp

       class icmp
          match icmp

       class dns commit 1Mbit
          server dns
          client dns

       class ntp
          server ntp
          client ntp

       class ssh commit 2Mbit
          server ssh
          client ssh

       class rsync commit 2Mbit max 10Mbit
          server rsync
          client rsync

       class web_server commit 40Mbit
          server http
          server netdata

       class client
          client surfing

       class nms commit 1Mbit
          match input src 10.2.3.5
```

Nothing more is needed. You just run `fireqos start` to apply this configuration, restart netdata
and you have real-time visualization of the bandwidth consumption of your applications. FireQOS is
not a daemon. It will just convert the configuration to `tc` commands. It will run them and it will
exit.

**IMPORTANT**: If you copy this configuration to apply it to your system, please adapt the
speeds - experiment in non-production environments to learn the tool, before applying it on
your servers.

And this is what you are going to get:

![image](https://cloud.githubusercontent.com/assets/2662304/14436322/c91d90a4-0024-11e6-9fb1-57cdef1580df.png)

---

## More examples:

This is QoS from a linux router. Check these features:

1. It is real-time (per second updates)
2. QoS really works in Linux - check that the `background` traffic is squeezed when `surfing` needs it.

![test2](https://cloud.githubusercontent.com/assets/2662304/14093004/68966020-f553-11e5-98fe-ffee2086fafd.gif)