diff options
Diffstat (limited to 'doc/userguide/rules/datasets.rst')
-rw-r--r-- | doc/userguide/rules/datasets.rst | 350 |
1 files changed, 350 insertions, 0 deletions
diff --git a/doc/userguide/rules/datasets.rst b/doc/userguide/rules/datasets.rst new file mode 100644 index 0000000..069ee72 --- /dev/null +++ b/doc/userguide/rules/datasets.rst @@ -0,0 +1,350 @@ +.. _datasets: + +Datasets +======== + +Using the ``dataset`` and ``datarep`` keyword it is possible to match on +large amounts of data against any sticky buffer. + +For example, to match against a DNS black list called ``dns-bl``:: + + dns.query; dataset:isset,dns-bl; + +These keywords are aware of transforms. So to look up a DNS query against +a MD5 black list:: + + dns.query; to_md5; dataset:isset,dns-bl; + +Global config (optional) +------------------------ + +Datasets can optionally be defined in the main config. Sets can also be +declared from the rule syntax. + +Example of sets for tracking unique values:: + + datasets: + ua-seen: + type: string + state: ua-seen.lst + dns-sha256-seen: + type: sha256 + state: dns-sha256-seen.lst + +Rules to go with the above: + +.. container:: example-rule + + alert dns any any -> any any (msg:"dns list test"; dns.query; to_sha256; dataset:isset,dns-sha256-seen; sid:123; rev:1;) + +.. container:: example-rule + + alert http any any -> any any (msg: "http user-agent test"; http.user_agent; dataset:set,ua-seen; sid:234; rev:1;) + +It is also possible to optionally define global default memcap and hashsize. + +Example:: + + datasets: + defaults: + memcap: 100mb + hashsize: 2048 + ua-seen: + type: string + load: ua-seen.lst + +or define memcap and hashsize per dataset. + +Example:: + + datasets: + ua-seen: + type: string + load: ua-seen.lst + memcap: 10mb + hashsize: 1024 + +.. note:: The `hashsize` should be close to the amount of entries in the dataset to avoid collisions. If it's set too low, this could result in rather long startup time. + +Rule keywords +------------- + +dataset +~~~~~~~ + +Datasets are binary: something is in the set or it's not. + +Syntax:: + + dataset:<cmd>,<name>,<options>; + + dataset:<set|isset|isnotset>,<name> \ + [, type <string|md5|sha256|ipv4|ip>, save <file name>, load <file name>, state <file name>, memcap <size>, hashsize <size>]; + +type <type> + the data type: string, md5, sha256, ipv4, ip +load <file name> + file name for load the data when Suricata starts up +state + sets file name for loading and saving a dataset +save <file name> + advanced option to set the file name for saving the in-memory data + when Suricata exits. +memcap <size> + maximum memory limit for the respective dataset +hashsize <size> + allowed size of the hash for the respective dataset + +.. note:: 'type' is mandatory and needs to be set. + +.. note:: 'load' and 'state' or 'save' and 'state' cannot be mixed. + +Example rules could look like: + +1. Detect unique User-Agents: + +.. container:: example-rule + + alert http any any -> any any (msg:"LOCAL HTTP new UA"; http.user_agent; dataset:set,http-ua-seen, type string, state http-ua-seen.csv; sid:8000001; rev:1;) + +2. Detect unique TLDs: + +.. container:: example-rule + + alert dns $HOME_NET any -> any any (msg:"LOCAL DNS unique TLD"; dns.query; pcrexform:"\\.([^\\.]+)$"; dataset:set,dns-tld-seen, type string, state dns-tld-seen.csv; sid:8000002; rev:1;) + +Following image is a pictorial representation of how the ``pcrexform`` works +on domain names to find TLDs in the dataset ``dns-tld-seen``: + +.. image:: dataset-examples/detect-unique-tlds.png + +Notice how it is not possible to do certain operations alone with datasets +(example 2 above), but, it is possible to use a combination of other rule +keywords. Keep in mind the cost of additional keywords though e.g. in the +second example rule above, negative performance impact can be expected due +to ``pcrexform``. + +datarep +~~~~~~~ + +Data Reputation allows matching data against a reputation list. + +Syntax:: + + datarep:<name>,<operator>,<value>, \ + [, load <file name>, type <string|md5|sha256|ipv4|ip>, memcap <size>, hashsize <size>]; + +Example rules could look like:: + + alert dns any any -> any any (dns.query; to_md5; datarep:dns_md5, >, 200, load dns_md5.rep, type md5, memcap 100mb, hashsize 2048; sid:1;) + alert dns any any -> any any (dns.query; to_sha256; datarep:dns_sha256, >, 200, load dns_sha256.rep, type sha256; sid:2;) + alert dns any any -> any any (dns.query; datarep:dns_string, >, 200, load dns_string.rep, type string; sid:3;) + +In these examples the DNS query string is checked against three different +reputation lists. A MD5 list, a SHA256 list, and a raw string (buffer) list. +The rules will only match if the data is in the list and the reputation +value is higher than 200. + + +Rule Reloads +------------ + +Sets that are defined in the yaml, or sets that only use `state` or `save`, are +considered `dynamic` sets. These are not reloaded during rule reloads. + +Sets that are defined in rules using only `load` are considered `static` tests. +These are not expected to change during runtime. During rule reloads these are +reloaded from disk. This reload is effective when the complete rule reload +process is complete. + + +Unix Socket +----------- + +dataset-add +~~~~~~~~~~~ + +Unix Socket command to add data to a set. On success, the addition becomes +active instantly. + +Syntax:: + + dataset-add <set name> <set type> <data> + +set name + Name of an already defined dataset +type + Data type: string, md5, sha256, ipv4, ip +data + Data to add in serialized form (base64 for string, hex notation for md5/sha256, string representation for ipv4/ip) + +Example adding 'google.com' to set 'myset':: + + dataset-add myset string Z29vZ2xlLmNvbQ== + +dataset-remove +~~~~~~~~~~~~~~ + +Unix Socket command to remove data from a set. On success, the removal becomes +active instantly. + +Syntax:: + + dataset-remove <set name> <set type> <data> + +set name + Name of an already defined dataset +type + Data type: string, md5, sha256, ipv4, ip +data + Data to remove in serialized form (base64 for string, hex notation for md5/sha256, string representation for ipv4/ip) + +dataset-clear +~~~~~~~~~~~~~ + +Unix Socket command to remove all data from a set. On success, the removal becomes +active instantly. + +Syntax:: + + dataset-clear <set name> <set type> + +set name + Name of an already defined dataset +type + Data type: string, md5, sha256, ipv4, ip + +dataset-lookup +~~~~~~~~~~~~~~ + +Unix Socket command to test if data is in a set. + +Syntax:: + + dataset-lookup <set name> <set type> <data> + +set name + Name of an already defined dataset +type + Data type: string, md5, sha256, ipv4, ip +data + Data to test in serialized form (base64 for string, hex notation for md5/sha256, string notation for ipv4/ip) + +Example testing if 'google.com' is in the set 'myset':: + + dataset-lookup myset string Z29vZ2xlLmNvbQ== + +dataset-dump +~~~~~~~~~~~~ + +Unix socket command to trigger a dump of datasets to disk. + +Syntax:: + + dataset-dump + +File formats +------------ + +Datasets use a simple CSV format where data is per line in the file. + +data types +~~~~~~~~~~ + +string + in the file as base64 encoded string +md5 + in the file as hex encoded string +sha256 + in the file as hex encoded string +ipv4 + in the file as string +ip + in the file as string, it can be IPv6 or IPv4 address (standard notation or IPv4 in IPv6 one) + + +dataset +~~~~~~~ + +Datasets have a simple structure, where there is one piece of data +per line in the file. + +Syntax:: + + <data> + +e.g. for ua-seen with type string:: + + TW96aWxsYS80LjAgKGNvbXBhdGlibGU7ICk= + +which when piped to ``base64 -d`` reveals its value:: + + Mozilla/4.0 (compatible; ) + + +datarep +~~~~~~~ + +The datarep format follows the dataset, expect that there are 1 more CSV +field: + +Syntax:: + + <data>,<value> + +.. _datasets_file_locations: + +File Locations +-------------- + +Dataset filenames configured in the ``suricata.yaml`` can exist +anywhere on your filesytem. + +When a dataset filename is specified in rule, the following *rules* +are applied: + +- For ``load``, the filename is opened relative to the rule file + containing the rule. Absolute filenames and parent directory + traversals are allowed. +- For ``save`` and ``state`` the filename is relative to + ``$LOCALSTATEDIR/suricata/data``. On many installs this will be + ``/var/lib/suricata/data``, but run ``suricata --build-info`` and + check the value of ``--localstatedir`` to verify this location onn + your installation. + + - Absolute filenames, or filenames containing parent directory + traversal (``..``) are not allowed unless the configuration + paramater ``datasets.allow-absolute-filenames`` is set to + ``true``. + +.. _datasets_security: + +Security +-------- + +As datasets potentially allow a rule distributor write access to your +system with ``save`` and ``state`` dataset rules, the locations +allowed are strict by default, however there are two dataset options +to tune the security of rules utilizing dataset filenames:: + + datasets: + rules: + # Set to true to allow absolute filenames and filenames that use + # ".." components to reference parent directories in rules that specify + # their filenames. + allow-absolute-filenames: false + + # Allow datasets in rules write access for "save" and + # "state". This is enabled by default, however write access is + # limited to the data directory. + allow-write: true + +By setting ``datasets.rules.allow-write`` to false, all ``save`` and +``state`` rules will fail to load. This option is enabled by default +to preserve compatiblity with previous 6.0 Suricata releases, however +may change in a future major release. + +Pre-Suricata 6.0.13 behavior can be restored by setting +``datasets.rules.allow-absolute-filenames`` to ``true``, however +allowing so will allow any rule to overwrite any file on your system +that Suricata has write access to. |