summaryrefslogtreecommitdiffstats
path: root/docs/netdata-cloud/netdata-cloud-on-prem
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-07-24 09:54:23 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-07-24 09:54:44 +0000
commit836b47cb7e99a977c5a23b059ca1d0b5065d310e (patch)
tree1604da8f482d02effa033c94a84be42bc0c848c3 /docs/netdata-cloud/netdata-cloud-on-prem
parentReleasing debian version 1.44.3-2. (diff)
downloadnetdata-836b47cb7e99a977c5a23b059ca1d0b5065d310e.tar.xz
netdata-836b47cb7e99a977c5a23b059ca1d0b5065d310e.zip
Merging upstream version 1.46.3.
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'docs/netdata-cloud/netdata-cloud-on-prem')
-rw-r--r--docs/netdata-cloud/netdata-cloud-on-prem/README.md77
-rw-r--r--docs/netdata-cloud/netdata-cloud-on-prem/infrastructure.jpegbin0 -> 517302 bytes
-rw-r--r--docs/netdata-cloud/netdata-cloud-on-prem/installation.md212
-rw-r--r--docs/netdata-cloud/netdata-cloud-on-prem/poc-without-k8s.md70
-rw-r--r--docs/netdata-cloud/netdata-cloud-on-prem/troubleshooting.md37
5 files changed, 396 insertions, 0 deletions
diff --git a/docs/netdata-cloud/netdata-cloud-on-prem/README.md b/docs/netdata-cloud/netdata-cloud-on-prem/README.md
new file mode 100644
index 000000000..49373c454
--- /dev/null
+++ b/docs/netdata-cloud/netdata-cloud-on-prem/README.md
@@ -0,0 +1,77 @@
+# Netdata Cloud On-Prem
+
+Netdata Cloud is built as microservices and is orchestrated by a Kubernetes cluster, providing a highly available and auto-scaled observability platform.
+
+The overall architecture looks like this:
+
+```mermaid
+flowchart TD
+ agents("🌍 <b>Netdata Agents</b><br/>Users' infrastructure<br/>Netdata Children & Parents")
+ users[["🔥 <b>Unified Dashboards</b><br/>Integrated Infrastructure<br/>Dashboards"]]
+ ingress("🛡️ <b>Ingress Gateway</b><br/>TLS termination")
+ traefik((("🔒 <b>Traefik</b><br/>Authentication &<br/>Authorization")))
+ emqx(("📤 <b>EMQX</b><br/>Agents Communication<br/>Message Bus<br/>MQTT"))
+ pulsar(("⚡ <b>Pulsar</b><br/>Internal Microservices<br/>Message Bus"))
+ frontend("🌐 <b>Front-End</b><br/>Static Web Files")
+ auth("👨‍💼 <b>Users &amp; Agents</b><br/>Authorization<br/>Microservices")
+ spaceroom("🏡 <b>Spaces, Rooms,<br/>Nodes, Settings</b><br/>Microservices for<br/>managing Spaces,<br/>Rooms, Nodes and<br/>related settings")
+ charts("📈 <b>Metrics & Queries</b><br/>Microservices for<br/>dispatching queries<br/>to Netdata agents")
+ alerts("🔔 <b>Alerts & Notifications</b><br/>Microservices for<br/>tracking alert<br/>transitions and<br/>deduplicating alerts")
+ sql[("✨ <b>PostgreSQL</b><br/>Users, Spaces, Rooms,<br/>Agents, Nodes, Metric<br/>Names, Metrics Retention,<br/>Custom Dashboards,<br/>Settings")]
+ redis[("🗒️ <b>Redis</b><br/>Caches needed<br/>by Microservices")]
+ elk[("🗞️ <b>Elasticsearch</b><br/>Feed Events Database")]
+ bridges("🤝 <b>Input & Output</b><br/>Microservices bridging<br/>agents to internal<br/>components")
+ notifications("📢 <b>Notifications Integrations</b><br/>Dispatch alert<br/>notifications to<br/>3rd party services")
+ feed("📝 <b>Feed & Events</b><br/>Microservices for<br/>managing the events feed")
+ users --> ingress
+ agents --> ingress
+ ingress --> traefik
+ ingress ==>|agents<br/>websockets| emqx
+ traefik -.- auth
+ traefik ==>|http| spaceroom
+ traefik ==>|http| frontend
+ traefik ==>|http| charts
+ traefik ==>|http| alerts
+ spaceroom o-...-o pulsar
+ spaceroom -.- redis
+ spaceroom x-..-x sql
+ spaceroom -.-> feed
+ charts o-.-o pulsar
+ charts -.- redis
+ charts x-.-x sql
+ charts -..-> feed
+ alerts o-.-o pulsar
+ alerts -.- redis
+ alerts x-.-x sql
+ alerts -..-> feed
+ auth o-.-o pulsar
+ auth -.- redis
+ auth x-.-x sql
+ auth -.-> feed
+ feed <--> elk
+ alerts ----> notifications
+ %% auth ~~~ spaceroom
+ emqx <.-> bridges o-..-o pulsar
+```
+
+## Requirements
+
+The following components are required to run Netdata Cloud On-Prem:
+
+- **Kubernetes cluster** version 1.23+
+- **Kubernetes metrics server** (for autoscaling)
+- **TLS certificate** for secure connections. A single endpoint is required but there is an option to split the frontend, api, and MQTT endpoints. The certificate must be trusted by all entities connecting to it.
+- Default **storage class configured and working** (persistent volumes based on SSDs are preferred)
+
+The following 3rd party components are used, which can be pulled with the `netdata-cloud-dependency` package we provide:
+
+- **Ingress controller** supporting HTTPS
+- **PostgreSQL** version 13.7 (main database for all metadata Netdata Cloud maintains)
+- **EMQX** version 5.11 (MQTT Broker that allows Agents to send messages to the On-Prem Cloud)
+- **Apache Pulsar** version 2.10+ (message broken for inter-container communication)
+- **Traefik** version 2.7.x (internal API Gateway)
+- **Elasticsearch** version 8.8.x (stores the feed of events)
+- **Redis** version 6.2 (caching)
+- imagePullSecret (our ECR repos are secured)
+
+Keep in mind though that the pulled versions are not configured properly for production use. Customers of Netdata Cloud On-Prem are expected to configure these applications according to their needs and policies for production use. Netdata Cloud On-Prem can be configured to use all these applications as a shared resource from other existing production installations.
diff --git a/docs/netdata-cloud/netdata-cloud-on-prem/infrastructure.jpeg b/docs/netdata-cloud/netdata-cloud-on-prem/infrastructure.jpeg
new file mode 100644
index 000000000..a866e141c
--- /dev/null
+++ b/docs/netdata-cloud/netdata-cloud-on-prem/infrastructure.jpeg
Binary files differ
diff --git a/docs/netdata-cloud/netdata-cloud-on-prem/installation.md b/docs/netdata-cloud/netdata-cloud-on-prem/installation.md
new file mode 100644
index 000000000..259ddb5ce
--- /dev/null
+++ b/docs/netdata-cloud/netdata-cloud-on-prem/installation.md
@@ -0,0 +1,212 @@
+# Netdata Cloud On-Prem Installation
+
+This installation guide assumes the prerequisites for installing Netdata Cloud On-Prem as satisfied. For more information please refer to the [requirements documentation](/docs/netdata-cloud/netdata-cloud-on-prem/README.md#requirements).
+
+## Installation Requirements
+
+The following components are required to install Netdata Cloud On-Prem:
+
+- **AWS** CLI
+- **Helm** version 3.12+ with OCI Configuration (explained in the installation section)
+- **Kubectl**
+
+## Preparations for Installation
+
+### Configure AWS CLI
+
+Install [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html).
+
+There are 2 options for configuring `aws cli` to work with the provided credentials. The first one is to set the environment variables:
+
+```bash
+export AWS_ACCESS_KEY_ID=<your_secret_id>
+export AWS_SECRET_ACCESS_KEY=<your_secret_key>
+```
+
+The second one is to use an interactive shell:
+
+```bash
+aws configure
+```
+
+### Configure helm to use secured ECR repository
+
+Using `aws` command we will generate a token for helm to access the secured ECR repository:
+
+```bash
+aws ecr get-login-password --region us-east-1 | helm registry login --username AWS --password-stdin 362923047827.dkr.ecr.us-east-1.amazonaws.com
+```
+
+After this step you should be able to add the repository to your helm or just pull the helm chart:
+
+```bash
+helm pull oci://362923047827.dkr.ecr.us-east-1.amazonaws.com/netdata-cloud-dependency --untar #optional
+helm pull oci://362923047827.dkr.ecr.us-east-1.amazonaws.com/netdata-cloud-onprem --untar
+```
+
+Local folders with the newest versions of helm charts should appear on your working dir.
+
+## Installation
+
+Netdata provides access to two helm charts:
+
+1. `netdata-cloud-dependency` - required applications for `netdata-cloud-onprem`.
+2. `netdata-cloud-onprem` - the application itself + provisioning
+
+### netdata-cloud-dependency
+
+This helm chart is designed to install the necessary applications:
+
+- Redis
+- Elasticsearch
+- EMQX
+- Apache Pulsar
+- PostgreSQL
+- Traefik
+- Mailcatcher
+- k8s-ecr-login-renew
+- kubernetes-ingress
+
+Although we provide an easy way to install all these applications, we expect users of Netdata Cloud On-Prem to provide production quality versions for them. Therefore, every configuration option is available through `values.yaml` in the folder that contains your netdata-cloud-dependency helm chart. All configuration options are described in `README.md` which is a part of the helm chart.
+
+Each component can be enabled/disabled individually. It is done by true/false switches in `values.yaml`. This way, it is easier to migrate to production-grade components gradually.
+
+Unless you prefer otherwise, `k8s-ecr-login-renew` is responsible for calling out the `AWS API` for token regeneration. This token is then injected into the secret that every node is using for authentication with secured ECR when pulling the images.
+
+The default setting in `values.yaml` of `netdata-cloud-onprem` - `.global.imagePullSecrets` is configured to work out of the box with the dependency helm chart.
+
+For helm chart installation - save your changes in `values.yaml` and execute:
+
+```shell
+cd [your helm chart location]
+helm upgrade --wait --install netdata-cloud-dependency -n netdata-cloud --create-namespace -f values.yaml .
+```
+
+Keep in mind that `netdata-cloud-dependency` is provided only as a proof of concept. Users installing Netdata Cloud On-Prem should properly configure these components.
+
+### netdata-cloud-onprem
+
+Every configuration option is available in `values.yaml` in the folder that contains your `netdata-cloud-onprem` helm chart. All configuration options are described in the `README.md` which is a part of the helm chart.
+
+#### Installing Netdata Cloud On-Prem
+
+```shell
+cd [your helm chart location]
+helm upgrade --wait --install netdata-cloud-onprem -n netdata-cloud --create-namespace -f values.yaml .
+```
+
+##### Important notes
+
+1. Installation takes care of provisioning the resources with migration services.
+
+2. During the first installation, a secret called the `netdata-cloud-common` is created. It contains several randomly generated entries. Deleting helm chart is not going to delete this secret, nor reinstalling the whole On-Prem, unless manually deleted by kubernetes administrator. The content of this secret is extremely relevant - strings that are contained there are essential parts of encryption. Losing or changing the data that it contains will result in data loss.
+
+## Short description of Netdata Cloud microservices
+
+#### cloud-accounts-service
+
+Responsible for user registration & authentication. Manages user account information.
+
+#### cloud-agent-data-ctrl-service
+
+Forwards request from the cloud to the relevant agents.
+The requests include:
+- Fetching chart metadata from the agent
+- Fetching chart data from the agent
+- Fetching function data from the agent
+
+#### cloud-agent-mqtt-input-service
+
+Forwards MQTT messages emitted by the agent related to the agent entities to the internal Pulsar broker. These include agent connection state updates.
+
+#### cloud-agent-mqtt-output-service
+
+Forwards Pulsar messages emitted in the cloud related to the agent entities to the MQTT broker. From there, the messages reach the relevant agent.
+
+#### cloud-alarm-config-mqtt-input-service
+
+Forwards MQTT messages emitted by the agent related to the alarm-config entities to the internal Pulsar broker. These include the data for the alarm configuration as seen by the agent.
+
+#### cloud-alarm-log-mqtt-input-service
+
+Forwards MQTT messages emitted by the agent related to the alarm-log entities to the internal Pulsar broker. These contain data about the alarm transitions that occurred in an agent.
+
+#### cloud-alarm-mqtt-output-service
+
+Forwards Pulsar messages emitted in the cloud related to the alarm entities to the MQTT broker. From there, the messages reach the relevant agent.
+
+#### cloud-alarm-processor-service
+
+Persists latest alert statuses received from the agent in the cloud.
+Aggregates alert statuses from relevant node instances.
+Exposes API endpoints to fetch alert data for visualization on the cloud.
+Determines if notifications need to be sent when alert statuses change and emits relevant messages to Pulsar.
+Exposes API endpoints to store and return notification-silencing data.
+
+#### cloud-alarm-streaming-service
+
+Responsible for starting the alert stream between the agent and the cloud.
+Ensures that messages are processed in the correct order, and starts a reconciliation process between the cloud and the agent if out-of-order processing occurs.
+
+#### cloud-charts-mqtt-input-service
+
+Forwards MQTT messages emitted by the agent related to the chart entities to the internal Pulsar broker. These include the chart metadata that is used to display relevant charts on the cloud.
+
+#### cloud-charts-mqtt-output-service
+
+Forwards Pulsar messages emitted in the cloud related to the charts entities to the MQTT broker. From there, the messages reach the relevant agent.
+
+#### cloud-charts-service
+
+Exposes API endpoints to fetch the chart metadata.
+Forwards data requests via the `cloud-agent-data-ctrl-service` to the relevant agents to fetch chart data points.
+Exposes API endpoints to call various other endpoints on the agent, for instance, functions.
+
+#### cloud-custom-dashboard-service
+
+Exposes API endpoints to fetch and store custom dashboard data.
+
+#### cloud-environment-service
+
+Serves as the first contact point between the agent and the cloud.
+Returns authentication and MQTT endpoints to connecting agents.
+
+#### cloud-feed-service
+
+Processes incoming feed events and stores them in Elasticsearch.
+Exposes API endpoints to fetch feed events from Elasticsearch.
+
+#### cloud-frontend
+
+Contains the on-prem cloud website. Serves static content.
+
+#### cloud-iam-user-service
+
+Acts as a middleware for authentication on most of the API endpoints. Validates incoming token headers, injects the relevant ones, and forwards the requests.
+
+#### cloud-metrics-exporter
+
+Exports various metrics from an On-Prem Cloud installation. Uses the Prometheus metric exposition format.
+
+#### cloud-netdata-assistant
+
+Exposes API endpoints to fetch a human-friendly explanation of various netdata configuration options, namely the alerts.
+
+#### cloud-node-mqtt-input-service
+
+Forwards MQTT messages emitted by the agent related to the node entities to the internal Pulsar broker. These include the node metadata as well as their connectivity state, either direct or via parents.
+
+#### cloud-node-mqtt-output-service
+
+Forwards Pulsar messages emitted in the cloud related to the charts entities to the MQTT broker. From there, the messages reach the relevant agent.
+
+#### cloud-notifications-dispatcher-service
+
+Exposes API endpoints to handle integrations.
+Handles incoming notification messages and uses the relevant channels(email, slack...) to notify relevant users.
+
+#### cloud-spaceroom-service
+
+Exposes API endpoints to fetch and store relations between agents, nodes, spaces, users, and rooms.
+Acts as a provider of authorization for other cloud endpoints.
+Exposes API endpoints to authenticate agents connecting to the cloud.
diff --git a/docs/netdata-cloud/netdata-cloud-on-prem/poc-without-k8s.md b/docs/netdata-cloud/netdata-cloud-on-prem/poc-without-k8s.md
new file mode 100644
index 000000000..6be4066bd
--- /dev/null
+++ b/docs/netdata-cloud/netdata-cloud-on-prem/poc-without-k8s.md
@@ -0,0 +1,70 @@
+# Netdata Cloud On-Prem PoC without k8s
+
+These instructions are about installing a light version of Netdata Cloud, for clients who do not have a Kubernetes cluster installed. This setup is **only for demonstration purposes**, as it has no built-in resiliency on failures of any kind.
+
+## Requirements
+
+- Ubuntu 22.04 (clean installation will work best).
+- 10 CPU Cores and 24 GiB of memory.
+- Access to shell as a sudo.
+- TLS certificate for Netdata Cloud On-Prem PoC. A single endpoint is required. The certificate must be trusted by all entities connecting to this installation.
+- AWS ID and License Key - we should have provided this to you, if not contact us: <info@netdata.cloud>.
+
+To install the whole environment, log in to the designated host and run:
+
+```bash
+curl https://netdata-cloud-netdata-static-content.s3.amazonaws.com/provision.sh -o provision.sh
+chmod +x provision.sh
+sudo ./provision.sh install \
+ -key-id "" \
+ -access-key "" \
+ -onprem-license-key "" \
+ -onprem-license-subject "" \
+ -onprem-url "" \
+ -certificate-path "" \
+ -private-key-path ""
+```
+
+What does the script do during installation?
+
+1. Prompts for user to provide:
+ - `-key-id` - AWS ECR access key ID.
+ - `-access-key` - AWS ECR Access Key.
+ - `-onprem-license-key` - Netdata Cloud On-Prem license key.
+ - `-onprem-license-subject` - Netdata Cloud On-Prem license subject.
+ - `-onprem-url` - URL for the On-prem (without http(s) protocol).
+ - `-certificate-path` - path to your PEM encoded certificate.
+ - `-private-key-path` - path to your PEM encoded key.
+
+2. After all the above installation will begin. The script will install:
+ - Helm
+ - Kubectl
+ - AWS CLI
+ - K3s cluster (single node)
+
+3. When all the required software is installed script starts to provision the K3s cluster with gathered data.
+
+After cluster provisioning netdata is ready to be used.
+
+> WARNING:
+> This script will automatically expose not only netdata but also a mailcatcher under `<URL from point 1.>/mailcatcher`.
+
+## How to log in?
+
+Only login by mail can work without further configuration. Every mail this Netdata Cloud On-Prem sends, will appear on the mailcatcher, which acts as the SMTP server with a simple GUI to read the mails.
+
+Steps:
+
+1. Open Netdata Cloud On-Prem PoC in the web browser on URL you specified
+2. Provide email and use the button to confirm
+3. Mailcatcher will catch all the emails so go to `<URL from point 1.>/mailcatcher`. Find yours and click the link.
+4. You are now logged into Netdata Cloud. Add your first nodes!
+
+## How to remove Netdata Cloud On-Prem PoC?
+
+To uninstall the whole PoC, use the same script that installed it, with the `uninstall` switch.
+
+```shell
+cd <script dir>
+sudo ./provision.sh uninstall
+```
diff --git a/docs/netdata-cloud/netdata-cloud-on-prem/troubleshooting.md b/docs/netdata-cloud/netdata-cloud-on-prem/troubleshooting.md
new file mode 100644
index 000000000..ac8bdf6f8
--- /dev/null
+++ b/docs/netdata-cloud/netdata-cloud-on-prem/troubleshooting.md
@@ -0,0 +1,37 @@
+# Netdata Cloud On-Prem Troubleshooting
+
+Netdata Cloud is a sophisticated software piece relying on in multiple infrastructure components for its operation.
+
+We assume that your team already manages and monitors properly the components Netdata Cloud depends upon, like the PostgreSQL, Redis and Elasticsearch databases, the Pulsar and EMQX message brokers, the traffic controllers (Ingress and Traefik) and of course the health of the Kubernetes cluster itself.
+
+The following are questions that are usually asked by Netdata Cloud On-Prem operators.
+
+## Loading charts takes a long time or ends with an error
+
+The charts service is trying to collect data from the agents involved in the query. In most of the cases, this microservice queries many agents (depending on the Room), and all of them have to reply for the query to be satisfied.
+
+One or more of the following may be the cause:
+
+1. **Slow Netdata Agent or Netdata Agents with unreliable connections**
+
+ If any of the Netdata agents queried is slow or has an unreliable network connection, the query will stall and Netdata Cloud will have timeout before responding.
+
+ When agents are overloaded or have unreliable connections, we suggest to install more Netdata Parents for providing reliable backends to Netdata Cloud. They will automatically be preferred for all queries, when available.
+
+2. **Poor Kubernetes cluster management**
+
+ Another common issue is poor management of the Kubernetes cluster. When a node of a Kubernetes cluster is saturated, or the limits set to its containers are small, Netdata Cloud microservices get throttled by Kubernetes and does not get the resources required to process the responses of Netdata agents and aggregate the results for the dashboard.
+
+ We recommend to review the throttling of the containers and increase the limits if required.
+
+3. **Saturated Database**
+
+ Slow responses may also indicate performance issues at the PostgreSQL database.
+
+ Please review the resources utilization of the database server (CPU, Memory, and Disk I/O) and take action to improve the situation.
+
+4. **Messages pilling up in Pulsar**
+
+ Depending on the size of the infrastructure being monitored and the resources allocated to Pulsar and the microservices, messages may be pilling up. When this happens you may also experience that nodes status updates (online, offline, stale) are slow, or alerts transitions take time to appear on the dashboard.
+
+ We recommend to review Pulsar configuration and the resources allocated of the microservices, to ensure that there is no saturation.