summaryrefslogtreecommitdiffstats
path: root/docs/guides/troubleshoot
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2023-06-14 19:20:36 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2023-06-14 19:20:36 +0000
commitdd24e74edfbafc09eaeb2dde0fda7eb3e1e86d0b (patch)
tree1e52f4dac2622ab377c7649f218fb49003b4cbb9 /docs/guides/troubleshoot
parentReleasing debian version 1.39.1-2. (diff)
downloadnetdata-dd24e74edfbafc09eaeb2dde0fda7eb3e1e86d0b.tar.xz
netdata-dd24e74edfbafc09eaeb2dde0fda7eb3e1e86d0b.zip
Merging upstream version 1.40.0.
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'docs/guides/troubleshoot')
-rw-r--r--docs/guides/troubleshoot/troubleshooting-agent-with-cloud-connection.md138
1 files changed, 86 insertions, 52 deletions
diff --git a/docs/guides/troubleshoot/troubleshooting-agent-with-cloud-connection.md b/docs/guides/troubleshoot/troubleshooting-agent-with-cloud-connection.md
index a0e8973f..ad747cb7 100644
--- a/docs/guides/troubleshoot/troubleshooting-agent-with-cloud-connection.md
+++ b/docs/guides/troubleshoot/troubleshooting-agent-with-cloud-connection.md
@@ -1,31 +1,71 @@
# Troubleshoot Agent-Cloud connectivity issues
-Learn how to troubleshoot the Netdata Agent showing as offline after claiming, so you can connect the Agent to Netdata Cloud.
+Learn how to troubleshoot connectivity issues leading to agents not appearing at all in Netdata Cloud, or
+appearing with a status other than `live`.
-When you are claiming a node, you might not be able to immediately see it online in Netdata Cloud.
-This could be due to an error in the claiming process or a temporary outage of some services.
+After installing an agent with the claiming token provided by Netdata Cloud, you should see charts from that node on
+Netdata Cloud within seconds. If you don't see charts, check if the node appears in the list of nodes
+(Nodes tab, top right Node filter, or Manage Nodes screen). If your node does not appear in the list, or it does appear with a status other than "Live", this guide will help you troubleshoot what's happening.
-We identified some scenarios that might cause this delay and possible actions you could take to overcome each situation.
+ The most common explanation for connectivity issues usually falls into one of the following three categories:
-The most common explanation for the delay usually falls into one of the following three categories:
+- If the node does not appear at all in Netdata Cloud, [the claiming process was unsuccessful](#the-claiming-process-was-unsuccessful).
+- If the node appears as in Netdata Cloud, but is in the "Unseen" state, [the Agent was claimed but can not connect](#the-agent-was-claimed-but-can-not-connect).
+- If the node appears as in Netdata Cloud as "Offline" or "Stale", it is a [previously connected agent that can no longer connect](#previously-connected-agent-that-can-no-longer-connect).
-- [Troubleshoot Agent-Cloud connectivity issues](#troubleshoot-agent-cloud-connectivity-issues)
- - [The claiming process of the kickstart script was unsuccessful](#the-claiming-process-of-the-kickstart-script-was-unsuccessful)
- - [The kickstart script auto-claimed the Agent but there was no error message displayed](#the-kickstart-script-auto-claimed-the-agent-but-there-was-no-error-message-displayed)
- - [Claiming on an older, deprecated version of the Agent](#claiming-on-an-older-deprecated-version-of-the-agent)
- - [Network issues while connecting to the Cloud](#network-issues-while-connecting-to-the-cloud)
- - [Verify that your IP is whitelisted from Netdata Cloud](#verify-that-your-ip-is-whitelisted-from-netdata-cloud)
- - [Make sure that your node has internet connectivity and can resolve network domains](#make-sure-that-your-node-has-internet-connectivity-and-can-resolve-network-domains)
+## The claiming process was unsuccessful
-## The claiming process of the kickstart script was unsuccessful
+If the claiming process fails, the node will not appear at all in Netdata Cloud.
-Here, we will try to define some edge cases you might encounter when claiming a node.
+First ensure that you:
+- Use the newest possible stable or nightly version of the agent (at least v1.32).
+- Your node can successfully issue an HTTPS request to https://api.netdata.cloud
-### The kickstart script auto-claimed the Agent but there was no error message displayed
+Other possible causes differ between kickstart installations and Docker installations.
-The kickstart script will install/update your Agent and then try to claim the node to the Cloud (if tokens are provided). To
-complete the second part, the Agent must be running. In some platforms, the Netdata service cannot be enabled by default
-and you must do it manually, using the following steps:
+### Verify your node can access Netdata Cloud
+
+If you run either `curl` or `wget` to do an HTTPS request to https://api.netdata.cloud, you should get
+back a 404 response. If you do not, check your network connectivity, domain resolution,
+and firewall settings for outbound connections.
+
+If your firewall is configured to completely prevent outbound connections, you need to whitelist `api.netdata.cloud` and `mqtt.netdata.cloud`. If you can't whitelist domains in your firewall, you can whitelist the IPs that the hostnames resolve to, but keep in mind that they can change without any notice.
+
+If you use an outbound proxy, you need to [take some extra steps]( https://github.com/netdata/netdata/blob/master/claim/README.md#connect-through-a-proxy).
+
+### Troubleshoot claiming with kickstart.sh
+
+Claiming is done by executing `netdata-claim.sh`, a script that is usually located under `${INSTALL_PREFIX}/netdata/usr/sbin/netdata-claim.sh`. Possible error conditions we have identified are:
+- No script found at all in any of our search paths.
+- The path where the claiming script should be does not exist.
+- The path exists, but is not a file.
+- The path is a file, but is not executable.
+Check the output of the kickstart script for any reported errors claiming and verify that the claiming script exists
+and can be executed.
+
+### Troubleshoot claiming with Docker
+
+First verify that the NETDATA_CLAIM_TOKEN parameter is correctly configured and then check for any errors during
+initialization of the container.
+
+The most common issue we have seen claiming nodes in Docker is [running on older hosts with seccomp enabled](https://github.com/netdata/netdata/blob/master/claim/README.md#known-issues-on-older-hosts-with-seccomp-enabled).
+
+## The Agent was claimed but can not connect
+
+Agents that appear on the cloud with state "Unseen" have successfully been claimed, but have never
+been able to successfully establish an ACLK connection.
+
+Agents that appear with state "Offline" or "Stale" were able to connect at some point, but are currently not
+connected. The difference between the two is that "Stale" nodes had some of their data replicated to a
+parent node that is still connected.
+
+### Verify that the agent is running
+
+#### Troubleshoot connection establishment with kickstart.sh
+
+The kickstart script will install/update your Agent and then try to claim the node to the Cloud
+(if tokens are provided). To complete the second part, the Agent must be running. In some platforms,
+the Netdata service cannot be enabled by default and you must do it manually, using the following steps:
1. Check if the Agent is running:
@@ -53,17 +93,39 @@ and you must do it manually, using the following steps:
> In some cases a simple restart of the Agent can fix the issue.
> Read more about [Starting, Stopping and Restarting the Agent](https://github.com/netdata/netdata/blob/master/docs/configure/start-stop-restart.md).
-## Claiming on an older, deprecated version of the Agent
+#### Troubleshoot connection establishment with Docker
+
+If a Netdata container exits or is killed before it properly starts, it may be able to complete the claiming
+process, but not have enough time to establish the ACLK connection.
+
+### Verify that your firewall allows websockets
+
+The agent initiates an SSL connection to `api.netdata.cloud` and then upgrades that connection to use secure
+websockets. Some firewalls completely prevent the use of websockets, even for outbound connections.
+
+## Previously connected agent that can no longer connect
-Make sure that you are using the latest version of Netdata if you are using the [Claiming script](https://github.com/netdata/netdata/blob/master/claim/README.md#claiming-script).
+The states "Offline" and "Stale" suggest that the agent was able to connect at some point in the past, but
+that it is currently not connected.
-With the introduction of our new architecture, Agents running versions lower than `v1.32.0` can face claiming problems, so we recommend you [update the Netdata Agent](https://github.com/netdata/netdata/blob/master/packaging/installer/UPDATE.md) to the latest stable version.
+### Verify that network connectivity is still possible
-## Network issues while connecting to the Cloud
+Verify that you can still issue HTTPS requests to api.netdata.cloud and that no firewall or proxy changes were made.
-### Verify that your IP is whitelisted from Netdata Cloud
+### Verify that the claiming info is persisted
-Most of the nodes change IPs dynamically. It is possible that your current IP has been restricted from accessing `api.netdata.cloud` due to security concerns.
+If you use Docker, verify that the contents of `/var/lib/netdata` are preserved across container restarts, using a persistent volume.
+
+### Verify that the claiming info is not cloned
+
+A relatively common case we have seen especially with VMs is two or more nodes sharing the same credentials.
+This happens if you claim a node in a VM and then create an image based on that node. Netdata can't properly
+work this way, as we have unique node identification information under `/var/lib/netdata`.
+
+### Verify that your IP is not blocked by Netdata Cloud
+
+Most of the nodes change IPs dynamically. It is possible that your current IP has been restricted from accessing `api.netdata.cloud` due to security concerns, usually because it was spamming Netdata Coud with too many
+failed requests (old versions of the agent).
To verify this:
@@ -83,31 +145,3 @@ To verify this:
- Contact our team to whitelist your IP by submitting a ticket in the [Netdata forum](https://community.netdata.cloud/)
- Change your node's IP
-
-### Make sure that your node has internet connectivity and can resolve network domains
-
-1. Try to reach a well known host:
-
- ```bash
- ping 8.8.8.8
- ```
-
-2. If you can reach external IPs, then check your domain resolution.
-
- ```bash
- host api.netdata.cloud
- ```
-
- The expected output should be something like this:
-
- ```bash
- api.netdata.cloud is an alias for main-ingress-545609a41fcaf5d6.elb.us-east-1.amazonaws.com.
- main-ingress-545609a41fcaf5d6.elb.us-east-1.amazonaws.com has address 54.198.178.11
- main-ingress-545609a41fcaf5d6.elb.us-east-1.amazonaws.com has address 44.207.131.212
- main-ingress-545609a41fcaf5d6.elb.us-east-1.amazonaws.com has address 44.196.50.41
- ```
-
- > ### Info
- >
- > There will be cases in which the firewall restricts network access. In those cases, you need to whitelist `api.netdata.cloud` and `mqtt.netdata.cloud` domains to be able to see your nodes in Netdata Cloud.
- > If you can't whitelist domains in your firewall, you can whitelist the IPs that the above command will produce, but keep in mind that they can change without any notice.