Adding upstream version 1.44.3.upstream/1.44.3 upstream

Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
author: Daniel Baumann <daniel.baumann@progress-linux.org> 2024-04-19 02:57:58 +0000
committer: Daniel Baumann <daniel.baumann@progress-linux.org> 2024-04-19 02:57:58 +0000
commit: be1c7e50e1e8809ea56f2c9d472eccd8ffd73a97 (patch)
tree: 9754ff1ca740f6346cf8483ec915d4054bc5da2d /health/guides/hdfs/hdfs_dead_nodes.md
parent: Initial commit. (diff)
download: netdata-be1c7e50e1e8809ea56f2c9d472eccd8ffd73a97.tar.xz
netdata-be1c7e50e1e8809ea56f2c9d472eccd8ffd73a97.zip
1 files changed, 44 insertions, 0 deletions
diff --git a/health/guides/hdfs/hdfs_dead_nodes.md b/health/guides/hdfs/hdfs_dead_nodes.md
new file mode 100644
index 00000000..9c65a0c6
--- /dev/null
+++ b/health/guides/hdfs/hdfs_dead_nodes.md
@@ -0,0 +1,44 @@
+### Understand the Alert
+
+The Netdata Agent monitors the number of DataNodes that are currently dead. Receiving this alert indicates that there are dead DataNodes in your HDFS cluster. The NameNode characterizes a DataNode as dead if no heartbeat message is exchanged for approximately 10 minutes. Any data that was registered to a dead DataNode is not available to HDFS anymore.
+
+This alert is triggered into critical when the number of dead DataNodes is 1 or more.
+
+### Troubleshoot the Alert 
+
+1. Fix corrupted or missing blocks.
+
+    ```
+    root@netdata #  hadoop dfsadmin -report
+    ```
+
+    Inspect the output and check which DataNode is dead.
+
+2. Connect to the DataNode and check the log of the DataNode. You can also check for errors in the system services.
+
+    ```
+    root@netdata #  systemctl status hadoop
+    ```
+
+   Restart the service if needed.
+
+
+3. Verify that the network connectivity between NameNode and DataNodes is functional. You can use tools like `ping` and `traceroute` to confirm the connectivity.
+
+4. Check the logs of the dead DataNode(s) for any issues. Log location may vary depending on your installation, but you can typically find them in the `/var/log/hadoop-hdfs/` directory. Analyze the logs to identify any errors or issues that may have caused the DataNode to become dead.
+
+    ```
+    root@netdata # tail -f /var/log/hadoop-hdfs/hadoop-hdfs-datanode-*.log
+    ```
+
+5. If the DataNode service is not running or has crashed, attempt to restart it.
+
+    ```
+    root@netdata # systemctl restart hadoop
+    ```
+
+### Useful resources
+
+1. [Hadoop Commands Guide](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CommandsManual.html)
+
+Remember that troubleshooting and resolving issues, especially on a production environment, requires a good understanding of the system and its architecture. Proceed with caution and always ensure data backup and environmental safety before performing any action.
author	Daniel Baumann <daniel.baumann@progress-linux.org>	2024-04-19 02:57:58 +0000
committer	Daniel Baumann <daniel.baumann@progress-linux.org>	2024-04-19 02:57:58 +0000
commit	be1c7e50e1e8809ea56f2c9d472eccd8ffd73a97 (patch)
tree	9754ff1ca740f6346cf8483ec915d4054bc5da2d /health/guides/hdfs/hdfs_dead_nodes.md
parent	Initial commit. (diff)
download	netdata-be1c7e50e1e8809ea56f2c9d472eccd8ffd73a97.tar.xz netdata-be1c7e50e1e8809ea56f2c9d472eccd8ffd73a97.zip