summaryrefslogtreecommitdiffstats
path: root/health/guides/hdfs/hdfs_dead_nodes.md
diff options
context:
space:
mode:
Diffstat (limited to 'health/guides/hdfs/hdfs_dead_nodes.md')
-rw-r--r--health/guides/hdfs/hdfs_dead_nodes.md44
1 files changed, 0 insertions, 44 deletions
diff --git a/health/guides/hdfs/hdfs_dead_nodes.md b/health/guides/hdfs/hdfs_dead_nodes.md
deleted file mode 100644
index 9c65a0c66..000000000
--- a/health/guides/hdfs/hdfs_dead_nodes.md
+++ /dev/null
@@ -1,44 +0,0 @@
-### Understand the Alert
-
-The Netdata Agent monitors the number of DataNodes that are currently dead. Receiving this alert indicates that there are dead DataNodes in your HDFS cluster. The NameNode characterizes a DataNode as dead if no heartbeat message is exchanged for approximately 10 minutes. Any data that was registered to a dead DataNode is not available to HDFS anymore.
-
-This alert is triggered into critical when the number of dead DataNodes is 1 or more.
-
-### Troubleshoot the Alert
-
-1. Fix corrupted or missing blocks.
-
- ```
- root@netdata # hadoop dfsadmin -report
- ```
-
- Inspect the output and check which DataNode is dead.
-
-2. Connect to the DataNode and check the log of the DataNode. You can also check for errors in the system services.
-
- ```
- root@netdata # systemctl status hadoop
- ```
-
- Restart the service if needed.
-
-
-3. Verify that the network connectivity between NameNode and DataNodes is functional. You can use tools like `ping` and `traceroute` to confirm the connectivity.
-
-4. Check the logs of the dead DataNode(s) for any issues. Log location may vary depending on your installation, but you can typically find them in the `/var/log/hadoop-hdfs/` directory. Analyze the logs to identify any errors or issues that may have caused the DataNode to become dead.
-
- ```
- root@netdata # tail -f /var/log/hadoop-hdfs/hadoop-hdfs-datanode-*.log
- ```
-
-5. If the DataNode service is not running or has crashed, attempt to restart it.
-
- ```
- root@netdata # systemctl restart hadoop
- ```
-
-### Useful resources
-
-1. [Hadoop Commands Guide](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CommandsManual.html)
-
-Remember that troubleshooting and resolving issues, especially on a production environment, requires a good understanding of the system and its architecture. Proceed with caution and always ensure data backup and environmental safety before performing any action.