diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-05-05 12:08:03 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-05-05 12:08:18 +0000 |
commit | 5da14042f70711ea5cf66e034699730335462f66 (patch) | |
tree | 0f6354ccac934ed87a2d555f45be4c831cf92f4a /src/health/guides/hdfs/hdfs_dead_nodes.md | |
parent | Releasing debian version 1.44.3-2. (diff) | |
download | netdata-5da14042f70711ea5cf66e034699730335462f66.tar.xz netdata-5da14042f70711ea5cf66e034699730335462f66.zip |
Merging upstream version 1.45.3+dfsg.
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'src/health/guides/hdfs/hdfs_dead_nodes.md')
-rw-r--r-- | src/health/guides/hdfs/hdfs_dead_nodes.md | 44 |
1 files changed, 44 insertions, 0 deletions
diff --git a/src/health/guides/hdfs/hdfs_dead_nodes.md b/src/health/guides/hdfs/hdfs_dead_nodes.md new file mode 100644 index 000000000..9c65a0c66 --- /dev/null +++ b/src/health/guides/hdfs/hdfs_dead_nodes.md @@ -0,0 +1,44 @@ +### Understand the Alert + +The Netdata Agent monitors the number of DataNodes that are currently dead. Receiving this alert indicates that there are dead DataNodes in your HDFS cluster. The NameNode characterizes a DataNode as dead if no heartbeat message is exchanged for approximately 10 minutes. Any data that was registered to a dead DataNode is not available to HDFS anymore. + +This alert is triggered into critical when the number of dead DataNodes is 1 or more. + +### Troubleshoot the Alert + +1. Fix corrupted or missing blocks. + + ``` + root@netdata # hadoop dfsadmin -report + ``` + + Inspect the output and check which DataNode is dead. + +2. Connect to the DataNode and check the log of the DataNode. You can also check for errors in the system services. + + ``` + root@netdata # systemctl status hadoop + ``` + + Restart the service if needed. + + +3. Verify that the network connectivity between NameNode and DataNodes is functional. You can use tools like `ping` and `traceroute` to confirm the connectivity. + +4. Check the logs of the dead DataNode(s) for any issues. Log location may vary depending on your installation, but you can typically find them in the `/var/log/hadoop-hdfs/` directory. Analyze the logs to identify any errors or issues that may have caused the DataNode to become dead. + + ``` + root@netdata # tail -f /var/log/hadoop-hdfs/hadoop-hdfs-datanode-*.log + ``` + +5. If the DataNode service is not running or has crashed, attempt to restart it. + + ``` + root@netdata # systemctl restart hadoop + ``` + +### Useful resources + +1. [Hadoop Commands Guide](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CommandsManual.html) + +Remember that troubleshooting and resolving issues, especially on a production environment, requires a good understanding of the system and its architecture. Proceed with caution and always ensure data backup and environmental safety before performing any action. |