summaryrefslogtreecommitdiffstats
path: root/health/guides/hdfs/hdfs_stale_nodes.md
diff options
context:
space:
mode:
Diffstat (limited to 'health/guides/hdfs/hdfs_stale_nodes.md')
-rw-r--r--health/guides/hdfs/hdfs_stale_nodes.md46
1 files changed, 46 insertions, 0 deletions
diff --git a/health/guides/hdfs/hdfs_stale_nodes.md b/health/guides/hdfs/hdfs_stale_nodes.md
new file mode 100644
index 00000000..71ca50f9
--- /dev/null
+++ b/health/guides/hdfs/hdfs_stale_nodes.md
@@ -0,0 +1,46 @@
+### Understand the alert
+
+The `hdfs_stale_nodes` alert is triggered when there is at least one stale DataNode in the Hadoop Distributed File System (HDFS) due to missed heartbeats. A stale DataNode is one that has not been reachable for `dfs.namenode.stale.datanode.interval` (default is 30 seconds). Stale DataNodes are avoided and marked as the last possible target for a read or write operation.
+
+### Troubleshoot the alert
+
+1. Identify the stale node(s)
+
+ Run the following command to generate a report on the state of the HDFS cluster:
+
+ ```
+ hadoop dfsadmin -report
+ ```
+
+ Inspect the output and look for any stale DataNodes.
+
+2. Check the DataNode logs and system services status
+
+ Connect to the identified stale DataNode and check the log of the DataNode for any issues. Also, check the status of the system services.
+
+ ```
+ systemctl status hadoop
+ ```
+
+ If required, restart the HDFS service:
+
+ ```
+ systemctl restart hadoop
+ ```
+
+3. Monitor the HDFS cluster
+
+ After resolving issues identified in the logs or restarting the service, continue to monitor the HDFS cluster to ensure the problem is resolved. Re-run the `hadoop dfsadmin -report` command to check if the stale DataNode status has been cleared.
+
+4. Ensure redundant data storage
+
+ To protect against data loss or unavailability, HDFS stores data in multiple nodes, providing fault tolerance. Make sure that the replication factor for your HDFS cluster is set correctly, typically with a factor of 3, so that data is stored on three different nodes. A higher replication factor will increase data redundancy and reliability.
+
+5. Review HDFS cluster configuration
+
+ Examine the HDFS cluster's configuration settings to ensure that they are appropriate for your specific use case and hardware setup. Identifying performance bottlenecks, such as slow or unreliable network connections, can help avoid stale DataNodes in the future.
+
+### Useful resources
+
+1. [Apache Hadoop on Wikipedia](https://en.wikipedia.org/wiki/Apache_Hadoop)
+2. [HDFS Architecture](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html) \ No newline at end of file