summaryrefslogtreecommitdiffstats
path: root/health/guides/hdfs/hdfs_dead_nodes.md
blob: 9c65a0c667212db7b73a4a2f97fb9bad6939b90b (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
### Understand the Alert

The Netdata Agent monitors the number of DataNodes that are currently dead. Receiving this alert indicates that there are dead DataNodes in your HDFS cluster. The NameNode characterizes a DataNode as dead if no heartbeat message is exchanged for approximately 10 minutes. Any data that was registered to a dead DataNode is not available to HDFS anymore.

This alert is triggered into critical when the number of dead DataNodes is 1 or more.

### Troubleshoot the Alert 

1. Fix corrupted or missing blocks.

    ```
    root@netdata #  hadoop dfsadmin -report
    ```

    Inspect the output and check which DataNode is dead.

2. Connect to the DataNode and check the log of the DataNode. You can also check for errors in the system services.

    ```
    root@netdata #  systemctl status hadoop
    ```

   Restart the service if needed.


3. Verify that the network connectivity between NameNode and DataNodes is functional. You can use tools like `ping` and `traceroute` to confirm the connectivity.

4. Check the logs of the dead DataNode(s) for any issues. Log location may vary depending on your installation, but you can typically find them in the `/var/log/hadoop-hdfs/` directory. Analyze the logs to identify any errors or issues that may have caused the DataNode to become dead.

    ```
    root@netdata # tail -f /var/log/hadoop-hdfs/hadoop-hdfs-datanode-*.log
    ```

5. If the DataNode service is not running or has crashed, attempt to restart it.

    ```
    root@netdata # systemctl restart hadoop
    ```

### Useful resources

1. [Hadoop Commands Guide](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CommandsManual.html)

Remember that troubleshooting and resolving issues, especially on a production environment, requires a good understanding of the system and its architecture. Proceed with caution and always ensure data backup and environmental safety before performing any action.