summaryrefslogtreecommitdiffstats
path: root/health/guides/hdfs
diff options
context:
space:
mode:
Diffstat (limited to 'health/guides/hdfs')
-rw-r--r--health/guides/hdfs/hdfs_capacity_usage.md42
-rw-r--r--health/guides/hdfs/hdfs_dead_nodes.md44
-rw-r--r--health/guides/hdfs/hdfs_missing_blocks.md47
-rw-r--r--health/guides/hdfs/hdfs_num_failed_volumes.md44
-rw-r--r--health/guides/hdfs/hdfs_stale_nodes.md46
5 files changed, 223 insertions, 0 deletions
diff --git a/health/guides/hdfs/hdfs_capacity_usage.md b/health/guides/hdfs/hdfs_capacity_usage.md
new file mode 100644
index 00000000..666dcdc2
--- /dev/null
+++ b/health/guides/hdfs/hdfs_capacity_usage.md
@@ -0,0 +1,42 @@
+### Understand the alert
+
+This alert calculates the percentage of used space capacity across all DataNodes in the Hadoop Distributed File System (HDFS). If you receive this alert, it means that your HDFS DataNodes space capacity utilization is high.
+
+The alert is triggered into warning when the percentage of used space capacity across all DataNodes is between 70-80% and in critical when it is between 80-90%.
+
+### Troubleshoot the alert
+
+Data is priceless. Before you perform any action, make sure that you have taken any necessary backup steps. Netdata is not liable for any loss or corruption of any data, database, or software.
+
+#### Check your Disk Usage across the cluster
+
+1. Inspect the Disk Usage for each DataNode:
+
+ ```
+ root@netdata # hadoop dfsadmin -report
+ ```
+
+ If all the DataNodes are in Disk pressure, you should consider adding more disk space. Otherwise, you can perform a balance of data between the DataNodes.
+
+2. Perform a balance:
+
+ ```
+ root@netdata # hdfs balancer –threshold 15
+ ```
+
+ This means that the balancer will balance data by moving blocks from over-utilized to under-utilized nodes, until each DataNode’s disk usage differs by no more than plus or minus 15 percent.
+
+#### Investigate high disk usage
+
+1. Review your Hadoop applications, jobs, and scripts that write data to HDFS. Identify the ones with excessive disk usage or logging.
+
+2. Optimize or refactor these applications, jobs, or scripts to reduce their disk usage.
+
+3. Delete any unnecessary or temporary files from HDFS, if safe to do so.
+
+4. Consider data compression or deduplication strategies, if applicable, to reduce storage usage in HDFS.
+
+### Useful resources
+
+1. [Apache Hadoop on Wikipedia](https://en.wikipedia.org/wiki/Apache_Hadoop)
+2. [HDFS architecture](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html) \ No newline at end of file
diff --git a/health/guides/hdfs/hdfs_dead_nodes.md b/health/guides/hdfs/hdfs_dead_nodes.md
new file mode 100644
index 00000000..9c65a0c6
--- /dev/null
+++ b/health/guides/hdfs/hdfs_dead_nodes.md
@@ -0,0 +1,44 @@
+### Understand the Alert
+
+The Netdata Agent monitors the number of DataNodes that are currently dead. Receiving this alert indicates that there are dead DataNodes in your HDFS cluster. The NameNode characterizes a DataNode as dead if no heartbeat message is exchanged for approximately 10 minutes. Any data that was registered to a dead DataNode is not available to HDFS anymore.
+
+This alert is triggered into critical when the number of dead DataNodes is 1 or more.
+
+### Troubleshoot the Alert
+
+1. Fix corrupted or missing blocks.
+
+ ```
+ root@netdata # hadoop dfsadmin -report
+ ```
+
+ Inspect the output and check which DataNode is dead.
+
+2. Connect to the DataNode and check the log of the DataNode. You can also check for errors in the system services.
+
+ ```
+ root@netdata # systemctl status hadoop
+ ```
+
+ Restart the service if needed.
+
+
+3. Verify that the network connectivity between NameNode and DataNodes is functional. You can use tools like `ping` and `traceroute` to confirm the connectivity.
+
+4. Check the logs of the dead DataNode(s) for any issues. Log location may vary depending on your installation, but you can typically find them in the `/var/log/hadoop-hdfs/` directory. Analyze the logs to identify any errors or issues that may have caused the DataNode to become dead.
+
+ ```
+ root@netdata # tail -f /var/log/hadoop-hdfs/hadoop-hdfs-datanode-*.log
+ ```
+
+5. If the DataNode service is not running or has crashed, attempt to restart it.
+
+ ```
+ root@netdata # systemctl restart hadoop
+ ```
+
+### Useful resources
+
+1. [Hadoop Commands Guide](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CommandsManual.html)
+
+Remember that troubleshooting and resolving issues, especially on a production environment, requires a good understanding of the system and its architecture. Proceed with caution and always ensure data backup and environmental safety before performing any action.
diff --git a/health/guides/hdfs/hdfs_missing_blocks.md b/health/guides/hdfs/hdfs_missing_blocks.md
new file mode 100644
index 00000000..49002880
--- /dev/null
+++ b/health/guides/hdfs/hdfs_missing_blocks.md
@@ -0,0 +1,47 @@
+### Understand the alert
+
+This alert monitors the number of missing blocks in a Hadoop Distributed File System (HDFS). If you receive this alert, it means that there is at least one missing block in one of the DataNodes. This issue could be caused by a problem with the underlying storage or filesystem of a DataNode.
+
+### Troubleshooting the alert
+
+#### Fix corrupted or missing blocks
+
+Before you perform any action, make sure that you have taken any necessary backup steps. Netdata is not liable for any loss or corruption of any data, database, or software.
+
+1. Identify which files are facing issues.
+
+```sh
+root@netdata # hdfs fsck -list-corruptfileblocks
+```
+
+Inspect the output and track the path(s) to the corrupted files.
+
+2. Determine where the file's blocks might live. If the file is larger than your block size, it consists of multiple blocks.
+
+```sh
+root@netdata # hdfs fsck <path_to_corrupted_file> -locations -blocks -files
+```
+
+This command will print out locations for every "problematic" block.
+
+3. Search in the corresponding DataNode and the NameNode's logs for the machine or machines on which the blocks lived. Try looking for filesystem errors on those machines. Use `fsck`.
+
+4. If there are files or blocks that you cannot fix, you must delete them so that the HDFS becomes healthy again.
+
+- For a specific file:
+
+```sh
+root@netdata # hdfs fs -rm <path_to_file_with_unrecoverable_blocks>
+```
+
+- For all the "problematic" files:
+
+```sh
+hdfs fsck / -delete
+```
+
+### Useful resources
+
+1. [Apache Hadoop on Wikipedia](https://en.wikipedia.org/wiki/Apache_Hadoop)
+2. [HDFS Architecture](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html)
+3. [Man Pages of fsck](https://linux.die.net/man/8/fsck) \ No newline at end of file
diff --git a/health/guides/hdfs/hdfs_num_failed_volumes.md b/health/guides/hdfs/hdfs_num_failed_volumes.md
new file mode 100644
index 00000000..bdb23f24
--- /dev/null
+++ b/health/guides/hdfs/hdfs_num_failed_volumes.md
@@ -0,0 +1,44 @@
+### Understand the alert
+
+This alert is triggered when the number of failed volumes in your Hadoop Distributed File System (HDFS) cluster increases. A failed volume may be due to hardware failure or misconfiguration, such as duplicate mounts. When a single volume fails on a DataNode, the entire node may go offline depending on the `dfs.datanode.failed.volumes.tolerated` setting for your cluster. This can lead to increased network traffic and potential performance degradation as the NameNode needs to copy any under-replicated blocks lost on that node.
+
+### Troubleshoot the alert
+
+#### 1. Identify which DataNode has a failing volume
+
+Use the `dfsadmin -report` command to identify the DataNodes that are offline:
+
+```bash
+root@netdata # dfsadmin -report
+```
+
+Find any nodes that are not reported in the output of the command. If all nodes are listed, you'll need to run the next command for each DataNode.
+
+#### 2. Review the volumes status
+
+Use the `hdfs dfsadmin -getVolumeReport` command, specifying the DataNode hostname and port:
+
+```bash
+root@netdata # hdfs dfsadmin -getVolumeReport datanodehost:port
+```
+
+#### 3. Inspect the DataNode logs
+
+Connect to the affected DataNode and check its logs using `journalctl -xe`. If you have the Netdata Agent running on the DataNodes, you should be able to identify the problem. You may also receive alerts about the disks and mounts on this system.
+
+#### 4. Take necessary actions
+
+Based on the information gathered in the previous steps, take appropriate actions to resolve the issue. This may include:
+
+- Repairing or replacing faulty hardware.
+- Fixing misconfigurations such as duplicate mounts.
+- Ensuring that the HDFS processes are running on the affected DataNode.
+- Ensuring that the affected DataNode is properly communicating with the NameNode.
+
+**Note**: When working with HDFS, it's essential to have proper backups of your data. Netdata is not responsible for any loss or corruption of data, database, or software.
+
+### Useful resources
+
+1. [Apache Hadoop on Wikipedia](https://en.wikipedia.org/wiki/Apache_Hadoop)
+2. [HDFS architecture](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html)
+3. [HDFS 3.3.1 commands guide](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html)
diff --git a/health/guides/hdfs/hdfs_stale_nodes.md b/health/guides/hdfs/hdfs_stale_nodes.md
new file mode 100644
index 00000000..71ca50f9
--- /dev/null
+++ b/health/guides/hdfs/hdfs_stale_nodes.md
@@ -0,0 +1,46 @@
+### Understand the alert
+
+The `hdfs_stale_nodes` alert is triggered when there is at least one stale DataNode in the Hadoop Distributed File System (HDFS) due to missed heartbeats. A stale DataNode is one that has not been reachable for `dfs.namenode.stale.datanode.interval` (default is 30 seconds). Stale DataNodes are avoided and marked as the last possible target for a read or write operation.
+
+### Troubleshoot the alert
+
+1. Identify the stale node(s)
+
+ Run the following command to generate a report on the state of the HDFS cluster:
+
+ ```
+ hadoop dfsadmin -report
+ ```
+
+ Inspect the output and look for any stale DataNodes.
+
+2. Check the DataNode logs and system services status
+
+ Connect to the identified stale DataNode and check the log of the DataNode for any issues. Also, check the status of the system services.
+
+ ```
+ systemctl status hadoop
+ ```
+
+ If required, restart the HDFS service:
+
+ ```
+ systemctl restart hadoop
+ ```
+
+3. Monitor the HDFS cluster
+
+ After resolving issues identified in the logs or restarting the service, continue to monitor the HDFS cluster to ensure the problem is resolved. Re-run the `hadoop dfsadmin -report` command to check if the stale DataNode status has been cleared.
+
+4. Ensure redundant data storage
+
+ To protect against data loss or unavailability, HDFS stores data in multiple nodes, providing fault tolerance. Make sure that the replication factor for your HDFS cluster is set correctly, typically with a factor of 3, so that data is stored on three different nodes. A higher replication factor will increase data redundancy and reliability.
+
+5. Review HDFS cluster configuration
+
+ Examine the HDFS cluster's configuration settings to ensure that they are appropriate for your specific use case and hardware setup. Identifying performance bottlenecks, such as slow or unreliable network connections, can help avoid stale DataNodes in the future.
+
+### Useful resources
+
+1. [Apache Hadoop on Wikipedia](https://en.wikipedia.org/wiki/Apache_Hadoop)
+2. [HDFS Architecture](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html) \ No newline at end of file