summaryrefslogtreecommitdiffstats
path: root/health/guides/kubelet
diff options
context:
space:
mode:
Diffstat (limited to '')
-rw-r--r--health/guides/kubelet/kubelet_10s_pleg_relist_latency_quantile_05.md35
-rw-r--r--health/guides/kubelet/kubelet_10s_pleg_relist_latency_quantile_09.md58
-rw-r--r--health/guides/kubelet/kubelet_10s_pleg_relist_latency_quantile_099.md58
-rw-r--r--health/guides/kubelet/kubelet_1m_pleg_relist_latency_quantile_05.md59
-rw-r--r--health/guides/kubelet/kubelet_1m_pleg_relist_latency_quantile_09.md45
-rw-r--r--health/guides/kubelet/kubelet_1m_pleg_relist_latency_quantile_099.md36
-rw-r--r--health/guides/kubelet/kubelet_node_config_error.md56
-rw-r--r--health/guides/kubelet/kubelet_operations_error.md61
-rw-r--r--health/guides/kubelet/kubelet_token_requests.md44
9 files changed, 452 insertions, 0 deletions
diff --git a/health/guides/kubelet/kubelet_10s_pleg_relist_latency_quantile_05.md b/health/guides/kubelet/kubelet_10s_pleg_relist_latency_quantile_05.md
new file mode 100644
index 00000000..595fae8a
--- /dev/null
+++ b/health/guides/kubelet/kubelet_10s_pleg_relist_latency_quantile_05.md
@@ -0,0 +1,35 @@
+### Troubleshoot the alert
+
+1. Check Kubelet logs
+ To diagnose issues with the PLEG relist process, look at the Kubelet logs. The following command can be used to fetch the logs from the affected node:
+
+ ```
+ kubectl logs -n kube-system <node_name>
+ ```
+
+ Look for any error messages related to PLEG or container runtime.
+
+2. Check container runtime status
+ Monitor the health status and performance of the container runtime (e.g. Docker, containerd) by running the appropriate commands like `docker ps`, `docker info` or `ctr version` and `ctr info`. Check container runtime logs for any issues as well.
+
+3. Inspect node resources
+ Verify if the node is overloaded or under excessive pressure by checking the CPU, memory, disk, and network resources. Use tools like `top`, `vmstat`, `df`, and `iostat`. You can also use the Kubernetes `kubectl top node` command to view resource utilization on your nodes.
+
+4. Limit maximum Pods per node
+ To avoid overloading nodes in your cluster, consider limiting the maximum number of Pods that can run on a single node. You can follow these steps to update the max Pods value:
+
+ - Edit the Kubelet configuration file (usually located at `/etc/kubernetes/kubelet.conf` or `/var/lib/kubelet/config.yaml`) on the affected node.
+ - Change the value of the `maxPods` parameter to a more appropriate number. The default value is 110.
+ - Restart the Kubelet service with `systemctl restart kubelet` or `service kubelet restart`.
+ - Check the Kubelet logs to ensure the new value is effective.
+
+5. Check Pod eviction thresholds
+ Review the Pod eviction thresholds defined in the Kubelet configuration, which might cause Pods to be evicted due to resource pressure. Adjust the threshold values if needed.
+
+6. Investigate Pods causing high relisting latency
+ Analyze the Pods running on the affected node and identify any Pods that might be causing high PLEG relist latency. These could be Pods with a large number of containers or high resource usage. Consider optimizing or removing these Pods if they are not essential to your workload.
+
+### Useful resources
+
+1. [Kubelet CLI in Kubernetes official docs](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/)
+2. [PLEG mechanism explained in Redhat's blogspot](https://developers.redhat.com/blog/2019/11/13/pod-lifecycle-event-generator-understanding-the-pleg-is-not-healthy-issue-in-kubernetes/)
diff --git a/health/guides/kubelet/kubelet_10s_pleg_relist_latency_quantile_09.md b/health/guides/kubelet/kubelet_10s_pleg_relist_latency_quantile_09.md
new file mode 100644
index 00000000..05c03064
--- /dev/null
+++ b/health/guides/kubelet/kubelet_10s_pleg_relist_latency_quantile_09.md
@@ -0,0 +1,58 @@
+### Understand the alert
+
+This alert indicates that the average relisting latency of the Pod Lifecycle Event Generator (PLEG) in Kubelet over the last 10 seconds compared to the last minute (quantile 0.9) has increased significantly. This can cause the node to become unavailable (NotReady) due to a "PLEG is not healthy" event.
+
+### Troubleshoot the alert
+
+1. Check for high node resource usage
+
+ First, ensure that the node does not have an overly high number of Pods. High resource usage could increase the PLEG relist latency, leading to poor Kubelet performance. You can check the current number of running Pods on a node using the following command:
+
+ ```
+ kubectl get pods --all-namespaces -o wide | grep <node-name>
+ ```
+
+2. Check Kubelet logs for errors
+
+ Inspect the Kubelet logs for any errors that might be causing the increased PLEG relist latency. You can check the Kubelet logs using the following command:
+
+ ```
+ sudo journalctl -u kubelet
+ ```
+
+ Look for any errors associated with PLEG or the container runtime, such as Docker or containerd.
+
+3. Check container runtime health
+
+ If you find any issues in the Kubelet logs related to the container runtime, investigate the health of the container runtime, such as Docker or containerd, and its logs to identify any issues:
+
+ - For Docker, you can check its health using:
+
+ ```
+ sudo docker info
+ sudo journalctl -u docker
+ ```
+
+ - For containerd, you can check its health using:
+
+ ```
+ sudo ctr version
+ sudo journalctl -u containerd
+ ```
+
+4. Adjust the maximum number of Pods per node
+
+ If you have configured your cluster manually (e.g., with `kubeadm`), you can update the value of max Pods in the Kubelet configuration file. The default file location is `/var/lib/kubelet/config.yaml`. Change the `maxPods` value according to your requirements and restart the Kubelet service:
+
+ ```
+ sudo systemctl restart kubelet
+ ```
+
+5. Monitor the PLEG relist latency
+
+ After making any necessary changes, continue monitoring the PLEG relist latency to ensure the issue has been resolved.
+
+### Useful resources
+
+1. [Kubelet CLI in Kubernetes official docs](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/)
+2. [PLEG mechanism explained in Redhat's blogspot](https://developers.redhat.com/blog/2019/11/13/pod-lifecycle-event-generator-understanding-the-pleg-is-not-healthy-issue-in-kubernetes#) \ No newline at end of file
diff --git a/health/guides/kubelet/kubelet_10s_pleg_relist_latency_quantile_099.md b/health/guides/kubelet/kubelet_10s_pleg_relist_latency_quantile_099.md
new file mode 100644
index 00000000..76f1123e
--- /dev/null
+++ b/health/guides/kubelet/kubelet_10s_pleg_relist_latency_quantile_099.md
@@ -0,0 +1,58 @@
+### Understand the alert
+
+This alert is related to the Kubernetes Kubelet, which is the primary node agent responsible for ensuring containers run in a Pod. The alert specifically relates to the Pod Lifecycle Event Generator (PLEG) module, which is responsible for adjusting the container runtime state and maintaining the Pod's cache. When there is a significant increase in the relisting time for PLEG, you'll receive a `kubelet_10s_pleg_relist_latency_quantile_099` alert.
+
+### Troubleshoot the alert
+
+Follow the steps below to troubleshoot this alert:
+
+1. Check the container runtime health status
+
+ If you are using Docker as the container runtime, run the following command:
+
+ ```
+ sudo docker info
+ ```
+
+ Check for any reported errors or issues.
+
+ If you are using a different container runtime like containerd or CRI-O, refer to the respective documentation for health check commands.
+
+2. Check Kubelet logs for any errors.
+
+ You can do this by running the following command:
+
+ ```
+ sudo journalctl -u kubelet -n 1000
+ ```
+
+ Look for any relevant error messages or warnings in the output.
+
+3. Validate that the node is not overloaded with too many Pods.
+
+ Run the following commands:
+
+ ```
+ kubectl get nodes
+ kubectl describe node <node_name>
+ ```
+
+ Adjust the max number of Pods per node if needed, by editing the Kubelet configuration file `/etc/systemd/system/kubelet.service.d/10-kubeadm.conf`, adding the `--max-pods=<NUMBER>` flag, and restarting Kubelet:
+
+ ```
+ sudo systemctl daemon-reload
+ sudo systemctl restart kubelet
+ ```
+
+4. Check for issues related to the underlying storage or network.
+
+ Inspect the Node's storage and ensure there are no I/O limitations or bottlenecks causing the increased latency. Also, check for network-related issues that could affect the communication between the Kubelet and the container runtime.
+
+5. Verify the performance and health of the Kubernetes API server.
+
+ High workload on the API server could affect the Kubelet's ability to communicate and process Pod updates. Check the API server logs and metrics to find any performance bottlenecks or errors.
+
+### Useful resources
+
+1. [Kubelet CLI in Kubernetes official docs](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/)
+2. [PLEG mechanism explained in Redhat's blogspot](https://developers.redhat.com/blog/2019/11/13/pod-lifecycle-event-generator-understanding-the-pleg-is-not-healthy-issue-in-kubernetes#) \ No newline at end of file
diff --git a/health/guides/kubelet/kubelet_1m_pleg_relist_latency_quantile_05.md b/health/guides/kubelet/kubelet_1m_pleg_relist_latency_quantile_05.md
new file mode 100644
index 00000000..b448c4d9
--- /dev/null
+++ b/health/guides/kubelet/kubelet_1m_pleg_relist_latency_quantile_05.md
@@ -0,0 +1,59 @@
+### Understand the alert
+
+This alert is related to Kubernetes and is triggered when the average `Pod Lifecycle Event Generator (PLEG)` relisting latency over the last minute is higher than the expected threshold (quantile 0.5). If you receive this alert, it means that the kubelet is experiencing some latency issues, which may affect the scheduling and management of your Kubernetes Pods.
+
+### What is PLEG?
+
+The Pod Lifecycle Event Generator (PLEG) is a component within the kubelet responsible for keeping track of changes (events) to the Pod and updating the kubelet's internal status. This ensures that the kubelet can successfully manage and schedule Pods on the Kubernetes node.
+
+### What does relisting latency mean?
+
+Relisting latency refers to the time taken by the PLEG to detect, process, and update the kubelet about the events or changes in a Pod's lifecycle. High relisting latency can lead to delays in the kubelet reacting to these changes, which can affect the overall functioning of the Kubernetes cluster.
+
+### Troubleshoot the alert
+
+1. Check the kubelet logs for any errors or warnings related to PLEG:
+
+ ```
+ sudo journalctl -u kubelet
+ ```
+
+ Look for any logs related to PLEG delays, issues, or timeouts.
+
+2. Restart the kubelet if necessary:
+
+ ```
+ sudo systemctl restart kubelet
+ ```
+
+ Sometimes, restarting the kubelet can resolve sporadic latency issues.
+
+3. Monitor the Kubernetes node's resource usage (CPU, Memory, Disk) using `kubectl top nodes`:
+
+ ```
+ kubectl top nodes
+ ```
+
+ If the node's resource usage is too high, consider scaling your cluster or optimizing workloads.
+
+4. Check the overall health of your Kubernetes cluster:
+
+ ```
+ kubectl get nodes
+ kubectl get pods --all-namespaces
+ ```
+
+ These commands will help you identify any issues with other nodes or Pods in your cluster.
+
+5. Investigate the specific Pods experiencing latency in PLEG:
+
+ ```
+ kubectl describe pod <pod_name> -n <namespace>
+ ```
+
+ Look for any signs of the Pod being stuck in a pending state, startup issues, or container crashes.
+
+### Useful resources
+
+1. [Kubernetes Kubelet - PLEG](https://kubernetes.io/docs/concepts/overview/components/#kubelet)
+2. [Kubernetes Troubleshooting](https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/)
diff --git a/health/guides/kubelet/kubelet_1m_pleg_relist_latency_quantile_09.md b/health/guides/kubelet/kubelet_1m_pleg_relist_latency_quantile_09.md
new file mode 100644
index 00000000..6c71f1cf
--- /dev/null
+++ b/health/guides/kubelet/kubelet_1m_pleg_relist_latency_quantile_09.md
@@ -0,0 +1,45 @@
+### Understand the alert
+
+This alert calculates the average Pod Lifecycle Event Generator (PLEG) relisting latency over the period of one minute, using the 0.9 quantile. This alert is related to Kubelet, a critical component in the Kubernetes cluster that ensures the correct running of containers inside pods. If you receive this alert, it means that the relisting latency has increased in your Kubernetes cluster, possibly affecting the performance of your workloads.
+
+### What does PLEG relisting latency mean?
+
+In Kubernetes, PLEG is responsible for keeping track of container lifecycle events, such as container start, stop, or pause. It periodically relists these events and updates the Kubernetes Pod status, ensuring the scheduler and other components know the correct state of the containers. An increased relisting latency could lead to slower updates on Pod status and overall degraded performance.
+
+### What does 0.9 quantile mean?
+
+The 0.9 quantile represents the value below which 90% of the latencies are. An alert based on the 0.9 quantile suggests that 90% of relisting latencies are below the specified threshold, meaning that the remaining 10% are experiencing increased latency, which could lead to issues in your cluster.
+
+### Troubleshoot the alert
+
+1. Check Kubelet logs for errors or warnings related to PLEG:
+
+ Access the logs of the Kubelet component running on the affected node:
+
+ ```
+ sudo journalctl -u kubelet
+ ```
+
+2. Monitor the overall performance of your Kubernetes cluster:
+
+ Use `kubectl top nodes` to check the resource usage of your nodes and identify any bottlenecks, such as high CPU or memory consumption.
+
+3. Check the status of Pods:
+
+ Use `kubectl get pods --all-namespaces` to check the status of all Pods in your cluster. Look for Pods in an abnormal state (e.g., Pending, CrashLoopBackOff, or Terminating), which could be related to high PLEG relisting latency.
+
+4. Analyze Pod logs for issues:
+
+ Investigate the logs of the affected Pods to understand any issues with the container lifecycle events:
+
+ ```
+ kubectl logs <pod-name> -n <namespace>
+ ```
+
+5. Review the Kubelet configuration:
+
+ Ensure that your Kubelet configuration is set up correctly to handle your workloads. If necessary, adjust the settings to improve PLEG relisting performance.
+
+### Useful resources
+
+1. [Kubernetes Troubleshooting Guide](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-cluster/)
diff --git a/health/guides/kubelet/kubelet_1m_pleg_relist_latency_quantile_099.md b/health/guides/kubelet/kubelet_1m_pleg_relist_latency_quantile_099.md
new file mode 100644
index 00000000..39e03162
--- /dev/null
+++ b/health/guides/kubelet/kubelet_1m_pleg_relist_latency_quantile_099.md
@@ -0,0 +1,36 @@
+### Understand the alert
+
+This alert calculates the average Pod Lifecycle Event Generator (PLEG) relisting latency over the last minute with a quantile of 0.99 in microseconds. If you receive this alert, it means that the Kubelet's PLEG latency is high, which can slow down your Kubernetes cluster.
+
+### What does PLEG latency mean?
+
+Pod Lifecycle Event Generator (PLEG) is a component of the Kubelet that watches for container events on the system and generates events for a pod's lifecycle. High PLEG latency indicates a delay in processing these events, which can cause delays in pod startup, termination, and updates.
+
+### Troubleshoot the alert
+
+1. Check the overall Kubelet performance and system load:
+
+ a. Run `kubectl get nodes` to check the status of the nodes in your cluster.
+ b. Investigate the node with high PLEG latency using `kubectl describe node <NODE_NAME>` to view detailed information about resource usage and events.
+ c. Use monitoring tools like `top`, `htop`, or `vmstat` to check for high CPU, memory, or disk usage on the node.
+
+2. Look for problematic pods or containers:
+
+ a. Run `kubectl get pods --all-namespaces` to check the status of all pods across namespaces.
+ b. Use `kubectl logs <POD_NAME> -n <NAMESPACE>` to check the logs of the pods in the namespace.
+ c. Investigate pods with high restart counts, crash loops, or other abnormal statuses.
+
+3. Verify Kubelet configurations and logs:
+
+ a. Check the Kubelet configuration on the node. Look for any misconfigurations or settings that could cause high latency.
+ b. Check Kubelet logs using `journalctl -u kubelet` for more information about PLEG events and errors.
+
+4. Consider evaluating your workloads and scaling your cluster:
+
+ a. If you have multiple nodes experiencing high PLEG latency or if the overall load on your nodes is consistently high, you might need to scale your cluster.
+ b. Evaluate your workloads and adjust resource requests and limits to make the best use of your available resources.
+
+### Useful resources
+
+1. [Understanding the Kubernetes Kubelet](https://kubernetes.io/docs/concepts/overview/components/#kubelet)
+2. [Troubleshooting Kubernetes Clusters](https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/)
diff --git a/health/guides/kubelet/kubelet_node_config_error.md b/health/guides/kubelet/kubelet_node_config_error.md
new file mode 100644
index 00000000..695a479c
--- /dev/null
+++ b/health/guides/kubelet/kubelet_node_config_error.md
@@ -0,0 +1,56 @@
+### Understand the alert
+
+This alert, `kubelet_node_config_error`, is related to the Kubernetes Kubelet component. If you receive this alert, it means that there is a configuration-related error in one of the nodes in your Kubernetes cluster.
+
+### What is Kubernetes Kubelet?
+
+Kubernetes Kubelet is an agent that runs on each node in a Kubernetes cluster. It ensures that containers are running in a pod and manages the lifecycle of those containers.
+
+### Troubleshoot the alert
+
+1. Identify the node with the configuration error
+
+ The alert should provide information about the node experiencing the issue. You can also use the `kubectl get nodes` command to list all nodes in your cluster and their statuses:
+
+ ```
+ kubectl get nodes
+ ```
+
+2. Check the Kubelet logs on the affected node
+
+ The logs for Kubelet can be found on each node of your cluster. Login to the affected node and check its logs using either `journalctl` or the log files in `/var/log/`.
+
+ ```
+ journalctl -u kubelet
+ ```
+ or
+ ```
+ sudo cat /var/log/kubelet.log
+ ```
+
+ Look for any error messages related to the configuration issue or other problems.
+
+3. Review and update the node configuration
+
+ Based on the error messages you found in the logs, review the Kubelet configuration on the affected node. You might need to update the `kubelet-config.yaml` file or other related files specific to your setup.
+
+ If any changes are made, don't forget to restart the Kubelet service on the affected node:
+
+ ```
+ sudo systemctl restart kubelet
+ ```
+
+4. Check the health of the cluster
+
+ After the configuration issue is resolved, make sure to check the health of your cluster using `kubectl`:
+
+ ```
+ kubectl get nodes
+ ```
+
+ Ensure that all nodes are in a `Ready` state and no errors are reported for the affected node.
+
+### Useful resources
+
+1. [Kubernetes Documentation: Kubelet](https://kubernetes.io/docs/concepts/overview/components/#kubelet)
+2. [Kubernetes Troubleshooting Guide](https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/) \ No newline at end of file
diff --git a/health/guides/kubelet/kubelet_operations_error.md b/health/guides/kubelet/kubelet_operations_error.md
new file mode 100644
index 00000000..870993b5
--- /dev/null
+++ b/health/guides/kubelet/kubelet_operations_error.md
@@ -0,0 +1,61 @@
+### Understand the alert
+
+This alert indicates that there is an increase in the number of Docker or runtime operation errors in your Kubernetes cluster's kubelet. A high number of errors can affect the overall stability and performance of your cluster.
+
+### What are Docker or runtime operation errors?
+
+Docker or runtime operation errors are errors that occur while the kubelet is managing container-related operations. These errors can be related to creating, starting, stopping, or deleting containers in your Kubernetes cluster.
+
+### Troubleshoot the alert
+
+1. Check kubelet logs:
+
+ You need to inspect the kubelet logs of the affected nodes to find more information about the reported errors. SSH into the affected node and use the following command to stream the kubelet logs:
+
+ ```
+ journalctl -u kubelet -f
+ ```
+
+ Look for any error messages or patterns that could indicate a problem.
+
+2. Inspect containers' logs:
+
+ If an error is related to a specific container, you can inspect the logs of that container using the following command:
+
+ ```
+ kubectl logs <container_name> -n <namespace>
+ ```
+
+ Replace `<container_name>` and `<namespace>` with the appropriate values.
+
+3. Check Docker or runtime logs:
+
+ On the affected node, check Docker or container runtime logs for any issues:
+
+ - For Docker, use: `journalctl -u docker`
+ - For containerd, use: `journalctl -u containerd`
+ - For CRI-O, use: `journalctl -u crio`
+
+4. Examine Kubernetes events:
+
+ Run the following command to see recent events in your cluster:
+
+ ```
+ kubectl get events
+ ```
+
+ Look for any error messages or patterns that could indicate a kubelet or container-related problem.
+
+5. Verify resource allocation:
+
+ Ensure that the node has enough resources available (such as CPU, memory, and disk space) for the containers running on it. You can use commands like `kubectl describe node <node_name>` or monitor your cluster resources using Netdata.
+
+6. Investigate other issues:
+
+ If the above steps didn't reveal the cause of the errors, investigate other potential causes, such as network issues, filesystem corruption, hardware problems, or misconfigurations.
+
+### Useful resources
+
+1. [Kubernetes Debugging and Troubleshooting](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-cluster/)
+2. [Troubleshoot the Kubelet](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-application-introspection/)
+3. [Access Clusters Using the Kubernetes API](https://kubernetes.io/docs/tasks/administer-cluster/access-cluster-api/) \ No newline at end of file
diff --git a/health/guides/kubelet/kubelet_token_requests.md b/health/guides/kubelet/kubelet_token_requests.md
new file mode 100644
index 00000000..28d70241
--- /dev/null
+++ b/health/guides/kubelet/kubelet_token_requests.md
@@ -0,0 +1,44 @@
+### Understand the alert
+
+This alert is related to Kubernetes Kubelet token requests. It monitors the number of failed `Token()` requests to an alternate token source. If you receive this alert, it means that your system is experiencing an increased rate of token request failures.
+
+### What does a token request in Kubernetes mean?
+
+In Kubernetes, tokens are used for authentication purposes when making requests to the API server. The Kubelet uses tokens to authenticate itself when it needs to access cluster information or manage resources on the API server.
+
+### Troubleshoot the alert
+
+- Investigate the reason behind the failed token requests
+
+1. Check the Kubelet logs for any error messages or warnings related to the token requests. You can use the following command to view the logs:
+
+ ```
+ journalctl -u kubelet
+ ```
+
+ Look for any entries related to `Token()` request failures or authentication issues.
+
+2. Verify the alternate token source configuration
+
+ Review the Kubelet configuration file, usually located at `/etc/kubernetes/kubelet/config.yaml`. Check the `authentication` and `authorization` sections to ensure all the required settings have been correctly configured.
+
+ Make sure that the specified alternate token source is available and working correctly.
+
+3. Check the API server logs
+
+ Inspect the logs of the API server to identify any issues that may prevent the Kubelet from successfully requesting tokens. Use the following command to view the logs:
+
+ ```
+ kubectl logs -n kube-system kube-apiserver-<YOUR_NODE_NAME>
+ ```
+
+ Look for any entries related to authentication, especially if they are connected to the alternate token source.
+
+4. Monitor kubelet_token_requests metric
+
+ Keep an eye on the `kubelet_token_requests` metric using the Netdata dashboard or a monitoring system of your choice. If the number of failed requests continues to increase, this might indicate an underlying issue that requires further investigation.
+
+### Useful resources
+
+1. [Understanding Kubernetes authentication](https://kubernetes.io/docs/reference/access-authn-authz/authentication/)
+2. [Kubelet configuration reference](https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/)