To outline our approach based on potential tools:
Prometheus and Grafana: If you’re using this setup, we can deploy Prometheus agents on all Kubernetes nodes to collect metrics like CPU, memory, etc., and configure alerts in Grafana with Prometheus as the data source.
Managed Services: For resources like Kafka, MemoryDB, etc., we’ll route CloudWatch metrics into Grafana for alerting.
Functional Alerts: For functional checks (e.g., failed runs), we’ll send metrics to Prometheus via a connector, enabling alerting in Grafana.
Datadog/Splunk: If Datadog or Splunk is the monitoring tool, we can deploy Datadog agents to monitor the entire infrastructure.
Alerts
| Alert Name | Threshold | Severity |
CPU Utilization (Percent) | 70% | L1 Warning |
CPU Utilization (Percent) | 90% | L2 Critical |
FreeLocalStorage (Bytes) | 10Gb | L1 Warning |
FreeLocalStorage (Bytes) | 5Gb | L2 Critical |
Database Connections (Count) | less than 50 | L1 Warning |
Database Connections (Count) | less than 25 | L2 Critical |
Read Latency (Seconds) | greater than 3sec | L1 Warning |
Read Latency (Seconds) | greater than 5sec | L2 Critical |
Write Latency (Seconds) | greater than 3sec | L1 Warning |
Write Latency (Seconds) | greater than 5sec | L2 Critical |
Database Memory Usage Percentage | 85 | L1 Warning |
Database Memory Usage Percentage | 90 | L2 Critical |
Engine CPU Utilization | 85 | L1 Warning |
Engine CPU Utilization | 90 | L2 Critical |
Number of client connection over last hour is less than | <15 | L2 Critical |
Authentication failures over last hour is more than | >3 | L2 Critical |
Disk usage by broker | 80% | L1 Warning |
Disk usage by broker | 90% | L2 Critical |
CPU (User) usage by broker | 80% | L1 Warning |
CPU (User) usage by broker | 90% | L2 Critical |
Lag alerts on topics | >500 | L1 Warning |
Lag alerts on topics | >1000 | L2 Critical |
Kafka - partition count per broker | >1000 | L2 Critical |
Kafka - connection count per broker | <= 0 | L2 Critical |
Container in waiting status (in minutes) | >1min | L2 Critical |
Container restarts (over last 10 min) | >5 | L2 Critical |
Container terminated with error (over last 10 min) | 1 | L2 Critical |
Pod High CPU Usage (in percentage) | >80% | L1 Warning |
Pod High CPU Usage (in percentage) | >90% | L2 Critical |
Pod High Memory Usage (in percentage) | >80% | L1 Warning |
Pod High Memory Usage (in percentage) | >90% | L2 Critical |
Kubernetes Pod Crash Looping (over last 5 min) | 1 | L2 Critical |
Node Not Ready (duration in minutes) | 4mins | L1 Warning |
Node Not Ready (duration in minutes) | 5min | L2 Critical |
High CPU Node Utilization (in percentage) | >80% | L1 Warning |
High CPU Node Utilization (in percentage) | >90% | L2 Critical |
Kubernetes PVC available space (in percentage) | <20% | L1 Warning |
Kubernetes PVC available space (in percentage) | <10% | L2 Critical |
Kubernetes PVC Pending / Lost (over last 10 min) | 1 | L2 Critical |
Full GC Alerts on pods (over last 5 min) | >0 | L2 Critical |
Subnet run out of free IP addresses | 0 | L2 Critical |
401 status code | >50 hits within 5 minutes | L1 Warning |
500 status code | >10 hits within a minute | L1 Warning |
>500 status codes | >10 hits within a minute | L1 Warning |
SQL Routine load error | >10 hits within 5 minute | L1 Warning |
GRPC Exceptions | >5 hits within 5 minutes | L1 Warning |
NullPointerException errors | >5 hits within 5 minutes | L1 Warning |