Documentation
Infrastructural Documentation
Logo
L1/L2 Alerts

L1/L2 Alerts

Logo

5 mins READ

To outline our approach based on potential tools:

  • Prometheus and Grafana: If you’re using this setup, we can deploy Prometheus agents on all Kubernetes nodes to collect metrics like CPU, memory, etc., and configure alerts in Grafana with Prometheus as the data source.

  • Managed Services: For resources like Kafka, MemoryDB, etc., we’ll route CloudWatch metrics into Grafana for alerting.

  • Functional Alerts: For functional checks (e.g., failed runs), we’ll send metrics to Prometheus via a connector, enabling alerting in Grafana.

  • Datadog/Splunk: If Datadog or Splunk is the monitoring tool, we can deploy Datadog agents to monitor the entire infrastructure.

Alerts

Alert NameThresholdSeverity
CPU Utilization (Percent)70%L1 Warning
CPU Utilization (Percent)90%L2 Critical
FreeLocalStorage (Bytes)10GbL1 Warning
FreeLocalStorage (Bytes)5GbL2 Critical
Database Connections (Count)less than 50L1 Warning
Database Connections (Count)less than 25L2 Critical
Read Latency (Seconds)greater than 3secL1 Warning
Read Latency (Seconds)greater than 5secL2 Critical
Write Latency (Seconds)greater than 3secL1 Warning
Write Latency (Seconds)greater than 5secL2 Critical
Database Memory Usage Percentage85L1 Warning
Database Memory Usage Percentage90L2 Critical
Engine CPU Utilization85L1 Warning
Engine CPU Utilization90L2 Critical
Number of client connection over last hour is less than<15L2 Critical
Authentication failures over last hour is more than>3L2 Critical
Disk usage by broker80%L1 Warning
Disk usage by broker90%L2 Critical
CPU (User) usage by broker80%L1 Warning
CPU (User) usage by broker90%L2 Critical
Lag alerts on topics>500L1 Warning
Lag alerts on topics>1000L2 Critical
Kafka - partition count per broker>1000L2 Critical
Kafka - connection count per broker<= 0L2 Critical
Container in waiting status (in minutes)>1minL2 Critical
Container restarts (over last 10 min)>5L2 Critical
Container terminated with error (over last 10 min)1L2 Critical
Pod High CPU Usage (in percentage)>80%L1 Warning
Pod High CPU Usage (in percentage)>90%L2 Critical
Pod High Memory Usage (in percentage)>80%L1 Warning
Pod High Memory Usage (in percentage)>90%L2 Critical
Kubernetes Pod Crash Looping (over last 5 min)1L2 Critical
Node Not Ready (duration in minutes)4minsL1 Warning
Node Not Ready (duration in minutes)5minL2 Critical
High CPU Node Utilization (in percentage)>80%L1 Warning
High CPU Node Utilization (in percentage)>90%L2 Critical
Kubernetes PVC available space (in percentage)<20%L1 Warning
Kubernetes PVC available space (in percentage)<10%L2 Critical
Kubernetes PVC Pending / Lost (over last 10 min)1L2 Critical
Full GC Alerts on pods (over last 5 min)>0L2 Critical
Subnet run out of free IP addresses0L2 Critical
401 status code>50 hits within 5 minutesL1 Warning
500 status code>10 hits within a minuteL1 Warning
>500 status codes>10 hits within a minuteL1 Warning
SQL Routine load error>10 hits within 5 minuteL1 Warning
GRPC Exceptions>5 hits within 5 minutesL1 Warning
NullPointerException errors>5 hits within 5 minutesL1 Warning