L1/L2 Alerts - UnifyApps

To outline our approach based on potential tools:

Prometheus and Grafana: If you’re using this setup, we can deploy Prometheus agents on all Kubernetes nodes to collect metrics like CPU, memory, etc., and configure alerts in Grafana with Prometheus as the data source.
Managed Services: For resources like Kafka, MemoryDB, etc., we’ll route CloudWatch metrics into Grafana for alerting.
Functional Alerts: For functional checks (e.g., failed runs), we’ll send metrics to Prometheus via a connector, enabling alerting in Grafana.
Datadog/Splunk: If Datadog or Splunk is the monitoring tool, we can deploy Datadog agents to monitor the entire infrastructure.

Alerts

Alert Name	Threshold	Severity
`CPU Utilization` (Percent)	70%	L1 Warning
`CPU Utilization` (Percent)	90%	L2 Critical
`FreeLocalStorage` (Bytes)	10Gb	L1 Warning
`FreeLocalStorage` (Bytes)	5Gb	L2 Critical
`Database Connections` (Count)	less than 50	L1 Warning
`Database Connections` (Count)	less than 25	L2 Critical
`Read Latency` (Seconds)	greater than 3sec	L1 Warning
`Read Latency` (Seconds)	greater than 5sec	L2 Critical
`Write Latency` (Seconds)	greater than 3sec	L1 Warning
`Write Latency` (Seconds)	greater than 5sec	L2 Critical
`Database Memory Usage Percentage`	85	L1 Warning
`Database Memory Usage Percentage`	90	L2 Critical
`Engine CPU Utilization`	85	L1 Warning
`Engine CPU Utilization`	90	L2 Critical
`Number of client connection over last hour is less than`	<15	L2 Critical
`Authentication failures over last hour is more than`	>3	L2 Critical
`Disk usage by broker`	80%	L1 Warning
`Disk usage by broker`	90%	L2 Critical
`CPU (User) usage by broker`	80%	L1 Warning
`CPU (User) usage by broker`	90%	L2 Critical
`Lag alerts on topics`	>500	L1 Warning
`Lag alerts on topics`	>1000	L2 Critical
`Kafka - partition count per broker`	>1000	L2 Critical
`Kafka - connection count per broker`	<= 0	L2 Critical
`Container in waiting status` (in minutes)	>1min	L2 Critical
`Container restarts` (over last 10 min)	>5	L2 Critical
`Container terminated with error` (over last 10 min)	1	L2 Critical
`Pod High CPU Usage` (in percentage)	>80%	L1 Warning
`Pod High CPU Usage` (in percentage)	>90%	L2 Critical
`Pod High Memory Usage` (in percentage)	>80%	L1 Warning
`Pod High Memory Usage` (in percentage)	>90%	L2 Critical
`Kubernetes Pod Crash Looping` (over last 5 min)	1	L2 Critical
`Node Not Ready` (duration in minutes)	4mins	L1 Warning
`Node Not Ready` (duration in minutes)	5min	L2 Critical
`High CPU Node Utilization` (in percentage)	>80%	L1 Warning
`High CPU Node Utilization` (in percentage)	>90%	L2 Critical
`Kubernetes PVC available space` (in percentage)	<20%	L1 Warning
`Kubernetes PVC available space` (in percentage)	<10%	L2 Critical
`Kubernetes PVC Pending / Lost` (over last 10 min)	1	L2 Critical
`Full GC Alerts on pods` (over last 5 min)	>0	L2 Critical
`Subnet run out of free IP addresses`	0	L2 Critical
`401 status code`	>50 hits within 5 minutes	L1 Warning
`500 status code`	>10 hits within a minute	L1 Warning
`>500 status codes`	>10 hits within a minute	L1 Warning
`SQL Routine load error`	>10 hits within 5 minute	L1 Warning
`GRPC Exceptions`	>5 hits within 5 minutes	L1 Warning
`NullPointerException errors`	>5 hits within 5 minutes	L1 Warning