Skip to main content

Set Up Watcher

Ask Monk to set up monitoring after deploying to a cluster:
set up watcher
Monk configures autonomous monitoring and asks for a Slack webhook URL to send alerts. Monitoring & Observability

What Watcher Monitors

Watcher AI agent continuously monitors your cluster for: Peer (node) health:
  • CPU, memory, and disk usage across all nodes
  • Sustained high resource usage (80%+ for 5+ minutes)
  • Resource pressure and capacity issues
Runnable (container) health:
  • Container crashes and restart loops
  • Failed liveness checks
  • Service unavailability
Crash detection:
  • Repeated crashes within thresholds
  • Crash loop patterns

Slack Integration

Getting a Slack Webhook

  1. Go to your Slack workspace settings
  2. Navigate to AppsAdd apps
  3. Search for “Incoming Webhooks” and add it
  4. Create a webhook for your alerts channel (e.g., #monk-alerts)
  5. Copy the webhook URL (starts with https://hooks.slack.com/...)
  6. Provide to Monk when prompted during setup
Alternatively, tell Monk:
configure Slack alerts for watcher
Monk will request the webhook URL via secure form.

How Watcher Alerts Work

Watcher sends two types of Slack messages: alerts when issues are detected, and recovery notifications when issues resolve.

Alert: Issue Detected

When Watcher detects a problem, it sends a detailed Slack message: Example - Sustained High CPU:
⚠️ AI Assessment

Sustained high CPU usage on templates/local/calc/worker 
(25f2489db3582169426dde8ce142f556-ocal-calc-worker-worker) 
is driving node CPU to ~80%+, with repeated in-app warnings 
but no crashes yet.

Recommendation:
First, confirm this is a true capacity issue rather than a 
transient spike by continuing to monitor CPU usage and runtime 
behavior for templates/local/calc/worker and its container 
25f2489db3582169426dde8ce142f556-ocal-calc-worker-worker, 
paying attention to any emerging errors, timeouts, or increased 
latency in dependent services.

If high CPU remains sustained, adjust the capacity for 
templates/local/calc/worker by either increasing the CPU 
resources available to the worker container...

Target: templates/local/calc/worker/25f2489...
Type: agent_notice
Severity: warning
Source: ai-agent

Recovery: Issue Resolved

When the issue resolves, Watcher sends a recovery notification:
✅ AI Assessment

Recovery: CPU usage on peer rrs-do and container 
25f2489db3582169426dde8ce142f556-ocal-calc-worker-worker 
for templates/local/calc/worker has normalized to ~78-83%, 
with workloads running normally and no new errors beyond 
prior high-CPU warnings.

Recommendation:
Treat this as an informational recovery event and keep the 
current configuration unchanged for now, but continue to 
observe CPU usage on peer rrs-do and the 
templates/local/calc/worker runnable to confirm it remains 
stable under typical load.

Target: templates/local/calc/worker/25f2489...
Severity: recovery
Source: ai-agent

Alert Components

Each Watcher alert includes: Context:
  • Assessment Type: Warning, Alert, or Recovery (with appropriate emoji)
  • Problem Description: What’s happening in plain language
  • Technical Details: Affected service/container identifiers
AI Analysis:
  • Diagnosis: Root cause based on logs, metrics, and patterns
  • Recommendations: Specific, actionable steps to address the issue
  • Context: Distinguishes real issues from transient spikes
Metadata:
  • Target: Affected service/container
  • Severity: info, warning, alert, recovery
  • Source: watcher-agent (metrics) or ai-agent (AI analysis)
  • Type: agent_notice, crash_alert, resource_alert, etc.

Understanding Watcher’s AI Analysis

Watcher uses AI to provide intelligent monitoring, not just threshold alerts:

1. Pattern Recognition

Transient spike vs sustained issue: Watcher distinguishes between:
  • Brief CPU spike during startup → No alert
  • Sustained 80%+ CPU for 5+ minutes → Alert with analysis

2. Context-Aware Diagnosis

Not just “CPU high”: Watcher explains:
  • Which service is causing it
  • Whether crashes occurred
  • Impact on dependent services
  • Whether it’s a capacity issue or application bug

3. Actionable Recommendations

Not just “fix it”: Watcher suggests:
  • Specific actions to take
  • How to verify if it’s a real issue
  • What to monitor next
  • When to scale vs when to optimize code

4. Lifecycle Tracking

Watcher monitors the full issue lifecycle:
  • Detection: Issue identified with initial diagnosis
  • Ongoing: Updates if situation worsens
  • Recovery: Notification when issue resolves
  • Lessons: Context for future similar issues
This means fewer false alarms and more actionable insights.

Configuring Monitoring Thresholds

Watcher uses intelligent defaults: Default thresholds:
  • CPU: 80% sustained for 5+ minutes
  • Memory: 80% sustained for 5+ minutes
  • Disk: 85% with multiple breaches
  • Crash threshold: 3 crashes trigger alert
  • Liveness failures: 3 consecutive failures
These thresholds work well for most applications. The defaults balance sensitivity (catching real issues) with specificity (avoiding false alarms).

Checking Watcher Status

Is Watcher Running?

is watcher running?
Monk checks if Watcher is active on your cluster.

View Recent Alerts

show me recent alerts
what did watcher detect in the last hour?
has watcher found any issues?

Check Alert History

show me all watcher alerts from yesterday
how many times did the API crash this week?

What Watcher Does Today

Currently available:
  • 24/7 autonomous monitoring of cluster health
  • AI-powered diagnosis of issues with context
  • Slack notifications with detailed analysis
  • Pattern recognition - sustained issues vs transient spikes
  • Recovery notifications when problems resolve
  • Multi-layered monitoring - peers (nodes) and runnables (containers)
  • Crash loop detection - identifies repeated failures
  • Resource pressure alerts - CPU, memory, disk

Coming Soon

”Fix with Monk” Button

COMING SOON Slack alerts will include a “Fix with Monk” button that:
  1. You click the button in Slack
  2. Your IDE opens automatically
  3. Monk loads with full context of the issue
  4. Monk is pre-prompted with the problem and recommended fix
  5. You review and approve the fix
One-click from alert to resolution.

Autonomous Auto-Fixes

COMING SOON Future Watcher will automatically fix common issues:
  • Auto-restart crashed services
  • Auto-scale when sustained resource pressure detected
  • Apply known fixes for common issues
  • Smart rollback on failed deployments
Currently, Watcher detects and alerts - future versions will also fix autonomously.

Troubleshooting Watcher

Watcher Not Sending Alerts

Check:
  1. Verify Watcher is running:
is watcher running?
  1. Verify Slack webhook URL is correct:
check watcher configuration
  1. Ensure cluster has internet connectivity to reach Slack
Fix:
reconfigure watcher with new Slack webhook

Too Many Alerts

If you’re getting frequent alerts: Ask Monk:
why is watcher alerting so much?
Likely causes:
  • Cluster genuinely under-resourced
  • Application has performance issues
  • Thresholds too sensitive for your workload
Solutions:
  • Scale up cluster resources
  • Optimize application performance
  • Ask Monk about right-sizing your infrastructure
Watcher’s AI filters noise, so frequent alerts usually indicate real issues.

False Positives

If Watcher alerts on non-issues:
that CPU spike was expected during deployment
Monk learns from your feedback to improve future alerts.

Want to Temporarily Disable?

disable watcher
Monitoring stops. Re-enable anytime:
enable watcher
turn watcher back on

Managing Multiple Clusters

If you have multiple clusters, each can have its own Watcher:
set up watcher on my production cluster
Switch between clusters to configure Watcher for each:
switch to staging cluster
set up watcher
All alerts go to the same Slack channel (or configure different webhooks per cluster).

Alert Severity Levels

Watcher uses different severity levels: Info:
  • FYI notifications
  • Non-critical events
  • Successful recoveries
Warning:
  • Issues detected but not critical
  • Sustained resource usage
  • Worth monitoring
Alert:
  • Critical issues requiring attention
  • Service crashes
  • Severe resource exhaustion
Recovery:
  • Issue has resolved
  • System back to normal
  • Informational