Skip to main content

What It Does

Watcher is Monk’s built-in cluster monitoring and alerting system. It runs on your cluster 24/7, detects crashes, resource pressure, and health check failures, then sends AI-analyzed alerts to Slack with actionable recommendations. Available on Pro and Team plans.

How to Set Up

In chat, ask Monk to set up Watcher:
set up watcher
Monk shows a configuration form with monitoring thresholds. The defaults work for most clusters. Click Deploy Watcher to start. Alternative triggers:
  • “configure watcher”
  • “enable watcher”
  • “set up cluster monitoring”
  • “configure Slack alerts for my cluster”
Requirements:
  • Active cluster with at least one non-local node
  • Slack webhook URL (optional, for alerts)

Configuration Options

The setup form has four sections:

Crash Detection

  • Crash Threshold: Number of restarts within the window to trigger an alert (default: 3)
  • Crash Window: Time window for counting restarts (default: 5 minutes)
  • Health Check Failures: Consecutive liveness failures before alerting (default: 3)

Peer Thresholds (Cluster Nodes)

  • CPU %: CPU usage threshold (default: 80%)
  • CPU Duration: Sustained time before alerting (default: 5 minutes)
  • Memory %: Memory usage threshold (default: 80%)
  • Memory Duration: Sustained time before alerting (default: 5 minutes)
  • Disk %: Disk usage threshold (default: 85%)
  • Disk Breaches: Consecutive polls above threshold before alerting (default: 2)

Workload Thresholds (Running Services)

  • CPU %: CPU usage threshold (default: 80%)
  • CPU Duration: Sustained time before alerting (default: 5 minutes)
  • Memory %: Memory usage threshold (default: 80%)
  • Memory Duration: Sustained time before alerting (default: 5 minutes)
  • Disk %: Disk usage threshold (default: 90%)
  • Disk Breaches: Consecutive polls above threshold before alerting (default: 3)

Advanced Settings

Toggle “Show Advanced Options” to access:
  • Poll Interval: How often to check cluster health (default: 15 seconds)
  • AI Only Slack: Only send AI-analyzed alerts to Slack, reduces noise (default: on)
  • Enable Fix with Monk: Include debugging links in Slack alerts (default: on)
  • Ignore Local Peer: Skip local node checks, focus on remote peers (default: on)
  • Context TTL: How long to keep alert context for debugging links (default: 24 hours)
  • Reassess Interval: How often to re-evaluate ongoing issues (default: 15 minutes)
  • Log Lines: Number of log lines to analyze per workload (default: 100)

Slack Integration

When you set up Watcher, Monk asks if you want to configure Slack alerts. If you choose yes, Monk prompts for your Slack webhook URL (collected securely, never shown in chat). To create a Slack webhook:
  1. Go to Slack Incoming Webhooks
  2. Create a new webhook for your workspace
  3. Copy the webhook URL
  4. Paste it when Monk asks during Watcher setup
If you skip Slack configuration, Watcher still monitors your cluster - you just won’t get push notifications.

How It Works

Watcher deploys two components to your cluster:
  1. watcher-agent: Monitors cluster health, collects metrics, detects issues
  2. watcher-ai: Analyzes issues with AI, generates recommendations, sends Slack alerts
Detection flow:
  1. Continuous polling of all nodes and workloads
  2. Threshold breach or crash detected
  3. AI analyzes logs, metrics, and context
  4. Alert sent to Slack with diagnosis and recommendations
  5. Recovery notification when issue resolves

Slack Alert Format

Issue detected:
⚠️ AI Assessment

Sustained high CPU usage on api-server is driving node 
CPU to ~80%+, with repeated warnings but no crashes yet.

Recommendation:
Confirm this is a true capacity issue rather than a 
transient spike by continuing to monitor CPU usage. 
If sustained, increase CPU resources or scale horizontally.

Target: api-server
Severity: warning

[Fix with Monk]
Recovery:
✅ AI Assessment

Recovery: CPU usage has normalized to ~78%, with workloads 
running normally and no new errors.

Recommendation:
Keep current configuration but continue to observe.

Fix with Monk Button

Each Slack alert includes a Fix with Monk button. Clicking it:
  1. Opens VS Code with the Monk extension
  2. Loads the Monk chat panel
  3. Prefills context about the issue (affected workload, logs, metrics, AI diagnosis)
You can then ask Monk to fix the issue, and it has full context from the alert.

Managing Watcher

Check status:
is watcher running?
Update configuration: Run setup again to reconfigure:
set up watcher
This reloads the template and applies new settings. View Watcher logs:
show logs from system/watcher-agent

Coming Soon

Autonomous Auto-Fixes COMING SOON Future Watcher capabilities will include automatic remediation:
  • Automatic restart of crashed services
  • Auto-scaling resources when sustained pressure detected
  • Applying known fixes without human intervention
  • Smart rollback on failed deployments
Currently, Watcher detects and diagnoses issues with AI-powered analysis - future versions will also fix them autonomously.