Skip to main content

What It Does

Watcher is Monk’s built-in cluster monitoring and alerting system. It runs on your cluster 24/7, detects crashes, resource pressure, and health check failures, then sends AI-analyzed alerts to Slack with actionable recommendations. Available on Pro and Team plans.

How to Set Up

In chat, ask Monk to set up Watcher:
set up watcher
Monk shows a configuration form with monitoring thresholds. The defaults work for most clusters. Click Deploy Watcher to start. Alternative triggers:
  • “configure watcher”
  • “enable watcher”
  • “set up cluster monitoring”
  • “configure Slack alerts for my cluster”
Requirements:
  • Active cluster with at least one non-local node
  • Slack webhook URL (optional, for alerts)

Configuration Options

The setup form has four sections:

Crash Detection

  • Crash Threshold: Number of restarts within the window to trigger an alert (default: 3)
  • Crash Window: Time window for counting restarts (default: 5 minutes)
  • Health Check Failures: Consecutive liveness failures before alerting (default: 3)

Peer Thresholds (Cluster Nodes)

  • CPU %: CPU usage threshold (default: 80%)
  • CPU Duration: Sustained time before alerting (default: 5 minutes)
  • Memory %: Memory usage threshold (default: 80%)
  • Memory Duration: Sustained time before alerting (default: 5 minutes)
  • Disk %: Disk usage threshold (default: 85%)
  • Disk Breaches: Consecutive polls above threshold before alerting (default: 2)

Workload Thresholds (Running Services)

  • CPU %: CPU usage threshold (default: 80%)
  • CPU Duration: Sustained time before alerting (default: 5 minutes)
  • Memory %: Memory usage threshold (default: 80%)
  • Memory Duration: Sustained time before alerting (default: 5 minutes)
  • Disk %: Disk usage threshold (default: 90%)
  • Disk Breaches: Consecutive polls above threshold before alerting (default: 3)

Advanced Settings

Toggle “Show Advanced Options” to access:
  • Poll Interval: How often to check cluster health (default: 15 seconds)
  • AI Only Slack: Only send AI-analyzed alerts to Slack, reduces noise (default: on)
  • Enable Fix with Monk: Include debugging links in Slack alerts (default: on)
  • Ignore Local Peer: Skip local node checks, focus on remote peers (default: on)
  • Context TTL: How long to keep alert context for debugging links (default: 24 hours)
  • Reassess Interval: How often to re-evaluate ongoing issues (default: 15 minutes)
  • Log Lines: Number of log lines to analyze per workload (default: 100)
Watcher configuration

Slack Integration

When you set up Watcher, Monk asks if you want to configure Slack alerts. If you choose yes, Monk prompts for your Slack webhook URL (collected securely, never shown in chat). To create a Slack webhook:
  1. Go to Slack Incoming Webhooks
  2. Create a new webhook for your workspace
  3. Copy the webhook URL
  4. Paste it when Monk asks during Watcher setup
If you skip Slack configuration, Watcher still monitors your cluster - you just won’t get push notifications.

How It Works

Watcher deploys two components to your cluster:
  1. watcher-agent: Monitors cluster health, collects metrics, detects issues
  2. watcher-ai: Analyzes issues with AI, generates recommendations, sends Slack alerts
Detection flow:
  1. Continuous polling of all nodes and workloads
  2. Threshold breach or crash detected
  3. AI analyzes logs, metrics, and context
  4. Alert sent to Slack with diagnosis and recommendations
  5. Recovery notification when issue resolves
If an issue is detected:
  • An alert is sent to the configured notification endpoint (e.g., Slack).
  • The notification includes a summary of the issue and a “Fix with Monk” button.
  • Upon activation, Monk opens a contextual chat session.
  • Monk explains the root cause and transparently performs remediation steps.
Watcher documents each action it takes, ensuring full visibility into the resolution process.

What Watcher Detects

Watcher can identify a wide range of infrastructure and application-level issues, including:
  • High CPU usage exceeding defined thresholds
  • Crash loops
  • Noisy neighbor resource contention
  • Excessive log output
  • Infrastructure instability

Alert Notification Configuration

Watcher supports flexible alert routing options:
  • Slack Webhook Notifications
This enables organizations to integrate alerts directly into their existing incident management workflows.

Incident Flow

A typical incident resolution process follows these steps:
  • An alert is triggered in an external application (e.g., Slack).
  • The root cause of the issue is analyzed.
  • The user clicks Fix with Monk.
  • Monk opens a contextual chat session and begins remediation.
  • Monk displays each action taken in real time.

Slack Alert Format

Issue detected:
⚠️ AI Assessment

Sustained high CPU usage on api-server is driving node 
CPU to ~80%+, with repeated warnings but no crashes yet.

Recommendation:
Confirm this is a true capacity issue rather than a 
transient spike by continuing to monitor CPU usage. 
If sustained, increase CPU resources or scale horizontally.

Target: api-server
Severity: warning

[Fix with Monk]
Recovery:
✅ AI Assessment

Recovery: CPU usage has normalized to ~78%, with workloads 
running normally and no new errors.

Recommendation:
Keep current configuration but continue to observe.

Fix with Monk Button

Each Slack alert includes a Fix with Monk button. Clicking it:
  1. Opens VS Code with the Monk extension
  2. Loads the Monk chat panel
  3. Prefills context about the issue (affected workload, logs, metrics, AI diagnosis)
You can then ask Monk to fix the issue, and it has full context from the alert. Watcher CPU notification Slack alert

Managing Watcher

Check status:
is watcher running?
Update configuration: Run setup again to reconfigure:
set up watcher
This reloads the template and applies new settings. View Watcher logs:
show logs from system/watcher-agent

Coming Soon

Autonomous Auto-Fixes COMING SOON Future Watcher capabilities will include automatic remediation:
  • Automatic restart of crashed services
  • Auto-scaling resources when sustained pressure detected
  • Applying known fixes without human intervention
  • Smart rollback on failed deployments
Currently, Watcher detects and diagnoses issues with AI-powered analysis - future versions will also fix them autonomously.