Watcher

What It Does

Watcher is Monk’s built-in cluster monitoring and alerting system. It runs on your cluster 24/7, detects crashes, resource pressure, and health check failures, then sends AI-analyzed alerts to Slack with actionable recommendations. Available on Pro and Team plans.

How to Set Up

In chat, ask Monk to set up Watcher:

set up watcher

Monk shows a configuration form with monitoring thresholds. The defaults work for most clusters. Click Deploy Watcher to start. Alternative triggers:

“configure watcher”
“enable watcher”
“set up cluster monitoring”
“configure Slack alerts for my cluster”

Requirements:

Active cluster with at least one non-local node
Slack webhook URL (optional, for alerts)

Configuration Options

The setup form has four sections:

Crash Detection

Crash Threshold: Number of restarts within the window to trigger an alert (default: 3)
Crash Window: Time window for counting restarts (default: 5 minutes)
Health Check Failures: Consecutive liveness failures before alerting (default: 3)

Peer Thresholds (Cluster Nodes)

CPU %: CPU usage threshold (default: 80%)
CPU Duration: Sustained time before alerting (default: 5 minutes)
Memory %: Memory usage threshold (default: 80%)
Memory Duration: Sustained time before alerting (default: 5 minutes)
Disk %: Disk usage threshold (default: 85%)
Disk Breaches: Consecutive polls above threshold before alerting (default: 2)

Workload Thresholds (Running Services)

CPU %: CPU usage threshold (default: 80%)
CPU Duration: Sustained time before alerting (default: 5 minutes)
Memory %: Memory usage threshold (default: 80%)
Memory Duration: Sustained time before alerting (default: 5 minutes)
Disk %: Disk usage threshold (default: 90%)
Disk Breaches: Consecutive polls above threshold before alerting (default: 3)

Advanced Settings

Toggle “Show Advanced Options” to access:

Poll Interval: How often to check cluster health (default: 15 seconds)
AI Only Slack: Only send AI-analyzed alerts to Slack, reduces noise (default: on)
Enable Fix with Monk: Include debugging links in Slack alerts (default: on)
Ignore Local Peer: Skip local node checks, focus on remote peers (default: on)
Context TTL: How long to keep alert context for debugging links (default: 24 hours)
Reassess Interval: How often to re-evaluate ongoing issues (default: 15 minutes)
Log Lines: Number of log lines to analyze per workload (default: 100)

Slack Integration

When you set up Watcher, Monk asks if you want to configure Slack alerts. If you choose yes, Monk prompts for your Slack webhook URL (collected securely, never shown in chat). To create a Slack webhook:

Go to Slack Incoming Webhooks
Create a new webhook for your workspace
Copy the webhook URL
Paste it when Monk asks during Watcher setup

If you skip Slack configuration, Watcher still monitors your cluster - you just won’t get push notifications.

How It Works

Watcher deploys two components to your cluster:

watcher-agent: Monitors cluster health, collects metrics, detects issues
watcher-ai: Analyzes issues with AI, generates recommendations, sends Slack alerts

Detection flow:

Continuous polling of all nodes and workloads
Threshold breach or crash detected
AI analyzes logs, metrics, and context
Alert sent to Slack with diagnosis and recommendations
Recovery notification when issue resolves

If an issue is detected:

An alert is sent to the configured notification endpoint (e.g., Slack).
The notification includes a summary of the issue and a “Fix with Monk” button.
Upon activation, Monk opens a contextual chat session.
Monk explains the root cause and transparently performs remediation steps.

Watcher documents each action it takes, ensuring full visibility into the resolution process.

What Watcher Detects

Watcher can identify a wide range of infrastructure and application-level issues, including:

High CPU usage exceeding defined thresholds
Crash loops
Noisy neighbor resource contention
Excessive log output
Infrastructure instability

Alert Notification Configuration

Watcher supports flexible alert routing options:

Slack Webhook Notifications

This enables organizations to integrate alerts directly into their existing incident management workflows.

Incident Flow

A typical incident resolution process follows these steps:

An alert is triggered in an external application (e.g., Slack).
The root cause of the issue is analyzed.
The user clicks Fix with Monk.
Monk opens a contextual chat session and begins remediation.
Monk displays each action taken in real time.

Slack Alert Format

Issue detected:

⚠️ AI Assessment

Sustained high CPU usage on api-server is driving node 
CPU to ~80%+, with repeated warnings but no crashes yet.

Recommendation:
Confirm this is a true capacity issue rather than a 
transient spike by continuing to monitor CPU usage. 
If sustained, increase CPU resources or scale horizontally.

Target: api-server
Severity: warning

[Fix with Monk]

Recovery:

✅ AI Assessment

Recovery: CPU usage has normalized to ~78%, with workloads 
running normally and no new errors.

Recommendation:
Keep current configuration but continue to observe.

Fix with Monk Button

Each Slack alert includes a Fix with Monk button. Clicking it:

Opens VS Code with the Monk extension
Loads the Monk chat panel
Prefills context about the issue (affected workload, logs, metrics, AI diagnosis)

You can then ask Monk to fix the issue, and it has full context from the alert.

Managing Watcher

Check status:

is watcher running?

Update configuration: Run setup again to reconfigure:

set up watcher

This reloads the template and applies new settings. View Watcher logs:

show logs from system/watcher-agent

Coming Soon

Autonomous Auto-Fixes COMING SOON Future Watcher capabilities will include automatic remediation:

Automatic restart of crashed services
Auto-scaling resources when sustained pressure detected
Applying known fixes without human intervention
Smart rollback on failed deployments

Currently, Watcher detects and diagnoses issues with AI-powered analysis - future versions will also fix them autonomously.

Monitoring & Observability - Log streaming and metrics
Scaling - Metrics that trigger scaling
Security - How Watcher access is secured

Deployment & Build

Infrastructure & Cloud

Configuration & Data

Networking & Security

Operations & Monitoring

Developer Experience

Team Features

What It Does

How to Set Up

Configuration Options

Crash Detection

Peer Thresholds (Cluster Nodes)

Workload Thresholds (Running Services)

Advanced Settings

Slack Integration

How It Works

What Watcher Detects

Alert Notification Configuration

Incident Flow

Slack Alert Format

Fix with Monk Button

Managing Watcher

Coming Soon

Deployment & Build

Infrastructure & Cloud

Configuration & Data

Networking & Security

Operations & Monitoring

Developer Experience

Team Features

​What It Does

​How to Set Up

​Configuration Options

​Crash Detection

​Peer Thresholds (Cluster Nodes)

​Workload Thresholds (Running Services)

​Advanced Settings

​Slack Integration

​How It Works

​What Watcher Detects

​Alert Notification Configuration

​Incident Flow

​Slack Alert Format

​Fix with Monk Button

​Managing Watcher

​Coming Soon

​Related Features

What It Does

How to Set Up

Configuration Options

Crash Detection

Peer Thresholds (Cluster Nodes)

Workload Thresholds (Running Services)

Advanced Settings

Slack Integration

How It Works

What Watcher Detects

Alert Notification Configuration

Incident Flow

Slack Alert Format

Fix with Monk Button

Managing Watcher

Coming Soon

Related Features