Auto-Investigate Datadog Alerts

For a more detailed Datadog integration guide, click here.

Enable the Datadog MCP

Devin needs access to your Datadog account to query logs, metrics, and monitors during an investigation.

Go to Settings > Connections > MCP servers and find Datadog
Click Enable, select your Datadog site/region, and enter your DD-API-KEY and DD-APPLICATION-KEY — generate these in Datadog > Organization Settings > API Keys
Click Test listing tools to verify Devin can connect

Once enabled, Devin can query error logs, pull metric timeseries, list active monitors, and search traces — all within a session. Learn more about connecting MCP servers.

Build the alert-to-Devin bridge

You need a small service that receives alert webhooks and starts a Devin session via the Devin API. Deploy this as a serverless function (AWS Lambda, Cloudflare Worker) or a lightweight container:

from flask import Flask, request, jsonify
import requests, os

app = Flask(__name__)

@app.route("/alert", methods=["POST"])
def handle_alert():
    payload = request.json

    # Datadog webhook payload fields
    alert_title = payload.get("title", "Unknown alert")
    tags_str = payload.get("tags", "")
    service = next(
        (t.split(":", 1)[1] for t in tags_str.split(",") if t.strip().startswith("service:")),
        "unknown-service"
    )
    alert_url = payload.get("link", "")

    org_id = os.environ["DEVIN_ORG_ID"]
    response = requests.post(
        f"https://api.devin.ai/v3/organizations/{org_id}/sessions",
        headers={"Authorization": f"Bearer {os.environ['DEVIN_API_KEY']}"},
        json={
            "prompt": (
                f"Datadog alert fired: '{alert_title}'\n"
                f"Service: {service}\n"
                f"Alert link: {alert_url}\n\n"
                "Using the Datadog MCP:\n"
                "1. Pull error logs for this service from the past 30 min\n"
                "2. Identify the top error messages and stack traces\n"
                "3. Check if this correlates with a recent deploy\n"
                "4. If the root cause is clear, open a hotfix PR\n"
                "5. Post your findings to #incidents on Slack"
            ),
            "playbook_id": "14fed18b89d44713a26e673cf258f548",
        }
    )
    return jsonify(response.json()), 200

Create a service user in Settings > Service Users at app.devin.ai with ManageOrgSessions permission. Copy the API token shown after creation and store it as DEVIN_API_KEY on your bridge service. Set DEVIN_ORG_ID to your organization ID — get it by calling GET https://api.devin.ai/v3/enterprise/organizations with your token.The code above uses the !triage template playbook — duplicate it and customize the investigation steps for your stack, then update the playbook_id in your bridge service.

Route alerts to the webhook

From Datadog directly:

In your Datadog dashboard, go to Integrations > Webhooks
Click New Webhook and set the URL to your bridge endpoint (e.g., https://your-bridge.example.com/alert)
In any monitor’s notification message, add @webhook-devin-bridge — Devin investigates whenever that monitor fires

From PagerDuty:

In PagerDuty, go to Services > [your service] > Integrations
Add a Generic Webhooks (v3) integration
Set the webhook URL to your bridge endpoint and filter by event type incident.triggered

Start with warning-level monitors to test the pipeline before routing critical alerts.

What Devin investigates

When an alert triggers a session, Devin uses the Datadog MCP to run a structured investigation — querying logs, correlating with deploys, and tracing the error to source code.Example investigation Devin posts to Slack:

Alert Investigation: payments-service error rate spike

Timeline:
- 14:28 UTC — Deploy #492 released (commit abc123f)
- 14:31 UTC — Error rate jumped from 0.3% to 5.2%
- 14:32 UTC — Alert triggered

Root cause: Deploy #492 refactored the Stripe webhook handler
(src/webhooks/stripe.ts) to async/await but removed the try/catch
around handlePaymentIntent(). Unhandled rejections are returning
500s on ~4% of checkout requests.

Fix: Added error boundary with structured logging and proper 4xx
responses for client errors.

PR #493 opened → https://github.com/acme/payments/pull/493

Extend the pipeline

Once basic investigation works, layer on more automation:Customize the triage playbook. The bridge code already uses the !triage template playbook. Duplicate it and tailor the investigation checklist to your team’s stack — add service-specific runbooks, escalation paths, and conventions for hotfix PRs.Scope by severity. Route P1 alerts for immediate investigation and hotfix. Route P3 alerts for root-cause analysis only. Use different prompts or playbooks per severity level.Add Knowledge about your services — normal thresholds, architecture, on-call runbooks — so Devin’s investigation starts from your team’s context instead of from scratch.