coder-server

Author	SHA1	Message	Date
Kyle Carberry	5b32c4d79d	fix: prevent stdio MCP server subprocess from dying after connect (#24035 ) ## Problem MCP servers configured in `.mcp.json` with stdio transport are discovered successfully (tools appear) but die immediately after connection, making all tool calls fail. ## Root Cause In `connectServer`, the subprocess is spawned with `connectCtx` — a 30-second timeout context whose `cancel()` is deferred: ```go connectCtx, cancel := context.WithTimeout(ctx, connectTimeout) defer cancel() if err := c.Start(connectCtx); err != nil { ... } ``` The mcp-go stdio transport calls `exec.CommandContext(connectCtx, ...)`. When `connectServer` returns, `cancel()` fires, and `exec.CommandContext` sends SIGKILL to the subprocess. The process immediately becomes a zombie. Confirmed by checking `/proc/<pid>/status` after context cancellation: ``` State: Z (zombie) ``` ## Fix Pass the parent `ctx` (which is `a.gracefulCtx` — the agent's long-lived context) to `c.Start()`. `connectCtx` continues to bound only the `Initialize()` handshake. The subprocess is cleaned up when the Manager is closed or the parent context is canceled. ## Regression Test Added `TestConnectServer_StdioProcessSurvivesConnect` which: - Spawns a real subprocess (re-execs the test binary as a fake MCP server) - Calls `connectServer` and lets it return (internal `connectCtx` gets canceled) - Verifies the subprocess is still alive by calling `ListTools` The test fails on the old code with `transport error: context deadline exceeded` and passes with the fix. > Generated with [Coder Agents](https://coder.com/agents)	2026-04-05 12:04:13 +00:00
Kyle Carberry	919dc299fc	feat: agent reads context files and discovers skills locally (#23935 ) Piggybacks on #23878. Moves instruction file reading and skill discovery from `chatd` (server-side, via multiple `LS`/`ReadFile` round-trips through the agent connection) to the agent itself (local filesystem access). This intentionally drops backward compatibility with older agents that don't support the context-config endpoint. Agents and server are deployed together; there is no rolling-update contract to maintain here. ## What changed The agent's `GET /api/v0/context-config` response now returns `[]ChatMessagePart` directly — the same types chatd persists. This eliminates intermediate type conversions and makes the protocol extensible. \| Field \| Type \| Description \| \|---\|---\|---\| \| `parts` \| `[]ChatMessagePart` \| Context-file and skill parts, ready to persist \| \| `working_dir` \| `string` \| Agent's resolved working directory \| Removed from the response: `instructions_dirs`, `instructions_file`, `skills_dirs`, `skill_meta_file`, `mcp_config_files` — the agent reads files locally and returns their content as parts. Removed from chatd: all legacy `LS`/`ReadFile` fallback code (`readHomeInstructionFile`, `readInstructionDirFile`, `DiscoverSkills` via LS, etc). ## Why The previous architecture had the agent resolve paths, serve them over HTTP, then `chatd` make N+1 round-trips back through the agent connection to read files. The agent has direct filesystem access and should just read the files. ## Key design decisions - Agent returns `ChatMessagePart` directly — same types chatd persists. No intermediate `InstructionFileEntry`/`SkillEntry` types needed. - `SkillMeta.MetaFile` — persisted via `ContextFileSkillMetaFile` on the skill part, so custom meta file names (`CODER_AGENT_EXP_SKILL_META_FILE`) survive across chat turns. - No pre-read body — `read_skill` always dials the workspace to fetch the skill body on demand. Simpler than caching the body in the response. - MCP config paths kept agent-internal — `MCPConfigFiles()` getter, not sent over the wire. - No backward compat fallback — old agents that don't support context-config get no instruction files. This is acceptable since agent and server deploy together.	2026-04-04 12:45:46 -04:00
Hugo Dutka	17dec2a70f	feat: agents desktop recordings backend (#23894 ) This PR introduces screen recording of the computer use agent using the virtual desktop. - Screen recording is triggered by a `wait_agent` tool call. Recording is stopped by a successful `wait_agent` tool call or when there hasn't been any desktop activity for 10 minutes. - Recordings are handled by the `portabledesktop` cli via the `record` command. The videos are sped up in periods of inactivity. - Recordings are saved to the database to the `chat_files` table. There's a hard limit of 100MB per recording. Larger recordings are dropped. - A successful `wait_agent` on a computer use subagent tool call returns a `recording_file_id`, later allowing the frontend to display the corresponding video.	2026-04-02 17:23:27 +00:00
Cian Johnston	cd784c755a	fix(agent): exorcise data race haunting contextConfigAPI on reconnect (#23946 ) Fixes: coder/internal#1441 - Move `contextConfigAPI` init from `handleManifest` to `init()`, matching all other API fields - Change `agentcontextconfig.NewAPI` to accept `func() string` closure (lazy directory evaluation) - `Config()` and HTTP handler now compute on demand via `a.manifest.Load().Directory` - Widen `TestAgent_Reconnect` to loop 5 reconnections with a non-empty manifest directory - Add `TestContextConfigAPI_InitOnce` internal test verifying lazy eval across manifest changes - Add `TestNewAPI_LazyDirectory` unit test for the lazy contract > 🤖 Written by a Coder Agent. Reviewed by a human.	2026-04-02 09:00:13 +01:00
Kyle Carberry	ee855f9618	feat: make agent context paths configurable via env vars (#23878 ) Replace hardcoded paths for instruction files, skills, and MCP config with values read from `CODER_AGENT_EXP_` environment variables. Template authors configure paths via the existing `coder_agent` `env` block. The agent resolves `~`, relative, and absolute paths locally, then serves the resolved config over `GET /api/v0/context-config`. `chatd` fetches this once per workspace attach and falls back to today's defaults for older agents. All path env vars are comma-separated, allowing multiple directories: \| Env Var \| Default \| Controls \| \|---\|---\|---\| \| `CODER_AGENT_EXP_INSTRUCTIONS_DIRS` \| `~/.coder` \| Dirs containing the instruction file \| \| `CODER_AGENT_EXP_INSTRUCTIONS_FILE` \| `AGENTS.md` \| Instruction file name \| \| `CODER_AGENT_EXP_SKILLS_DIRS` \| `.agents/skills` \| Skills directories \| \| `CODER_AGENT_EXP_SKILL_META_FILE` \| `SKILL.md` \| Skill metadata file name \| \| `CODER_AGENT_EXP_MCP_CONFIG_FILES` \| `.mcp.json` \| MCP config files \| ### Example ```hcl resource "coder_agent" "main" { os = "linux" arch = "amd64" env = { CODER_AGENT_EXP_INSTRUCTIONS_DIRS = "/opt/company/agent-config,~/.coder" CODER_AGENT_EXP_INSTRUCTIONS_FILE = "CLAUDE.md" CODER_AGENT_EXP_SKILLS_DIRS = "/opt/company/ai-skills,.agents/skills" CODER_AGENT_EXP_MCP_CONFIG_FILES = "/opt/company/mcp.json,.mcp.json" } } ``` <details> <summary>Implementation Details</summary> ### Architecture Follows the same pattern as MCP tool discovery: agent resolves locally → exposes via HTTP → chatd consumes. Agent-side* (`agent/agentcontextconfig/`): - `ResolvePath` / `ResolvePaths` handle `~`, relative, and absolute path forms; returns `""` for relative paths when baseDir is empty - `Config` reads env vars, falls back to defaults, resolves all paths - `GET /api/v0/context-config` serves the resolved config as JSON chatd-side (`coderd/x/chatd/`): - Calls `conn.ContextConfig()` once on first workspace attach - Falls back to hardcoded defaults on 404 (older agents) - Iterates instruction dirs, skills dirs using resolved absolute paths - `LSRelativityRoot` everywhere — no more home/root juggling ### Key design decisions - `EXP_` prefix: env vars use `CODER_AGENT_EXP_` to indicate experimental status - Plural names: comma-separated vars use plural names (`DIRS`, `FILES`); single-value vars use singular (`FILE`) - Defaults in `workspacesdk`: default constants live in `codersdk/workspacesdk/` so both agent and server reference them without cross-layer imports - `skillMetaFile` persistence: stored on context-file parts via `ContextFileSkillMetaFile` and restored on subsequent chat turns so custom values survive across turns - Working dir dedup: `slices.Contains` guard prevents reading the same instruction file from both `InstructionsDirs` and the working directory - MCP server dedup: first-occurrence-wins dedup prevents leaking duplicate connections from overlapping config files - ResolvePath safety*: returns `""` for relative paths when `baseDir` is empty, so `ResolvePaths` filters them out ### Files changed \| File \| Change \| \|---\|---\| \| `agent/agentcontextconfig/` \| New package — path resolution + HTTP endpoint \| \| `codersdk/workspacesdk/agentconn.go` \| `ContextConfigResponse` type, default constants, client method \| \| `agent/agent.go` + `agent/api.go` \| Wire up endpoint, pass config to MCP \| \| `agent/x/agentmcp/manager.go` \| Accept `[]string` MCP config paths, dedup by name \| \| `coderd/x/chatd/chatd.go` \| Fetch config, thread through, named returns \| \| `coderd/x/chatd/instruction.go` \| Accept configurable dir + file name, `skillMetaFileFromParts` \| \| `coderd/x/chatd/chattool/skill.go` \| Accept configurable dirs + meta file \| \| `codersdk/chats.go` \| `ContextFileSkillMetaFile` field for persistence \| ### Test coverage - `TestConfig` (4 cases): defaults, custom env vars, whitespace trimming, comma-separated dirs - `TestResolvePath` / `TestResolvePaths`: including empty baseDir edge case - `TestPersistInstructionFilesFallbackOnOlderAgent`: backward-compat path when `ContextConfig` returns 404 - `TestChatMessagePartVariantTags`: updated exclusion list for new internal field ### Backward compatibility Older agents return 404 for the new endpoint. `chatd` catches this and falls back to today's defaults via `readHomeInstructionFile` (using `LSRelativityHome`). Existing workspaces work with no changes. </details>	2026-04-01 12:28:47 -04:00
Kyle Carberry	19e44f4136	fix: target specific chat in MarkStale instead of broadcasting to all workspace chats (#23883 ) ## Problem Subagent chats were receiving git context (branch, remote origin, PR status) from their parent or sibling chats' git operations. When a git operation triggers external auth, the workspace agent sends `chat_id` identifying which chat initiated it — but this was broken at two levels: 1. Agent side: `CODER_CHAT_ID` was never injected into process environments. `chatd` sets `Coder-Chat-Id` HTTP headers and the agent extracts them for process isolation, but never propagated `CODER_CHAT_ID` to `cmd.Env`. So `gitaskpass` always sent an empty `chat_id`. 2. Server side: `workspaceAgentsExternalAuth` ignored the `chat_id` query param. `MarkStale` broadcast git context to all chats on the workspace via `filterChatsByWorkspaceID`. ## Fix - Inject `CODER_CHAT_ID` into `cmd.Env` in `agentproc` when the chat ID is known, so `gitaskpass` can read and forward it. - Read `chat_id` from query params in `workspaceAgentsExternalAuth` and thread it through `chatGitRef`. - Refactor `MarkStale` to accept a `MarkStaleParams` struct. When `ChatID` is provided, target only that specific chat. When empty (legacy agents, non-chat git operations), fall back to the existing workspace-wide broadcast. - Extract `markStaleSingle` helper to deduplicate the upsert+publish logic. <details><summary>Investigation notes</summary> ### Data flow before fix ``` chatd → sets Coder-Chat-Id header on agent conn agent → extracts chatID, stores on process struct agent → does NOT set CODER_CHAT_ID in cmd.Env ← gap 1 gitaskpass → reads CODER_CHAT_ID (always empty), sends chat_id="" server handler → ignores chat_id query param ← gap 2 MarkStale → broadcasts to ALL workspace chats ``` ### Data flow after fix ``` chatd → sets Coder-Chat-Id header on agent conn agent → extracts chatID, stores on process struct agent → sets CODER_CHAT_ID in cmd.Env gitaskpass → reads CODER_CHAT_ID, sends chat_id=<uuid> server handler → reads chat_id, passes to MarkStale MarkStale → targets only that specific chat ``` </details>	2026-04-01 13:04:59 +00:00
Ethan	b86161e0a6	test: fix TestServer_X11_EvictionLRU hang on fish shell (#23838 ) `TestServer_X11_EvictionLRU` hangs forever when the developer's login shell is `fish`. This is the only test in the repo that breaks on fish, and it meant I couldn't run `make test` or similar without it blocking indefinitely. The test uses `sess.Shell()` to start interactive shell sessions, which causes the SSH server to run the user's login shell directly (`fish -l`). Fish buffers all piped stdin to EOF before executing any of it, so the test's `echo ready-0\n` write never gets processed — fish sits waiting for the pipe to close, and the test sits waiting for the echo response. The fix is a one-line change: `sess.Shell()` → `sess.Start("sh")`. The test is exercising X11 LRU eviction, not shell behavior, so using `sh` explicitly is both correct and shell-agnostic. The DISPLAY environment variable is set identically either way since the x11-req handler runs before `sessionStart`.	2026-04-01 12:31:22 +11:00
Kyle Carberry	0f86c4237e	feat: add workspace MCP tool discovery and proxying for chat (#23680 ) Coder's chat (chatd) can now discover and use MCP servers configured in a workspace's `.mcp.json` file. This brings project-specific tooling (GitHub, databases, docs servers, etc.) into the chat without any manual configuration. ## How it works The workspace agent reads `.mcp.json` from the workspace directory (same format Claude Code uses), connects to the declared MCP servers — spawning child processes for stdio servers and connecting over the network for HTTP/SSE — and caches their tool lists. Two new agent HTTP endpoints expose this: - `GET /api/v0/mcp/tools` returns the cached tool list (supports `?refresh=true`) - `POST /api/v0/mcp/call-tool` proxies calls to the correct server On each chat turn, chatd calls `ListMCPTools` through the existing `AgentConn` tailnet connection, wraps each tool as a `fantasy.AgentTool`, and adds them to the LLM's tool set alongside built-in and admin-configured MCP tools. Tool names are prefixed with the server name (`github__create_issue`) to avoid collisions. Failed server connections are logged and skipped — they never block the agent or break the chat. Child stdio processes are terminated on agent shutdown.	2026-03-26 19:57:02 +00:00
Cian Johnston	847a88c6ca	chore: clean up stale and dangerous //nolint comments (#23643 ) ## Changes - Commit 1: Remove 17 unnecessary `//nolint` directives: - `//nolint:varnamelen` — linter not active - `//nolint:unused` on exported `SlimUnsupported` - `//nolint:govet` in `coderd/httpmw/csrf` — no longer fires - `//nolint:revive` on functions refactored since the nolint was added - `//nolint:paralleltest` citing Go 1.22 loop variable capture (obsolete) - Bare `//nolint` narrowed to specific `//nolint:gocritic` with justification - Commit 2: Fix root causes behind 5 dangerous nolint suppressions: - Add `MinVersion: tls.VersionTLS12` to TLS client config (removes `gosec` G402) - Delete trivial unexported wrappers `apiKey()`/`normalizeProvider()` in chatprovider (removes `revive` confusing-naming) - Add doc comments to `StartWithAssert` and `Router` (removes `revive` exported) - Rename unused parameters to `_` in integration test helpers > 🤖 This PR was created using Coder Agents and reviewed by me.	2026-03-26 14:13:53 +00:00
Cian Johnston	c753a622ad	refactor(agent): move agentdesktop under x/ subpackage (#23610 ) - Move `agent/agentdesktop/` to `agent/x/agentdesktop/` to signal experimental/unstable status - Update import paths in `agent/agent.go` and `api_test.go` > 🤖 This mechanical refactor was performed by an agent. I made sure it didn't change anything it wasn't supposed to.	2026-03-25 18:23:52 +00:00
Mathias Fredriksson	798a6673c6	fix(agent/agentfiles): make multi-file edit_files atomic (#23493 ) When edit_files receives multiple files, each file was processed independently: read, compute edits, write. If file B failed, file A was already written to disk. The caller got an error but had no way to know which files were modified. Split editFile into prepareFileEdit (read + compute, no side effects) and a write phase. The handler runs all preparations first and writes only if every file's edits succeed. A write-phase failure (e.g. disk full) can still leave earlier files committed. True cross-file atomicity would require filesystem transactions. The prepare phase catches the common failure modes: bad paths, search misses, permission errors.	2026-03-24 19:23:57 +00:00
Mathias Fredriksson	1c0442c247	fix(agent/agentfiles): fix replace_all in fuzzy matching mode (#23480 ) replace_all in fuzzy mode (passes 2 and 3 of fuzzyReplace) only replaced the first match. seekLines returned the first match, spliceLines replaced one range, and there was no loop. Extract fuzzy pass logic into fuzzyReplaceLines which: - Returns a 3-tuple (result, matched, error) for clean caller flow - When replaceAll is true, collects all non-overlapping matches then applies replacements from last to first to preserve indices - When replaceAll is false with multiple matches, returns an error Add test cases for replace_all with fuzzy trailing whitespace and fuzzy indent matching.	2026-03-24 14:41:45 +02:00
Mathias Fredriksson	16edcbdd5b	fix(agent/agentfiles): follow symlinks in write_file and edit_files (#23478 ) Both write_file and edit_files use atomic writes (write to temp file, then rename). Since rename operates on directory entries, it replaces symlinks with regular files instead of writing through the link to the target. Add resolveSymlink() that uses afero.Lstater/LinkReader to resolve symlink chains (up to 10 levels) before the atomic write. Both writeFile and editFile resolve the path before any filesystem operations, matching the behavior of 'echo content > symlink'. Gracefully no-ops on filesystems that don't support symlinks (e.g. MemMapFs used in existing tests).	2026-03-24 12:39:55 +00:00
Mathias Fredriksson	147df5c971	refactor: replace sort.Strings with slices.Sort (#23457 ) The slices package provides type-safe generic replacements for the old typed sort convenience functions. The codebase already uses slices.Sort in 43 call sites; this finishes the migration for the remaining 29. - sort.Strings(x) -> slices.Sort(x) - sort.Float64s(x) -> slices.Sort(x) - sort.StringsAreSorted(x) -> slices.IsSorted(x)	2026-03-23 23:19:23 +02:00
Hugo Dutka	3163e74b77	fix: bump agents desktop resolution to 1920x1080 (#23425 ) This PR changes agents desktop resolution from 1366x768 to 1920x1080. Anthropic requires the that the resolution of desktop screenshots fits in 1,150,000 total pixels, so we downscale screenshots to 1280x720 before sending them to the LLM provider. Resolution scaling was already implemented, but our code didn't exercise it. The resolution bump showed that there were some bugs in the scaling logic - this PR fixes these bugs too.	2026-03-23 11:51:10 +01:00
Mathias Fredriksson	4aa94fcd4c	fix: StatusWriter Unwrap and process output error recovery (#23383 ) Add Unwrap() to StatusWriter so http.ResponseController.SetWriteDeadline can reach the underlying net.Conn through the middleware wrapper. Without this, the agent's 20s WriteTimeout killed blocking process output connections. Also add 30s headroom to the write deadline in handleProcessOutput so the response can be written after a full-duration blocking wait. On the tool layer, waitForProcess and the process_output tool now try a non-blocking snapshot on any error, not just context timeout. Transport errors (like the WriteTimeout EOF) previously returned with no process ID and no recovery path. Now if the process finished, the result is returned transparently. If still running, the error includes the process ID and tells the agent to use process_output.	2026-03-20 20:00:55 +00:00
Mathias Fredriksson	c60a3568d7	fix: resolve flaky TestAgent_Session_TTY_MOTD_Update (#23375 ) The 5ms ServiceBannerRefreshInterval caused excessive DRPC connection churn (200 calls/s) under the race detector, creating heavy mutex contention on FakeAgentAPI and significant CPU overhead. This made the test timing-sensitive in ways that manifested as session.Wait() hangs, killing the test binary via timeout. Three changes: - Increase refresh interval from 5ms to testutil.IntervalFast (25ms), reducing DRPC connection churn and mutex contention by 5x. - Replace bare <-ready receives with testutil.TryReceive so the test fails with context expiry instead of hanging indefinitely. - Add a timeout to session.Wait() in testSessionOutput to prevent any SSH session hang from killing the entire test binary. Fixes coder/internal#1417	2026-03-20 19:33:10 +00:00
Mathias Fredriksson	f3b91b7f11	fix(agent/agentfiles): use Create-style permissions for temp files (#23339 ) Replace afero.TempFile (which uses os.CreateTemp with mode 0600) with a custom createTempFile that uses OpenFile with mode 0666. This lets the kernel apply the process umask, matching the default behavior of os.Create. New files now get ~0644 (with standard umask) instead of 0600. Extract atomicWrite(ctx, path, mode, haveMode, reader) to share the entire temp-file lifecycle between writeFile and editFile.	2026-03-20 21:30:28 +02:00
Spike Curtis	ac51610332	fix(agent): downgrade script completion error log to warn (#23369 ) Downgrades the "reporting script completed" log in `agentscripts` from ERROR to WARN. During agent reconnects, the `scriptCompleted` RPC can race with the connection teardown, producing a "connection closed" error. Since `slogtest` treats ERROR logs as test failures, this causes `TestAgent_ReconnectNoLifecycleReemit` to flake on macOS. A failed timing report is non-fatal — the script itself has already finished, and the agent will continue operating normally. WARN is the appropriate severity, consistent with the call site in `agent.go:createDevcontainer`. Also switches from `fmt.Sprintf` to structured `slog.Error` fields for consistency with the rest of the codebase. Fixes coder/internal#1410	2026-03-20 11:34:06 -04:00
Mathias Fredriksson	41e15ae440	feat: make process output blocking-capable (#23312 ) Replace the 200ms polling loop in chatd's execute and process_output tools with server-side blocking via sync.Cond on HeadTailBuffer. The agent's GET /{id}/output endpoint accepts ?wait=true to block until the process exits or a 5-minute server cap expires. The process_output tool blocks by default for 10s (overridable via wait_timeout), and falls back to a non-blocking snapshot on timeout. The execute tool's foreground path makes a single blocking call instead of polling. Related #23316	2026-03-20 14:33:55 +02:00
Mathias Fredriksson	6edcbdba7f	fix(agent/agentproc): enforce chat ID isolation on output and signal endpoints (#23316 ) handleProcessOutput and handleSignalProcess did not check the chat ID from the request. Any caller that knew a process ID could read output or signal processes belonging to other chats. handleListProcesses already filtered by chat ID. Apply the same check to the output and signal handlers. Non-chat callers (no Coder-Chat-Id header) are allowed through for backwards compatibility.	2026-03-20 11:24:45 +02:00
Mathias Fredriksson	de4e568994	fix(agent/agentfiles): atomic writes and permission preservation (#23336 ) Both writeFile and editFile now use the same atomic write strategy: temp file in the same directory, write, rename. This ensures a failed write leaves the original file intact instead of truncated. editFile already used temp-and-rename but lost the original file's permissions because afero.TempFile creates with mode 0600. Both functions now Chmod after rename to preserve the original mode. writeFile also swallowed io.Copy errors (logged but returned HTTP 200). Fixed to return the error so the client knows the write failed.	2026-03-20 01:56:19 +02:00
Zach	c2bc2c5738	fix: fix data race in fakeContainerCLI test helper (#23335 ) The fakeContainerCLI struct had a sync.Mutex but it wasn't used in all methods where the shared data is accessed.	2026-03-19 23:14:46 +00:00
Cian Johnston	65b7658568	chore: extract testutil.FakeSink for slog test assertions (#23208 ) Follow-up to [review comment on #23025](https://github.com/coder/coder/pull/23025#discussion_r2930309487) from @mafredri. Extracts the repeated `logSink` / `fakeSink` test pattern into a shared `testutil.FakeSink` and migrates all existing call sites. > 🤖 This PR was created with the help of Coder Agents, and will be reviewed by my human. 🧑‍💻 --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-03-18 17:02:38 +00:00
Mathias Fredriksson	119030d795	fix(agent): default process working directory to agent dir or $HOME (#23224 ) Processes started via the agent process API inherited the agent's own working directory (/tmp/coder.xxx) when no WorkDir was specified. SSH sessions already use a fallback chain: configured agent directory > $HOME. This wires the same manifest directory closure into the process manager so the priority is now: explicit req.WorkDir > agent configured dir > $HOME The resolved directory is recorded on the process struct so ProcessInfo.WorkDir and pathStore notifications reflect where the process actually ran.	2026-03-18 16:46:26 +00:00
Hugo Dutka	fdb1205bdf	chore(agent): remove portabledesktop download logic (#23128 ) The new way to install portabledesktop in a workspace will be via a module: https://github.com/coder/registry/pull/805	2026-03-17 15:24:11 +01:00
Kyle Carberry	6972d073a2	fix: improve background process handling for agent tools (#23132 ) ## Problem Models frequently use shell `&` instead of `run_in_background=true` when starting long-running processes through `/agents`, causing them to die shortly after starting. This happens because: 1. No guidance in tool schema — The `ExecuteArgs` struct had zero `description` tags. The model saw `run_in_background: boolean (optional)` with no explanation of when/why to use it. 2. Shell `&` is silently broken — `sh -c "command &"` forks the process, the shell exits immediately, and the forked child becomes an orphan not tracked by the process manager. 3. No process group isolation — The SSH subsystem sets `Setsid: true` on spawned processes, but the agent process manager set no `SysProcAttr` at all. Signals only hit the top-level `sh`, not child processes. ## Investigation Compared our implementation against openai/codex and coder/mux: \| Aspect \| codex \| mux \| coder/coder (before) \| \|--------\|-------\|-----\|---------------------\| \| Background flag \| Yield/resume with `session_id` \| `run_in_background` with rich description \| `run_in_background` with no description \| \| `&` handling \| `setsid()` + `killpg()` \| `detached: true` + `killProcessTree()` \| Nothing — orphaned children escape \| \| Process isolation \| `setsid()` on every spawn \| `set -m; nohup ... setsid` for background \| No `SysProcAttr` at all \| \| Signal delivery \| `killpg(pgid, sig)` — entire group \| `kill -15 -\$pid` — negative PID \| `proc.cmd.Process.Signal()` — PID only \| ## Changes ### Fix 1: Add descriptions to `ExecuteArgs` (highest impact) The model now sees explicit guidance: "Use for long-running processes like dev servers, file watchers, or builds. Do NOT use shell & — it will not work correctly." ### Fix 2: Update tool description The top-level execute tool description now reinforces: "Use run_in_background=true for long-running processes. Never use shell '&' for backgrounding." ### Fix 3: Detect trailing `&` and auto-promote to background Defense-in-depth: if the model still uses `command &`, we strip the `&` and promote to `run_in_background=true` automatically. Correctly distinguishes `&` from `&&`. ### Fix 4: Process group isolation (`Setpgid`) New platform-specific files (`proc_other.go` / `proc_windows.go`) following the same pattern as `agentssh/exec_other.go`. Every spawned process gets its own process group. ### Fix 5: Process group signaling `signal()` now uses `syscall.Kill(-pid, sig)` on Unix to signal the entire process group, ensuring child processes from shell pipelines are also cleaned up. ## Testing All existing `agent/agentproc` tests pass. Both packages compile cleanly.	2026-03-16 16:22:10 -04:00
Kyle Carberry	32a894d4a7	fix: error on ambiguous matches in edit_files tool (#23125 ) ## Problem The `edit_files` tool used `strings.ReplaceAll` for exact substring matches, silently replacing every occurrence. When an LLM's search string wasn't unique in the file, this caused unintended edits. Fuzzy matches (passes 2 and 3) only replaced the first occurrence, creating inconsistent behavior. Zero matches were also silently ignored. ## Investigation Investigated how coder/mux and openai/codex handle this: \| Tool \| Multiple matches \| No match \| Flag \| \|---\|---\|---\|---\| \| coder/mux `file_edit_replace_string` \| Error (default `replace_count=1`) \| Error \| `replace_count` (int, default 1, -1=all) \| \| openai/codex `apply_patch` \| Uses first match after cursor (structural disambiguation via context lines + `@@` markers) \| Error \| None (different paradigm) \| \| coder/coder `edit_files` (before) \| Exact: replaces all. Fuzzy: replaces first. \| Silent success \| None \| ## Solution Adopted the mux approach (error on ambiguity) with a simpler `replace_all: bool` instead of `replace_count: int`: - Default (`replace_all: false`): search string must match exactly once. Multiple matches → error with guidance: "search string matches N occurrences. Include more surrounding context to make the match unique, or set replace_all to true" - `replace_all: true`: replaces all occurrences (opt-in for intentional bulk operations like variable renames) - Zero matches: now returns an error instead of silently succeeding Chose `bool` over `int` count because: 1. LLMs are bad at counting occurrences 2. The real intent is binary (one specific spot vs. all occurrences) 3. Simpler error recovery loop for the LLM ## Changes \| File \| Change \| \|---\|---\| \| `codersdk/workspacesdk/agentconn.go` \| Add `ReplaceAll bool` to `FileEdit` struct \| \| `agent/agentfiles/files.go` \| Count matches before replacing; error if >1 and not opted in; error on zero matches; add `countLineMatches` helper \| \| `codersdk/toolsdk/toolsdk.go` \| Expose `replace_all` in tool schema with description \| \| `agent/agentfiles/files_test.go` \| Update existing tests, add `EditEditAmbiguous`, `EditEditReplaceAll`, `NoMatchErrors`, `AmbiguousExactMatch`, `ReplaceAllExact` \|	2026-03-16 16:17:33 +00:00
Mathias Fredriksson	1adc22fffd	fix(agent/reaper): skip reaper tests in CI (#23068 ) ForkReap's syscall.ForkExec and process-directed signals remain flaky in CI despite the subprocess isolation added in #22894. Restore the testutil.InCI() skip guard that was removed in that change. Fixes coder/internal#1402	2026-03-14 21:15:47 +01:00
Hugo Dutka	84527390c6	feat: chat desktop backend (#23005 ) Implement the backend for the desktop feature for agents. - Adds a new `/api/experimental/chats/$id/desktop` endpoint to coderd which exposes a VNC stream from a [portabledesktop](https://github.com/coder/portabledesktop) process running inside the workspace - Adds a new `spawn_computer_use_agent` tool to chatd, which spawns a subagent that has access to the `computer` tool which lets it interact with the `portabledesktop` process running inside the workspace - Adds the plumbing to make the above possible There's a follow up frontend PR here: https://github.com/coder/coder/pull/23006	2026-03-13 19:49:34 +01:00
Mathias Fredriksson	efe114119f	fix(agent/reaper): run reaper tests in isolated subprocesses (#22894 ) Tests that call ForkReap or send signals to their own process now re-exec as isolated subprocesses. This prevents ForkReap's syscall.ForkExec and process-directed signals from interfering with the parent test binary or other tests running in parallel. Also: - Wait for the reaper goroutine to fully exit between subtests to prevent overlapping reapers from competing on Wait4(-1). - Register signal handlers synchronously before spawning the forwarding goroutine so no signal is lost between ForkExec and the handler being ready.	2026-03-13 19:33:02 +02:00
Kyle Carberry	0e1846fe2a	fix(agent): reap exited processes and scope process list by chat ID (#22944 )	2026-03-12 14:51:05 -07:00
Jon Ayers	22a87f6cf6	fix: filter sub-agents from build duration metric (#22732 )	2026-03-10 12:17:32 -05:00
Mathias Fredriksson	beed379b1d	fix(agent): handle ignored filepath.Walk error in filefinder (#22853 ) Log a warning when filepath.Walk fails during recursive directory watching instead of silently discarding the error.	2026-03-10 15:43:24 +02:00
Mathias Fredriksson	6e9e39a4e0	fix(agent/reaper): stop reaper goroutine in tests to prevent ECHILD race (#22844 ) Each ForkReap call started a reap.ReapChildren goroutine that never stopped (done=nil). Goroutines accumulated across subtests, racing to call Wait4(-1, WNOHANG) and stealing the child's wait status before ForkReap's Wait4(pid) could collect it. Add a WithDone option to pass the done channel through to ReapChildren, and use it in tests via a withDone(t) helper.	2026-03-09 17:34:44 +00:00
Mathias Fredriksson	4957888270	fix(agent/agentssh): make X11 max port configurable to fix test timeout (#22840 ) TestServer_X11_EvictionLRU was timing out under -race because it created 190 sequential SSH shell sessions (~0.55s each = ~105s), exceeding the 90s test timeout. The session count was derived from the production X11MaxPort constant (6200). Add a configurable X11MaxPort field to Config so the test can use a small port range (5 ports instead of 190). This reduces the number of sessions from 190 to 4, completing in ~3.8s under -race.	2026-03-09 17:03:22 +02:00
Hugo Dutka	703629f5e9	fix(agentgit): close subscribe-before-listen race in handleWatch (#22747 ) ## Problem `TestE2E_WriteFileTriggersGitWatch` and `TestE2E_SubagentAncestorWatch` flake intermittently in `test-go-race-pg` with: ``` agentgit_test.go:1271: timed out waiting for server message ``` ## Root Cause In `handleWatch()`, `GetPaths(chatID)` was called before `Subscribe(chatID)` on the PathStore. If `AddPaths()` fired between those two calls: 1. `GetPaths()` returned empty (paths not added yet). 2. `AddPaths()` stored the paths and called `notifySubscribers()` — but the subscription channel didn't exist yet, so the notification was a no-op. 3. `Subscribe()` created the channel, but the notification was already lost. 4. The handler never scanned, and the mock clock never advanced the 30s fallback ticker → timeout. Both failing tests connect the WebSocket with an empty PathStore and immediately call `AddPaths()` from the test goroutine, making them vulnerable to this scheduling interleaving. ## Fix Swap the order: call `Subscribe()` first, then `GetPaths()`. This guarantees: \| `AddPaths` fires... \| `Subscribe` sees it? \| `GetPaths` sees it? \| Outcome \| \|---\|---\|---\|---\| \| Before `Subscribe` \| No \| Yes \| Picked up by `GetPaths` \| \| Between the two calls \| Yes (queued) \| Yes \| Redundant but safe (delta dedupes) \| \| After `GetPaths` \| Yes \| No \| Goroutine handles it \| No window exists where both miss it. Verified with 10,000 iterations (`-race -count=5000`) — zero failures. Fixes coder/internal#1389	2026-03-07 06:36:43 -08:00
Hugo Dutka	4afdfc50a5	fix(agentgit): use git cli instead of go-git (#22730 ) go-git has bugs in gitignore logic. With more complex gitignores, some paths that should be ignored aren't. That caused extra, unexpected files to appear in the git diff panel. If the git cli isn't available in a workspace, the /git/watch endpoint will still allow the frontend to connect, but no git changes will ever be transmitted.	2026-03-06 22:52:32 +01:00
Hugo Dutka	48ab492f49	feat: agents git watch backend (#22565 ) Adds real-time git status watching for workspace agents, so the frontend can subscribe over WebSocket and show git file changes in near real-time. 1. Subscription is scoped to a chat via `GET /api/experimental/chats/{chat}/git/watch`. 2. The workspace agent automatically determines which paths to watch based on tool calls made by the chat (and its ancestor chats). 3. Workspace agent polls subscribed repo working trees on a 30s interval, on tools calls, and on explicit `refresh` from the client. 4. Scans are rate-limited to at most once per second. 5. Edited paths are tracked in-memory inside the workspace agent. There is no database persistence — state is lost on agent restart. This will be addresses in a future PR. 6. Messages sent over WebSocket include a full-repo snapshot (unified diff, branch, origin). A new message is emitted only when the snapshot changes. This PR was implemented with AI with me closely controlling what it's doing. The code follows a plan file that was updated continuously during implementation. Here's the file if you'd like to see it: [project.md](https://gist.github.com/hugodutka/8722cf80c92f8a56555f7bc595b770e2). It reflects the current state of the PR.	2026-03-06 10:47:55 +01:00
Spike Curtis	7cc2b22568	chore: expose UpdateAppStatus on agentsocket (#22353 ) relates to #21335 Adds UpdateAppStatus on the agentsocket, wired up to forward to Coderd over the dRPC connection the agent maintains. Disclosure: I used AI to generate significant portions of this PR, but hand-reviewed and tweaked the code. I consider it approximately indistinguishable from what I would have done by hand.	2026-03-04 21:18:17 +04:00
Zach	5b7377c375	feat: add Prometheus metrics for boundary log drop reporting (#22521 ) Add Prometheus metrics to the boundary log proxy for observability: - batches_dropped_total (reason: buffer_full, forward_failed) - logs_dropped_total (reason: buffer_full, forward_failed, boundary_channel_full, boundary_batch_full) - batches_forwarded_total Also add BoundaryStatus to the BoundaryMessage envelope so boundary can report dropped log counts as a separate wire message. The agent records these as Prometheus metrics, making boundary-side data loss visible. Backwards compatibility for older versions of boundary is maintained. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 12:42:34 -07:00
Spike Curtis	56eb57caf4	chore: enable agent socket by default (#22352 ) relates to #21335 Enables the agent socket by default and updates docs to strike references to having to enable it. The PRs in this stack change the MCP server that Tasks use to update their status to rely on the agent socket, rather than directly dialing Coderd with the agent token. Default disable was a reasonable default when it was only used for the experimental script ordering features, but now that we want to use it for Tasks, it should be default on.	2026-03-03 21:23:59 +04:00
Zach	66954aead0	feat: add TagV2 BoundaryMessage envelope protocol (#22520 ) Extend the wire protocol for the boundary <-> agent unix socket with a message envelope. The envelope creates a boundary <-> agent data path that is separate from the agent <-> coderd path. This lets boundary send operational metadata (drop counts, configuration like jail type, capabilities) that the agent can act on locally (e.g. Prometheus metrics) or use to enrich outbound requests, without polluting the coderd-facing proto with fields coderd never consumes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 09:13:11 -07:00
Kyle Carberry	68e4155fed	feat(agent/filefinder): add plocate-lite file finder package (#22453 ) Adds an in-memory trigram-indexed file finder package at `agent/filefinder`, designed to power a future `FindFiles` HTTP handler on the WorkspaceAgent. ## What it does Fast fuzzy file search with VS Code-quality matching across millions of files. Sub-millisecond search latency at 100K files. ## Architecture - Index: append-only docs slice with trigram + prefix posting lists - Snapshot: lock-free reader view via frozen slice headers + shallow-copied deleted set - Search pipeline: trigram intersection → fuzzy fallback (prefix bucket + subsequence) → brute-force scan (capped at 5K docs) - Scoring: subsequence match, basename prefix, boundary hits, contiguous runs, depth/length penalties - Engine: multi-root with fsnotify watcher (50ms batch coalescing), atomic snapshot publishing ## Benchmarks (10K files) \| Query Type \| Latency \| \|---\|---\| \| exact_basename (`handler.go`) \| ~43µs \| \| short_query (`ha`) \| ~7µs \| \| fuzzy_basename (`hndlr`) \| ~50µs \| \| path_structured (`internal/handler`) \| ~29µs \| \| multi_token (`api handler`) \| ~15µs \| ## File inventory (11 files, 3273 lines) \| File \| Lines \| Purpose \| \|---\|---\|---\| \| `text.go` \| 264 \| Normalization, trigram extraction, scoring \| \| `delta.go` \| 128 \| Index, Snapshot, CRUD operations \| \| `query.go` \| 272 \| Query planning, search strategies, top-K merge \| \| `engine.go` \| 323 \| Multi-root engine, watcher integration \| \| `watcher_fs.go` \| 201 \| fsnotify wrapper with batch coalescing \| \| `*_test.go` \| 2085 \| Unit tests, integration tests, benchmarks \| --------- Co-authored-by: Coder <coder@users.noreply.github.com>	2026-02-28 23:37:07 -05:00
Kyle Carberry	fb6bf3a568	fix(agent): wire updateCommandEnv into process manager (#22451 ) ## Problem The `agentproc` process manager spawns processes with only `os.Environ()`, missing agent-level environment variables like `GIT_ASKPASS`, `CODER_*`, and `GIT_SSH_COMMAND` that are injected by the agent's `updateCommandEnv` function. This means processes started through the HTTP process API (used by chat tools) cannot authenticate git operations via the Coder gitaskpass helper. By contrast, SSH sessions get the full agent environment because the SSH server calls `updateCommandEnv` via its `UpdateEnv` config hook. ## Fix Wire the agent's `updateCommandEnv` hook into the process manager so all spawned processes receive the full agent environment. The hook is: - Passed as a parameter through `NewAPI` → `newManager` - Called in `manager.start()` with `os.Environ()` as the base, producing the same enriched env that SSH sessions get - Gracefully falls back to `os.Environ()` if the hook returns an error Request-level env vars (`req.Env`, set by chat tools) are still appended last and take precedence. ## Changes - `agent/agentproc/process.go`: Add `updateEnv` field to manager, call it when building process env - `agent/agentproc/api.go`: Accept `updateEnv` parameter in `NewAPI` - `agent/agent.go`: Pass `a.updateCommandEnv` when creating the process API - `agent/agentproc/api_test.go`: Add `UpdateEnvHook` and `UpdateEnvHookOverriddenByReqEnv` tests Co-authored-by: Coder <coder@coder.com>	2026-02-28 21:58:59 -05:00
Kyle Carberry	5945febf06	feat(agent): add fuzzy whitespace matching to edit_files tool (#22446 ) Inspired by openai/codex's `apply_patch` implementation, this changes the `edit_files` search-and-replace to use a cascading match strategy when the exact search string isn't found: 1. Exact substring match (byte-for-byte) — existing behavior, unchanged 2. Line-by-line match ignoring trailing whitespace — handles trailing spaces/tabs the LLM omits 3. Line-by-line match ignoring all leading/trailing whitespace — handles tabs-vs-spaces and wrong indentation depth ## Problem When the chat agent uses `edit_files`, it generates a search string that must match the file content exactly. LLMs frequently get whitespace wrong: - Emitting spaces when the file uses tabs (or vice versa) - Getting the indentation depth wrong by one or more levels - Omitting trailing whitespace that exists in the file When this happens, the edit silently does nothing, and the agent falls into a retry loop using `cat -A` to diagnose the exact whitespace characters. ## Solution Adopted the same cascading fuzzy match strategy that [openai/codex uses in `seek_sequence.rs`](https://github.com/openai/codex/blob/main/codex-rs/apply-patch/src/seek_sequence.rs): - Pass 1: exact match (existing behavior) - Pass 2: `TrimRight` each line before comparing (trailing whitespace tolerance) - Pass 3: `TrimSpace` each line before comparing (full indentation tolerance) When a fuzzy match is found, the matched lines in the original file are replaced with the replacement text. This preserves surrounding content exactly. ## Changes - `agent/agentfiles/files.go`: Replaced `icholy/replace` streaming transformer with in-memory `fuzzyReplace` + helper functions (`seekLines`, `spliceLines`) - `agent/agentfiles/files_test.go`: Added 6 new test cases covering trailing whitespace, tabs-vs-spaces, different indent depths, exact match preference, no-match behavior, and mixed whitespace multiline edits - Removed `icholy/replace` dependency from go.mod/go.sum --------- Co-authored-by: Kyle Carberry <kylecarbs@users.noreply.github.com>	2026-02-28 17:02:57 -05:00
Kyle Carberry	a621c3cb13	feat(agent): add process execution API and rewrite execute tool (#22416 ) ## Summary Adds a new agent-side process management HTTP API and rewrites the chat execute tool to use it instead of SSH sessions. ## What changed ### New agent/agentproc/ package - headtail.go — Thread-safe io.Writer with bounded memory (16KB head + 16KB tail ring buffer). Provides LLM-ready output with truncation metadata and long-line truncation at 2048 bytes. - headtail_test.go — 16 tests including race detector coverage for concurrent writes. - process.go — Manager + Process types for lifecycle management using agentexec.Execer for proper OOM/nice scores. - api.go — HTTP API following the agentfiles chi router pattern. 4 endpoints: start, list, output, signal. ### Agent wiring (agent/agent.go, agent/api.go) Mounts the process API at /api/v0/processes, mirroring how agentfiles is mounted. ### SDK (codersdk/workspacesdk/agentconn.go) 4 new AgentConn interface methods + 7 request/response types: - StartProcess, ListProcesses, ProcessOutput, SignalProcess ### Execute tool rewrite (coderd/chatd/chattool/execute.go) - SSH to Agent API: conn.StartProcess() + conn.ProcessOutput() polling - New parameters: workdir, run_in_background - Structured response: success, exit_code, wall_duration_ms, error, truncated, note, background_process_id - Non-interactive env vars: GIT_EDITOR=true, TERM=dumb, NO_COLOR=1, PAGER=cat, etc. - Output truncation: HeadTailBuffer caps at 32KB for LLM consumption - File-dump detection with advisory notes suggesting read_file - Default timeout: 60s to 10s - Foreground polling: 200ms intervals until exit or timeout ## Architecture State lives on the agent, surviving coderd failover and instance changes. Any coderd replica can query any agent via HTTP over tailnet.	2026-02-28 12:33:52 -05:00
Kyle Carberry	b65c0766d2	feat: add line-based read_file tool with safety limits (#22400 ) ## Summary Adds a new line-based file reading endpoint to the workspace agent, replacing the unbounded byte-based approach for the `read_file` chat tool and `coder_workspace_read_file` MCP tool. Problem: The current `read_file` tool returns the entire file contents with no limits, which can blow up LLM context windows and cause OOM issues with large files. Solution: Inspired by [`coder/mux`](https://github.com/coder/mux) and [`openai/codex`](https://github.com/openai/codex), implement a line-based reader with safety limits. ## Changes ### Agent (`agent/agentfiles/`) - New `/read-file-lines` endpoint with `HandleReadFileLines` handler - Line-based `offset` (1-based line number, default: 1) and `limit` (line count, default: 2000) - Safety constants: \| Constant \| Value \| Purpose \| \|---\|---\|---\| \| `MaxFileSize` \| 1 MB \| Reject files larger than this at stat \| \| `MaxLineBytes` \| 1,024 \| Per-line truncation with `... [truncated]` marker \| \| `MaxResponseLines` \| 2,000 \| Max lines per response \| \| `MaxResponseBytes` \| 32 KB \| Max total response size \| \| `DefaultLineLimit` \| 2,000 \| Default when no limit specified \| - Line numbering format: `1\tcontent` (tab-separated) - Structured JSON response: `{ success, file_size, total_lines, lines_read, content, error }` - Hard errors when limits exceeded — tells the LLM to use `offset`/`limit` - Existing byte-based `/read-file` endpoint preserved (used by `instruction.go`) ### SDK (`codersdk/workspacesdk/`) - `ReadFileLinesResponse` type added - `ReadFileLines` method added to `AgentConn` interface - Mock regenerated ### Chat tool (`coderd/chatd/chattool/`) - `read_file` tool now uses `conn.ReadFileLines()` instead of `conn.ReadFile()` - Updated tool description to document line-based parameters - Response includes `file_size`, `total_lines`, `lines_read` metadata ### MCP tool (`codersdk/toolsdk/`) - `coder_workspace_read_file` updated to use line-based reading - Schema descriptions updated for line-based offset/limit - Removed `maxFileLimit` constant (agent handles limits now) ### Tests - 13 new test cases for `TestReadFileLines`: - Path validation (empty, relative, non-existent, directory, no permissions) - Empty file handling - Basic read, offset, limit, offset+limit combinations - Offset beyond file length - Long line truncation (>1024 bytes) - Large file rejection (>1MB) - All existing tests pass unchanged ## Design decisions \| Decision \| Rationale \| \|---\|---\| \| Line-based, not byte-based \| Both coder/mux and openai/codex use line-based — matches how LLMs reason about code \| \| Default limit of 2000 \| Matches codex; prevents accidental full-file dumps while being generous \| \| 32 KB response cap \| Compromise between mux (16 KB) and codex (no cap) \| \| 1024 byte/line truncation with marker \| More generous than codex (500), marker helps LLM know data is missing \| \| Hard errors on overflow \| Matches mux; forces LLM to paginate rather than getting partial data \| \| Preserve byte-based endpoint \| `instruction.go` needs raw byte access for AGENTS.md \|	2026-02-27 15:12:56 -05:00
Spike Curtis	393b3874ac	feat: add UpdateAppStatus to the workspace agent API (#22219 ) <!-- If you have used AI to produce some or all of this PR, please ensure you have read our [AI Contribution guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING) before submitting. --> part of https://github.com/coder/coder/issues/21335 This moves updating app status (used by Tasks) into the workspace agent API over dRPC. This will allow us to update the status without having to re-authenticate each time, like we would with an HTTP PATCH request. Further PRs in this stack will pipe these requests thru from the CLI MCP server to the agentsock and finally to this dRPC call to coderd.	2026-02-24 13:26:55 +04:00
Spike Curtis	1069ce6e19	feat: add support for agentsock on Windows (#22171 ) relates to #21335 Adds support for the agentsock and thus `coder exp sync` commands on Windows. This support was initially missing.	2026-02-20 16:27:32 +04:00

1 2 3 4 5 ...

615 Commits