The chat title column is now nullable in PostgreSQL. Chats are created
with title = NULL instead of a truncated fallback string. The async LLM
title generator fires whenever title IS NULL, providing a clean
one-way transition from skeleton to real title.
This eliminates the sidebar flicker caused by:
- The fallback-to-generated title swap on first render
- Query invalidations (refetchOnWindowFocus, reconnect, follow-up
messages) briefly reverting to stale fallback titles
Backend changes:
- Migration makes chats.title nullable, drops 'New Chat' default
- InsertChat no longer accepts a title parameter
- titleInput() triggers on NULL title instead of comparing fallback
- Removed chatTitleFromMessage, fallbackChatTitle, and
subagentFallbackChatTitle helper functions
Frontend changes:
- Chat.title is now string | null in TypeScript types
- Sidebar renders a shimmer skeleton for null titles
- Search, aria-labels, and TopBar handle null gracefully
The chat API is experimental (behind `ExperimentAgents`) and not ready
for public documentation yet. This removes swagger annotations from the
chat handlers so they no longer appear in the generated API reference at
https://coder.com/docs/reference/api/chats.
## Changes
- Remove `@swagger` annotations from 5 chat handlers in
`coderd/chats.go`
- Regenerate `coderd/apidoc/swagger.json` and `docs.go`
- Delete `docs/reference/api/chats.md`
- Remove Chats entry from `docs/manifest.json`
Fixes https://github.com/coder/internal/issues/1371
## Root causes
Two independent races cause this test to flake at ~2–3/1000:
### 1. Title-generation requests racing with the streaming request
counter
`maybeGenerateChatTitle` fires in a `context.WithoutCancel` goroutine
(line 2130) and makes a **non-streaming** request to the mock OpenAI
handler. The test handler was not filtering by request type, so these
title requests incremented the `requestCount` atomic — throwing off the
coordination logic that uses `requestCount == 1` to identify the first
streaming request and hold it open until shutdown.
**Fix:** Guard the test handler to return a canned response for
non-streaming requests before touching `requestCount`.
### 2. Phantom acquire: `AcquireChat` commits in Postgres but Go sees
`context.Canceled`
During `Close()`, the main loop's `select` can randomly pick
`acquireTicker.C` over `ctx.Done()` (Go spec: when multiple cases are
ready, one is chosen uniformly at random). This calls `processOnce(ctx)`
with an already-canceled context.
In the pq driver, `QueryContext` does **not** check `ctx.Err()` up
front. Instead it calls `watchCancel(ctx)` which spawns a goroutine
monitoring `ctx.Done()`, then sends the query on the existing
connection. When `ctx` is already canceled, a race ensues:
- **pq's watchCancel goroutine** immediately sees `<-done`, opens a
*new* TCP connection to Postgres, and sends a cancel request.
- **The query** is sent concurrently on the existing connection.
Because the `AcquireChat` UPDATE is fast (sub-millisecond, single row
with `SKIP LOCKED`), it often commits before the cancel arrives via the
second connection. Meanwhile in `database/sql`, `initContextClose`
spawns an `awaitDone` goroutine that fires immediately (context is
already canceled), stores `contextDone`, and calls `rs.close(ctx.Err())`
— which races with `Row.Scan` → `rows.Next()`. If `awaitDone` wins,
`Next()` sees `contextDone` is set and returns false, causing Scan to
return `context.Canceled` (or `ErrNoRows`).
**Result:** Postgres committed the UPDATE (chat is now `running` with
serverA's worker ID), but Go sees an error and never spawns a goroutine
to process it. The chat is stuck as `running` with no worker.
If the previous `processChat` cleanup already set the chat back to
`pending`, this phantom acquire flips it back to `running` — which is
exactly what the debug logs showed: after `Close()` returns, the DB
shows `status=running` with serverA's worker ID.
**Fix:** Three guards in `processOnce`:
1. Early `ctx.Err()` check — catches the common case where `select`
picked the ticker after cancellation.
2. `context.WithoutCancel(ctx)` for `AcquireChat` — prevents the pq
`watchCancel` race entirely, ensuring the driver sees the query
result if Postgres executed it.
3. Post-acquire `ctx.Err()` check — if the context was canceled while
`AcquireChat` ran (or between the early check and the call),
immediately release the chat back to `pending`.
## Verification
Passes 2000/2000 iterations (previously flaked at ~2–3/1000):
```
go test -run "TestCloseDuringShutdownContextCanceledShouldRetryOnNewReplica" \
-count=2000 -timeout 1800s -failfast ./coderd/chatd/
```
Adds a new child page at `/docs/ai-coder/agents/early-access` describing
the Coder Agents Early Access, including what it includes, what it does
not include, feature scope, licensing, and how to provide feedback.
Fixes a race condition in `DiffHunksRenderer` where a stale async
highlight callback overwrites the render cache with an old diff, causing
a hunk count mismatch:
```
DiffHunksRenderer.renderHunks: lineHunk doesn't exist
```
## Root cause
The `DiffHunksRenderer` in `@pierre/diffs@1.0.11` caches highlighted AST
results keyed by diff object reference. When the shiki highlighter isn't
fully loaded, it fires `asyncHighlight(diff)` which captures the current
diff in a closure. If the diff changes before that promise resolves,
`onHighlightSuccess` unconditionally overwrites `renderCache` with the
stale diff/result pair. The subsequent `rerender()` then iterates the
new diff's hunks against the old result's `code.hunks` array, crashing
at an out-of-bounds index.
## Fix
Upgrades `@pierre/diffs` from `1.0.11` to `1.1.0-beta.19`, which
completely refactors the rendering pipeline:
- Replaces the per-hunk `code.hunks[hunkIndex]` lookup with flat
`additionLines`/`deletionLines` arrays indexed directly by line index
- Uses a new `iterateOverDiff` callback pattern instead of the
`renderHunks` method
- The `lineHunk doesn't exist` error is gone from the codebase entirely
The only code change on our side is adapting `extractDiffContent()` in
`FilesChangedPanel.tsx` to the new `ChangeContent`/`ContextContent`
types where `deletions`, `additions`, and `lines` are now counts with
index pointers into top-level
`FileDiffMetadata.deletionLines`/`additionLines` arrays.
Fixes several small UI issues on the agent detail and sidebar pages:
- **Sidebar lines changed indicator**: removed monospace font, matched
styling to model text (text-[13px] leading-4)
- **Git panel**: always shown instead of "No panels available" fallback
- **Git tab active state**: added `text-content-primary` so the tab
looks selected
- **Attachment button**: switched to `subtle` variant (lighter color, no
border)
- **Context indicator / attachment button**: matched sizes (`size-7`
container, `size-icon-sm` icon) and swapped positions
Adds offset and cursor-based pagination to the `GET
/api/experimental/chats` endpoint, following the exact same patterns
used by `GetUsers` and `GetTemplateVersionsByTemplateID`.
## Changes
### Database
- Add `after_id`, `offset_opt`, `limit_opt` params to
`GetChatsByOwnerID` SQL query
- Use composite `(updated_at, id) DESC` cursor for stable, deterministic
pagination
- Add migration with composite index on `chats (owner_id, updated_at
DESC, id DESC)`
### Backend
- Use `ParsePagination()` in `listChats` handler (matches `users.go`
pattern)
- Add `Pagination` field to `ListChatsOptions` SDK struct
### Frontend
- Add `infiniteChats()` query factory using `useInfiniteQuery` with
offset-based page params (same pattern as `infiniteWorkspaceBuilds`)
- Update `AgentsPage` to use `useInfiniteQuery`
- Add "Show more" button at the bottom of the agents sidebar (matches
`HistorySidebar` pattern)
- Keep existing `chats()` query for non-paginated uses (e.g., parent
chat lookup in `AgentDetail`)
### Tests
- Add `TestListChats/Pagination` covering `limit`, `after_id` cursor,
`offset`, and no-limit behavior
_Disclaimer: implemented with Opus 4.6 and Coder Agents._
Follow-up to #22879.
## Problem
The `CODER_SESSION_TOKEN` guard added in #22879 blocks `coder login`
unconditionally when the env var is set. This conflicts with
`--use-token-as-session`, which intentionally uses the provided token
(including from the env var) directly as the session token.
## Fix
Add `&& !useTokenForSession` to the check so that `coder login
--use-token-as-session` still works when `CODER_SESSION_TOKEN` is set.
## Testing
Added `TestLogin/SessionTokenEnvVarWithUseTokenAsSession` — sets the env
var with a valid token and passes `--use-token-as-session`, verifying
login succeeds.
---------
Signed-off-by: Danny Kopping <danny@coder.com>
## Problem
When a chat worker shuts down gracefully (e.g. Kubernetes pod SIGTERM)
while a tool is executing (like `wait_agent` polling for a subagent),
the chat gets stuck in `waiting` status forever — no other worker will
pick it up.
### Root Cause
`persistStep` in `chatd.go` unconditionally returned
`chatloop.ErrInterrupted` for **any** canceled context:
```go
if persistCtx.Err() != nil {
return chatloop.ErrInterrupted // BUG: doesn't check WHY the context was canceled
}
```
During shutdown, the context cause is `context.Canceled` (not
`ErrInterrupted`). But because `persistStep` returned `ErrInterrupted`,
the error handling in `processChat` hit the `ErrInterrupted` check first
(line 2011) and set status to `waiting` — the `isShutdownCancellation`
check (line 2017) was never reached:
```go
// Checked FIRST — matches because persistStep returned ErrInterrupted
if errors.Is(err, chatloop.ErrInterrupted) {
status = database.ChatStatusWaiting // Stuck forever
return
}
// NEVER REACHED during shutdown
if isShutdownCancellation(ctx, chatCtx, err) {
status = database.ChatStatusPending // Would have been correct
return
}
```
### Trigger scenario (from production logs)
1. Chat spawns a subagent via `spawn_agent`, then calls `wait_agent`
2. `wait_agent` blocks in `awaitSubagentCompletion` polling loop
3. Worker pod receives SIGTERM → `Close()` cancels server context
4. Context cancellation propagates to `awaitSubagentCompletion` →
returns `context.Canceled`
5. Tool execution completes, `persistStep` is called with canceled
context
6. `persistStep` returns `ErrInterrupted` (wrong!) → status set to
`waiting` (stuck!)
## Fix
Check `context.Cause()` before deciding which error to return:
```go
if persistCtx.Err() != nil {
if errors.Is(context.Cause(persistCtx), chatloop.ErrInterrupted) {
return chatloop.ErrInterrupted // Intentional interruption
}
return persistCtx.Err() // Shutdown → context.Canceled
}
```
This preserves `context.Canceled` for shutdown, allowing
`isShutdownCancellation` to match and set status to `pending` so another
worker retries the chat.
## Test
Added `TestRun_ShutdownDuringToolExecutionReturnsContextCanceled` which:
1. Streams a tool call to a blocking tool (simulating `wait_agent`)
2. Cancels the server context (simulating shutdown) while the tool
blocks
3. Verifies `Run` returns `context.Canceled`, NOT `ErrInterrupted`
Uses streamdown's built-in `urlTransform` prop to intercept
`http://localhost:PORT` URLs in agent chat messages and rewrite them to
port-forwarded workspace URLs.
When the agent outputs a bare URL like `http://localhost:3000` or a
markdown link like `[app](http://localhost:8080/path)`, the URL is
rewritten to the workspace's port-forward subdomain (e.g.
`https://3000--agent--workspace--user.wildcard.host`). This makes links
clickable directly from the chat without manual port-forwarding.
## How it works
The transform is built in `AgentDetail` where workspace and proxy
context are available, then threaded as an optional prop through the
component tree:
```
AgentDetail → AgentDetailView → AgentDetailTimeline → ConversationTimeline → Response → Streamdown
```
- Uses streamdown's first-class `urlTransform` API — no monkey-patching
or rehype plugins
- Reuses the existing `portForwardURL()` utility from
`utils/portForward`
- Matches the same localhost detection as the terminal page
(`localhost`, `127.0.0.1`, `0.0.0.0`)
- Preserves pathname and search params
- Gracefully degrades: when any required context is missing (no
workspace, no wildcard proxy host), URLs pass through unchanged
## What gets transformed
| Markdown input | Transformed? |
|---|---|
| `http://localhost:8080` (bare URL, auto-linked by remark-gfm) | Yes |
| `[my app](http://localhost:3000/path)` (explicit link) | Yes |
| `\`http://localhost:8080\`` (inline code) | No (correct — code spans
are literal) |
| `https://example.com` (non-localhost) | No |
## Problem
The Admin → Agents → System Prompt textarea saved only to the browser's
`localStorage`. The value was never sent to the backend, never stored in
the database, and never injected into chats. Entering text, clicking
Save, and refreshing the page showed no changes — the prompt was
effectively a no-op.
## Root Cause
Three disconnected layers:
1. **Frontend** wrote to `localStorage`, never called an API.
2. **`handleCreateChat`** never read `savedSystemPrompt`.
3. **Backend** hardcoded `chatd.DefaultSystemPrompt` on every chat
creation — no field in `CreateChatRequest` accepted a custom prompt.
## Changes
### Database
- Added `GetChatSystemPrompt` / `UpsertChatSystemPrompt` queries on the
existing `site_configs` table (no migration needed).
### API
- `GET /api/experimental/chats/system-prompt` — returns the configured
prompt (any authenticated user).
- `PUT /api/experimental/chats/system-prompt` — sets the prompt
(admin-only, `rbac: deployment_config update`).
- Input validation: max 32 KiB prompt length.
### Backend
- `resolvedChatSystemPrompt(ctx)` checks for a custom prompt in the DB,
falls back to `chatd.DefaultSystemPrompt` when empty/unset.
- Logs a warning on DB errors instead of silently swallowing them.
- Replaced the hardcoded `defaultChatSystemPrompt()` call in chat
creation.
### Frontend
- Replaced `localStorage` read/write with React Query
`useQuery`/`useMutation` backed by the new endpoints.
- Fixed `useEffect` draft sync to avoid clobbering in-progress user
edits on refetch.
- Added `try/catch` error handling on save (draft stays dirty for
retry).
- Save button disabled during mutation (`isSavingSystemPrompt`).
- Query key follows kebab-case convention (`chat-system-prompt`).
### UX
- Added hint: "When empty, the built-in default prompt is used."
### Tests
- `TestChatSystemPrompt`: GET returns empty when unset, admin can set,
non-admin gets 403.
- dbauthz `TestMethodTestSuite` coverage for both new querier methods.
The Button icon variant applies [&>svg]:size-icon-sm (18px) and
the base applies [&>svg]:p-0.5, both of which silently override
h-*/w-* set directly on child SVGs. This caused the stop icon to
render at 18px instead of 12px and the send arrow to shift
off-center due to uncleared padding.
Pin each icon size via !important on the parent className so the
values are deterministic regardless of Tailwind class order:
- Attach: !size-icon-sm (18px, unchanged visual)
- Stop: !size-3 (12px, matches original intent)
- Send: !size-5 (20px, matches prior visual after padding)
Add Streaming and StreamingInterruptPending stories for the stop
button.
I keep running into the same couple of issues with subagents:
- when I request code analysis, the main agent tends to spawn subagents
to read files and output them verbatim to the main chat
- when I request to implement a feature, the main agent often spawns
subagents that edit the same files and conflict with one another,
reverting each other's changes.
This PR updates the `spawn_agent` tool description to mitigate those
issues.
The `TestGitSSH/Local_SSH_Keys` test was flaking on Windows CI with a
context deadline exceeded error when calling `client.GitSSHKey(ctx)`.
Two issues contributed to the flake:
1. `prepareTestGitSSH` called `coderdtest.AwaitWorkspaceAgents` without
passing the caller's context. This created a separate internal 25s
timeout, wasting time budget independently of the setup context.
Changed to use `NewWorkspaceAgentWaiter(...).WithContext(ctx).Wait()`
so the agent wait shares the caller's timeout.
2. The `Local SSH Keys` subtest used `WaitLong` (25s) for its setup
context, but this subtest does more work than `Dial` (runs the
command twice). Bumped to `WaitSuperLong` (60s) to give slow
Windows CI runners enough time.
Fixescoder/internal#770
Handle errors that were previously assigned to blank identifiers in the
`cli/` package.
- ssh.go: Log ExistsViaCoderConnect DNS lookup error at debug level
instead of silently discarding it. Fallthrough behavior preserved.
- exp_scaletest_llmmock.go: Log srv.Stop() error via the existing
logger instead of discarding it.
The `lint/go` recipe used `$(shell)` inside a recipe to extract the
golangci-lint version. When `MAKE_TIMED=1` (set by pre-commit/pre-push),
make expands `.SHELLFLAGS = $@ -ceu` for `$(shell)` calls, passing the
target name as the first argument to `timed-shell.sh`. Since the target
name doesn't start with `-`, the timing code path runs and its banner
output contaminates the captured value, causing intermittent failures:
```
bash: line 3: lint/go: No such file or directory
```
Replace with bash command substitution (`$$()`), which is the correct
approach under `.ONESHELL` and avoids the `SHELL`/`.SHELLFLAGS`
interaction entirely. Also replaces deprecated `egrep` with `grep -oE`.
_Disclaimer: created with Opus 4.6 and Coder Agents._
## Problem
When `CODER_SESSION_TOKEN` is set as an environment variable with an
invalid value, `coder login` fails with a confusing error:
```
error: Trace=[create api key: ]
You are signed out or your session has expired. Please sign in again to continue.
Suggestion: Try logging in using 'coder login'.
```
The suggestion to run `coder login` is what the user just did, making it
circular and unhelpful.
## Root cause
The `--token` flag is mapped to `CODER_SESSION_TOKEN` via serpent. When
the env var is set, `coder login` picks it up as the session token and
tries to use it to create a new API key, which fails because the token
is invalid. Even if login were to succeed and write a new token to disk,
subsequent commands would still use the env var (which takes precedence
over the on-disk token), so the user would remain stuck.
## Fix
Before attempting login, check if `CODER_SESSION_TOKEN` is set in the
environment. If so, return a clear error telling the user to unset it:
```
the environment variable CODER_SESSION_TOKEN is set, which takes precedence
over the session token stored on disk. Please unset it and try again.
unset CODER_SESSION_TOKEN
```
## Testing
Added `TestLogin/SessionTokenEnvVar` that verifies the error is returned
when the env var is set.
Previously `coder login token` didn't load the server URL from config,
so it always required --url or CODER_URL when using the keyring to store
the session token. This command would only print out the token when
already logged in to a deployment and file storage is used to store the
session token (keyring is the default on Windows/macOS). It would also
print out an incorrect token when --url was specified and the session
token stored on disk was for a different deployment that the user logged
into.
This change fixes all of these issues, and also errors out when using
session token file storage with a `--url` argument that doesn't match
the stored config URL, since the file only stores one token and would
silently return the wrong one.
See https://github.com/coder/coder/issues/22733 for a table of the
before/after behaviors.
pre-commit and pre-push only reported total elapsed time at the end,
making it hard to identify which jobs are slow.
Add a `MAKE_TIMED=1` mode that replaces `SHELL` with a wrapper
(`scripts/lib/timed-shell.sh`) to print wall-clock time for each
recipe. pre-commit and pre-push enable this on their sub-makes.
Ad-hoc use: `make MAKE_TIMED=1 test`
Fixes a flaky test (`TestUserTailnetTelemetry/invalid_header`) caused by
sub-microsecond precision mismatch between `time.Now()` calls on
Windows.
The server used `time.Now()` (nanosecond precision) for `ConnectedAt`
and `DisconnectedAt`, while the test compared against its own
`time.Now()`. On Windows, wall-clock jitter can cause the server
timestamp to appear slightly before the test's `predialTime`.
Switch to `dbtime.Now()` which rounds to microsecond precision (matching
Postgres), consistent with all other timestamps in `workspaceagents.go`.
Relates to: https://github.com/coder/internal/issues/1390
The codex registry module v4.2.0 wires `enable_state_persistence`
through to agentapi, completing session resume support. Combined with
the `--type codex` flag added in v4.1.2, Codex now fully preserves
conversation context across pause and resume cycles.
Refs coder/registry#783
Refs coder/registry#785
The Playwright e2e `webServer` starts the Coder server via
`go run -tags embed`, which must compile before serving. The default 60s
timeout leaves no margin when the CI runner is slow.
Failed run:
https://github.com/coder/coder/actions/runs/22782592241/job/66091950715
Successful run:
https://github.com/coder/coder/actions/runs/22782107623/job/66090828826
The server started and printed its banner, but with only ~4s left on the
clock the health check (`/api/v2/deployment/config`) could not complete
before the timeout fired. The same ~2x slowdown shows in the
`make site/e2e/bin/coder` step (45s vs 67s), confirming this is runner
performance variability.
Increase timeout to 120s.
Refs #22727
Each ForkReap call started a reap.ReapChildren goroutine that never
stopped (done=nil). Goroutines accumulated across subtests, racing to
call Wait4(-1, WNOHANG) and stealing the child's wait status before
ForkReap's Wait4(pid) could collect it.
Add a WithDone option to pass the done channel through to ReapChildren,
and use it in tests via a withDone(t) helper.
- Fix dead docker pull retry loop (Make ate bash expansions)
- Make test-postgres-docker idempotent so Phase 2 stops restarting it
mid-test
- Run migrate-ci at recipe time, not parse time
- Install Playwright browsers before e2e tests
- Set test timeout to 20m, 5m shy of CI's 25m job limit
- Cap parallelism at nproc/4 via PARALLEL_JOBS
- Add phase banners and elapsed time
## Problem
Two network requests were blocking the initial page render with
fullscreen `<Loader fullscreen />` spinners:
1. **`POST /api/v2/authcheck`** (permissions) — blocked in `RequireAuth`
via `AuthProvider.isLoading`
2. **`GET /api/v2/organizations`** — blocked in `DashboardProvider`
All other bootstrap queries (`user`, `entitlements`, `appearance`,
`experiments`, `build-info`, `regions`) already used server-side
metadata injection via `index.html` meta tags and resolved instantly.
These two did not.
## Solution
Follow the existing `cachedQuery` + `<meta>` tag pattern to inject both
datasets server-side:
### Server-side (`site/site.go`)
- Add `Permissions` and `Organizations` fields to `htmlState`
- Fetch organizations via `GetOrganizationsByUserID` in parallel with
existing queries
- Evaluate all `permissionChecks` using the RBAC authorizer directly
- Inject results as HTML-escaped JSON into `<meta>` tags
### Frontend
- Register `permissions` and `organizations` in `useEmbeddedMetadata`
- Update `checkAuthorization()` to accept optional metadata and use
`disabledRefetchOptions` when available
- Update `organizations()` to accept optional metadata and use
`cachedQuery` when available
- Wire metadata through `AuthProvider` and `DashboardProvider`
### Note
The Go `permissionChecks` map in `site/site.go` mirrors
`site/src/modules/permissions/index.ts` and must be kept in sync.
## Summary
The backend requires `context_limit` to be a positive integer when
creating a model config, but the frontend form did not visually indicate
this to the user. This caused a confusing error after submission
("Context limit is required. context_limit must be greater than zero.").
## Changes
- Added required asterisk (`*`) to the **Context Limit** label, matching
the existing **Model Identifier** field pattern
- Added Yup `.required()` validation to the `contextLimit` field so the
form catches the missing value client-side before submission
## Before
The "Context Limit" label had no required indicator. Users could submit
the form without filling it in, only to receive a backend error.
## After
The "Context Limit" label now shows a red `*` (consistent with "Model
Identifier"), and the form validates the field as required before
allowing submission.
Created on behalf of @uzair-coder07
---------
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
TestServer_X11_EvictionLRU was timing out under -race because it created
190 sequential SSH shell sessions (~0.55s each = ~105s), exceeding the
90s test timeout. The session count was derived from the production
X11MaxPort constant (6200).
Add a configurable X11MaxPort field to Config so the test can use a
small port range (5 ports instead of 190). This reduces the number of
sessions from 190 to 4, completing in ~3.8s under -race.
The `flush` method sets `start := b.clock.Now()` but later computes
duration with `time.Since(start)` instead of `b.clock.Since(start)` for
the `FlushDuration` metric and the debug log. Line 352 already uses
`b.clock.Since(start)` correctly — this makes the rest consistent.
Test output before fix:
```
flush complete count=100 elapsed=19166h12m30.265728663s reason=scheduled
```
After fix:
```
flush complete count=100 elapsed=0s reason=scheduled
```
## Summary
Refactors the right-side panel in the Agents page into a generic tabbed
container with a unified Git panel.
### Changes
**Architecture**
- `SidebarTabView` is now a generic tabbed container with no
git-specific logic, ready for additional tabs
- All Git content lives in a new `GitPanel` component with an internal
Remote/Local segmented control
**Git Panel**
- Remote view: branch/PR diff via `FilesChangedPanel`
- Local view: working tree changes with per-repo headers, commit &
refresh actions
- Split/unified diff toggle restored in the toolbar
- `DiffStatBadge` rendered inside the Remote/Local segmented buttons
(full-height, no rounding, inactive opacity 50%)
**Visual polish**
- Active/inactive/hover states match the sidebar agent selection styles
(`bg-surface-quaternary/25`, `hover:bg-surface-tertiary/50`)
- Inactive tab text uses `text-content-secondary` (not primary)
- Tab button sizing fixed: `min-w-0` + `px-2` to prevent inflated width
- Chat title centered via absolute positioning when panel is fullscreen
- Polished empty states with boxed icons (`GitCompareArrowsIcon` for
Remote, `FileDiffIcon` for Local)
- Unified header styles between Remote and Local sections (both use
`bg-surface-secondary` with consistent icon/text sizing)
- Panel toggle always visible in top bar (not gated on having diff data)
**Cleanup**
- Removed dead code: `DiffStatsInline`, `computeDiffStats` export,
`workingDiffStats` memo, `ChatDiffStatusResponse` import
- Simplified `RepoChangesPanel` to a pure `DiffViewer` wrapper
- Simplified `TopBar` to use a generic `panel` prop instead of
diff-specific props
When the chat WebSocket reconnects, the server replays all buffered
`message_part` events in the initial snapshot. The client's `onOpen`
callback only cleared the stream error but **not** the stream state, so
replayed parts appended to the stale accumulator, doubling (or further
multiplying) the visible text with each reconnect. A page refresh would
clear the issue temporarily since it creates a fresh `ChatStore`.
This was caused by:
- **Server** (`coderd/chatd/chatd.go`): `Subscribe()` unconditionally
includes all buffered `message_part` events in the snapshot sent to
new connections. The `afterMessageID` parameter only filters durable
DB messages, not ephemeral stream parts.
- **Client** (`ChatContext.ts`): The `onOpen` callback in
`createReconnectingWebSocket` called `store.clearStreamError()` but
not `store.clearStreamState()`. When the reconnected stream replays
buffered `message_part` events, `applyMessagePartToStreamState`
blindly appends text via `appendTextBlock`.
The fix was to add `store.clearStreamState()` in the `onOpen` callback
so replayed parts build from a clean slate instead of appending to stale
content.
A red/green verification test was added to ensure the fix works as
expected.
The `test-postgres` Makefile rule was redundant — CI never used it (it
runs `test-postgres-docker` + `make test` via the `test-go-pg` action),
and `make test` auto-starts a Postgres Docker container when needed via
`dbtestutil`.
- Remove the `test-postgres` rule from Makefile
- Update `pre-push` to run `test-postgres-docker` in the first phase
(alongside gen/fmt) and `make test` in the second phase
- Fix stale comments in CI workflows referencing `make test-postgres`
- Remove redundant "Test Postgres" entries from docs since `make test`
handles Postgres automatically
Closes https://github.com/coder/internal/issues/1391
## Problem
The `test-go-pg (macos-latest)` job hit its 25m timeout without ever
running
tests because `brew install google-chrome` stalled for 23+ minutes
downloading
from the Homebrew CDN:
```
==> Fetching downloads for: google-chrome
Error: The operation was canceled.
```
## Why this is safe to remove
`brew install google-chrome` was added in Oct 2023 (`70a4e56c0`) the day
after
chromedp was integrated into the scaletest/dashboard package
(`1c48610d5`). At
that time, `run.go` called `initChromeDPCtx` directly (hardcoded), so
the unit
test actually launched a real Chrome process.
In Jun 2024, #13650 refactored this to accept a mock `InitChromeDPCtx`
via the
`Config` struct, and the test now passes a stub that never launches a
browser.
No test file in the repo references `chromedp` directly — the only test
(`scaletest/dashboard/run_test.go`) fully mocks Chrome initialization.
The `chromedp` Go library compiles fine without Chrome installed; it
only needs
the binary at runtime, and no test exercises that path.
## Impact
- Removes a ~200MB+ download from every macOS CI run
- Eliminates a fragile external dependency on Homebrew CDN availability
- Saves several minutes per run even when the download succeeds
_Generated with mux but reviewed by a human_
## Problem
Rate limiting by user is broken (#20857). The rate limit middleware runs
before API key extraction, so user ID is never in the request context.
This causes:
- Rate limiting falls back to IP address for all requests
- `X-Coder-Bypass-Ratelimit` header for Owners is ignored (can't verify
role without identity)
## Solution
Adds `PrecheckAPIKey`, a **root-level middleware** that fully validates
the API key on every request (expiry, OIDC refresh, DB updates, role
lookup) and stores the result in context. Added **once** at the root
router — not duplicated per route group.
### Architecture
```
Request → Root middleware stack:
→ ExtractRealIP, Logger, ...
→ PrecheckAPIKey(...) ← validates key, stores result, never rejects
→ HandleSubdomain(apiRateLimiter) ← workspace apps now also benefit
→ CORS, CSRF
→ /api/v2 or /api/experimental:
→ apiRateLimiter ← reads prechecked result from context
→ route handlers:
→ ExtractAPIKeyMW ← reuses prechecked data, adds route-specific logic
→ handler
```
### Key design decisions
| Decision | Rationale |
|---|---|
| **Full validation, not lightweight** | Spike's review: "the whole idea
of a 'lightweight' extraction that skips security checks is
fundamentally flawed." Only fully validated keys are used for rate
limiting — expired/invalid keys fall back to IP. |
| **Structured error results** | `ValidateAPIKeyError` has a `Hard` flag
that maps to `write` vs `optionalWrite`. Hard errors (5xx, OAuth refresh
failures) surface even on optional-auth routes. Soft errors
(missing/expired token) are swallowed on optional routes. |
| **Added once at the root** | Spike's review: "Why can't we add it once
at the root?" Root placement means workspace app rate limiters also
benefit. |
| **Skip prechecked when `SessionTokenFunc != nil`** |
`workspaceapps/db.go` uses a custom `SessionTokenFunc` that extracts
from `issueReq.SessionToken`. The prechecked result may have validated a
different token. Falls back to `ValidateAPIKey` with the custom func. |
| **User status check stays in `ExtractAPIKey`** | Dormant activation is
route-specific — `ValidateAPIKey` stores status but doesn't enforce it.
|
| **Audience validation stays in `ExtractAPIKey`** | Depends on
`cfg.AccessURL` and request path, uses `optionalWrite(403)` which
depends on route config. |
### Changes
- **`coderd/httpmw/apikey.go`**:
- New `ValidateAPIKey` function — extracted core validation logic,
returns structured errors instead of writing HTTP responses
- New `PrecheckAPIKey` middleware — calls `ValidateAPIKey`, stores
result in `apiKeyPrecheckedContextKey`, never rejects
- New types: `ValidateAPIKeyConfig`, `ValidateAPIKeyResult`,
`ValidateAPIKeyError`, `APIKeyPrechecked`
- Refactored `ExtractAPIKey` — consumes prechecked result from context
(skipping redundant validation), falls back to `ValidateAPIKey` when no
precheck available
- Removed `ExtractAPIKeyForRateLimit` and `preExtractedAPIKey`
- **`coderd/httpmw/ratelimit.go`**: Rate limiter checks
`apiKeyPrecheckedContextKey` first, then `apiKeyContextKey` fallback
(for unit tests / workspace apps), then IP
- **`coderd/coderd.go`**: Added `PrecheckAPIKey` once at root
`r.Use(...)` block, removed `ExtractAPIKeyForRateLimit` from `/api/v2`
and `/api/experimental`
- **`coderd/coderd_test.go`**: `TestRateLimitByUser` regression test
with `BypassOwner` subtest
Fixes#20857
Fixes a regression where image attachments in user chat messages were
rendered twice, once inside the bubble container and once outside it.
- **ConversationTimeline.tsx**: Remove 43 duplicate lines (outer image
block + second fade overlay) from the `ChatMessageItem` user-message
branch.
- **ConversationTimeline.stories.tsx** (new): Add focused stories for
`ConversationTimeline` with `play` function assertions on image
thumbnail counts to guard against this class of regression.
Bumps rust from `c0a38f5` to `d6782f2`.
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Agents hit short shell timeouts on `git commit` (~13s) before
`make pre-commit` finishes (~20s warm), then disable hooks via
`git config core.hooksPath /dev/null`. This bypasses all local checks
and, because it writes to shared `.git/config`, silently disables hooks
for every other worktree too.
Add explicit timing guidance to AGENTS.md, and write worktree-scoped
`core.hooksPath` in post-checkout, pre-commit, and pre-push hooks to
make the bypass ineffective.
Follow-up to #22705 (pre-commit/pre-push hooks).
Unifies `test` and `test-race` into the same structure and lets CI call
`make test-race` instead of reproducing the gotestsum command.
**Parallelism**: Extracted from `GOTEST_FLAGS` into
`TEST_PARALLEL_PACKAGES`
/ `TEST_PARALLEL_TESTS` (default 8x8). `test-race` overrides to 4x4 via
target-specific Make variables. `TEST_NUM_PARALLEL_PACKAGES` and
`TEST_NUM_PARALLEL_TESTS` env vars continue to work for both targets.
**GOTEST_FLAGS**: Changed from simply-expanded (`:=`) to
recursively-expanded
(`=`) so target-specific overrides take effect at recipe time.
**CI**: `.github/actions/test-go-pg/action.yaml` now calls `make
test-race`
/ `make test` instead of hand-rolling the gotestsum command, eliminating
drift between local and CI configurations.
Refs #22705
Adds a guard + some unit tests to ensure that we don't try to fetch git
changes if there's no workspace agent from which to do so.
Generated by Claude Opus 4.6 but read using Cian's eyeballs.
## Bug
After compaction in the chat loop, the loop re-enters and calls the LLM
with a prompt that has **no non-system messages**. Anthropic (and most
providers) require at least one user/assistant/tool message, so the API
errors with empty messages.
## Root Cause
The compaction summary was stored as `role=system`. After compaction,
`GetChatMessagesForPromptByChatID` returns only:
- The compressed system summary (matched by the CTE)
- Original non-compressed system messages (system prompts)
All original user/assistant/tool messages are excluded (they predate the
summary). The compaction assistant/tool messages are `compressed=TRUE`
and don't match the main query's `compressed=FALSE` clauses.
So `ReloadMessages` returned only system messages. The Anthropic
provider moves system messages into a separate `system` field, leaving
the `messages` API field as `[]`.
## Fix
1. **Changed compaction summary from `role=system` to `role=user`** —
the summary now appears as a user message in the reloaded prompt, giving
the model valid conversational context to respond to.
2. **Simplified the CTE** — removed the `role = 'system'` check and
narrowed `visibility IN ('model', 'both')` to just `visibility =
'model'`. The summary is the only compressed message with
`visibility=model` (the assistant has `visibility=user`, the tool has
`visibility=both`), so the role check was redundant.
## Test
`PostRunCompactionReEntryIncludesUserSummary`: verifies the re-entry
prompt contains a user message (the compaction summary) after compaction
+ reload.
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore <dependency name> major version` will close this
group update PR and stop Dependabot creating any more for the specific
dependency's major version (unless you unignore this specific
dependency's major version or upgrade to it yourself)
- `@dependabot ignore <dependency name> minor version` will close this
group update PR and stop Dependabot creating any more for the specific
dependency's minor version (unless you unignore this specific
dependency's minor version or upgrade to it yourself)
- `@dependabot ignore <dependency name>` will close this group update PR
and stop Dependabot creating any more for the specific dependency
(unless you unignore this specific dependency or upgrade to it yourself)
- `@dependabot unignore <dependency name>` will remove all of the ignore
conditions of the specified dependency
- `@dependabot unignore <dependency name> <ignore condition>` will
remove the ignore condition of the specified dependency and ignore
conditions
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
## Summary
Adds a line-reference and annotation system for diffs in the Agents UI.
Users can click line numbers in the Git diff panel to open an inline
prompt input, type a comment, and have a reference chip + text added to
the chat message input.
## Changes
### Backend
- Added `diff-comment` type to `ChatInputPart` and `ChatMessagePart` in
`codersdk/chats.go` with `FileName`, `StartLine`, `EndLine`, `Side`
fields
### Frontend
- **`DiffCommentContext`**: React context/provider managing pending diff
comments with `addReference`, `removeComment`, `restoreComment`,
`clearComments`
- **`DiffCommentNode`**: Lexical `DecoratorNode` rendering inline chips
in the chat input showing file:line references. Chips are clickable
(scroll to line in diff), removable, and support undo/redo via mutation
tracking
- **`InlinePromptInput`**: Textarea annotation rendered inline under
clicked lines in the diff. Supports multiline (Shift+Enter), submit
(Enter), cancel (Escape)
- **`FilesChangedPanel`**: Line click/drag-select handlers open the
inline input. On submit, a badge chip + plain text are inserted into the
Lexical editor
- **`AgentDetail`**: Bidirectional sync between DiffCommentContext and
Lexical editor. Comments are sent as `diff-comment` parts on message
submit
- **`ConversationTimeline`**: Renders `diff-comment` message parts with
file:line labels
## How it works
1. Click a line number in the diff → inline textarea appears below that
line
2. Type a comment and press Enter → reference chip appears in chat input
with your text after it
3. Send the message → diff-comment parts are included alongside the
message text
## Problem
`TestE2E_WriteFileTriggersGitWatch` and `TestE2E_SubagentAncestorWatch`
flake intermittently in `test-go-race-pg` with:
```
agentgit_test.go:1271: timed out waiting for server message
```
## Root Cause
In `handleWatch()`, `GetPaths(chatID)` was called **before**
`Subscribe(chatID)` on the PathStore. If `AddPaths()` fired between
those two calls:
1. `GetPaths()` returned empty (paths not added yet).
2. `AddPaths()` stored the paths and called `notifySubscribers()` — but
the subscription channel didn't exist yet, so the notification was a
no-op.
3. `Subscribe()` created the channel, but the notification was already
lost.
4. The handler never scanned, and the mock clock never advanced the 30s
fallback ticker → timeout.
Both failing tests connect the WebSocket with an empty PathStore and
immediately call `AddPaths()` from the test goroutine, making them
vulnerable to this scheduling interleaving.
## Fix
Swap the order: call `Subscribe()` first, then `GetPaths()`. This
guarantees:
| `AddPaths` fires... | `Subscribe` sees it? | `GetPaths` sees it? |
Outcome |
|---|---|---|---|
| Before `Subscribe` | No | **Yes** | Picked up by `GetPaths` |
| Between the two calls | **Yes** (queued) | **Yes** | Redundant but
safe (delta dedupes) |
| After `GetPaths` | **Yes** | No | Goroutine handles it |
No window exists where both miss it.
Verified with 10,000 iterations (`-race -count=5000`) — zero failures.
Fixescoder/internal#1389
## Root Cause
The `createAndBuildTemplateVersion` mutation calls
`waitBuildToBeFinished`, which polls `getTemplateVersion` behind a real
`delay()` call:
```ts
await delay(jobStatus === "pending" ? 250 : 1000);
```
On the first iteration, `jobStatus` is `undefined` (not `"pending"`), so
the delay is **1000 ms**. The `waitFor` assertion in the test uses the
default `@testing-library` timeout, which is also **1000 ms**. The
`toast.success` call fires right at or after the timeout boundary,
making the test flaky under CI load.
## Fix
Mock `utils/delay` to resolve immediately at the top of the test file.
This eliminates the 1 s wall-clock wait in `waitBuildToBeFinished`, so
the async submit chain completes in microtasks and the `toast.success`
spy is called well within the `waitFor` window.
## Verification
- Both tests pass (`renders with variables` + `user submits the form
successfully`)
- **50/50 passes** under stress testing (sequential runs with
`--no-cache`)
- Submit test time dropped from ~2000 ms to ~1400 ms
## Problem
`offlinedocs/next-env.d.ts` was committed with content from an older
Next.js version. Next.js 15 rewrites this file on every `next build`
with two changes:
1. Adds `/// <reference path="./.next/types/routes.d.ts" />`
2. Updates the docs URL from `basic-features/typescript` to
`pages/api-reference/config/typescript`
During `make pre-commit` / `make pre-push`, the `pnpm export` step
triggers `next build`, which silently rewrites the file. The
`check-unstaged` guard then detects the diff and fails. If the hook is
interrupted, the regenerated file persists as an unstaged change,
blocking subsequent commits/pushes.
## Fix
Update the committed file to match what the current Next.js 15 produces,
making the build idempotent.
## Problem
When `message_agent` is called with `interrupt=true`, two independent
code paths race to persist messages:
1. `SendMessage` inserts the **user message** into `chat_messages` at
time T1
2. `persistInterruptedStep` saves the partial **assistant response** at
time T2 (T2 > T1)
Since `chat_messages` are ordered by `(created_at, id)`, the assistant
message ends up **after** the user message that triggered the interrupt.
On reload, this produces a broken conversation where the interrupted
response appears below the new user message — and Anthropic rejects the
trailing assistant message as unsupported prefill.
The root cause is that **two independent writers can't guarantee
ordering**. Any solution involving timestamp manipulation or
signal-then-wait coordination leaves race windows.
## Fix
Route interrupt behavior through the existing queued message mechanism:
1. `SendMessage` with `BusyBehaviorInterrupt` now inserts into
`chat_queued_messages` (not `chat_messages`) when the chat is busy
2. After queuing, `setChatWaiting` signals the running loop to stop
3. The deferred cleanup in `processChat` persists the partial assistant
response first, then auto-promotes the queued user message
This eliminates the race entirely: the assistant partial response and
user message are written by the same serialized cleanup flow, so
ordering is guaranteed by the DB's auto-incrementing `id` sequence. No
timestamp hacks, no reordering at send time.
Supersedes #22728 — fixes the root cause instead of reordering at prompt
construction time.
`Test_ProxyServer_Headers` never passed `--http-address`, so it bound to
the default `127.0.0.1:3000`.
`TestWorkspaceProxy_Server_PrometheusEnabled`
used `RandomPort(t)` for `--http-address` (a drive-by from #14972 which
was
fixing the Prometheus port).
Both now use `--http-address :0`. `ConfigureHTTPServers` calls
`net.Listen("tcp", ":0")` and holds the listener open, so there is no
TOCTOU window. Neither test connects to the HTTP listener, so the
assigned port is irrelevant. This matches `cli/server_test.go` where
`:0` is used throughout.
go-git has bugs in gitignore logic. With more complex gitignores, some
paths that should be ignored aren't. That caused extra, unexpected files
to appear in the git diff panel.
If the git cli isn't available in a workspace, the /git/watch endpoint
will still allow the frontend to connect, but no git changes will ever
be transmitted.
Split from #22693 per review feedback.
Fixes multiple bugs in coderd/chatd and sub-packages including race
conditions, transaction safety, stream buffer bounds, retry limits, and
enterprise relay improvements.
See commit message for full list.
Split from #22693 per review feedback.
Fixes SSE error handling and adds WebSocket reconnection with
exponential backoff to the AgentsPage chat list watcher.
The Button base styles apply `[&>svg]:p-0.5` (2px padding) to child
SVGs. In the small `size-7` rounded attach button, this extra padding
shifts the 16x16 ImageIcon off-center. Override with `[&>svg]:p-0` to
remove it.
On small viewports (below `xl`) the git/changes panel was expanding as a
bottom sheet. This changes it to always appear from the right side:
- **Mobile (<`sm`/640px):** Panel opens full-width (`w-[100vw]`) as a
right-side overlay
- **`sm`+ (640px+):** Panel uses the persisted width (`--panel-width`)
with min 360px / max 70vw, drag handle enabled
- Parent flex container is always `flex-row` instead of `flex-col
xl:flex-row`
### Changes
- `AgentDetail.tsx`: Removed `flex-col xl:flex-row` responsive switch,
always uses `flex-row`
- `RightPanel.tsx`: Replaced bottom-sheet layout (`h-[42dvh]`) with
right-side panel at all breakpoints. Full viewport width below `sm`,
resizable width at `sm`+. Drag handle activates at `sm` instead of `xl`.
This change adds support for image attachments to chat via add button
and clipboard paste. Files are stored in a new `chat_files` table and
referenced by ID in message content. File data is resolved from storage
at LLM dispatch time, keeping the message content column small.
Upload validates MIME types via content type or content sniffing against
an allowlist (png, jpeg, gif, webp). The retrieval endpoint serves files
with immutable caching headers. On the frontend, uploads start eagerly
on attach with a background fetch to pre-warm the browser HTTP cache so
the timeline renders instantly after send.
This change adds git hooks and Makefile targets that mirror CI required
checks locally, catching issues before they reach CI.
This is for use by AI agents (documented in AGENTS.md).
- **pre-commit** (every commit): gen, fmt, lint, typos, slim binary
build. Fast checks without Docker or Playwright.
- **pre-push** (before push): full CI suite including site build, tests,
sqlc-vet, offlinedocs.
To use:
```sh
git config core.hooksPath scripts/githooks
```
Works in worktrees (where `.git` is a file). Bypass with `--no-verify`.
Replaces the single-purpose PR diff right panel with a tabbed sidebar
that shows both the existing PR diff and real-time git repository
changes from the workspace agent.
There's an accompanying backend PR
[here](https://github.com/coder/coder/pull/22565).
https://github.com/user-attachments/assets/bbd53f1c-d753-4574-a159-6dad5989e5e3
## Backend surface
One endpoint drives this feature:
- **`WS /api/experimental/workspaceagents/{id}/git/watch`** —
bidirectional WebSocket. The client sends `refresh` messages; the agent
responds with `changes` messages containing per-repo branch and unified
diff. The workspace agent also automatically pushes changes as they
occur in the workspace.
Pins the `coder/coderd` Terraform provider in `dogfood/main.tf` to `>=
0.0.13`.
Previously there was no version constraint at all. The latest release
[v0.0.13](https://github.com/coder/terraform-provider-coderd/releases/tag/v0.0.13)
includes:
- `workspace_sharing` attribute on `coderd_organization` resource
- `cors_behavior` support on template resource
- Dependency updates (coder/coder SDK bumped to v2.29.2, various
Terraform plugin framework updates)
---------
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
## Summary
This PR adds a follow-up flow for paused Tasks so users can submit
another prompt as part of resuming the same task/session.
https://github.com/user-attachments/assets/eabe5e91-704c-44ad-9e28-39f55e6c5923
## What changed
- **Task page UX**
- Added a new **Follow-up** action in paused task state.
- Added `FollowUpDialog` to collect and submit follow-up input.
- **Follow-up flow behavior**
- If task is paused/stopped: call resume first, then wait for polling to
observe task `active`, then send input.
- Dialog closes itself after successful send.
- Added clear error handling for:
- resume failure
- build failure/canceled while resuming
- send failure
- **Stable API route parity**
- Added `POST /tasks/{user}/{task}/pause` and `POST
/tasks/{user}/{task}/resume` to the stable `/api/v2/tasks` router block.
- **SDK alignment**
- Updated `codersdk` pause/resume methods to use stable
`/api/v2/tasks/...` endpoints instead of `/api/experimental/...`.
- **Frontend API/query alignment**
- `site` task pause/resume/send paths are on stable `/api/v2/tasks/...`.
- Updated task query helpers accordingly.
- **Storybook coverage**
- Added follow-up dialog stories for key states:
- open dialog
- active direct send
- auto-resume then send
- resuming progress visible
- resume build failure
- send failure
- empty message disabled
- Added/updated mocks for task logs and new follow-up flows.
Closes https://github.com/coder/internal/issues/1269
## Problem
Bootstrap scripts under `provisionersdk/scripts/` are inlined into
templates via `sh -c '${init_script}'`. Any single quote (apostrophe) in
these `.sh` files silently breaks the shell quoting, causing the agent
to never start — with near-invisible error output.
## Changes
- **`scripts/check_bootstrap_quotes.sh`** — new lint script that scans
all `.sh` files under `provisionersdk/scripts/` for single quotes and
fails with a clear error if any are found. Only checks shell scripts
(not `.ps1`, which legitimately uses single quotes).
- **`Makefile`** — added `lint/bootstrap` target wired into the `lint`
dependency list.
Fixes#22062
We now use Linear for Triaging
<!--
If you have used AI to produce some or all of this PR, please ensure you
have read our [AI Contribution
guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING)
before submitting.
-->
Adds real-time git status watching for workspace agents, so the frontend
can subscribe over WebSocket and show
git file changes in near real-time.
1. Subscription is scoped to a **chat** via `GET
/api/experimental/chats/{chat}/git/watch`.
2. The workspace agent automatically determines which paths to watch
based on tool calls made by the chat (and its ancestor chats).
3. Workspace agent polls subscribed repo working trees on a 30s
interval, on tools calls, and on explicit `refresh` from the client.
4. Scans are rate-limited to at most once per second.
5. Edited paths are tracked **in-memory** inside the workspace agent.
There is no database persistence — state is lost on agent restart. This
will be addresses in a future PR.
6. Messages sent over WebSocket include a full-repo snapshot (unified
diff, branch, origin). A new message is emitted only when the snapshot
changes.
This PR was implemented with AI with me closely controlling what it's
doing. The code follows a plan file that was updated continuously during
implementation. Here's the file if you'd like to see it:
[project.md](https://gist.github.com/hugodutka/8722cf80c92f8a56555f7bc595b770e2).
It reflects the current state of the PR.
`time.Now()` has nanosecond precision while Postgres timestamps are
microsecond precision. When tests compare `time.Now()` against
DB-sourced timestamps using `Before`/`After`/`WithinRange`/etc., there
is a non-zero flake risk from the precision mismatch.
This replaces `time.Now()` with `dbtime.Now()` (which rounds to
microsecond precision) in all test assertions that compare against
database timestamps.
Follows from #22684.
## Changes (11 files)
| File | Changes |
|---|---|
| `coderd/apikey_test.go` | 11 comparisons with `ExpiresAt` |
| `coderd/users_test.go` | 2 comparisons with `ExpiresAt` |
| `coderd/oauth2_test.go` | 1 comparison with `token.Expiry` |
| `coderd/workspaces_test.go` | 2 comparisons with `DormantAt` |
| `coderd/workspaceagents_test.go` | 3 comparisons with
`ConnectedAt`/`DisconnectedAt` |
| `coderd/workspaceapps/db_test.go` | 1 comparison with `token.Expiry` |
| `coderd/provisionerdserver/provisionerdserver_test.go` | 1 comparison
with `key.ExpiresAt` |
| `enterprise/coderd/workspaces_test.go` | 1 comparison with `DormantAt`
|
| `enterprise/coderd/license/license_test.go` | 3 `NotBefore` values |
| `enterprise/coderd/licenses_test.go` | 2 `NotBefore` values |
| `enterprise/coderd/users_test.go` | 3 `Next()` comparisons |
## Not changed (intentionally)
- `scaletest/placebo/run_test.go` — compares wall-clock elapsed time,
not DB timestamps
- `cli/server_test.go`, `coderd/jwtutils/jwt_test.go`,
`enterprise/aibridgeproxyd/aibridgeproxyd_test.go` — TLS cert fields,
not DB-stored
- `coderd/azureidentity/azureidentity_test.go` — Azure cert expiry, not
DB
🤖 Generated by Claude Opus 4.6 but reviewed manually.
This PR does three things:
- Exports derp expvars to the pprof endpoint
- Exports the expvar metrics as prometheus metrics in both coderd and
wsproxy
- Updates our tailscale to a fix I also had to make to avoid a data race
condition
I generated this with mux but I also manually tested that the metrics
were getting properly emitted
## Problem
When creating a new chat from the `/agents` page and navigating back,
the initial prompt was still displayed. The draft text persisted in
`localStorage` (`agents.empty-input`) even after the chat was
successfully created.
### Root cause
After `handleSend` synchronously clears the localStorage key, the React
re-render (caused by `isCreating` flipping to `true`) triggers Lexical's
`ContentChangePlugin`, which fires `handleContentChange` with the old
editor content — re-writing the draft back to localStorage before
navigation occurs.
## Fix
- Extract the draft persistence logic into a `useCreatePageDraft` hook
with a `sentRef` guard
- Once `markSent()` is called, `handleContentChange` skips all
localStorage writes
- This prevents the Lexical editor's change events from re-persisting
the draft during the async gap between send and navigation
## Testing
Added 6 unit tests for `useCreatePageDraft`, including a regression test
that reproduces the exact bug scenario (calling `handleContentChange`
with old content after `markSent`).
Adds a docs page under /docs/ai-coder/agents describing our philosophy
on platform team control over agent behavior: admin-level configuration,
zero developer options, enforcement over defaults. Covers what's
available today (providers, models, system prompt, template routing) and
where we're headed (usage analytics, infra-level enforcement, tool
customization).
To allow tasks to be shareable, we need to share both the `task`
resource and the `workspace` resource, and their sharing state needs to
be kept in sync. We've already implemented all of the necessary ACL
functionality for workspaces, so we can just sort of proxy those ACLs
back to the task as well.
Follow-up to #22612. Running `git status --short` in a loop during `make
-B -j gen` still showed intermediate states for several files. This PR
fixes the remaining ones.
The main issues:
- `generate.sh` ran `gofmt` and `goimports` in-place after moving files
into the source tree. Now it formats in a workdir first and only `mv`s
the final result.
- `protoc` targets wrote directly to the source tree. Wrapped with
`scripts/atomic_protoc.sh` which redirects output to a tmpdir.
- Several generators used hardcoded `/tmp/` paths. On systems where
`/tmp` is tmpfs, `mv` degrades to copy+delete. Switched to a
project-local `_gen/` directory (gitignored, same filesystem).
- `apidoc/.gen` and `cli/index.md` used `cp` for final output. Replaced
with `mv`.
- `manifest.json` was written twice (unformatted, then formatted). Now
`.gen` writes to a staging file and the manifest target does one
formatted atomic write.
- `biome_format.sh` silently skipped files in gitignored dirs. Added
`--vcs-enabled=false`.
Two helpers reduce the Makefile boilerplate: `scripts/atomic_protoc.sh`
(wraps protoc) and an `atomic_write` Make define
(stdout-to-temp-to-target pattern). `.PRECIOUS` now also covers `.pb.go`
and mock files.
Verification: `make -B -j gen` x3 with `git status` polling, no changes.
Refs #22612
When a git change has zero additions and zero deletions (like adding a
binary file/image), the diff stats were hidden entirely because of
`additions === 0 && deletions === 0` early-return guards.
This changes the behavior so that `+0 -0` is always rendered when there
are changed files, ensuring visibility in both the sidebar and the Git
tab.
### Changes
**`DiffStats.tsx`**
- `DiffStatNumbers`: Removed the `null` early return — always renders
both `+N` and `−N` counters.
- `DiffStatBadge`: Now only returns `null` when there are no changed
files AND both counts are zero. Always renders both pills.
- `DiffStatsInline`: Same guard — shows `+0 −0` clickable stats when
files changed but lines are zero.
**`AgentsSidebar.tsx`**
- `hasLineStats` now also checks `changedFiles > 0`, so the sidebar
entry shows `+0 -0` for binary-only diffs.
- Removed the `additions > 0` / `deletions > 0` conditional wrappers —
both values are always rendered.
The `--parameter-default` value is now used to pre-select the default option for a coder parameter
with option blocks when prompting interactively in CLI.
Related to: https://github.com/coder/coder/issues/22078
Closes#22140
Short simple and sweet PR to add a bunch of details to our `<Alert />`
stacks. This means we aren't simply asking the user to read the
developer console and surface things easier.
- Implement `Response data` and `Stack trace` `<details />`
- Fix overflow in `ErrorAlert` debug accordions so long `Response data`
and `Stack Trace` content stays inside the alert.
- Add horizontal scroll wrappers around both `<pre>` blocks used in
debug details.
- Update `Alert` layout with `min-w-0` on flex containers so nested
content can shrink correctly and internal scrolling works as intended.
<img width="739" height="550" alt="preview-validation"
src="https://github.com/user-attachments/assets/a6f890d3-8f1f-4fd6-b9d0-882838db04a4"
/>
The `apple-touch-icon.png` had weird padding/cropping that looked wrong
when added to the iOS home screen. This replaces it with a 180×180
resize of the canonical `pwa-icon-512.png` so the icon matches the PWA
icon exactly — no extra padding.
## Summary
When a chat's workspace is stopped, the LLM previously had no way to
start it — `create_workspace` would either create a duplicate workspace
or fail. This adds a dedicated `start_workspace` tool to the agent flow.
## Changes
### New: `start_workspace` tool
(`coderd/chatd/chattool/startworkspace.go`)
- Detects if the chat's workspace is stopped and starts it via a new
build with `transition=start`
- Reuses the existing `waitForBuild` and `waitForAgent` helpers (shared
logic)
- Shares the workspace mutex with `create_workspace` to prevent races
- Idempotent: returns immediately if the workspace is already running or
building
- Returns a `no_agent` / `not_ready` status if the agent isn't available
yet (non-fatal)
### Updated: `create_workspace` stopped-workspace hint
- `checkExistingWorkspace` now returns a `stopped` status with message
`"use start_workspace to start it"` when it detects the chat's workspace
is stopped, instead of falling through to create a new workspace
### Wiring
- `chatd.Config` / `chatd.Server`: new `StartWorkspace` /
`startWorkspaceFn` field
- `coderd/chats.go`: new `chatStartWorkspace` method that calls
`postWorkspaceBuildsInternal` with proper RBAC context
- `coderd/coderd.go`: passes `chatStartWorkspace` into chatd config
- Tool registered alongside `create_workspace` for root chats only (not
subagents)
### Tests (`startworkspace_test.go`)
- `NoWorkspace`: error when chat has no workspace
- `AlreadyRunning`: idempotent return for workspace with successful
start build
- `StoppedWorkspace`: verifies StartFn is called, build is waited on,
and success response returned
## Problem
When `coder ssh --stdio` checks for Coder Connect availability, it
constructs a hostname like `agent.workspace.owner.coder` and performs a
DNS AAAA lookup via `ExistsViaCoderConnect`. Without a trailing dot,
this hostname is not a fully-qualified domain name (FQDN), so the system
DNS resolver appends each configured search domain before querying.
Go's pure-Go DNS resolver (used when `CGO_ENABLED=0`, which is the
default for CLI builds) does **not** stop after getting NXDOMAIN on the
first name. It tries all names in the search list sequentially:
1. `agent.workspace.owner.coder.` → NXDOMAIN (fast)
2. `agent.workspace.owner.coder.corp.example.com.` → timeout
3. `agent.workspace.owner.coder.internal.company.com.` → timeout
On corporate networks where the search-domain-expanded queries hit DNS
infrastructure that drops rather than responds (common for nonsensical
hostnames with deep subdomain chains), each expanded query hits the full
DNS timeout (default 5s × 2 attempts = 10s per name). With 2-3 search
domains, this compounds to 20-30+ seconds of blocking.
## Fix
Adding a trailing dot marks the hostname as an FQDN. Go's `nameList()`
in `src/net/dnsclient_unix.go` returns a single-entry list for rooted
names, completely bypassing search domain expansion.
This is consistent with how `IsCoderConnectRunning` already handles its
DNS check — `tailnet.IsCoderConnectEnabledFmtString` includes a trailing
dot for exactly this reason.
## Verification
Tested with a fake DNS server that responds with NXDOMAIN for `.coder`
queries but drops search-domain-expanded queries:
| Hostname | Time | Queries sent |
|---|---|---|
| `main.workstation.kevin.coder` (no trailing dot) | **~15s** | 4 (as-is
+ 3 search domains) |
| `main.workstation.kevin.coder.` (trailing dot) | **<1ms** | 1 (FQDN
only) |
Closes https://github.com/coder/coder/issues/22581
_Generated by [mux](https://github.com/coder/mux) but reviewed by a
human_
## Problem
The chat input loses focus after sending a message. Users have to click
back into the input field to type their next message.
## Root Cause
When a message is sent:
1. `handleSubmit` in `AgentChatInput` calls `onSend(text)` then
immediately `focus()` — but this is premature
2. The async send sets `isLoading=true`, which disables the editor via
`EditableStatePlugin`
3. When the send resolves, `handleSendFromInput` calls `clear()` and
`focus()` — but the editor may still be disabled at this point (React
hasn't re-rendered yet)
4. When React re-renders with `isLoading=false`, the editor becomes
editable again but nobody restores focus
## Fix
Added a `useEffect` in `AgentChatInput` that watches for `isLoading`
transitioning from `true` to `false` and calls `focus()` on the editor.
This ensures focus is restored *after* React has re-enabled the editor,
not prematurely.
## Test
Added a test in `AgentDetail.test.ts` verifying that `focus()` is called
on the input ref after `handleSendFromInput` resolves.
## Problem
During rolling deploys, the chat stream WebSocket disconnects and the
user sees **"Chat stream disconnected."** permanently with no recovery
other than a full page refresh.
## Changes
Add automatic WebSocket reconnection with capped exponential backoff (1s
→ 2s → 4s → … → 10s max) to the chat stream `useEffect` in
`ChatContext.ts`.
**`ChatContext.ts`**
- Wrap socket creation in a `connect()` function that can be retried on
disconnect.
- On `error` or `close`, schedule a reconnect with exponential backoff
via `scheduleReconnect()`.
- Guard against double scheduling when both `error` and `close` fire for
the same disconnect (`disconnected` flag per connection).
- On successful reconnect (`open` event), reset backoff and clear
`streamError`.
- Pass the latest `lastMessageIdRef` on each reconnect so the server
replays only unseen durable messages via `after_id`.
- Add `RECONNECT_BASE_MS` (1s) and `RECONNECT_MAX_MS` (10s) constants.
**`ChatContext.test.tsx`**
- Extend mock socket with `emitOpen()` and `emitClose()` helpers.
- Update existing disconnect tests for new reconnect behavior.
- Add **"sets streamError on WebSocket disconnect and reconnects"** —
verifies error banner appears, reconnect fires, and error clears on
open.
- Add **"uses exponential backoff on consecutive disconnects"** —
verifies increasing delays between reconnects.
- Add **"passes latest message ID on reconnect for catch-up"** —
verifies `after_id` is forwarded on each reconnect.
### User experience
- **Before**: "Chat stream disconnected." — permanent, requires page
refresh
- **After**: "Chat stream disconnected. Reconnecting…" — auto-recovers
within seconds
### Limitations
`message_part` events (streaming LLM tokens) are ephemeral / in-memory
only. If the processing replica dies mid-step, those partial tokens are
lost regardless of reconnect. The backend's stale-chat recovery will
re-run the step on a new replica; the reconnect ensures the client is
connected to see that happen.
Adds progressive web app support for the agents page so it can be
installed as a standalone app on mobile/desktop.
## Changes
- **`manifest.json`** — Web app manifest with `display: standalone`,
`start_url: /agents`, Coder theme colors
- **PWA icons** — 192x192, 512x512 PNGs + 180x180 apple-touch-icon,
rendered from the existing favicon SVG
- **`index.html`** — Added manifest link, apple-touch-icon, and mobile
web app meta tags (`apple-mobile-web-app-capable`,
`mobile-web-app-capable`, `apple-mobile-web-app-status-bar-style`,
title)
- **Service worker** — `notificationclick` now focuses an existing
agents tab or opens `/agents` in a new window
## Testing
1. Open `/agents` on a mobile device
2. Use browser "Add to Home Screen" / "Install App"
3. App should launch in standalone mode pointing at the agents page
4. Push notifications should navigate to the agents page on click
Polishes the right panel UI introduced in #22633:
<img width="3138" height="1596" alt="image"
src="https://github.com/user-attachments/assets/d3947db0-6600-4469-b7e2-6eb80aadb7bc"
/>
Over 2k lines of this is just the Seti font definition.
The file tree view isn't actually adjusting much, it's just a scroll
helper. Soon I'll add a comment system so users can leave agents
feedback directly from the code.
`make gen` could not run with `-j` because inter-target dependency edges
were missing. Multiple recipes compile `coderd/rbac` (which includes
generated files like `object_gen.go`), and without explicit ordering,
parallel runs produced syntax errors from mid-write reads.
Three main changes:
**Dependency graph fixes** declare the compile-time chain through
`coderd/rbac` so that `object_gen.go` is written before anything that
imports it is compiled. The DB generation targets use a GNU Make 4.3+
grouped target (`&:`) so Make knows `generate.sh` co-produces
`querier.go`, `unique_constraint.go`, `dbmetrics`, and `dbauthz` in a
single invocation. `SKIP_DUMP_SQL=1` avoids re-entrant `make` inside
`generate.sh` when the Makefile already guarantees `dump.sql` is fresh.
**`scripts/atomicwrite` package** replaces `os.WriteFile` in all gen
scripts with a temp-file-in-same-dir + rename pattern, preventing
interrupted runs from leaving partial files.
**`.PRECIOUS` and shell atomic writes** protect git-tracked generated
files from Make's default delete-on-error behavior. Since these files
are committed, deletion is worse than staleness -- `git restore` is the
recovery path.
CI now runs `make -j --output-sync -B gen` (~32s, down from ~85s
serial).
| Scenario | Before | After |
|-----------------------------------|--------------------|----------|
| `make gen` (serial) | 95s | 95s |
| `make -j gen` (parallel) | race error | **22s** |
| CI `make -j --output-sync -B gen` | forced serial ~85s | **~32s** |
Follow-up to #22630. Addresses [review
feedback](https://github.com/coder/coder/pull/22630#pullrequestreview-2953419963)
that was missed due to auto-merge.
## Changes
Replaces three `require.Eventually` calls with `testutil.Eventually` in
`TestInterruptChatDoesNotSendWebPushNotification`, linking the condition
to the existing test context (`ctx`) created on line 1194. This ensures
the test respects context cancellation instead of using a standalone
timeout/tick pattern.
The `create_workspace` tool waited for the workspace build to succeed
and the agent to become connectable, but did not wait for the agent's
startup scripts (e.g. git clone) to finish. This caused agents to
attempt file operations on repositories that hadn't been cloned yet.
Add a waitForStartupScripts step that polls the agent's lifecycle_state
via GetWorkspaceAgentLifecycleStateByID until it transitions out of
created/starting into a terminal state (ready, start_error, or
start_timeout). The tool now only returns success once the workspace is
fully initialized.
If the scripts fail or time out, the tool still returns (non-fatal) with
an appropriate agent_status so the model knows something went wrong.
Created using thingies (Opus 4.6 Max)
## Description
Documents the new TLS listener support for AI Bridge Proxy.
Updates `setup.md` with a new "Proxy TLS Configuration" section covering self-signed and corporate CA certificate setup, rewrites "Security Considerations" to reflect TLS as the recommended approach for encrypting client connections, and updates "Client Configuration" with `HTTPS_PROXY` defaults and combined certificate trust instructions.
Updates `copilot.md` to default all proxy URL examples to `https://`, add TLS certificate trust guidance for each client (CLI, VS Code, JetBrains), and document the MCP server trust store requirement for Copilot CLI.
Closes: https://github.com/coder/internal/issues/1335
## Description
Adds optional TLS support for the AI Bridge Proxy listener. When TLS cert and key files are provided, the proxy serves over HTTPS instead of plain HTTP.
## Changes
* New configuration options to enable TLS on the proxy listener
* Wraps the TCP listener in `tls.NewListener` when configured
* Tests for validation errors, invalid files, and full integration (tunneled + MITM) through a TLS listener
Note: Documentation for TLS listener setup and client configuration will be handled in a follow-up PR.
Related to: https://github.com/coder/internal/issues/1335
When a user interrupts a chat, the status transitions to `waiting` which
previously triggered an "Agent has finished running." web push
notification. This is incorrect — the user interrupted it themselves, so
no notification is needed.
## Changes
### `coderd/chatd/chatd.go`
- Added `wasInterrupted` flag alongside the existing `status` variable
- Set the flag when `ErrInterrupted` is detected in the error handler
- Added `!wasInterrupted` to the web push dispatch condition
### `coderd/chatd/chatd_test.go`
- Added `TestInterruptChatDoesNotSendWebPushNotification` that creates a
chat with a mock webpush dispatcher, processes it, interrupts it, and
verifies no push notification was dispatched
- Added `mockWebpushDispatcher` implementing the `webpush.Dispatcher`
interface
## Description
Renames internal fields, variables, and comments related to the proxy's certificate/key configuration to explicitly reference their MITM CA purpose.
The AI Bridge Proxy uses a CA certificate to sign dynamically generated leaf certificates during MITM interception of HTTPS traffic from AI clients. With the upcoming introduction of TLS listener certificates (for serving the proxy itself over HTTPS, implemented upstack https://github.com/coder/coder/pull/22411), the previous generic naming would become ambiguous. This refactor makes it clear which certificate is which.
No user-facing flags, environment variables, YAML keys, or JSON fields were changed, this is purely an internal rename to avoid confusion going forward.
Related to https://github.com/coder/internal/issues/1335
## Problem
When a browser connects to the chat stream via WebSocket, it
authenticates using cookies only — the native WebSocket API cannot set
custom headers like `Coder-Session-Token`. The relay between replicas
copies the original request's `Cookie` header but did **not** set the
`Coder-Session-Token` header as a fallback.
This causes a **401 on the worker replica** when `EnableHostPrefix` is
enabled, because the `HTTPCookies.Middleware` strips bare
`coder_session_token` cookies (expecting the `__Host-` prefix). Without
a `Coder-Session-Token` header fallback, `apiKeyMiddleware` finds no
valid credentials.
### Root Cause
The data flow:
1. Browser → subscriber replica: `Cookie:
__Host-coder_session_token=xxx` (browser sends prefixed cookie)
2. Subscriber's `HTTPCookies.Middleware` normalizes: `Cookie:
coder_session_token=xxx` (strips prefix)
3. `relayHeaders()` copies `Cookie: coder_session_token=xxx` to relay
request
4. Worker replica's `HTTPCookies.Middleware` sees bare
`coder_session_token` → **strips it** (expects `__Host-` prefix)
5. `apiKeyMiddleware` → `APITokenFromRequest`: no cookie, no header →
**401**
## Fix
Modified `relayHeaders()` to extract the session token value from the
`Cookie` header and set it as the `Coder-Session-Token` header when no
explicit session token header is already present. The header is never
stripped by middleware, so the worker replica can always authenticate.
## Testing
- **`TestRelayHeaders`**: Unit tests for the updated `relayHeaders()`
function covering all scenarios (cookie-only, header+cookie, no auth,
nil source)
- **`TestExtractSessionTokenFromCookieHeader`**: Unit tests for the
helper function
- **`TestChatStreamRelay/RelayCookieOnlyAuth`**: Integration test with
plain HTTP, cookie-only WebSocket auth
- **`TestChatStreamRelay/RelayCookieOnlyAuthWithHostPrefix`**:
Integration test with `EnableHostPrefix=true`, confirming the 401 is
fixed
- **`cookieOnlySessionTokenProvider`**: Test helper that simulates
browser WebSocket behavior (sets Cookie header only on WebSocket dials,
no custom headers)
## Files Changed
- `enterprise/coderd/chatd/chatd.go` — `relayHeaders()` fix +
`extractSessionTokenFromCookieHeader()` helper
- `enterprise/coderd/chatd/relay_headers_internal_test.go` — unit tests
(new file)
- `enterprise/coderd/chats_test.go` — integration tests + test helper
type
Closes#22065
This pull-request ensures that when we load the `<WorkspacePage />`
we're not instantly attempting to generate an `apiKey` every-time. These
are now only generated once the user attempts to actually click on the
VSCode link, this is now a mutation also (which is the correct action
for this).
## Problem
`TestWorkspaceProvisionerdServerMetrics` flakes because metric
assertions run immediately after
`AwaitWorkspaceBuildJobCompleted` returns, but metrics are updated
**asynchronously after the
DB transaction commits** in `completeWorkspaceBuildJob`.
The timeline in the provisioner server:
1. DB transaction commits (`provisionerdserver.go:~2362`) — job marked
completed
2. Audit logging, notifications, DB queries (`~2370-2427`)
3. **Metric `.Observe()`** (`~2463`) — happens ~100 lines later
The test synchronization (`AwaitWorkspaceBuildJobCompleted`) polls for
`CompletedAt != nil`,
which fires at step 1. The metric assertion then executes before step 3,
causing the flake.
## Fix
Wrap all three metric assertions (prebuild creation, prebuild claim,
regular workspace
creation) in `require.Eventually` to poll until the metric appears, then
assert on the value.
## Test
- `go test -run TestWorkspaceProvisionerdServerMetrics -count=5` — all
pass
- `go test -race -run TestWorkspaceProvisionerdServerMetrics -count=1` —
clean
## Problem
Three bugs with chat summarization (compaction) share a single root
cause: `ReloadMessages` was never wired up in the production
`chatloop.Run()` call.
### Bug 1: Compaction never fires between steps
The inline compaction guard in `chatloop.go` requires both `Compaction`
and `ReloadMessages` to be non-nil:
```go
if opts.Compaction != nil && opts.ReloadMessages != nil {
```
Since `ReloadMessages` was only set in tests, inline compaction was
**dead code in production**. Long multi-step turns could blow through
the context window.
### Bug 2: Compaction only occurs at end of turn
The post-run safety net doesn't check `ReloadMessages`, so it was the
only compaction path that fired:
```go
if !alreadyCompacted && opts.Compaction != nil { // no ReloadMessages check
```
This meant compaction only happened once, after the entire agent turn
finished.
### Bug 3: Agent stops after summarization
After post-run compaction, `Run()` unconditionally returned `nil`.
`processChat` then set the chat status to `waiting` (done). The agent
never had a chance to continue with its fresh summarized context.
## Fix
1. **Wire up `ReloadMessages`** in `chatd.go`: reloads persisted
messages from the database and re-applies system prompts (subagent
instruction, workspace AGENTS.md).
2. **Wrap the step loop in an outer compaction loop**: when compaction
fires on the model's final step (`compactedOnFinalStep`), reload
messages and `continue` the outer loop so the agent re-enters with
summarized context.
3. **Track `compactedOnFinalStep`** to distinguish inline compaction on
the last step (needs re-entry) from inline compaction mid-loop followed
by more tool-call steps (agent already consumed the compacted context,
no re-entry needed).
4. **Add `maxCompactionRetries = 3`** to prevent infinite compaction
loops.
## Testing
- All 7 existing compaction tests pass unchanged.
- Added `PostRunCompactionReEntersStepLoop` test: verifies that when a
text-only response triggers compaction, the outer loop re-enters and the
agent makes a second stream call with fresh context.
## Problem
In multi-replica Coder deployments, the chat relay WebSocket between
replicas fails with HTTP 401 (or TLS handshake errors). The subscriber
replica cannot relay `message_part` events from the worker replica.
**Root cause:** `codersdk.Client.Dial()` does not pass `c.HTTPClient` to
`websocket.DialOptions.HTTPClient`. The websocket library
(`github.com/coder/websocket`) falls back to `http.DefaultClient`, which
lacks the mesh TLS configuration needed for inter-replica communication.
The relay code in `enterprise/coderd/chatd/chatd.go` correctly sets
`sdkClient.HTTPClient = cfg.ReplicaHTTPClient` (which has mesh TLS
certs), but that client was never used for the actual WebSocket
handshake.
## Fix
One-line fix in `codersdk/client.go`: propagate `c.HTTPClient` to
`opts.HTTPClient` when the caller hasn't already set one.
## Test
Added `TestChatStreamRelay/RelayWithTLSAndCookieAuth` which:
- Sets up two replicas with TLS certificates (simulating mesh TLS in
production)
- Authenticates via cookies (simulating browser WebSocket behavior)
- Verifies message_part events relay across replicas over TLS
This test times out without the fix because the WebSocket handshake
fails with `x509: certificate signed by unknown authority`
(http.DefaultClient rejects self-signed certs).
## Related
Follow-up to #22635 which fixed the `redirectToAccessURL` middleware
bypassing 307 redirects for relay requests. That fix changed the error
from HTTP 200 to HTTP 401, exposing this deeper issue.
## Problem
Users hit this error when agent tool results contain Unicode null
characters:
```
persist step: insert tool result: pq: unsupported Unicode escape sequence
```
PostgreSQL's `jsonb` type rejects `\u0000` (Unicode null, U+0000) with
that error, even though it's valid JSON per RFC 8259. Tool results from
agents can contain this sequence — e.g. binary data, C-style strings, or
certain API responses.
## Root cause
`MarshalToolResult` and `MarshalContent` in `chatprompt.go` serialize
content blocks to JSON and pass them directly to `InsertChatMessage`
which casts to `::jsonb`. Go's `json.Marshal` / `json.Valid` accept
`\u0000`, but Postgres does not.
## Fix
Added `sanitizeJSONForPG()` which strips `\u0000` escape sequences from
serialized JSON before insertion. Uses `bytes.Contains` as a fast-path
check to avoid allocation when no null bytes are present (the common
case).
Applied to both `MarshalContent` (assistant messages) and
`MarshalToolResult` (tool result messages).
## Problem
`TestGetUserStatusCounts/OK_when_offset_is_provided_without_timezone`
fails intermittently in CI:
```
Error: Should be zero, but was 1
Test: TestGetUserStatusCounts/OK_when_offset_is_provided_without_timezone
```
## Root Cause
The `happyResponseCheck` asserts `count=0` for all 61 dates. The test
creates a first user, which inserts a `user_status_changes` row with
`new_status=active` and `changed_at=now()`.
The query computes its date range using the requested timezone/offset:
```go
nextHourInLoc = dbtime.Now().Truncate(time.Hour).Add(time.Hour).In(loc)
sixtyDaysAgo = dbtime.StartOfDay(nextHourInLoc).AddDate(0, 0, -60)
```
When the UTC time of day is earlier than the timezone offset (e.g. UTC
01:30 with offset `-2` means local time is 23:30 previous day),
`StartOfDay(nextHourInLoc)` rounds forward to start-of-today in the
target timezone, which is *after* the current UTC time. The last
`date_of_interest` in the SQL query ends up ahead of `now()` in UTC, so
the user's `changed_at` satisfies `changed_at <= date` — producing
`count=1` on the last date.
This happens ~8% of the time for offset `-2` (when UTC hour is 0 or 1)
and ~15% for `America/St_Johns` (UTC-3:30).
## Fix
Allow the last date entry to have count 0 or 1 (only 1 user exists)
while keeping all earlier dates strictly zero. This correctly accounts
for the timezone boundary without weakening the test's structural
validation.
## Summary
Fixes cross-replica chat relay failing with:
```
failed to open initial relay for chat stream
error= dial relay stream: - failed to WebSocket dial: expected handshake response status code 101 but got 200
failed to open relay for message parts
error= dial relay stream: - failed to WebSocket dial: expected handshake response status code 101 but got 200
```
Subscribers see accurate `status=running` (delivered via pubsub) but
miss all in-progress `message_part` events (delivered only via the relay
WebSocket that never connects).
## Root cause
`redirectToAccessURL` in `cli/server.go` redirects any request whose
`Host` header doesn't match the access URL. The enterprise chat relay
dials another replica directly via its DERP relay address (e.g.
`http://10.0.0.2:8080`), so the `Host` header is the pod IP — not the
access URL.
This triggers a **307 redirect** to the access URL. The WebSocket
library follows the redirect, but the second request is a plain GET —
`Connection: Upgrade` and `Upgrade: websocket` headers are **not carried
over** by HTTP redirect semantics. The load-balanced access URL routes
the plain GET to any replica, which serves the SPA catch-all handler and
returns **HTTP 200 with `index.html`**.
The WebSocket library then fails: `expected handshake response status
code 101 but got 200`.
DERP mesh already has an exemption for this exact scenario
(`isDERPPath`). Chat relay was added later and didn't get one.
## Fix
Bypass `redirectToAccessURL` for requests that carry the
`X-Coder-Relay-Source-Replica` header, which the enterprise relay
already sets on every request (`enterprise/coderd/chatd/chatd.go:573`).
## Sequence diagram
**Before (broken):**
```
Replica A (subscriber) Replica B (worker) Load Balancer
| | |
|--- WS dial pod-ip:8080 ----->| |
| |-- 307 redirect to LB --->|
| | |
|<----------- plain GET (no Upgrade headers) ------------->|
| | |-- routes to any replica
|<----------- 200 index.html -------------------------------|
| |
X 'expected 101 but got 200' |
```
**After (fixed):**
```
Replica A (subscriber) Replica B (worker)
| |
|--- WS dial pod-ip:8080 ----->|
| (X-Coder-Relay-Source- |
| Replica header set) |
| |-- bypass redirect
|<--------- 101 Upgrade ------|
|<==== message_part events ====|
```
Replace the single-purpose DiffRightPanel with a generic RightPanel
component that supports tabs, drag-resize, drag-to-snap, and
drag-to-collapse-sidebar.
## Changes
- **New `RightPanel.tsx`**: generic tabbed panel with:
- Drag handle with pointer capture for smooth resizing
- Snap thresholds: drag past max → expand, drag below min → close
- Live sidebar collapse when dragging to the left viewport edge (and
reverses if dragged back)
- Persisted width via localStorage
- `onVisualExpandedChange` callback so parent syncs sibling visibility
during drag (not just on pointer-up)
- **Deleted `DiffRightPanel.tsx`**
- **Updated `AgentDetail.tsx`**: uses `RightPanel` with `tabContent`
record, tracks `dragVisualExpanded` for live chat section hiding
- **Updated `FilesChangedPanel.tsx`**: removed border/background (now
handled by RightPanel wrapper)
## Drag behavior
| Gesture | Effect |
|---|---|
| Drag left past 70vw + 80px | Snap to expanded (fullscreen within
parent) |
| Drag right below 360px - 80px | Snap to closed |
| Drag to left viewport edge (<80px) | Collapse sidebar live |
| Drag back from left edge | Uncollapse sidebar live |
| Start expanded, drag right | Live resize back to normal |
Adds a new child page under **Coder Agents**
(`/docs/ai-coder/agents-architecture`) that explains how the agent in
the control plane communicates with workspaces.
## Core message
The Coder Agent interacts with workspaces using the exact same
connection path as a developer's IDE, web terminal, or SSH session — no
special protocol, no sidecar, no new ports.
## Summary
Fixes a bug where interrupting a streaming chat and sending a new
message
left the relay connected to the wrong replica. Expanded into a broader
refactor that cleanly separates concerns:
- **OSS** owns pubsub subscription, message catch-up, queue updates,
status forwarding, and local parts merging.
- **Enterprise** (`enterprise/coderd/chatd`) only manages relay dialing,
reconnection, and stale-dial discarding for cross-replica streaming.
## Architecture
### OSS `coderd/chatd/chatd.go`
`Subscribe()` builds the initial snapshot then runs a single merge
goroutine that handles:
- Pubsub subscription for durable events (status, messages, queue,
errors)
- Message catch-up via `AfterMessageID`
- Local `message_part` forwarding
- Relay events from enterprise (when `SubscribeFn` is set)
- Sends `StatusNotification` to enterprise so it can manage relay
lifecycle
Key types:
- `SubscribeFn` — enterprise hook, returns relay-only events channel
- `SubscribeFnParams` — `ChatID`, `Chat`, `WorkerID`,
`StatusNotifications`, `RequestHeader`, `DB`, `Logger`
- `StatusNotification` — `Status` + `WorkerID`, sent to enterprise on
pubsub status changes
### Enterprise `enterprise/coderd/chatd/chatd.go`
`NewMultiReplicaSubscribeFn(cfg MultiReplicaSubscribeConfig)` returns a
`SubscribeFn` that:
- Opens an initial synchronous relay if the chat is running on a remote
worker
- Reads `StatusNotifications` from OSS to open/close relay connections
- Handles async dial, reconnect timers, stale-dial discarding
- Returns only relay `message_part` events
## Bug fixes
### Original bug: stale relay dial after interrupt
`openRelayAsync` goroutines used `mergedCtx` (subscription-level), not a
per-dial context. `closeRelay()` could not cancel in-flight dials. When
the user interrupts and a new replica picks up the chat, the old dial
goroutine could complete after the new one and deliver a stale
`relayResult`.
**Fix**: per-dial `dialCtx`/`dialCancel`, `expectedWorkerID` tracking,
`workerID` on `relayResult`. `closeRelay()` cancels the dial context and
drains `relayReadyCh`. Merge loop rejects mismatched worker IDs.
### Additional fixes
- `statusNotifications` send-on-closed-channel race — goroutine now owns
`close()` via defer
- Enterprise spin-loop on `StatusNotifications` close — two-value
receive
with nil-out
- `hasPubsub` set from `p.pubsub != nil` instead of subscription success
— now tracks actual subscription result
- `lastMessageID` not initialized from `afterMessageID` — caused
duplicate messages on catch-up
- `wrappedParts` goroutine leaked remote connection on `dialCtx` cancel
- `closeRelay()` did not drain `relayReadyCh`
- `setChatWaiting` race with `SendMessage(Interrupt)` — wrapped in
`InTx`
- `processChat` post-TX side effects fired when chat was taken by
another
worker — added `errChatTakenByOtherWorker` sentinel
- Cancel closure data race on `reconnectTimer`
- Bare blocking send on pubsub error path
- `localParts` hot-spin after channel close
- No-pubsub branch dropped relay events and initial snapshot
- Failed relay dial caused permanent stall (no reconnect retry)
- DB error during reconnect timer caused permanent stall
- `time.NewTimer` replaced with `quartz.Clock` for testable timing
## Tests
9 enterprise tests covering:
- Relay reconnect on drop (mock clock)
- Async dial does not block merge loop
- Relay snapshot delivery
- Stale dial discarded after interrupt
- Cancel during in-flight dial
- Running-to-running worker switch
- Failed dial retries (mock clock)
- Local worker closes relay
- Multiple consecutive reconnects (mock clock)
All pass with `-race`.
Adds a `ProcessOutputTool` component that renders `process_output` tool
calls with a clean terminal-style output block instead of falling
through to the generic JSON renderer.
## Changes
**New file:** `ProcessOutputTool.tsx`
- Output shown directly with no header
- Copy button and status indicators float top-right on hover
- Collapsible output with the same expand/collapse chevron bar used by
`ExecuteTool`
- Exit code badge shown only for non-zero exits
- Spinner shown while process is still running
**Modified files:**
- `Tool.tsx` — `ProcessOutputRenderer` + registered in `toolRenderers`
map
- `ToolIcon.tsx` — `process_output` falls through to `TerminalIcon`
- `ToolLabel.tsx` — shows "Reading process output" label
## Changes
- **User dropdown → sidebar bottom**: Moved from the TopBar into the
sidebar footer with avatar + display name, whole row clickable to open
the dropdown menu
- **Diff stats inline badge**: Compact green/red pill badge next to the
chat title showing `+additions −deletions`, clickable to toggle the diff
panel
- **Reordered TopBar actions**: Ellipsis menu first, then drawer toggle
button on the far right
- **Notification bell scoped**: Removed from individual chat pages
(remains on `/agents` listing)
- **Cleanup**: Removed unused `signOut`/`buildInfo` destructuring from
AgentsPage
### Files changed
- `site/src/pages/AgentsPage/AgentDetail/TopBar.tsx`
- `site/src/pages/AgentsPage/AgentsPage.tsx`
- `site/src/pages/AgentsPage/AgentsSidebar.tsx`
<img width="1876" height="1597" alt="image"
src="https://github.com/user-attachments/assets/8ec33955-f8b4-4064-9767-19147951b3ff"
/>
relates to #21335
Modifies our taskstatus scaletest load generator to use the dRPC connection to mimic what an actual running Task would do via the MCP server (c.f. PRs below this one in the stack).
Disclosure: I used AI to generate large portions of this PR, but hand-reviewed and tweaked.
Previously, WorkspaceBuildBuilder.doInTX() inserted provisioner jobs
with empty tags and used a loop in AcquireProvisionerJob that could
match other tests' pending jobs when parallel tests share a database.
Add a unique tag (jobID -> "true") to each provisioner job at insert
time, then use that tag in AcquireProvisionerJob to target only the
correct job. This follows the same pattern used in dbgen.ProvisionerJob.
Closescoder/internal#1367
relates to #21335
Modifies our local MCP server used in Tasks to push task status updates over the agentsocket, rather than directly dialing Coderd. This will significantly reduce pressure on the database at scale because we can avoid expensive authentication of the agent API key.
Disclosure: I used AI to generate a lot of this PR, but hand-reviewed and tweaked it.
relates to #21335
Adds UpdateAppStatus on the agentsocket, wired up to forward to Coderd over the dRPC connection the agent maintains.
Disclosure: I used AI to generate significant portions of this PR, but hand-reviewed and tweaked the code. I consider it approximately indistinguishable from what I would have done by hand.
Fixes two bugs in the agents chat input:
1. **Remove queue message button next to stop button** — The send button
(which showed a ListPlusIcon during streaming) is now hidden when
streaming and not editing a queued message. Messages are still queued
via Enter key; only the visual button is removed. The stop button
remains.
2. **Clear localStorage draft on submit** — The `agents.empty-input`
localStorage key is now cleared synchronously in `handleSend` before the
async `onCreateChat` call. Previously, the draft was only cleared inside
the async `handleCreateChat` after `mutateAsync` resolved, allowing
Lexical editor change events to re-persist the draft during the async
gap.
## Problem
Flaky e2e test: `update workspace, new required, mutable parameter
added`
```
Error: Timed out 15000ms waiting for expect(locator).toHaveValue(expected)
Locator: getByTestId('parameter-field-Sixth parameter').locator('input')
Expected string: "99"
Received string: ""
```
## Root Cause
When the workspace parameters page loads, the WebSocket sends an initial
response with template defaults. For parameters with no default (like
`sixth_parameter`), the server returns `{valid: false, value: ""}`. On
first render, `useSyncFormParameters` sees this invalid server value and
overwrites the form's correctly-autofilled value ("99" from the previous
build) with "".
## Fix
When the server value is `{valid: false}`, preserve the current form
value instead of overwriting with "". This prevents the sync hook from
clobbering autofilled values before the server has had a chance to
process them.
## Verification
- TypeScript: zero type errors
- Biome lint: clean
- Unit tests: 2/2 passing
- **E2E soak test: 849/854 passed across 854 runs (99.5% pass rate)**
- 0 occurrences of the original flake (empty value on settings page)
- 5 residual failures are a separate pre-existing race in
`fillParameters` where user input is overwritten during the 500ms
debounce window
## Changes
- Removed the Coder Agents entry from the middle of the children array
in `docs/manifest.json`.
- Added the Coder Agents entry back at the end of the children array to
improve the organization of the documentation structure.
<img width="368" height="688" alt="image"
src="https://github.com/user-attachments/assets/3117acfd-8c8a-4522-84e7-a748a7596cc6"
/>
<!--
If you have used AI to produce some or all of this PR, please ensure you
have read our [AI Contribution
guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING)
before submitting.
-->
## Problem
Title generation uses the same model the user selected for chat. This
breaks when:
1. **Thinking/extended thinking models** — `ToolChoice: None` conflicts
with extended thinking on Anthropic. The bare call has no thinking
config, so provider-level defaults can conflict.
2. **Expensive models** — User picks `o3` or `claude-opus-4`, and a
trivial 8-word title generation burns through tokens/cost unnecessarily.
3. **Provider quirks** — Different providers have different constraints
around thinking mode + tool choice combinations.
## Solution
Modeled after how `coder/mux` handles this with
`NAME_GEN_PREFERRED_MODELS` + ordered candidate fallback:
### Phase 1: Candidate model list with fallback
- New `TitleModelFunc` type returns an ordered list of candidate models
- Tries `claude-haiku-4-5` → `gpt-4o-mini` → user's model
- Gracefully skips unavailable candidates (missing API key, provider not
configured)
- Falls back to the user's chat model as last resort
### Phase 2: Provider-safe call options
- Removed `ToolChoice: None` which conflicts with extended thinking on
some providers
- Added `MaxOutputTokens: 256` to cap token usage
- Improved title prompt with verb-noun format guidance (`Fix sidebar
layout`, `Add user authentication`) and explicit
no-markdown/no-code-fences instructions
### Files changed
- `coderd/chatd/title.go` — Candidate loop, improved prompt, safe call
options
- `coderd/chatd/chatd.go` — Build `TitleModelFunc` closure with
lightweight candidates
## Summary
Subagent (child) chats were previously given access to workspace
provisioning tools (`list_templates`, `read_template`,
`create_workspace`), which could lead to uncontrolled resource
consumption. This PR moves those tools behind the same
`!chat.ParentChatID.Valid` gate that already protects the subagent tools
(`spawn_agent`, `wait_agent`, etc.).
## Changes
- **`coderd/chatd/chatd.go`**: Moved `list_templates`, `read_template`,
and `create_workspace` tool registration into the root-chat-only block
alongside subagent tools.
- **`coderd/chatd/chatd_test.go`**: Added
`TestSubagentChatExcludesWorkspaceProvisioningTools` — an E2E test that
spawns a subagent via a root chat and verifies the subagent's LLM call
does not include workspace provisioning or subagent tools.
- **`coderd/chatd/chattest/openai.go`**: Added `Tools` field to
`OpenAIRequest` and supporting `OpenAITool`/`OpenAIToolFunction` types
so tests can inspect which tools are sent to the model.
## Problem
The agents admin panel (`/agents` → Admin button) is rendered inside a
Radix Dialog (`ConfigureAgentsDialog`). Deleting a model or provider
previously opened a MUI `DeleteDialog` on top, creating a modal-on-modal
situation. The two dialog systems (Radix and MUI) don't coordinate focus
trapping, scroll locking, or backdrop behavior, so the delete
confirmation was broken.
## Solution
Replace the modal `DeleteDialog` in both `ModelForm` and `ProviderForm`
with an inline confirmation strip rendered in the footer area. Clicking
"Delete" now swaps the footer to show:
- A warning message ("Are you sure? This action is irreversible.")
- Cancel and a destructive confirm button with loading spinner
This keeps everything within the existing Radix Dialog content pane — no
layering issues, no second modal.
## Changes
| File | Change |
|---|---|
| `ModelForm.tsx` | Added `isDeleting` prop, changed `onDeleteModel`
signature to async, added `confirmingDelete` state, inline confirmation
footer |
| `ProviderForm.tsx` | Removed `DeleteDialog` import/usage, replaced
with inline confirmation footer |
| `ModelsSection.tsx` | Removed `DeleteDialog` import/usage, removed
`modelToDelete` state, passes new props to `ModelForm` |
Adds a new documentation page at `docs/ai-coder/agents.md` describing
Coder Agents — the built-in chat interface, API, and lightweight AI
coding agent that runs in the Coder control plane.
## What's included
- Overview of what Coder Agents is and who it's for (regulated
industries, platform teams, existing Coder deployments)
- How the architecture works (agent loop in coderd, outbound to LLM
providers, connects to workspaces via existing daemon connection)
- Key features: automatic template/workspace selection, sub-agents, chat
persistence, message queuing
- Security benefits of the control plane architecture (no API keys in
workspaces, simpler network boundaries, centralized enforced control,
user identity attached)
- LLM provider support table (verified against
`coderd/chatd/chatprovider/chatprovider.go`)
- Built-in tools reference
- Comparison to Coder Tasks
- Product status (internal preview, early access next)
## Summary
The macOS `.dylib` is only used by Coder Desktop macOS v0.7.2 or older.
v0.7.2 was released in August 2025. v0.8.0 of Coder Desktop macOS, also
released in August 2025, uses a signed Coder slim binary from the
deployment instead.
It's unlikely customers will be using Coder Desktop macOS v0.7.2 and the
next release of Coder simultaneously, so I think we can safely remove
this process, given it slows down CI & release processes.
## Changes
- **Makefile**: Remove `DYLIB_ARCHES`, `CODER_DYLIBS` variables and
`build/coder-dylib` target
- **scripts/build_go.sh**: Remove `--dylib` flag and all dylib-specific
logic (c-shared buildmode, CGO, plist embedding, vpn/dylib entrypoint)
- **scripts/sign_darwin.sh**: Remove dylib-specific comment
- **CI (ci.yaml)**: Remove `build-dylib` job, artifact download/insert
steps, and `build-dylib` dependency from `build` job
- **Release (release.yaml)**: Remove `build-dylib` job, artifact
download/insert steps, and `build-dylib` dependency from `release` job
- **vpn/dylib/**: Delete entire directory (`lib.go` + `info.plist.tmpl`)
- **vpn/router.go, vpn/dns.go**: Clean up comments referencing dylib
The slim and fat binary builds are completely unaffected — the dylib was
an independent build target with its own CI job.
_Generated by mux but reviewed by a human_
## Problem
On the `/agents/:agentId` detail page, text typed into the chat input
was lost when navigating away and returning. The empty-state page
(`/agents`) already persisted drafts via `localStorage`, but individual
conversation pages did not.
## Solution
Adds per-conversation draft persistence to `useConversationEditingState`
in `AgentDetail.tsx`, following the same patterns used elsewhere in the
agents page:
- Drafts are stored under `agents.draft-input.<chatID>` keys
- The saved draft is read as the editor's initial value on mount
- `localStorage` is updated on every content change
- The key is removed when the input is cleared or a message is sent
successfully
Fixescoder/internal#642
We recently fixed Windows specific flakes for this test and reenabled
it. It then failed intermittently due to context deadline expiration.
The temporary path created on Windows contained invalid characters. This
resulted in a silent startup script failure on Windows. The test then
fruitlessly waited until context expiration. The test now uses a valid
path on Windows.
## Problem
Two bugs in the agents chat flow:
1. **Optimistic rendering glitch**: When sending a message while the
agent is busy, a fake message with a negative ID appears in the
timeline, then gets rolled back to the queued state. This causes a
jarring flash.
2. **Auto-promoted messages not appearing**: When the server
auto-promotes a queued message after finishing a task, the promoted user
message doesn't show up in the timeline until the LLM finishes its
response.
## Root Causes
**Bug 1**: The optimistic rendering system injected placeholder messages
with `id: -Date.now()` into the store. When the server responded with
`queued: true`, the optimistic message was rolled back — but the user
had already seen it flash in the timeline.
**Bug 2**: In `processChat`'s deferred cleanup, the auto-promoted
message was published via `publishEvent()`, which only delivers to local
in-process stream subscribers. The SSE subscriber goroutine only
forwards `message_part` events from the local channel — it ignores
`message` events. Durable events reach the SSE client via pubsub → DB
read, but `publishEvent` doesn't trigger a pubsub notification. The
explicit `PromoteQueued` endpoint correctly used `publishMessage()`
(which does both), but the auto-promote path did not.
## Changes
### Frontend (`site/`)
- **AgentDetail.tsx**: Remove optimistic message injection from send and
edit flows. Instead, use the `CreateChatMessageResponse.message` from
the POST response to insert the real server message into the store
immediately.
- **ChatContext.ts**: Remove the negative-ID cleanup logic from
`upsertDurableMessage` that stripped optimistic placeholders when real
messages arrived.
- **chatStore.test.ts**: Remove 2 tests for negative-ID optimistic
message behavior.
### Backend (`coderd/chatd/`)
- **chatd.go**: In `processChat` cleanup, replace `publishEvent()` with
`publishMessage()` for auto-promoted messages. This ensures the pubsub
notification (`AfterMessageID`) is sent, so SSE subscribers read the new
message from the DB immediately.
I'm having a hard time reproducing [this
Heisenbug](https://github.com/coder/internal/issues/1154) in PR CI, but
it seems to happen pretty often on main, so I would like to add some
logging for a few more page events to the ones we already have.
closes https://github.com/coder/internal/issues/464
# Summary
This PR resolves a flaky test that was sensitive to DST transitions in
various time zones. The root of the flake was:
* a bug; the query and its tests assume 24 hours per day
* the tests used local system time, which resulted in failures for dates
proximal to DST transitions
# Changes
Query:
The original query assumed 24 hour intervals between each day, which is
not a valid assumption. It now increments `1 day` at a time.
Database tests:
Database level tests for the query all assumed 24 hour days. They now
increment in DST-aware days instead. Instead of using time.Now() as a
base for testing, the test uses a series of dates over the course of an
entire year, to ensure that DST transition dates are present in every
test run.
# API Endpoint
The endpoint that delivers the user status chart now accepts an IANA
timezone name as a parameter and passes it, keeping the existing offset
as a fallback, to the database query.
API level tests were added to ensure the correct response form and error
behaviour. Correctness of content is tested at the database level.
## Summary
Replace hand-coded per-provider field components, form state types,
validation schemas, and builder functions with generic schema-driven
code that reads from the auto-generated
`chatModelOptionsGenerated.json`.
## Changes
### `ModelConfigFields.tsx` (492 → 341 lines)
- Remove 6 per-provider components (`OpenAIFields`, `AnthropicFields`,
`GoogleFields`, `OpenAICompatFields`, `OpenRouterFields`,
`VercelFields`)
- Remove exported option arrays (`modelConfigReasoningEffortOptions`,
etc.)
- Add `renderSchemaField()` that dispatches to
`InputField`/`SelectField`/`JSONField` based on `field.input_type` from
the generated schema
- `ModelConfigFields` now calls `getVisibleProviderFields()` instead of
a switch statement
- `GeneralModelConfigFields` now calls `getVisibleGeneralFields()`
instead of hard-coding 6 InputField instances
### `modelConfigFormLogic.ts` (742 → 525 lines)
- Remove 6 per-provider form state types and empty defaults
- Remove 6 per-provider Yup validation schemas
- Remove 6 per-provider builder functions (`buildOpenAIOptions`, etc.)
- Remove 2 switch-case dispatch blocks (validation + build)
- Add `buildEmptyProviderState()` that walks schema fields to create
empty form state
- Add schema-driven `extractModelConfigFormState()` and
`buildModelConfigFromForm()`
- Add `yupTestForField()` + `buildYupSchema()` generating Yup validation
from field metadata
- Lazy-cache per-provider Yup schemas for performance
### `modelConfigFormLogic.test.ts`
- All 83 tests updated for the new nested state shape
- Uses `toContain` for error message assertions since labels now come
from schema descriptions
## Motivation
The auto-generated schema (`chatModelOptionsGenerated.json`) was merged
in #22568 but not yet consumed by the UI. This PR wires it up so that
when a new provider or field is added in Go (`codersdk/chats.go`),
running `make gen` regenerates the JSON schema and the UI automatically
picks up the new fields — no manual TypeScript changes needed.
**Production code reduced from 1234 to 866 lines (-30%).**
## Problem
The subscribe flow in `useWebpushNotifications` called
`pushManager.subscribe()` without first requesting the `Notification`
permission. When the browser permission state is `"denied"` (e.g. from a
previous prompt dismissal), the browser throws:
```
DOMException: Registration failed - permission denied
```
This surfaced as a confusing error toast on the agents page. The error
has nothing to do with Coder RBAC roles — it's the browser denying the
push subscription because notification permission was previously
declined. An admin who had granted browser permission wouldn't see this;
a user who previously dismissed or denied the prompt would.
## Fix
Added an explicit `Notification.requestPermission()` call before
`pushManager.subscribe()`. This:
1. **Re-prompts** the user if the permission state is `"default"` (not
yet decided)
2. **Throws a clear, actionable error** if the permission is `"denied"`:
*"Notifications are blocked by your browser. Please allow notifications
for this site in your browser settings."*
3. **Only proceeds** to `pushManager.subscribe()` after permission is
confirmed as `"granted"`
## Tests
New test file `useWebpushNotifications.jest.ts`:
- **requests notification permission before subscribing** — verifies
`requestPermission()` is called before `pushManager.subscribe()`
- **throws a clear error when permission is denied** — verifies the
user-friendly error message
- **does not call pushManager.subscribe when permission is denied** —
verifies we bail out early
On the `/agents` page, the "View Workspace" link in the header dropdown
menu was navigating in the same tab via `navigate()`. This changes it to
`window.open(workspaceRoute, "_blank")` so it opens in a new browser
window/tab instead.
It's frustrating when I want to view my workspace and then I have to go
back and find my chat.
## Problem
There is a race condition in the chat stream reconnect path. When a
client connects (or reconnects) to `/stream`, sometimes they only see a
`status: running` event but never receive any `message_part` events —
the stream appears stuck.
## Root Cause
In `processChat`, the sequence is:
1. `publishStatus(running)` — broadcasts `status: running` to all
subscribers and via pubsub.
2. `runChat()` is called.
3. Inside `runChat`, there's significant setup work (model resolution,
DB queries, title generation, prompt building, instruction resolution).
4. Only **after** all that setup does `runChat` set `buffering = true`
on the stream state.
If a client connects to `/stream` between steps 1 and 4:
- `Subscribe()` reads `chat.Status == running` from the DB, so it
includes `status: running` in the snapshot.
- But `buffering` is still `false`, so `subscribeToStream` returns an
**empty** local snapshot (no message_parts).
- `publishToStream` **drops** all `message_part` events when `buffering`
is false.
- Result: client sees `running` but never gets any streaming content.
## Fix
Move the `buffering = true` setup (and its deferred cleanup) from
`runChat` into `processChat`, right before `publishStatus(running)`.
This guarantees the buffer is active before any subscriber can observe
`status: running`, so:
- The snapshot always includes any in-flight `message_part` events.
- `publishToStream` never drops parts because buffering is already on.
Despite the SDK type having an `Archived` field for chats, this data was
never fetched from the database — the `GetChatsByOwnerID` query
hardcoded `AND archived = false`, and the `convertChat` function never
mapped the field.
This PR adds an optional `archived` query parameter to `GET
/api/experimental/chats`:
| Value | Behavior |
|-------|----------|
| *(not provided)* | Returns all chats (active and archived) |
| `archived=false` | Returns only non-archived chats |
| `archived=true` | Returns only archived chats |
This follows the same pattern used by template versions
(`sqlc.narg('archived')` nullable boolean).
Also fixes `convertChat` to populate the `Archived` field in API
responses, which was never being set despite existing on the SDK type.
The search input was removed from `AgentsSidebar` but the
`ArchivedAgentsSearchAutoExpands` story still referenced the `Search
agents...` placeholder, causing the Storybook interaction test to fail:
```
within(<div#storybook-root>).getByPlaceholderText("Search agents...")
Unable to find an element with the placeholder text of: Search agents...
```
This PR removes the stale story.
This is an attempt to address coder/internal#1154
Tests appear to fail often on `verifyParameters`, which asserts input
visibility and value in series for all expected parameters. This change
makes the same assertions in parallel, hopefully completing before
timeout.
Add Prometheus metrics to the boundary log proxy for observability:
- batches_dropped_total (reason: buffer_full, forward_failed)
- logs_dropped_total (reason: buffer_full, forward_failed,
boundary_channel_full, boundary_batch_full)
- batches_forwarded_total
Also add BoundaryStatus to the BoundaryMessage envelope so boundary
can report dropped log counts as a separate wire message. The agent
records these as Prometheus metrics, making boundary-side data loss
visible. Backwards compatibility for older versions of boundary is maintained.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds database columns and server-side logic to track interception lineage via tool call IDs. When an interception ends, the server resolves the correlating tool call ID to find the parent interception and links them via `parent_id`.
New `provider_tool_call_id` column on `aibridge_tool_usages` and `parent_id` column on `aibridge_interceptions`, with indexes for lookup. `findParentInterceptionID` queries by tool call ID and filters out the current interception to find the parent.
Adapted from the [coder/coder `dk/prompt_provenance_poc`](https://github.com/coder/coder/compare/main...dk/prompt_provenance_poc) branch.
Depends on [coder/aibridge#188](https://github.com/coder/aibridge/pull/188).
Closes https://github.com/coder/internal/issues/1334
Adds database columns and server-side logic to track interception lineage via tool call IDs. When an interception ends, the server resolves the correlating tool call ID to find the parent interception and links them via `parent_id`.
New `provider_tool_call_id` column on `aibridge_tool_usages` and `parent_id` column on `aibridge_interceptions`, with indexes for lookup. `findParentInterceptionID` queries by tool call ID and filters out the current interception to find the parent.
Adapted from the [coder/coder `dk/prompt_provenance_poc`](https://github.com/coder/coder/compare/main...dk/prompt_provenance_poc) branch.
Depends on [coder/aibridge#188](https://github.com/coder/aibridge/pull/188).
Closes https://github.com/coder/internal/issues/1334
Removes the search input and restyles the New Agent button in the agents
sidebar:
- Removed the search input box
- Replaced the outlined button with a subtle, left-aligned button
featuring a `SquarePenIcon`
- Button icon and text alignment matches the tree node items in the
sidebar
<img width="769" height="337" alt="image"
src="https://github.com/user-attachments/assets/2284c8c0-6294-4823-9ce0-5cc72b0d0054"
/>
relates to #21335
Enables the agent socket by default and updates docs to strike references to having to enable it.
The PRs in this stack change the MCP server that Tasks use to update their status to rely on the agent socket, rather than directly dialing Coderd with the agent token.
Default disable was a reasonable default when it was only used for the experimental script ordering features, but now that we want to use it for Tasks, it should be default on.
## Problem
The **Open in Cursor** and **Open in VS Code** buttons on the agent
detail page were broken. Clicking them did nothing.
### Root Cause
The `handleOpenInEditor` handler in `AgentDetail.tsx` called
`window.location.assign()` with a custom protocol URI (`vscode://` or
`cursor://`) **after** an `await API.getApiKey()` call. This creates an
async boundary that breaks the browser's user gesture chain, causing
custom protocol navigations (`vscode://`, `cursor://`) to be silently
blocked by the browser.
The handler was invoked from a Radix `DropdownMenuItem.onSelect`, which
adds another layer of event indirection that makes the gesture chain
more fragile.
In contrast, the workspace page's `VSCodeDesktopButton` works because it
uses a direct `onClick` handler on a button element.
## Fix
- **Eagerly fetch and cache the API key** via `useQuery` when workspace
and agent data is available
- **Make `handleOpenInEditor` synchronous** — it reads the cached key
instead of awaiting a network call, keeping `window.location.assign()`
within the original user gesture context
- **Disable buttons** while the API key is still loading
(`canOpenEditors` now gates on key availability)
- **Simplify** the `onOpenInEditor` callback (remove `void` async
wrapper)
Prebuilds need to be valid. Before this change, you can push a template
version that's preset will fail when making a prebuild. This PR ensures
all presets that are used for prebuilds are valid
## Problem
Flaky test:
`TestCloseDuringShutdownContextCanceledShouldRetryOnNewReplica`
(coder/internal#1371)
The test intermittently fails because the chat ends up in `waiting`
status instead of `pending` after server shutdown.
## Root Cause
There is a race condition in `processChat` where `runChat` completes
successfully just as the server context is being canceled during
`Close()`. The sequence:
1. Server calls `Close()`, canceling the server context.
2. The LLM HTTP response has already been fully written by the mock
server (the stream closes normally before context cancellation
propagates to the HTTP client).
3. `runChat` returns `nil` (success) instead of `context.Canceled`.
4. The existing `isShutdownCancellation` check only runs when `runChat`
returns an error, so the shutdown is not detected.
5. `processChat`'s deferred cleanup marks the chat as `waiting` instead
of `pending`.
6. The test's assertion that the chat is `pending` never becomes true.
This race is timing-dependent — it only triggers when the mock server's
HTTP response completes in the narrow window between context
cancellation being initiated and it propagating through the HTTP
transport layer.
## Fix
Add a server context check after `runChat` returns successfully. If the
server is shutting down (`ctx.Err() != nil`), override the status to
`pending` so another replica can pick up the chat.
This is the same pattern already used for the error path
(`isShutdownCancellation`), extended to cover the success path.
Extend the wire protocol for the boundary <-> agent unix socket with
a message envelope.
The envelope creates a boundary <-> agent data path that is separate
from the agent <-> coderd path. This lets boundary send operational
metadata (drop counts, configuration like jail type, capabilities)
that the agent can act on locally (e.g. Prometheus metrics) or use
to enrich outbound requests, without polluting the coderd-facing proto
with fields coderd never consumes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
## Summary
Removes `time.Sleep` calls in two test files by replacing them with
deterministic or event-driven alternatives.
### Changes
**`coderd/provisionerjobs_test.go`** (34.5s → 0.25s)
Replaced `time.Sleep(1500ms)` with a direct SQL `UPDATE` to bump
`created_at` by 2 seconds. The sleep existed purely to ensure different
timestamps for sort-order testing. The fix is deterministic and cannot
flake. Uses `NewDBWithSQLDB` (the test already required real Postgres
via `WithDumpOnFailure`).
**`coderd/database/pubsub/pubsub_test.go`** (2.05s → 1.3s)
Replaced `time.Sleep(1s)` with a `testutil.Eventually` retry loop that
publishes and checks for subscriber receipt. This is the idiomatic
pattern in the codebase. The old sleep waited for pq.Listener to
re-issue LISTEN after reconnect; the new code polls until it actually
works.
## Summary
Change the four main `coderdtest` Await helper functions to poll at
`IntervalFast` (25ms) instead of `IntervalMedium` (250ms):
- `AwaitTemplateVersionJobCompleted`
- `AwaitWorkspaceBuildJobCompleted`
- `WorkspaceAgentWaiter.WaitFor`
- `WorkspaceAgentWaiter.Wait`
These are called **~855 times** across the test suite. Each call
previously wasted ~125ms on average waiting for the next poll tick.
`AwaitTemplateVersionJobRunning` already used `IntervalFast` — this
makes all Await helpers consistent.
## Measured Impact
Local benchmarks (postgres, `-short -count=1 -p 8 -parallel 8
-tags=testsmallbatch`):
| Package | Before | After | Delta |
|---|---|---|---|
| enterprise/coderd | 90.8s | 76.0s | **-16.3%** |
| coderd | 65.6s | 56.5s | **-13.8%** |
| cli | 57.9s | 37.8s | **-34.7%** |
| enterprise (root) | 41.1s | 39.9s | -2.9% |
| **Sum of all packages** | **623s** | **543s** | **-12.8%** |
Zero test failures across all 199 packages.
The pause/resume endpoints were only registered under /api/experimental
but the frontend and Go SDK were calling /api/v2, resulting in 404s.
Register the routes in the v2 group, update the SDK client paths, and
fix swagger annotations (Accept → Produce) since these POST endpoints
have no request body.
This pull-request cleans up various issues we I noticed while using the
`<Checkbox />` element.
* Margins were missing around the outsides of the `<Checkbox />`
components.
* Resolved the story of the `WithLabel` so things are lined up.
* Lowered the `borderRadius` down to `2px` (This can be cleaned up by
Tailwind 4 later).
* Refined checkbox styling across focus, disabled, checked, and
indeterminate states to be inline with Figma.
* Simplified checkbox indicator rendering and centered the icon with
absolute positioning.
- Unexposes port 5432 so it doesn't conflict with existing databases.
You can still hop into the DB if you need.
- Updates multiple CODER_ envs to sensible defaults, overrideable via env
This pull-request follows up #22060
Felt wrong to only make use of Geist when there is a Monospace variant
here too. Felt best we default to this as the default font as its inline
with the rest of the application. This also updates the lower line for
Workspace Statistics 🙂
This pull-request updates our icons to be inline with the Figma file.
They were slightly too small in the two variants of `--avatar-default`
and `--avatar-sm`. Now these are inline with what we have defined and
using the correct variants in the breadcrumbs.
Adds two new items to the agent chat TopBar dropdown menu:
- **Open Terminal**: opens the workspace web terminal in a new browser
window, reusing the existing `getTerminalHref`/`openAppInNewWindow`
infra.
- **Copy SSH Command**: copies the SSH command (e.g. `ssh
agent.workspace.owner.suffix`) to the clipboard with a toast
confirmation. Only shown when the deployment SSH hostname suffix is
configured.
Both items appear after a separator below the existing editor/workspace
actions.
## Changes
| File | What |
|---|---|
| `TopBar.tsx` | Added `Open Terminal` and `Copy SSH Command` dropdown
items with separator, `TerminalIcon`/`CopyIcon`, toast on copy |
| `AgentDetail.tsx` | Wired up `getTerminalHref`, `openAppInNewWindow`,
`deploymentSSHConfig` query, and passed new props to TopBar in all 3
render paths |
| `TopBar.stories.tsx` | Added new fields to default story props |
The Ubuntu Jammy `cargo` apt package provides Rust 1.75, which is too
old for transitive dependencies requiring edition 2024 (Rust 1.85+).
**Changes:**
- Replace apt `cargo` with a rustup-based install (stable channel,
minimal profile).
- Override `CARGO_HOME` to `/home/coder/.cargo` after `USER coder` so
cargo registry/cache writes go to the user's home (the rustup-installed
binaries remain on PATH via `/usr/local/cargo/bin`).
- Add `--fail` to all `curl` commands in the tool-download block so HTTP
errors fail fast with clear messages instead of silently piping error
pages into `tar`.
- Bump kube-linter 0.6.3 → 0.8.1 and trivy 0.41.0 → 0.69.2 (old releases
were removed from GitHub, causing persistent 404s).
Replace manual experiment checks in web-push handlers with the
`RequireExperimentWithDevBypass` middleware on the route group, matching
the pattern used by OAuth2, Agents, and MCP experiments.
## Changes
- **`coderd/coderd.go`**: Add `RequireExperimentWithDevBypass`
middleware to `/webpush` route group
- **`coderd/webpush.go`**: Remove inline
`api.Experiments.Enabled(codersdk.ExperimentWebPush)` checks from all
three handlers
- **`cli/server.go`**: Gate webpush dispatcher initialization with
`buildinfo.IsDev()` fallback so dev builds always init the real
dispatcher
- **`coderd/webpush_test.go`**: Remove experiment enablement from tests
(dev bypass handles it)
Net effect: -26 lines removed, +5 added.
Created using whatchamacallits (Opus 4.6 Max)
closes https://github.com/coder/internal/issues/642
This PR:
* re-enables `func TestReinitializeAgent(t *testing.T)`
* adjusts it to use a Windows specific command in Windows environments
## problem
Fixes an issue where updates to docs resulted in docs links returning
HTTP 404, sometimes taking 4-12 hours before returning HTTP 200 (OK).
coder.com is deployed to Vercel from a separate Next.js repo, which has
no knowledge of when docs pages in this repo get updated.
### examples (non-exhaustive)
PR | 404 description
---|---
#19625 | URL for https://coder.com/docs/install/offline was updated to
https://coder.com/docs/install/airgap, but the latter returned 404 for 3
hr 56 min after the PR was merged
#21434 | URLs https://coder.com/docs/ai-coder/nsjail and
https://coder.com/docs/ai-coder/landjail were added, but both paths
404ed for 1 hr 30 min after the PR was merged. Note that these paths
have changed since then--don't be alarmed if clicking those links
returns 404s while reviewing this PR
#21708 | URL https://coder.com/docs/ai-coder/boundary/agent-boundary was
added, but it returned 404 for 1 hr 19 min after the PR was merged
## solution
All 3 PRs listed above modify manifest.json. This file is fetched during
coder.com's `getStaticPaths` for docs pages, defining which docs URLs
get statically generated at build time. In the latter 2 cases, the 404s
were resolved by manually triggering a redeploy of coder.com in the
Vercel dashboard.
The new CI workflow in this PR automatically triggers a Vercel deploy
hook ([see
docs](https://vercel.com/docs/deploy-hooks#triggering-a-deploy-hook))
with a POST request that runs whenever commits are pushed to main that
modify manifest.json. The deploy hook initiates a new build+deploy of
the coder.com Next.js app, which reruns `getStaticPaths`, updating docs
pages' URLs.
**Note:** I have not tested this workflow yet. I will verify that it
works after this PR is merged. I confirmed in a local terminal that the
webhook URL does successfully initiate a new Vercel build. I also tested
with a malformed URL and received error JSON output, so if the action
fails for some reason, we should see error output in the workflow logs
([example](https://github.com/coder/coder/actions/runs/22361453442/job/64722503802)).
## Problem
When the git askpass flow triggered diff status refreshes, it updated
**every chat** connected to the workspace. This was wasteful and could
cause confusing status updates on unrelated chats.
## Solution
Thread the chat ID through the entire git askpass flow so only the chat
that initiated the git operation gets updated:
1. **`coderd/chatd/chattool/execute.go`** — Sets `CODER_CHAT_ID` env var
on spawned processes (alongside the existing `CODER_CHAT_AGENT`)
2. **`cli/gitaskpass.go`** — Reads `CODER_CHAT_ID` from the environment
and sends it as a `chat_id` query parameter in the `ExternalAuthRequest`
3. **`codersdk/agentsdk/agentsdk.go`** — Adds `ChatID` field to
`ExternalAuthRequest` and encodes it as a query param
4. **`coderd/workspaceagents.go`** — Parses `chat_id` query param and
passes it through to `storeChatGitRef` and
`triggerWorkspaceChatDiffStatusRefresh`
5. **`coderd/chats.go`** — `storeChatGitRef` and
`refreshWorkspaceChatDiffStatuses` now scope updates to just the
initiating chat when a chat ID is provided, falling back to
all-workspace-chats behavior for backwards compatibility (non-chat git
operations)
Currently the sharing UI is only hidden under certain circumstances,
rather than on a permission basis. This makes it permissions based, and
makes some backend changes to make sure permissions are correct.
## Problem
Subscribers connecting to a different replica than the one running the
chat see full messages appear but no streaming partials (`message_part`
events). The relay mechanism that forwards ephemeral parts across
replicas had several bugs.
## Root Causes
1. **`openRelay()` blocked the event loop** — The WebSocket dial (TCP +
TLS + HTTP upgrade) to the worker replica ran synchronously inside the
select loop. While dialing, no events could be processed, channels
filled up, and parts were silently dropped.
2. **Relay drops were permanent** — When the relay WebSocket closed
mid-stream, `relayParts` was set to nil and never reopened. No status
notification would re-trigger it since the chat was still running on the
same worker.
3. **`drainInitial` snapshot race** — The `default` case in the initial
drain loop caused the snapshot to be empty if the remote hadn't flushed
data yet (common immediately after WebSocket connect).
4. **Duplicate event delivery** — The `preloaded` slice caused snapshot
events to be sent both in the return value and re-sent through the
channel goroutine.
## Fixes
### `coderd/chatd/chatd.go` (Subscribe method)
- **Async relay dial**: `openRelayAsync()` spawns a goroutine to dial
the remote replica. The result (channel + cancel func) is delivered on a
`relayReadyCh` channel that the select loop reads without blocking.
- **Relay reconnection**: When the relay channel closes, a 500ms timer
fires. The handler re-checks chat status from the DB and reopens the
relay if the chat is still running on a remote worker.
- **Snapshot parts via channel**: Relay snapshot + live parts are
wrapped into a single channel so they flow through the same path,
avoiding races with the select loop.
### `enterprise/coderd/chats.go` (newRemotePartsProvider)
- **Timer-based drain**: Replaced `default` with a 1-second timer. After
the first event, `Reset(0)` switches to non-blocking drain for remaining
buffered events.
- **Remove preloaded duplication**: The goroutine now only forwards new
events; snapshot events are returned to the caller directly.
## Testing
All existing tests pass:
- `TestInterruptChatBroadcastsStatusAcrossInstances`
- `TestSubscribeSnapshotIncludesStatusEvent`
- `TestSubscribeNoPubsubNoDuplicateMessageParts`
- `TestSubscribeAfterMessageID`
- `TestChatStreamRelay/RelayMessagePartsAcrossReplicas`
When archiving a chat that has an attached workspace, a dialog now pops
up asking whether to also delete the associated workspace.
## Changes
### New file: `ArchiveAgentDialog.tsx`
A Radix-based dialog component that appears when archiving a chat that
has a `workspace_id`. It provides:
- A checkbox to opt into deleting the associated workspace
- **Cancel** — closes without archiving
- **Archive only** — archives the chat, leaves the workspace intact
- **Archive & Delete Workspace** — archives the chat and triggers
workspace deletion (enabled only when checkbox is checked)
### Modified: `AgentsPage.tsx`
- Extracted archive logic into a `performArchive` helper
- `requestArchiveAgent` now checks if the chat has a `workspace_id`:
- If yes, opens the `ArchiveAgentDialog`
- If no, proceeds with archiving directly (existing behavior)
- Added `handleArchiveOnly`, `handleArchiveAndDeleteWorkspace`, and
`handleCloseArchiveDialog` handlers
- Renders the `<ArchiveAgentDialog>` at the page level
Chats without a workspace are archived immediately as before — no UX
change for those.
## Problem
When a user sends a message while the agent is busy, the message appears
in the chat timeline as if it was sent and being processed (with the
"Thinking..." shimmer), instead of appearing in the queued messages list
above the input.
## Root Cause
`handleSend` in `AgentDetail.tsx` unconditionally injects an optimistic
user message into the conversation timeline and sets chat status to
`"pending"` **before** awaiting the server response. However, the server
can respond with `{ queued: true, queued_message: {...} }` (via
`CreateChatMessageResponse`) when the agent is already busy — meaning
the message was queued, not processed.
The client never inspected `response.queued` after the request
succeeded, so the optimistic message stayed in the timeline even though
the server queued it.
## Fix
After `sendMutation.mutateAsync(request)` resolves, check
`response.queued`. If true, roll back the optimistic message and restore
the previous chat status. The `queue_update` SSE event from the
WebSocket stream handles adding it to the queued messages list.
## Changes
- **`site/src/pages/AgentsPage/AgentDetail.tsx`**: Capture the response
from `sendMutation.mutateAsync` and roll back the optimistic message +
status when `response.queued === true`.
## Problem
The pubsub notification handler in `chatd` re-fetched **all** messages
from the DB on every new message notification, then filtered in Go with
`msg.ID > lastMessageID`. This grows linearly with conversation length —
every new message triggers a full table scan of that chat's history.
The `AfterMessageID` field in the pubsub notification payload was
clearly designed for cursor-based fetching, but no matching query
existed.
## Fix
- Add `GetChatMessagesByChatIDAfter` SQL query with `WHERE id >
@after_id`, so the database does the filtering instead of Go.
- Use it in the pubsub notification handler in `chatd.go`, passing
`lastMessageID` as the cursor.
- Implement the dbauthz wrapper (was a `panic("not implemented")` stub
from codegen) with the same read-check-on-parent-chat pattern as
adjacent methods.
- Add dbauthz test coverage for the new method.
**Not changed:** The initial snapshot in `Subscribe()` still loads all
messages — that's correct, since a newly-connecting client needs the
full conversation state. The waste was only in the ongoing notification
path.
The gradient mask overlay was positioned at the top of the parent
container (`absolute top-0`), causing it to overlap the title bar
instead of fading the scroll content beneath it.
**Changes:**
- Wrap the TopBar, archived banner, and gradient in a `relative z-10
shrink-0 overflow-visible` container
- Change the gradient from `top-0` to `top-full` so it anchors to the
bottom of the title bar and fades downward over the message area
## Summary
`deleteUserWebpushSubscription` in `coderd/webpush.go` had incorrect
error handling that masked database errors as 404 responses.
## Bug
`GetWebpushSubscriptionsByUserID` is a `:many` query — it returns `([],
nil)` when no rows match, never `sql.ErrNoRows`. The previous `if/else
if` chain:
```go
if existing, err := api.Database.GetWebpushSubscriptionsByUserID(ctx, user.ID); err != nil && errors.Is(err, sql.ErrNoRows) {
// dead code — :many queries never return sql.ErrNoRows
} else if idx := slices.IndexFunc(existing, ...); idx == -1 {
// real DB errors fall through here, existing is nil, idx is -1 → 404
}
```
Any real database error (connection failure, timeout, authorization
error) fell through to the `else if` branch where `slices.IndexFunc(nil,
...)` returns `-1`, returning 404 "subscription not found" instead of
500.
## Fix
Split into two separate checks so database errors properly return 500:
```go
existing, err := api.Database.GetWebpushSubscriptionsByUserID(ctx, user.ID)
if err != nil {
// 500
}
if idx := slices.IndexFunc(existing, ...); idx == -1 {
// 404
}
```
## Testing
Added `TestDeleteWebpushSubscription/database_error_returns_500` which
wraps the DB store to inject an error into
`GetWebpushSubscriptionsByUserID` and asserts the handler returns 500
(not 404).
## Problem
LLM responses currently stream in bulk chunks — multiple `message_part`
events arrive per WebSocket frame, get batched into a single
`startTransition` state update, and render as a visual jump. This looks
janky compared to smooth character-by-character reveal.
## Solution
Port the jitter-buffer approach from
[coder/mux](https://github.com/coder/mux) into a single self-contained
file: `SmoothText.ts`.
### What's in the file
| Component | Purpose |
|---|---|
| `STREAM_SMOOTHING` constants | Tuning knobs (72–420 cps adaptive rate,
120 char max visual lag, 48 char frame cap) |
| `SmoothTextEngine` class | Pure state machine — two-clock model
(ingestion vs presentation) with budget-gated adaptive reveal |
| `useSmoothStreamingText` hook | React bridge via
`requestAnimationFrame` loop, single `useState<number>`, grapheme-safe
slicing |
### How the engine works
- **Adaptive rate:** Linear interpolation from 72 → 420 chars/sec based
on backlog pressure (how far behind the display is from ingested text)
- **Budget accumulation:** Fractional character budget accrues per RAF
tick. Only reveals when ≥1 whole character is ready. This makes it
frame-rate invariant — 60Hz and 240Hz displays reveal the same amount
over wall-clock time (tested to ≤2 char deviation)
- **Max visual lag:** Hard cap of 120 chars. If the gap exceeds this,
the visible pointer jumps forward immediately
- **Clean flush:** When streaming ends, remaining buffer appears
instantly — no trailing animation
- **Grapheme safety:** Uses `Intl.Segmenter` (with codepoint fallback)
to never split emoji mid-animation
### Integration
To wire this up, wrap the `<Response>` component in
`ConversationTimeline.tsx` with the hook:
```tsx
const SmoothedResponse: FC<{text: string; isStreaming: boolean; streamKey: string}> =
({ text, isStreaming, streamKey }) => {
const { visibleText } = useSmoothStreamingText({
fullText: text,
isStreaming,
bypassSmoothing: false,
streamKey,
});
return <Response>{visibleText}</Response>;
};
```
### Tests
8 engine tests covering: steady reveal, adaptive acceleration, max lag
cap, immediate flush on stream end, bypass mode, content shrink,
sub-char budget gating, and frame-rate invariance.
---------
Co-authored-by: Danielle Maywood <danielle@themaywoods.com>
When archiving a chat, the frontend no longer navigates away to a
different chat. Instead it stays on the current chat and shows an
archived state.
## Changes
**AgentsPage.tsx** — Removed the redirect logic from
`requestArchiveAgent`. After a successful archive, invalidates the
individual chat query so the detail view picks up the `archived` flag
immediately.
**AgentDetail.tsx** — Detects `chatRecord.archived` and:
- Disables the chat input
- Shows a banner: "This agent has been archived and is read-only."
- Passes `isArchived` to the top bar
- Guards `handleArchiveAgentAction` against double-archiving
**AgentDetail/TopBar.tsx** — When `isArchived`:
- Shows an "Archived" badge next to the chat title
- Hides the "Archive Agent" dropdown menu item
**AgentDetail/TopBar.stories.tsx** — Added an `Archived` story variant.
## Problem
The drag handle (resize slider) on the diff right panel and the sticky
file headers inside `FilesChangedPanel` both had `z-index: 10`. Because
the sticky headers render later in the DOM and are positioned, they
painted on top of the drag handle — making it appear to go "below" the
headers when dragging.
## Fix
Bump the drag handle from `z-10` to `z-20` so it always stays above the
sticky `[data-diffs-header]` elements (`z-index: 10`).
## Problem
Production logs frequently show:
```
[debu] coderd.chats.chat-processor: failed to generate chat title
error= generate title text: context deadline exceeded
```
## Root Cause
The title generation timeout in `maybeGenerateChatTitle` is 10 seconds.
Many LLM providers routinely exceed this under load (cold starts, rate
limits, large models). Since `chatretry` classifies `context deadline
exceeded` as non-retryable, the first timeout kills the entire attempt
with no retry.
## Fix
Increase the timeout from 10s to 30s. Title generation is async and
best-effort — it runs in a background goroutine and doesn't block the
chat response — so a longer timeout has no user-facing impact.
Fixes https://github.com/coder/internal/issues/1371
## Problem
`TestCloseDuringShutdownContextCanceledShouldRetryOnNewReplica` flakes
intermittently in CI. The observed failure is that the chat never
reaches `pending` status after `serverA.Close()`.
## Root cause
Race between context cancellation and the mock OpenAI server's stream
completion marker.
When `Close()` cancels the server context, the in-flight HTTP streaming
request is canceled. The mock server's handler detects this via
`req.Context().Done()` and closes its chunks channel. The mock's
`writeChatCompletionsStreaming` then writes `data: [DONE]` — the SSE
completion marker. On a loopback connection, this marker can reach the
client **before** the client's HTTP transport honors the context
cancellation.
When this happens:
1. The client sees a successful stream completion (not an error)
2. `chatloop.Run` returns `nil`
3. `processChat` falls through without error → status stays `waiting`
(the default)
4. The test expects `pending` → **flake**
## Fix
Skip writing the `[DONE]` marker when the request context is already
canceled, in both `writeChatCompletionsStreaming` and
`writeResponsesAPIStreaming`.
Our `AGENTS.md` previously contained this directive:
> When adding tests for new behavior, add new test cases instead of
modifying existing ones. This preserves coverage for the original
behavior and makes it clear what the new test covers.
This leads to inflated diffs and test explosions. Updating it to bias
more towards updating existing tests where applicable.
---------
Co-authored-by: Danielle Maywood <danielle@themaywoods.com>
Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
## Problem
When archiving an agent with subagents, the children briefly flash in
the sidebar as root-level items before disappearing. Two issues:
1. **Backend:** Archive used N+1 queries — a recursive DFS
(`archiveChatTree`, no transaction) or BFS loop (`chatd.ArchiveChat`,
N+1 queries in a tx) to walk the tree and archive each chat
individually.
2. **Frontend:** The SSE `deleted` event handler only filtered out the
parent chat from the cache. Children remained briefly, got promoted to
root-level by `buildChatTree`, then disappeared on the next re-fetch.
## Fix
**Backend:** Replace both tree-walk implementations with a single SQL
query:
```sql
UPDATE chats SET archived = true, updated_at = NOW()
WHERE id = @id OR root_chat_id = @id;
```
This leverages the existing `root_chat_id` column (already indexed) to
archive the entire tree atomically.
**Frontend:** When a `deleted` event arrives, also filter out any chats
whose `root_chat_id` matches the deleted chat, so children vanish from
the sidebar immediately with the parent.
## Changes
- `coderd/database/queries/chats.sql` — Added `ArchiveChatTreeByID`
query
- `coderd/chats.go` — Use single query, delete `archiveChatTree`
function
- `coderd/chatd/chatd.go` — Simplify `ArchiveChat` to use single query
- `coderd/database/dbauthz/dbauthz.go` — Auth wrapper for new query
- `coderd/chats_test.go` — Added `TestArchiveChat/ArchivesChildren`
subtest
- `site/src/pages/AgentsPage/AgentsPage.tsx` — Filter children in SSE
handler
- Generated files updated via `make gen`
Add a new SubjectTypeChatd RBAC subject with minimal permissions:
- Chat: CRUD
- Workspace: Read
- DeploymentConfig: Read
Replace all 10 AsSystemRestricted calls in coderd/chatd/chatd.go:
- Line 890: Use AsChatd instead of AsSystemRestricted for the background
processor context.
- Subscribe() path (5 calls): Remove system escalation entirely; these
run under the authenticated user's context from the HTTP handler.
- processChat path (4 calls): Remove redundant per-call wraps; the
context already carries AsChatd from the processor start.
Add TestAsChatd verifying allowed and denied actions.
Created using Mux (Opus 4.6)
## Flake Fix
Resolves https://github.com/coder/internal/issues/1301
`TestAIBridgeListInterceptions/Pagination/offset` flakes with a 500
caused by `runtime error: integer divide by zero` in `pq.ParseTimestamp`
(encode.go:430) during `GetAPIKeyByID` in the auth middleware.
### Root Cause
**PostgreSQL historical timezone formatting + fragile pq parser:**
1. **Year-0001 timestamps trigger unusual PostgreSQL formatting.** New
API keys were initialized with `LastUsed: time.Time{}` (year
0001-01-01). When the PostgreSQL server timezone is non-UTC, it applies
historical Local Mean Time (LMT) offsets for pre-1900 dates. For year
0001, this can produce timestamps with seconds in the timezone offset
like `0001-12-31 19:03:58-04:56:02`, a format the pq parser was never
designed to handle.
2. **The pq parser panics on unexpected formats.** The
fractional-seconds parser at encode.go:430 computes `fracOff` via
`strings.IndexAny`. When the timestamp has an unusual LMT format, index
arithmetic can produce `fracOff ≤ 0`, causing `int(math.Pow(10,
float64(negative))) = 0` → divide-by-zero panic.
3. **Why it is intermittent:** CI Postgres instances may have varying
timezone configs across runs. The pagination test makes 80+ API calls,
each reading `last_used` via `GetAPIKeyByID`, increasing the probability
of hitting the edge case.
4. **Ruled out pq race condition.** The decode path copies bytes to a Go
string via `string(s)` before `ParseTimestamp`, so buffer reuse cannot
corrupt the input.
### Fix
Initialize `LastUsed` to `time.Unix(0, 0).UTC()` (Unix epoch,
1970-01-01) instead of `time.Time{}` (year 0001). This avoids the entire
class of historical timestamp formatting edge cases.
**Why not `dbtime.Now()`?** The auth middleware debounces `LastUsed`
updates — it only writes when `now.Sub(key.LastUsed) > time.Hour`. Using
`dbtime.Now()` makes the key appear freshly used so the debounce never
triggers, breaking `TestPostUsers/LastSeenAt` and
`TestUsersFilter/LastSeenBeforeNow`. Unix epoch is always >1 hour in the
past, so debounce works correctly.
### Follow-up
A defensive fix should also be added to the `coder/pq` fork (guard
`fracOff ≤ 0` before the division in `ParseTimestamp`). Other year-0001
sentinel values exist across the codebase (`workspace_builds.deadline`,
`users.last_seen_at`, `workspaces.last_used_at`, etc.) and remain
theoretically vulnerable until the pq fork is hardened.
Follow-up to #22452. The previous fix only checked the chat's own
status, so a root chat in `waiting` status with actively running
sub-agents still showed the expand/collapse chevron on full-row hover.
## Problem
A root chat that's idle (`waiting`/`completed`) but has running
sub-agents would still swap its status icon for the `>` chevron on row
hover. The fix in #22452 only gated on `chat.status` being
`pending`/`running`, which doesn't cover the parent when sub-agents are
the ones executing.
## Fix
`isExecuting` now also checks whether **any direct child** is
`pending`/`running`:
```ts
const isExecuting =
chat.status === "pending" ||
chat.status === "running" ||
(hasChildren &&
childIDs.some((id) => {
const c = chatById.get(id);
return c?.status === "pending" || c?.status === "running";
}));
```
When `isExecuting` is true, the chevron only appears on hover of the
icon area itself (`group-hover/icon`), not the entire row.
## New story
Added `IdleParentWithRunningChild` — verifies a `waiting` parent with a
`running` child uses icon-only hover scope for the toggle.
Co-authored-by: Coder <coder@users.noreply.github.com>
Bumps rust from `7e6fa79` to `c0a38f5`.
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
## Context
This commit is part of the fix for a downstream provider outage observed
during
`coderd_template` updates.
Observed downstream symptoms (terraform-provider-coderd):
- Template-version websocket log stream requests returned `401`:
`GET /api/v2/templateversions/<id>/logs`.
- In older provider code (`waitForJob`), stream-init errors could
produce
`(nil, nil, err)` and then trigger a nil dereference when
`closer.Close()`
was deferred before checking `err`.
- Net effect: template update path crashed instead of returning a
controlled
provisioning error.
That provider panic is being hardened in the provider repo separately
(https://github.com/coder/terraform-provider-coderd/pull/308). This
commit addresses the upstream SDK auth mismatch that caused the
websocket `401`
side of the chain.
## Root cause
On deployments with host-prefixed cookie handling (dev.coder.com)
enabled
(`--host-prefix-cookie` / `EnableHostPrefix=true`), middleware rewrites
cookie
state to enforce prefixed auth cookies.
For non-browser websocket clients that still sent unprefixed
`coder_session_token` via cookie jars, this created an auth mismatch:
- cookie-based credential expected by the client path,
- but cookie normalization/stripping applied server-side,
- resulting in no usable token at auth extraction time.
## Fix in this commit
Apply the #22226 non-browser auth principle to remaining websocket
callsites in
`codersdk` by replacing cookie-jar session auth with header-token auth.
_Generated with mux but reviewed by a human_
The mux module's input variable was renamed from `add-project` to
`add_project`. This updates the dogfood template to use the new name.
Ref:
https://github.com/coder/registry/blob/main/registry/coder/modules/mux/main.tf
(variable `add_project`)
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore <dependency name> major version` will close this
group update PR and stop Dependabot creating any more for the specific
dependency's major version (unless you unignore this specific
dependency's major version or upgrade to it yourself)
- `@dependabot ignore <dependency name> minor version` will close this
group update PR and stop Dependabot creating any more for the specific
dependency's minor version (unless you unignore this specific
dependency's minor version or upgrade to it yourself)
- `@dependabot ignore <dependency name>` will close this group update PR
and stop Dependabot creating any more for the specific dependency
(unless you unignore this specific dependency or upgrade to it yourself)
- `@dependabot unignore <dependency name>` will remove all of the ignore
conditions of the specified dependency
- `@dependabot unignore <dependency name> <ignore condition>` will
remove the ignore condition of the specified dependency and ignore
conditions
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
## Summary
Wire VAPID web push notifications into the Agents (chat) system so users
get desktop notifications when an agent finishes running.
### Backend
- Add `webpush.Dispatcher` to `chatd.Server` and pass it through from
`coderd.Options.WebPushDispatcher`
- In `processChat()`'s deferred cleanup, dispatch a web push
notification when the chat reaches a terminal state:
- **`waiting`** (success): "Agent has finished running."
- **`error`** (failure): the error message, or "Agent encountered an
error."
- Sub-agent chats (`ParentChatID.Valid`) are skipped to avoid
notification spam from internal delegation
- Gracefully no-ops when the dispatcher is nil (web push disabled)
### Frontend
- New `WebPushButton` component — a bell icon that uses the existing
`useWebpushNotifications` hook
- Returns `null` when the `web-push` experiment is off
- Three states: loading spinner, green bell (subscribed), muted bell-off
(unsubscribed)
- Tooltip + toast feedback on toggle
- Added to both the Agents page empty state top bar and the AgentDetail
top bar
- The Agents page has its own layout (no standard Navbar), so it needs
its own subscribe button
### End-to-end flow
1. User clicks the bell icon on `/agents` → browser subscribes via VAPID
2. User starts an agent chat → chat enters `running` status
3. Agent finishes → `processChat` defer sets status to `waiting`/`error`
→ dispatches web push
4. Browser service worker shows a desktop notification with the chat
title and status
---------
Co-authored-by: Coder <coder@users.noreply.github.com>
## Summary
Replaces the plain `<TextareaAutosize>` in the Agent chat input
(`AgentChatInput`) with a Lexical-based editor component, matching the
pattern used in [coder/blink](https://github.com/coder/blink).
## What changed
### New component: `ChatMessageInput`
`site/src/components/ChatMessageInput/ChatMessageInput.tsx`
A Lexical-powered text input that behaves as a plain-text editor with:
- **Enter** submits, **Shift+Enter** inserts newline
- Rich-text formatting disabled (Cmd+B/I/U blocked)
- Paste sanitization (strips formatting, inserts plain text)
- Undo/redo via HistoryPlugin
- Imperative ref API: `insertText()`, `clear()`, `focus()`, `getValue()`
### Updated components
- **`AgentChatInput.tsx`** — Swapped `<TextareaAutosize>` for
`<ChatMessageInput>`. Moved from controlled `value`/`onChange` to
ref-based pattern with `initialValue`/`onContentChange`.
- **`AgentDetail.tsx`** — Updated to use `useRef` for input value
tracking and `editorInitialValue` state for editor resets (edit/cancel
flows).
- **`AgentsPage.tsx`** — Updated to use `useRef` + `initialValue`
pattern.
- **`AgentChatInput.stories.tsx`** — Updated prop names.
### Why Lexical?
This lays the groundwork for features that a native `<textarea>` can't
support:
- Ghost text / inline autocomplete suggestions
- @-mentions and slash commands
- Programmatic text insertion (e.g. from speech-to-text)
- Custom inline decorators (chips, pills, badges)
- Syntax-highlighted code blocks
No adornments are added in this PR — it's a drop-in replacement that
matches existing behavior.
---------
Co-authored-by: Coder <coder@coder.com>
When hovering over a running/pending chat in the agents sidebar, the
spinning status icon was being replaced by the expand/collapse chevron
button. This was disorienting because the spinner conveys important "in
progress" state.
## Changes
**`AgentsSidebar.tsx`**:
- Added `group/icon` scoped hover group to the icon container div
- When a chat is executing (`pending`/`running`), the chevron toggle
only appears on hover of the icon area itself, not the entire row
- Non-executing chats retain the original whole-row hover behavior (no
UX change)
**`AgentsSidebar.stories.tsx`**:
- Added `RunningChatPreservesSpinner` story verifying the spinner is
present and the toggle button starts invisible for running chats with
children
Co-authored-by: Coder <coder@users.noreply.github.com>
## Problem
The `agentproc` process manager spawns processes with only
`os.Environ()`, missing agent-level environment variables like
`GIT_ASKPASS`, `CODER_*`, and `GIT_SSH_COMMAND` that are injected by the
agent's `updateCommandEnv` function. This means processes started
through the HTTP process API (used by chat tools) cannot authenticate
git operations via the Coder gitaskpass helper.
By contrast, SSH sessions get the full agent environment because the SSH
server calls `updateCommandEnv` via its `UpdateEnv` config hook.
## Fix
Wire the agent's `updateCommandEnv` hook into the process manager so all
spawned processes receive the full agent environment. The hook is:
- Passed as a parameter through `NewAPI` → `newManager`
- Called in `manager.start()` with `os.Environ()` as the base, producing
the same enriched env that SSH sessions get
- Gracefully falls back to `os.Environ()` if the hook returns an error
Request-level env vars (`req.Env`, set by chat tools) are still appended
last and take precedence.
## Changes
- `agent/agentproc/process.go`: Add `updateEnv` field to manager, call
it when building process env
- `agent/agentproc/api.go`: Accept `updateEnv` parameter in `NewAPI`
- `agent/agent.go`: Pass `a.updateCommandEnv` when creating the process
API
- `agent/agentproc/api_test.go`: Add `UpdateEnvHook` and
`UpdateEnvHookOverriddenByReqEnv` tests
Co-authored-by: Coder <coder@coder.com>
## Problem
Chat titles sometimes don't update in the UI. The generated AI title
gets stuck as the fallback (first 6 words of the message) even though
the backend successfully generates a proper title.
## Root Causes
### 1. Cancelable context used during cleanup DB read (P0)
In `processChat`, the deferred cleanup re-reads the chat from the DB to
pick up the AI-generated title for the `status_change` pubsub event. But
it used the cancelable `ctx` instead of `cleanupCtx`:
```go
// Before — ctx may already be canceled here
if freshChat, readErr := p.db.GetChatByID(ctx, chat.ID); readErr == nil {
```
When the context is canceled, the DB read fails silently and the
`status_change` event carries the stale fallback title.
### 2. Title goroutine not tracked by inflight WaitGroup (P2)
The `maybeGenerateChatTitle` goroutine was fire-and-forget — not tracked
by `p.inflight`. During graceful shutdown, the server could exit before
the goroutine completes its DB write or pubsub publish.
### 3. No recovery when watchChats() WebSocket misses events
The frontend relies entirely on the `watchChats()` SSE connection for
title updates. If the connection drops or misses events, titles never
recover — the only fix was a full page reload.
## Fixes
1. **Use `cleanupCtx`** for the `GetChatByID` call and logger in the
deferred cleanup block.
2. **Track the title goroutine** with `p.inflight.Add(1)` / `defer
p.inflight.Done()` so shutdown waits for it.
3. **Invalidate chats query** on WebSocket open/close/error events so
missed updates are recovered via refetch. Also enable
`refetchOnWindowFocus` for the chats query.
Co-authored-by: Coder <coder@users.noreply.github.com>
When a chatd server shuts down (`Close()`), the server context is
canceled. Previously, in-flight chats would be marked as `error` because
the `context.Canceled` error was not distinguished from actual
processing failures.
This adds `isShutdownCancellation()` to detect when the error is caused
by the server context being canceled (as opposed to a chat-specific
cancellation like `ErrInterrupted`). When detected, the chat status is
set to `pending` with no `last_error`, allowing another replica to pick
it up and retry.
Extracted from #22440 — only the context cancellation bug fix, no
chattest changes.
Inspired by openai/codex's `apply_patch` implementation, this changes
the `edit_files` search-and-replace to use a cascading match strategy
when the exact search string isn't found:
1. **Exact substring match** (byte-for-byte) — existing behavior,
unchanged
2. **Line-by-line match ignoring trailing whitespace** — handles
trailing spaces/tabs the LLM omits
3. **Line-by-line match ignoring all leading/trailing whitespace** —
handles tabs-vs-spaces and wrong indentation depth
## Problem
When the chat agent uses `edit_files`, it generates a search string that
must match the file content exactly. LLMs frequently get whitespace
wrong:
- Emitting spaces when the file uses tabs (or vice versa)
- Getting the indentation depth wrong by one or more levels
- Omitting trailing whitespace that exists in the file
When this happens, the edit silently does nothing, and the agent falls
into a retry loop using `cat -A` to diagnose the exact whitespace
characters.
## Solution
Adopted the same cascading fuzzy match strategy that [openai/codex uses
in
`seek_sequence.rs`](https://github.com/openai/codex/blob/main/codex-rs/apply-patch/src/seek_sequence.rs):
- Pass 1: exact match (existing behavior)
- Pass 2: `TrimRight` each line before comparing (trailing whitespace
tolerance)
- Pass 3: `TrimSpace` each line before comparing (full indentation
tolerance)
When a fuzzy match is found, the matched lines in the original file are
replaced with the replacement text. This preserves surrounding content
exactly.
## Changes
- `agent/agentfiles/files.go`: Replaced `icholy/replace` streaming
transformer with in-memory `fuzzyReplace` + helper functions
(`seekLines`, `spliceLines`)
- `agent/agentfiles/files_test.go`: Added 6 new test cases covering
trailing whitespace, tabs-vs-spaces, different indent depths, exact
match preference, no-match behavior, and mixed whitespace multiline
edits
- Removed `icholy/replace` dependency from go.mod/go.sum
---------
Co-authored-by: Kyle Carberry <kylecarbs@users.noreply.github.com>
The in-memory stream buffer accumulated message-part events for the
entire duration of a chat run. Late-joining subscribers received all
buffered parts even though the backing messages had already been
committed to the database, wasting memory and potentially duplicating
content.
Clear the buffer at the end of each `persistStep` call so that only
in-flight (uncommitted) parts remain in the buffer.
## Summary
Remove the `workspace_agent_id` column from the `chats` table and
dynamically look up the first workspace agent instead.
## Problem
When a workspace is stopped and restarted, the workspace agent gets a
new ID. The `workspace_agent_id` stored on the chat at creation time
becomes stale, making the agent unreachable. This caused chats to break
after workspace restarts.
## Solution
Instead of persisting the agent ID, dynamically look up the first agent
from the workspace's latest build via
`GetWorkspaceAgentsInLatestBuildByWorkspaceID` whenever an agent
connection is needed. The `workspace_id` on the chat remains stable
across restarts.
This behavior may be refined later (e.g., agent selection heuristics),
but picking the first agent resolves the immediate breakage.
## Changes
- **Migration 000425**: Drop `workspace_agent_id` column from `chats`
- **SQL queries**: Remove `workspace_agent_id` from `InsertChat` and
`UpdateChatWorkspace`
- **chatd.go**: `getWorkspaceConn` and `resolveInstructions` now look up
agents dynamically from workspace ID
- **chatd.go**: Remove `refreshChatWorkspaceSnapshot` (no longer needed)
- **createworkspace.go**: Stop persisting agent ID when associating
workspace with chat
- **subagent.go**: Stop passing agent ID to child chats
- **SDK/frontend**: Remove `WorkspaceAgentID` / `workspace_agent_id`
from Chat type
---------
Co-authored-by: Kyle Carberry <kylecarbs@gmail.com>
Two changes:
1. **Gate subagent tools behind `!chat.ParentChatID.Valid`** so child
agents never receive `spawn_agent`, `wait_agent`, `message_agent`, or
`close_agent`. Previously all 4 tools were given to every chat.
`spawn_agent` would fail at runtime ("delegated chats cannot create
child subagents") but the other 3 had no guard at all — meaning a child
could theoretically operate on sibling chats. Removing the tools
entirely is cleaner and saves context window.
2. **Rewrite tool descriptions to explain *when* to use them**, not just
what they do. `spawn_agent` now says to use it for clearly scoped,
independent, self-contained tasks (e.g. fixing a specific bug, writing a
single module, running a migration) and explicitly says *not* to use it
for simple operations you can handle with
`execute`/`read_file`/`write_file`. It also states that child agents
cannot spawn their own subagents. The other 3 tools get similar
guidance-oriented descriptions.
Co-authored-by: Coder <coder@users.noreply.github.com>
The shimmer component has an infinitely repeating animation that causes
Chromatic snapshot diffs on every run. Adding `data-chromatic="ignore"`
to prevent false positives, consistent with how other animated
components in the codebase handle this (e.g. `Spinner`, `Alert`,
`SyntaxHighlighter`).
Co-authored-by: Coder <coder@users.noreply.github.com>
## Summary
Fixes four frontend↔backend discrepancies in chat stream state
management that could cause duplicate content, UI flicker, and stale
stream state.
### Backend fixes (`coderd/chatd/chatd.go`)
**1. No-pubsub path double-replayed message_part events**
`Subscribe()` built an `initialSnapshot` containing `message_part`
events from `localSnapshot`, then the no-pubsub goroutine replayed the
same `localSnapshot` into the `mergedEvents` channel. Since `streamChat`
sends the snapshot first then reads the channel, the frontend received
every `message_part` twice. `applyMessagePartToStreamState` doesn't
deduplicate — text gets concatenated, so content appeared doubled.
Fix: Only forward live `localParts` in the no-pubsub goroutine; the
snapshot already contains the historical events.
**2. Snapshot missing status event**
The initial snapshot never included a `status` event. The frontend's
`shouldApplyMessagePart()` gates on status (`pending`/`waiting`), but
the initial status came from a separate REST query via `useEffect`.
During the race window between snapshot arrival and REST resolution,
`message_part` events could be incorrectly accepted or rejected.
Fix: Prepend a `status` event to the snapshot after loading the chat
from DB, so the frontend has the authoritative status from the very
first batch.
### Frontend fixes (`ChatContext.ts`)
**3. Scheduled stream reset not canceled by subsequent message_parts**
When a `message` event arrived, `scheduleStreamReset()` queued
`clearStreamState` via `requestAnimationFrame`. If new `message_part`
events arrived in the next WebSocket frame before the rAF fired, they
were pushed to `pendingMessageParts` without canceling the scheduled
reset. The rAF would fire between frames, clearing stream state, then
the next flush would re-populate it — causing a visible flash.
Fix: Call `cancelScheduledStreamReset()` when accumulating
`message_part` events.
**4. startTransition race with synchronous clearStreamState**
`flushMessageParts` wrapped `applyMessageParts` in `startTransition`,
which React can defer. If a `status: "waiting"` event arrived in the
same batch after `message_part` events, the status handler cleared
stream state synchronously, but the deferred `applyMessageParts`
callback could fire afterward and re-populate it.
Fix: Re-check `shouldApplyMessagePart()` inside the `startTransition`
callback at execution time.
### Tests added
- **Go**: `TestSubscribeSnapshotIncludesStatusEvent` — asserts the first
snapshot event is a status event
- **Go**: `TestSubscribeNoPubsubNoDuplicateMessageParts` — asserts the
events channel doesn't replay snapshot events
- **TS**: `cancels scheduled stream reset when message_part arrives
after message` — verifies stream state survives a [message,
message_part] batch
- **TS**: `does not apply message parts after status changes to waiting`
— verifies deferred applyMessageParts respects status transitions
## Summary
Adds a new agent-side process management HTTP API and rewrites the chat
execute tool to use it instead of SSH sessions.
## What changed
### New agent/agentproc/ package
- **headtail.go** — Thread-safe io.Writer with bounded memory (16KB head
+ 16KB tail ring buffer). Provides LLM-ready output with truncation
metadata and long-line truncation at 2048 bytes.
- **headtail_test.go** — 16 tests including race detector coverage for
concurrent writes.
- **process.go** — Manager + Process types for lifecycle management
using agentexec.Execer for proper OOM/nice scores.
- **api.go** — HTTP API following the agentfiles chi router pattern. 4
endpoints: start, list, output, signal.
### Agent wiring (agent/agent.go, agent/api.go)
Mounts the process API at /api/v0/processes, mirroring how agentfiles is
mounted.
### SDK (codersdk/workspacesdk/agentconn.go)
4 new AgentConn interface methods + 7 request/response types:
- StartProcess, ListProcesses, ProcessOutput, SignalProcess
### Execute tool rewrite (coderd/chatd/chattool/execute.go)
- SSH to Agent API: conn.StartProcess() + conn.ProcessOutput() polling
- New parameters: workdir, run_in_background
- Structured response: success, exit_code, wall_duration_ms, error,
truncated, note, background_process_id
- Non-interactive env vars: GIT_EDITOR=true, TERM=dumb, NO_COLOR=1,
PAGER=cat, etc.
- Output truncation: HeadTailBuffer caps at 32KB for LLM consumption
- File-dump detection with advisory notes suggesting read_file
- Default timeout: 60s to 10s
- Foreground polling: 200ms intervals until exit or timeout
## Architecture
State lives on the agent, surviving coderd failover and instance
changes. Any coderd replica can query any agent via HTTP over tailnet.
Adds a nullable `last_error` column to the `chats` table so error
reasons survive page reloads.
**Backend:**
- Migration adds `last_error TEXT` (nullable) to chats
- `UpdateChatStatus` writes the error reason when status transitions to
`error`, clears it (NULL) on recovery
- `convertChat` maps `sql.NullString` to `*string` in the SDK
**Frontend:**
- Sidebar falls back to `chat.last_error` when no stream error reason is
cached
- Chat detail page does the same for `persistedErrorReason`
- Fixtures updated for new required field
Replaces the hand-rolled LCS diffing in `buildEditDiff` and the
manual patch-string assembly in `buildWriteFileDiff` with
[`Diff.createPatch()`](https://www.npmjs.com/package/diff) from the
`diff` npm package.
Both functions now just call `Diff.createPatch()` and feed the result
straight into `parsePatchFiles()`, removing all the manual line
splitting, prefix tagging, hunk-header arithmetic, and trailing-newline
cleanup.
### Changes
- Add `diff` as a dependency
- `buildWriteFileDiff`: replaced ~20 lines of manual patch assembly
with a single `Diff.createPatch()` call
- `buildEditDiff`: replaced ~60 lines (line splitting, `Diff.diffLines`
→ prefixed strings, hunk counting) with a `Diff.createPatch()` call
per edit
- Removed the `chunkLines` helper and the `diffLines` wrapper +
its test block
Net: +21 / -157 lines across source and tests.
The diff view on the `/agents` page had no way to handle lines wider
than the panel. The `@pierre/diffs` library supports an `overflow`
option — switching it from `"scroll"` (the shared default) to `"wrap"`
for the side panel makes long lines wrap naturally instead of being
clipped.
Also adds a long import line to the Storybook sample diff so the
wrapping behavior is easy to verify visually.
## Summary
Adds a typed-confirmation step before deleting a deployment license to
reduce accidental removals.
<img width="457" height="440" alt="Screenshot 2026-02-13 at 15 31 58"
src="https://github.com/user-attachments/assets/b13320a7-4b10-43fa-ab01-56f3284435b6"
/>
## Changes
- Swapped the license removal dialog from `ConfirmDialog` to
`DeleteDialog`, requiring the admin to type the license ID before
enabling **Remove**.
- Added interaction coverage to verify the confirmation guard.
TemplateVersionEditorPage tests have been flaking since I ported them to
vitest in 99a4ecd. Turns out our test timeout on jest is 20s (presumably
for these sorts of page-level journey tests). I kinda like the current
5s timeout as it forces us to write speedy tests, but I think in this
case it's unavoidable and makes sense to lengthen the timeout just for
these tests.
Hopefully fixescoder/internal#1369
You may want the whitespaceless diff here:
https://github.com/coder/coder/pull/22412/changes?w=1
## Summary
Adds a new `diff_status_change` event kind to the `/chats/watch` pubsub
stream so the sidebar can update diff status (PR created, files changed,
branch info) without a full page reload.
### Problem
When a chat's diff status changes (e.g. PR created via GitHub, git
branch pushed), the sidebar didn't update because:
1. The backend `publishChatPubsubEvent` didn't include diff status data
2. The frontend watch handler only merged `status`, `title`, and
`updated_at` from events
### Solution
A **notify-only** approach: a new `ChatEventKindDiffStatusChange` event
kind tells the frontend "diff status changed for chat X" — the frontend
then invalidates the relevant React Query cache entries to re-fetch.
### Backend changes
- **`coderd/pubsub/chatevent.go`**: New `ChatEventKindDiffStatusChange =
"diff_status_change"` constant
- **`coderd/chatd/chatd.go`**: New `PublishDiffStatusChange(ctx,
chatID)` method on `Server`
- **`coderd/chats.go`**: New `publishChatDiffStatusEvent` helper.
Published from:
- `refreshWorkspaceChatDiffStatuses` — after each chat's diff status is
refreshed via GitHub API
- `storeChatGitRef` — after persisting git branch/origin info from
workspace agent
### Frontend changes
- **`AgentsPage.tsx`**: Handle `diff_status_change` event by
invalidating `chatDiffStatusKey` and `chatDiffContentsKey` queries
- **`ChatContext.ts`**: Remove redundant diff status invalidation that
fired on every chat status change (the new event kind handles this
properly)
## Problem
When sending a message in the agent detail chat, the text lingered in
the input textarea while the HTTP POST round-tripped to the server. Only
after the server responded did the input clear and the message appear in
the timeline (via WebSocket). This created a noticeable delay where the
user couldn't start typing their next message.
## Solution
**Optimistic input clear** (`AgentChatInput.tsx`):
- Clear the textarea and editing state *immediately* on submit, before
awaiting the network call.
- Capture the input text beforehand so it can be restored in the `catch`
block if the request fails.
**Optimistic user bubble** (`AgentDetail.tsx`):
- Inject a temporary `ChatMessage` (with a negative ID) into the chat
store so the user's message bubble appears in the timeline instantly.
- Set chat status to `pending` and clear stream state, mirroring the
existing edit-message path.
- On error, roll back: remove the optimistic message and restore the
previous chat status.
The real message arrives via the WebSocket stream and
`upsertDurableMessage` replaces the optimistic entry naturally (the
server message has a positive ID, so it's inserted alongside; the
optimistic negative-ID message gets cleaned up when `replaceMessages` is
called with the authoritative message list from the next query
invalidation).
## Testing
- Type a message and press Enter — input clears and bubble appears
immediately.
- Simulate a network error — input text is restored, optimistic bubble
is removed.
- Edit an existing message — unchanged behavior (already had optimistic
updates).
- Queue a message while streaming — unchanged behavior.
Adds two keyboard shortcuts to the agents page:
- **Escape** — Interrupts the running agent when viewing a chat detail
page. Only fires when focus is outside text inputs/textareas so it
doesn't conflict with the existing edit-cancel Escape handler in the
chat input.
- **Ctrl+N / Cmd+N** — Navigates to create a new agent. Also skipped
when focus is in a text input/textarea.
Both keybindings are implemented in a new `useAgentsPageKeybindings.ts`
hook file:
- `useAgentsPageKeybindings` — used in `AgentsPage.tsx` for Ctrl+N
- `useAgentDetailKeybindings` — used in `AgentDetail.tsx` for Escape →
interrupt
## Summary
The UI has always labeled the action as "Archive agent" but the backend
was performing a hard `DELETE`, permanently destroying chats and all
their messages.
This change replaces the hard delete with a soft archive, consistent
with the pattern used by template versions.
## Changes
### Database
- **Migration 000423**: Add `archived boolean DEFAULT false NOT NULL`
column to `chats` table
- Replace `DeleteChatByID` query with `ArchiveChatByID` (`UPDATE SET
archived = true`)
- Add `UnarchiveChatByID` query (`UPDATE SET archived = false`)
- Filter archived chats from `GetChatsByOwnerID` (`WHERE archived =
false`)
### API
- Remove `DELETE /api/experimental/chats/{chat}`
- Add `POST /api/experimental/chats/{chat}/archive` — archives a chat
and all its descendants
- Add `POST /api/experimental/chats/{chat}/unarchive` — unarchives a
single chat (API only, no UI yet)
### Backend
- `archiveChatTree()` recursively archives child chats (replaces
`deleteChatTree()` which hard-deleted)
- Chat daemon's `ArchiveChat()` archives the full chat tree in a
transaction
- Authorization uses `ActionUpdate` instead of `ActionDelete`
### SDK
- Replace `DeleteChat()` with `ArchiveChat()` and `UnarchiveChat()`
- Add `Archived` field to `Chat` struct
### Frontend
- `archiveChat` API call uses `POST .../archive` instead of `DELETE`
- No UI changes — the "Archive agent" button now actually archives
instead of deleting
## Design Decision
This follows the **template version archive pattern** (Pattern B in the
codebase):
- `archived boolean` column (not `deleted boolean`)
- Dedicated `POST .../archive` and `POST .../unarchive` routes (not
repurposing `DELETE`)
- Reversible — users can unarchive via the API (UI for this will come
later)
## Problem
`resolveChatGitHubAccessToken` reads the `OAuthAccessToken` directly
from the database without refreshing it. When the token expires, GitHub
returns "bad credentials" and the chat diff features break.
## Fix
Call `config.RefreshToken()` before returning the token — the same code
path used by `provisionerdserver` when handing tokens to provisioners.
- Builds a map of provider ID → `*externalauth.Config` during the
existing config iteration
- After fetching the `ExternalAuthLink` from the DB, calls
`cfg.RefreshToken()` if a matching config exists
- On refresh failure, falls through to the existing token (GitHub tokens
without expiry still work) with a debug log
## Problem
Context compaction in chatd persisted durable messages for the
`chat_summarized` tool call and result via `publishMessage`, but never
published `message_part` streaming events via `publishMessagePart`. This
meant connected clients had no streaming representation of the
compaction.
The client's `streamState` (built entirely from `message_part` events in
`streamState.ts`) never saw the compaction tool call, so:
- No **"Summarizing..."** running state was shown to the user during
summary generation (which can take up to 90s).
- The durable `message` events arrived after or interleaved with the
`status: waiting` event, causing the tool to appear as "Summarized" with
the chat appearing to just stop.
## Fix
### 1. `CompactionOptions.OnStart` callback (chatloop)
Added an `OnStart` callback to `CompactionOptions`, called in
`maybeCompact` right before `generateCompactionSummary` (the slow LLM
call). This gives `chatd` a hook to publish the tool-call `message_part`
immediately when compaction begins.
### 2. Tool-result streaming part (chatd)
`persistChatContextSummary` now publishes a tool-result `message_part`
before the durable `message` events, so clients transition from
"Summarizing..." to "Summarized" before the status change arrives.
### Event ordering is now:
1. `message_part` (tool call via `OnStart`) — client shows
"Summarizing..."
2. LLM generates summary (up to 90s)
3. `message_part` (tool result) — client shows "Summarized" in stream
state
4. `message` (assistant) — durable message persisted, stream state
resets
5. `message` (tool) — durable tool result persisted
6. `status: waiting` — chat transitions to idle
## Tests
- **`OnStartFiresBeforePersist`**: Verifies callback ordering is
`on_start` → `generate` → `persist`.
- **`OnStartNotCalledBelowThreshold`**: Verifies `OnStart` is not called
when context usage is below the compaction threshold.
## Problem
The `update workspace, new required, mutable parameter added` e2e test
has been flaking consistently
([internal#1328](https://github.com/coder/internal/issues/1328)). The
error:
```
Error: Timed out 5000ms waiting for expect(locator).toHaveValue(expected)
Locator: getByTestId('parameter-field-Sixth parameter').locator('input')
Expected string: "99"
Received string: ""
```
## Root Cause
A race between page navigation and data hydration in `verifyParameters`:
1. The page navigates with `waitUntil: "domcontentloaded"` which does
not wait for API responses to settle
2. React Query may serve stale cached workspace data initially (from
before the update), causing the form to render with empty/old parameter
values
3. The `toHaveValue` assertion uses the default `actionTimeout` of
5000ms which isn't enough time for fresh data to arrive and the form to
re-render
## Fix
- Switch `verifyParameters` navigation to `waitUntil: "networkidle"` to
ensure API responses (workspace data, build parameters) are settled
before the form renders
- Increase the `toHaveValue` timeout to 15s to handle cases where
dynamic parameters hydrate slowly after initial render
Fixescoder/internal#1328
---------
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
## Problem
When switching between chats on the agents page, stream parts could be
lost or applied to the wrong chat due to several race conditions in
`ChatContext.ts`:
1. **`startTransition` deferred parts escape cleanup** —
`startTransition(() => store.applyMessageParts(parts))` defers the state
update. If a chat switch happens between `flushMessageParts` being
called and the transition executing, old-chat parts could apply after
`resetTransientState()` has already cleared stream state for the new
chat.
2. **`message` event has no `chat_id` filter** — Unlike `message_part`,
`queue_update`, and `status` events, the `message` event handler did not
check `streamEvent.chat_id`. While the server-scoped WebSocket makes
this safe in practice, it's an inconsistency in defensive programming.
3. **Brief stale message window on switch** — Between `chatID` changing
and `replaceMessages()` firing (after the query resolves), the store
held old-chat messages while the new WebSocket was already connected.
## Changes
### `ChatContext.ts`
- Added `activeChatIDRef` to track the currently active chat ID
- Guard `startTransition` callback: check `activeChatIDRef` before
applying message parts, discarding them if the chat has switched
- Added `chat_id` filter to `message` event handler, matching the
pattern used by all other event types
- Added `store.replaceMessages([])` to the chatID-change effect so
messages are cleared immediately on switch
### `ChatContext.test.tsx`
Four new tests covering the chat-switch lifecycle:
- WebSocket closure and state reset when chatID changes
- `message` event filtering by `chat_id`
- `startTransition` deferred parts discarded after switch
- Messages cleared immediately before new query resolves
All 13 tests pass (8 existing + 4 new + 1 existing).
## Problem
Non-admin users of the Agents (chat) feature send `model_config_id:
"00000000-0000-0000-0000-000000000000"` (nil UUID) when creating chats,
because the `GET /api/experimental/chats/model-configs` endpoint
requires `policy.ActionRead` on `rbac.ResourceDeploymentConfig`, which
is only granted to admins.
The flow:
1. `AgentsPage.tsx` calls `useQuery(chatModelConfigs())` → hits
`listChatModelConfigs`
2. Non-admin users get a **403 Forbidden** response
3. `chatModelConfigsQuery.data` is `undefined`, so the
`modelConfigIDByModelID` map is empty
4. `handleCreateChat` falls back to `nilUUID` for `model_config_id`
5. The backend rejects the nil UUID: `"Invalid model config ID."`
## Fix
Changed `listChatModelConfigs` to allow all authenticated users to read
model configs:
- **Admin users** continue to see all configs (including disabled ones)
for management via `GetChatModelConfigs`
- **Non-admin users** now see only enabled configs via
`GetEnabledChatModelConfigs` with a system context, which is sufficient
for using the chat feature
This follows the same pattern as `listChatModels`, which already uses
`dbauthz.AsSystemRestricted(ctx)` to allow all authenticated users to
see available models.
Write endpoints (create/update/delete) retain their existing
`ResourceDeploymentConfig` authorization.
## Testing
- Updated `TestListChatModelConfigs/ForbiddenForOrganizationMember` →
`SuccessForOrganizationMember` to verify non-admin users can list
enabled model configs
- All existing chat tests continue to pass
## Problem
When coderd instances are redeployed (e.g. rolling deployment on
dogfood), in-flight chats get stuck in `running` status permanently. The
UI shows them as "thinking" with a spinning indicator, but no worker is
actually processing them. They never error or resume.
## Root Cause
Two bugs combine to cause this:
### Bug 1: Shutdown cleanup uses a canceled context
The `processChat` defer block updates the chat status in the DB when
processing completes. But it uses `ctx`, which `Close()` cancels
*before* the defer runs. The DB transaction silently fails with
`context.Canceled`, leaving the chat in `status=running` with a dead
`worker_id`.
```go
// Close() calls p.cancel() which cancels ctx
// Then the defer tries to use the now-canceled ctx:
defer func() {
err := p.db.InTx(func(tx database.Store) error {
tx.GetChatByIDForUpdate(ctx, chat.ID) // FAILS
tx.UpdateChatStatus(ctx, ...) // FAILS
}, nil)
}()
```
### Bug 2: Stale recovery runs only once at startup
`recoverStaleChats()` was called only once in `start()`, not
periodically. During a rolling deployment, the new instance starts while
the old one is still alive (fresh heartbeat). By the time the old
instance crashes, no one checks again.
## Fix
1. **Use `context.WithoutCancel(ctx)` in the processChat defer** — the
cleanup transaction now completes even during graceful shutdown.
2. **Run `recoverStaleChats` periodically** — a second ticker in the
`start()` loop checks for stale chats at `inFlightChatStaleAfter / 5`
intervals (default: every 1 minute). This catches orphaned chats even
when the instance that owns them crashes without clean shutdown.
## Tests
- `TestRecoverStaleChatsPeriodically` — Verifies chats orphaned *after*
startup are recovered by the periodic loop (not just the startup check).
- `TestNewReplicaRecoversStaleChatFromDeadReplica` — Verifies a new
replica recovers stale chats on startup.
- `TestWaitingChatsAreNotRecoveredAsStale` — Negative test: `waiting`
chats are not incorrectly modified by recovery.
## Problem
The git diff on the `/agents` page had color issues: the editor
background followed light mode but the syntax highlighting used dark
mode (`github-dark-high-contrast`), and the filename header used
light-colored text on a light background.
The root cause was hardcoded dark theme options in the `FileDiff`
component:
```tsx
themeType: "dark",
theme: "github-dark-high-contrast",
```
## Fix
Uses the same theme-aware pattern as every other diff/file viewer in the
codebase (`WriteFileTool`, `EditFilesTool`, `ReadFileTool`, `Tool`,
`response.tsx`):
1. `useTheme()` from `@emotion/react` to read `palette.mode`
2. `getDiffViewerOptions(isDark)` from the shared `utils.ts` module —
returns `github-light` theme for light mode, `github-dark-high-contrast`
for dark mode
3. Reuses `DIFFS_FONT_STYLE` and `diffViewerCSS` constants instead of
inlining duplicates
## Storybook coverage
Added four new stories with real unified diff content:
- **WithDiffDark** — dark mode with a PR link
- **WithDiffLight** — light mode with a PR link
- **NoPullRequestDark** — dark mode, "Files Changed" header
- **NoPullRequestLight** — light mode, "Files Changed" header
The existing stories only covered empty and parse-error states with no
rendered diff.
## Summary
Adds a new line-based file reading endpoint to the workspace agent,
replacing the unbounded byte-based approach for the `read_file` chat
tool and `coder_workspace_read_file` MCP tool.
**Problem**: The current `read_file` tool returns the entire file
contents with no limits, which can blow up LLM context windows and cause
OOM issues with large files.
**Solution**: Inspired by [`coder/mux`](https://github.com/coder/mux)
and [`openai/codex`](https://github.com/openai/codex), implement a
line-based reader with safety limits.
## Changes
### Agent (`agent/agentfiles/`)
- New `/read-file-lines` endpoint with `HandleReadFileLines` handler
- Line-based `offset` (1-based line number, default: 1) and `limit`
(line count, default: 2000)
- Safety constants:
| Constant | Value | Purpose |
|---|---|---|
| `MaxFileSize` | 1 MB | Reject files larger than this at stat |
| `MaxLineBytes` | 1,024 | Per-line truncation with `... [truncated]`
marker |
| `MaxResponseLines` | 2,000 | Max lines per response |
| `MaxResponseBytes` | 32 KB | Max total response size |
| `DefaultLineLimit` | 2,000 | Default when no limit specified |
- Line numbering format: `1\tcontent` (tab-separated)
- Structured JSON response: `{ success, file_size, total_lines,
lines_read, content, error }`
- Hard errors when limits exceeded — tells the LLM to use
`offset`/`limit`
- Existing byte-based `/read-file` endpoint preserved (used by
`instruction.go`)
### SDK (`codersdk/workspacesdk/`)
- `ReadFileLinesResponse` type added
- `ReadFileLines` method added to `AgentConn` interface
- Mock regenerated
### Chat tool (`coderd/chatd/chattool/`)
- `read_file` tool now uses `conn.ReadFileLines()` instead of
`conn.ReadFile()`
- Updated tool description to document line-based parameters
- Response includes `file_size`, `total_lines`, `lines_read` metadata
### MCP tool (`codersdk/toolsdk/`)
- `coder_workspace_read_file` updated to use line-based reading
- Schema descriptions updated for line-based offset/limit
- Removed `maxFileLimit` constant (agent handles limits now)
### Tests
- 13 new test cases for `TestReadFileLines`:
- Path validation (empty, relative, non-existent, directory, no
permissions)
- Empty file handling
- Basic read, offset, limit, offset+limit combinations
- Offset beyond file length
- Long line truncation (>1024 bytes)
- Large file rejection (>1MB)
- All existing tests pass unchanged
## Design decisions
| Decision | Rationale |
|---|---|
| Line-based, not byte-based | Both coder/mux and openai/codex use
line-based — matches how LLMs reason about code |
| Default limit of 2000 | Matches codex; prevents accidental full-file
dumps while being generous |
| 32 KB response cap | Compromise between mux (16 KB) and codex (no cap)
|
| 1024 byte/line truncation with marker | More generous than codex
(500), marker helps LLM know data is missing |
| Hard errors on overflow | Matches mux; forces LLM to paginate rather
than getting partial data |
| Preserve byte-based endpoint | `instruction.go` needs raw byte access
for AGENTS.md |
## Problem
Chat titles revert to the fallback truncated title after briefly showing
the AI-generated title. Even reloading the page doesn't help — the
correct title flashes then gets overwritten.
## Root Cause
Single bug, two symptoms.
In `processChat` (`coderd/chatd/chatd.go`), the `chat` variable is
passed by value. The flow:
1. `processChat(ctx, chat)` receives `chat` with the initial fallback
title (truncated first message).
2. Inside `runChat`, `maybeGenerateChatTitle` generates an AI title,
writes it to the DB via `UpdateChatByID`, and publishes a `title_change`
event. **The DB has the correct title.** The client briefly displays it.
3. `runChat` returns. The **deferred cleanup** in `processChat`
publishes `publishChatPubsubEvent(chat, StatusChange)` — but `chat` here
is the original value copy that still has the **old fallback title**.
4. The frontend receives the `status_change` SSE event and
**unconditionally applies `title` from every event kind** (see
`AgentsPage.tsx` line ~305: `title: updatedChat.title`). This overwrites
the correct AI title with the stale fallback.
**Why reload doesn't help:** If the chat is still processing when the
page reloads, `listChats` loads the correct title from the DB, but then
the deferred `status_change` event arrives moments later and clobbers
it. The title was always in the DB — it was the pubsub event that kept
overwriting it.
## Fix
Re-read the chat from the database in the deferred cleanup before
publishing the final `status_change` event, so it carries the current
(AI-generated) title.
When navigating to a specific agent on the Agents page, the browser tab
title now reflects the agent's chat title (e.g. `Fix login bug - Agents
- Coder`). When the title hasn't loaded yet or when navigating away, it
falls back to `Agents - Coder`.
**Changes:**
- Added a `useEffect` in `AgentDetail` that sets `document.title` via
the existing `pageTitle` utility whenever the chat title changes.
- The cleanup function resets the title back to `Agents - Coder` when
unmounting (navigating away from the agent).
When injecting system instructions into the chat prompt, include:
1. **Operating system** and **working directory** from the
`workspace_agents` table
2. **Home-level instructions** from `~/.coder/AGENTS.md` (existing
behavior)
3. **Project-level instructions** from `<pwd>/AGENTS.md` (new)
The XML tag is renamed from `<coder-home-instructions>` to
`<system-instructions>` since it now carries more than just the home
instruction file.
### Example output (both files present)
```xml
<system-instructions>
Operating System: linux
Working Directory: /home/coder/coder
Source: /home/coder/.coder/AGENTS.md
... home instructions ...
Source: /home/coder/coder/AGENTS.md
... project instructions ...
</system-instructions>
```
### Example output (no AGENTS.md files)
```xml
<system-instructions>
Operating System: linux
Working Directory: /home/coder/coder
</system-instructions>
```
### Changes
- **`coderd/chatd/instruction.go`**:
- Renamed types: `homeInstructionContext` → `agentContext`, added
`instructionFile` struct
- Extracted `readInstructionFileAtPath` shared helper
- Added `readWorkingDirectoryInstructionFile` to read `<pwd>/AGENTS.md`
- Replaced `formatHomeInstruction` with `formatInstructions` that
renders both files under `<system-instructions>`
- **`coderd/chatd/chatd.go`**:
- Renamed `resolveHomeInstruction` → `resolveInstructions`; now reads
both home and pwd instruction files
- `resolveAgentContext` returns `agentContext` (renamed from
`homeInstructionContext`)
- pwd file read is skipped gracefully if directory is empty or file
doesn't exist
- **`coderd/chatd/instruction_test.go`**:
- Added `TestReadWorkingDirectoryInstructionFile` (success, not-found,
empty-directory)
- Replaced `TestFormatHomeInstruction` with `TestFormatInstructions`
covering all combinations
- Added ordering test (`AgentContextBeforeFiles`) to verify OS/pwd
appear before file sources
## Summary
The `chattool` `list_templates` tool previously returned all templates
in a single response with no popularity signal. On deployments with many
templates (e.g. 71 on dogfood), this wastes tokens and makes it hard for
the AI to pick the right template for broad user questions.
## Changes
Single file: `coderd/chatd/chattool/listtemplates.go`
- **`page` parameter** — optional, 1-indexed, 10 results per page
- **Popularity sort** — queries
`GetWorkspaceUniqueOwnerCountByTemplateIDs` to get active developer
counts, then sorts descending (most popular first). The DB query returns
templates alphabetically, so this explicit sort is needed.
- **`active_developers`** — included on each template item so the agent
can see the signal
- **Pagination metadata** — `page`, `total_pages`, `total_count` in the
response so the agent knows there are more results
- **Updated tool description** — tells the agent that results are
ordered by popularity and paginated
## Frontend
No frontend changes needed. The renderer already reads `rec.templates`
and `rec.count` from the response — the new fields (`page`,
`total_pages`, `total_count`) are additive and safely ignored.
When switching between chats on the agents page, the scroll position was
preserved from the previous chat instead of resetting to show the most
recent messages.
## Problem
Clicking a different chat in the sidebar loaded the new chat's messages
but kept the scroll container at whatever position the user had scrolled
to in the previous chat. This meant users often landed in the middle of
a conversation instead of at the bottom where the latest messages are.
## Fix
Added a `useEffect` in `AgentDetail` that resets `scrollTop` to `0`
whenever `agentId` changes. The scroll container uses
`flex-col-reverse`, so `scrollTop = 0` corresponds to the bottom (most
recent messages).
Fixes https://github.com/coder/coder/issues/22375
Updates `stringutil.Truncate` to properly handle multi-byte UTF-8
characters.
Adds tests for multi-byte truncation with word boundary.
Created by Mux using Opus 4.6
Resolves cases where the user is entitled to AI Governance but we don't
show them the page because its not enabled. If for some reason the user
doesn't have AI Bridge enabled anymore but still wants to access the old
logs page they now can.
Furthermore, we link to the docs regardless of if they have AI Bridge
enabled, this is inline with our other settings pages.
Replaces the approach in #22061 (with a cleaner `git history`)
This now ensures that we don't attempt to cause a layout shift when the
sidebars pop-in-out of existence (when scroll locking within `radix`).
This element was receiving the provisioner key daemons and then
immediately filtering them. This lead to the default state being a table
with nothing rendered rather than the `<TableEmpty />` as we would
expect.
<img width="1133" height="608" alt="image"
src="https://github.com/user-attachments/assets/229edb00-b108-4ec3-ac2f-33633c3e5760"
/>
This previously let auditors view the page though they can't update
anything. In a different fashion to #22382 the user will be able to see
all of this as they're logged in to the application anyway, we can
simply tell them `Sorry, no access`.
setup-go has been sporadically failing to download Go, and we were advised
by a member of the Go team that downloading Go from `storage.googleapis.com`
is not guaranteed (which is what setup-go <= v5.6.0 does).
Also remove the use-preinstalled-go optimization for Windows runners.
setup-go v6 sets GOTOOLCHAIN=local, which prevents the pre-installed
Go from auto-downloading the toolchain specified in go.mod. The windows
optimization with v5 relied on GOTOOLCHAIN=auto. setup-go uses the runner
cache, which is a different caching path but should serve the same purpose.
This change adds user-facing feedback when opening apps in a new window
fails due to popup blocking, replacing a silent no-op with a clear
recovery message. It improves reliability and supportability across
app-launch flows by helping users immediately understand and fix the
issue.
This was a poor UX decision to have to reload the entire page when a
template got invalidated. Simply now we refetch the data so that things
come across way smooother.
## Description
- Updates `wsbuilder` to return a `BuildError` with
`http.StatusBadRequest` to signify a "validation error" on missing or
invalid parameters
- Adds a short-circuit in `prebuilds.StoreReconciler` to mark presets
for which creating a build returns a "validation error" as "validation
failed" and skip further attempts to reconcile.
- Adds a test to verify the above
- Introduces a new Prometheus metric
`coderd_prebuilt_workspaces_preset_validation_failed` to track the above
Closes: https://github.com/coder/coder/issues/21237
---------
Co-authored-by: Cian Johnston <cian@coder.com>
State updates from setIsPublishingDialogOpen,
setLastSuccessfulPublishedVersion, and navigation were firing after
waitFor resolved, causing sporadic act() warnings and timeouts in the
publish template version tests (or so says Claude Sonnet 4.6).
Fixescoder/internal#1369
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Related to #22367
It was pointed out to me that we actually did regress this mildly by
removing a dividing line in the changes made in #22367, I've restored
this in a better way by taking advantage of `divide-y` and wrapping this
in a proper `<div />`.
<img width="332" height="385" alt="image"
src="https://github.com/user-attachments/assets/2827a9ae-7b54-4c48-aae9-2f6e965e7f8b"
/>
Switch to asserting only on the onChange spy, which is the actual
component contract being tested. Monaco's textarea value is always empty
regardless of model content, so the toHaveValue assertions were
unreliable anyway.
Fixes the new storybook test introduced in #22202
This was a bad smell that was being addressed by the frontend. This type
was generating out to be a `nil`/`null` instead of an empty `License[]`.
Now this returns as an empty array and we can actively check if we have
no licenses with a length of `0`.
This pull-request takes our icons shown in the sidebar tree and shows
them alongside the names of the files in the `Source Code` page of our
templates.
Also does a quick de-mui of this page.
<img width="637" height="345" alt="image"
src="https://github.com/user-attachments/assets/f3013eb6-9572-4d05-a683-10bb99b4e802"
/>
Adds a brief "Structured Logging" section to the [AI Bridge
Setup](https://coder.com/docs/ai-coder/ai-bridge/setup) page documenting
the `--aibridge-structured-logging` /
`CODER_AIBRIDGE_STRUCTURED_LOGGING` flag.
Covers:
- How to enable structured logging (CLI flag, env var, YAML)
- The five `record_type` values emitted (`interception_start`,
`interception_end`, `token_usage`, `prompt_usage`, `tool_usage`) and
their key fields
- How to filter for these records in a logging pipeline
Created on behalf of @dannykopping
---------
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
Fixes three bugs that caused `coder update` to always re-prompt for
multi-select (`list(string)`) parameters instead of reusing previous
build values:
1. **`isValidTemplateParameterOption` failed for multi-select values**
(`cli/parameterresolver.go`): It compared the entire JSON array string
(e.g. `["vim","emacs"]`) against individual option values, which never
matched. Now parses the JSON array and validates each element
separately.
2. **`RichParameter` ignored previous build value for multi-select**
(`cli/cliui/parameter.go`): The `list(string)` branch always used the
template's default value instead of the `defaultValue` argument (which
carries the previous build's value). Now uses `defaultValue` when
available, falling back to the template default.
3. **Pre-existing crash when `list(string)` has no default value**
(`cli/cliui/parameter.go`): `json.Unmarshal` on an empty string caused
`unexpected end of JSON input`. Now skips unmarshaling when the default
source is empty.
Fixes#19956
The sonner migration (https://github.com/coder/coder/pull/22258) shows
validation errors in both the inline form field and a toast. Scoping the
assertion to the form element avoids flaky matches against the toast.
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
The Monaco editor wrapper was only calling `onChange` if the template
file has content, but we want to allow saving an empty file.
Fixes#19721
Claude was used to port tests from jest to vitest, and for the stories.
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Kayla はな <mckayla@hey.com>
Claude 3.5 Haiku (`claude-3-5-haiku-20241022`) was retired by Anthropic
on February 19th, 2026. Requests to this model now return errors.
Switch to Claude Haiku 4.5 (`claude-haiku-4-5`), which is the
[recommended
replacement](https://docs.anthropic.com/en/docs/resources/model-deprecations).
---
One-line change in `coderd/taskname/taskname.go` L25:
```diff
- defaultModel = anthropic.ModelClaude3_5HaikuLatest
+ defaultModel = anthropic.ModelClaudeHaiku4_5
```
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
The listen loop in workspaceAgentsExternalAuthListen compared
OAuthExpiry using == which compares `time.Time` internal struct fields
including the `*time.Location` pointer.
`time.LoadLocation` does not cache the returned `*Location` pointer, so
each lib/pq connection gets a distinct pointer for the same timezone.
When `pq.ParseTimestamp()` applies the connection's location to a parsed
timestamp, the resulting time.Time embeds that connection-specific
pointer. If the `sql.DB` pool hands out different connections for the
two GetExternalAuthLink reads, the identical timestamp produces
`time.Time` values where == returns false despite representing the same
instant. This is intermittent because the pool _usually_ reuses the same
connection for sequential queries.
This change uses `.Equal()` to compare instants regardless of location.
Also makes the test's validation call counter atomic to fix a possible
data race between the HTTP server and test goroutines.
Replaces our custom `<GlobalSnackbar />` (MUI Snackbar + event emitter)
with [`sonner`](https://github.com/emilkowalski/sonner). Deletes
`GlobalSnackbar/`, the custom event emitter infra, and migrates ~80
source files to `toast.success()` / `toast.error()` from `sonner`.
- ~47 error toasts now surface API error detail via
`getErrorDetail(error)` in the toast description, not just a generic
message. Coincides with #22229.
- Toast messages follow an `{Action} "{entity}" {result}.` format (e.g.
`User "alice" suspended successfully.`) since toasts persist across
navigation now.
- 17 uses of `toast.promise()` for loading → success → error lifecycle.
- Some toasts include action buttons for quick navigation (e.g. "View
task", "View template").
- Multiple toasts can stack and display simultaneously.
---------
Co-authored-by: Kayla はな <mckayla@hey.com>
This pull-request moves our baseline CSS styles from the MUI theme
(`site/src/theme/mui.ts`) definition to `index.css`. As these are global
styles they should live in one dedicated place not two.
This pull-request removes the last instance of `@mui/material/Chip` from
the codebase. And removes it from our `vite.config.mts` so we no longer
have to cache it 🙂
This pull-request implements a simple filtering logic so that we're able
to pick which model the user actually used when logs were sent to AI
Bridge.
- Add `GET /aibridge/models` API endpoint that returns distinct model
names from AI Bridge interceptions, with pagination and search support
- New `ListAIBridgeModels` SQL query using case-sensitive prefix
matching (`LIKE model || '%'`) to allow B-tree index usage
- Hand-written `ListAuthorizedAIBridgeModels` in `modelqueries.go` for
RBAC authorization filter injection
- `AIBridgeModels` search query parser in searchquery/search.go
(defaults bare terms to the `model` field)
- dbauthz wrappers, dbmetrics, and dbmock implementations for the new
query
<img width="292" height="185" alt="image"
src="https://github.com/user-attachments/assets/134771df-2d26-4c54-acc4-27f58128b351"
/>
## Description
When multiple organizations have templates with the same name, the
Prometheus `/metrics` endpoint returns HTTP 500 because Prometheus
rejects duplicate label combinations. The three `coderd_insights_*`
metrics (`coderd_insights_templates_active_users`,
`coderd_insights_applications_usage_seconds`,
`coderd_insights_parameters`) used only `template_name` as a
distinguishing label, so two templates named e.g. `"openstack-v1"` in
different orgs would produce duplicate metric series.
This adds `organization_name` as a label to all three insight metric
descriptors to disambiguate templates across organizations.
## Changes
**`coderd/prometheusmetrics/insights/metricscollector.go`**:
- Added `organization_name` label to all three metric descriptors
- Added `organizationNames` field (template ID → org name) to the
`insightsData` struct
- In `doTick`: after fetching templates, collect unique org IDs, fetch
organizations via `GetOrganizations`, and build a
template-ID-to-org-name mapping
- In `Collect()`: pass the organization name as an additional label
value in every `MustNewConstMetric` call
**`coderd/prometheusmetrics/insights/testdata/insights-metrics.json`**:
Updated golden file to include `organization_name=coder` in all metric
label keys.
Fixes#21748
- Previously all tests were sharing the global http.Transport meaning on
`Close` it would close connections presumed to be idle for other tests.
fixes https://github.com/coder/internal/issues/112
Fixes#22030
## Problem
When a template has `require_active_version = true` and a workspace is
outdated, the web UI always shows "Update and start" as the **only**
button (for all users including admins), but `coder start` starts with
the old version. For admins, this silently succeeds on the stale
version. For non-admins, it goes through a clunky 403→retry path. This
also affects the VS Code extension, which calls `coder start --yes`
under the hood.
## Root Cause
`buildWorkspaceStartRequest()` in `cli/start.go` checks
`workspace.AutomaticUpdates == "always"` but ignores
`workspace.TemplateRequireActiveVersion`. The server-side autostart
already ORs both settings together:
```go
// coderd/autobuild/lifecycle_executor.go
func useActiveVersion(opts, ws) bool {
return opts.RequireActiveVersion || ws.AutomaticUpdates == "always"
}
```
The CLI was missing the `RequireActiveVersion` check.
## Fix
Add `workspace.TemplateRequireActiveVersion` to the existing OR
condition:
```go
// Before:
if workspace.AutomaticUpdates == codersdk.AutomaticUpdatesAlways || action == WorkspaceUpdate {
// After:
if workspace.AutomaticUpdates == codersdk.AutomaticUpdatesAlways || workspace.TemplateRequireActiveVersion || action == WorkspaceUpdate {
```
Now `coder start` and `coder restart` proactively use the active
template version when `require_active_version` is set, matching the web
UI and server autostart behavior. The 403→retry fallback remains as a
safety net but is no longer the primary path for any user.
## Testing
Updated `enterprise/cli/start_test.go` — all user types (owner, template
admin, ACL admin, group ACL admin, member) now expect the active version
when `require_active_version` is set, and verify the 403→retry message
does NOT appear.
When AgentAPI is configured, `WithTaskReporter` unconditionally
overrides all self-reported states to `working`. The intent was to
distrust the agent's `idle` and rely on the screen watcher, but the
override also blocks `failure` and `complete`, which only the agent can
produce (the screen watcher only knows `running`/`stable`). Tasks get
stuck as `working` or `null` forever.
Now only `idle` is overridden to `working`; `failure`, `complete`, and
`working` pass through as-is.
Also:
- Remove misplaced unconditional `"Failed to watch screen events"` log
that fired on every startup
- Add SSE reconnection with exponential backoff (1s-30s) in
`startWatcher` so it recovers from dropped connections instead of dying
silently
- Add `complete` to the `coder_report_task` tool enum, which the
`coder/claude-code` registry module already instructs agents to use but
was missing from the schema
Refs coder/internal#1350
Relates to https://github.com/coder/internal/issues/1259
Adds new database queries and telemetry collection functions to gather
task lifecycle events (pause/resume cycles, idle time) for analytics.
Task events track pause/resume activity, idle duration before pausing,
paused duration, and time from resume to first app status, filtered to
recent activity based on the telemetry snapshot interval.
🤖 Created with Mux (Opus 4.6).
## Summary
Moves expired token filtering from client-side to server-side by adding
an `include_expired` parameter to the `GetAPIKeysByLoginType` and
`GetAPIKeysByUserID` database queries. This is more efficient for large
deployments with many expired/short-lived tokens.
## Changes
- Add `include_expired` parameter to SQL queries using `OR`
short-circuit
- Add `include_expired` query parameter to `GET
/users/{user}/keys/tokens`
- Add `IncludeExpired` field to `codersdk.TokensFilter`
- Remove client-side filtering from CLI `tokens list` command
- Add `TestTokensFilterExpired` test
Fixescoder/internal#1357
<!--
If you have used AI to produce some or all of this PR, please ensure you
have read our [AI Contribution
guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING)
before submitting.
-->
part of https://github.com/coder/coder/issues/21335
This moves updating app status (used by Tasks) into the workspace agent
API over dRPC. This will allow us to update the status without having to
re-authenticate each time, like we would with an HTTP PATCH request.
Further PRs in this stack will pipe these requests thru from the CLI MCP
server to the agentsock and finally to this dRPC call to coderd.
## Problem
When a template adds a new immutable parameter, `coder update
--parameter param=value` fails with:
```
error: start workspace: parameter "machine_type" is immutable and cannot be updated
```
The interactive prompt handles this correctly (allows setting first-time
immutable params), but the CLI `--parameter` flag path does not.
## Root Cause
In `cli/parameterresolver.go`, `verifyConstraints()` runs before the
interactive prompt and unconditionally rejects any immutable parameter
during updates. It doesn't distinguish between **new** immutable
parameters (first-time use, should be allowed) and **existing** ones
(already set, should be blocked from changing).
## Fix
Added an `isFirstTimeUse` check to the immutable parameter constraint,
matching the logic already used by the interactive prompt path (line
323). New immutable parameters can now be set via `--parameter`, while
existing immutable parameters are still blocked from being changed.
## Testing
Added `TestUpdateValidateRichParameters/NewImmutableParameterViaFlag`
which:
1. Creates a workspace with a mutable parameter
2. Updates the template to add a new immutable parameter
3. Runs `coder update --parameter immutable_param=value`
4. Verifies the update succeeds and the parameter is set correctly
Fixes#22164
The provisioner state for a workspace build was being loaded for every
long-lived agent rpc connection. Since this state can be anywhere from
kilobytes to megabytes this can gradually cause the `coderd` memory
footprint to grow over time. It's also a lot of unnecessary allocations
for every query that fetches a workspace build since only a few callers
ever actually reference the provisioner state.
This PR removes it from the returned workspace build and adds a query to
fetch the provisioner state explicitly.
Adds two new icons to the icon library:
- **`anthropic.svg`** — Anthropic logo
- **`gemini-monochrome.svg`** — Gemini logo, monochrome variant
Both use `monochrome` theme handling to adapt for dark and light
backgrounds.
### Changes
- Added `anthropic.svg` and `gemini-monochrome.svg` to
`site/static/icon/`
- Registered both in `site/src/theme/icons.json` (alphabetically sorted)
- Added `monochrome` theme handling for both in
`site/src/theme/externalImages.ts`
---
Created on behalf of @tracyjohnsonux
---------
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
Closes https://github.com/coder/internal/issues/1353
Does not solve the issue, but the error is currently opaque. This fails
the test when the init fails, hopefully raising up the error.
Bumps [github.com/gohugoio/hugo](https://github.com/gohugoio/hugo) from
0.155.2 to 0.156.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/gohugoio/hugo/releases">github.com/gohugoio/hugo's
releases</a>.</em></p>
<blockquote>
<h2>v0.156.0</h2>
<p>This release brings significant speedups of <a
href="https://gohugo.io/functions/collections/where/#article">collections.Where</a>
and <a
href="https://gohugo.io/functions/collections/sort/#article">collections.Sort</a>
– but this is mostly a "spring cleaning" release, to make the
API cleaner and simpler to understand/document.</p>
<h2>Deprecated</h2>
<ul>
<li>Site.AllPages is Deprecated</li>
<li>Site.BuildDrafts is Deprecated</li>
<li>Site.Languages is Deprecated</li>
<li>Site.Data is deprecated, use hugo.Data</li>
<li>Page.Sites and Site.Sites is Deprecated, use hugo.Sites</li>
</ul>
<p>See <a
href="https://discourse.gohugo.io/t/deprecations-in-v0-156-0/56732">this
topic</a> for more info.</p>
<h2>Removed</h2>
<p>These have all been deprecated at least since <code>v0.136.0</code>
and any usage have been logged as an error for a long time:</p>
<p>Template functions</p>
<ul>
<li>data.GetCSV / getCSV (use resources.GetRemote)</li>
<li>data.GetJSON / getJSON (use resources.GetRemote)</li>
<li>crypto.FNV32a (use hash.FNV32a)</li>
<li>resources.Babel (use js.Babel)</li>
<li>resources.PostCSS (use css.PostCSS)</li>
<li>resources.ToCSS (use css.Sass)</li>
</ul>
<p>Page methods:</p>
<ul>
<li>.Page.NextPage (use .Page.Next)</li>
<li>.Page.PrevPage (use .Page.Prev)</li>
</ul>
<p>Paginator:</p>
<ul>
<li>.Paginator.PageSize (use .Paginator.PagerSize)</li>
</ul>
<p>Site methods:</p>
<ul>
<li>.Site.LastChange (use .Site.Lastmod)</li>
<li>.Site.Author (use .Site.Params.Author)</li>
<li>.Site.Authors (use .Site.Params.Authors)</li>
<li>.Site.Social (use .Site.Params.Social)</li>
<li>.Site.IsMultiLingual (use hugo.IsMultilingual)</li>
<li>.Sites.First (use .Sites.Default)</li>
</ul>
<p>Site config:</p>
<ul>
<li>paginate (use pagination.pagerSize)</li>
<li>paginatePath (use pagination.path)</li>
</ul>
<p>File caches:</p>
<ul>
<li>getjson cache</li>
<li>getcsv cache</li>
</ul>
<h2>Notes</h2>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/gohugoio/hugo/commit/9d914726dee87b0e8e3d7890d660221bde372eec"><code>9d91472</code></a>
releaser: Bump versions for release of 0.156.0</li>
<li><a
href="https://github.com/gohugoio/hugo/commit/86aa62524f8bc36a04c8e0c0f76d1fd952585509"><code>86aa625</code></a>
hugolib: Move site.Data to hugo.Data, deprecate
Site.AllPages/BuildDrafts/Lan...</li>
<li><a
href="https://github.com/gohugoio/hugo/commit/d8ec0eeeaf2ff078565fddbbab5565a65b86346c"><code>d8ec0ee</code></a>
build(deps): bump google.golang.org/api from 0.255.0 to 0.267.0</li>
<li><a
href="https://github.com/gohugoio/hugo/commit/4148eded9c5f90036c47d241faac73e1d0c6ee70"><code>4148ede</code></a>
hugolib: Add Page.Sites to Site.Sites deprecation notice</li>
<li><a
href="https://github.com/gohugoio/hugo/commit/bba2aed3527e5c6086244c0ab76192b35b6ffa73"><code>bba2aed</code></a>
hugolib: Simplify sites collection</li>
<li><a
href="https://github.com/gohugoio/hugo/commit/29b8e17d29ad38621cf6c7c104309bcedf5c20c5"><code>29b8e17</code></a>
hugolib: Adjust hugo.Sites.Default</li>
<li><a
href="https://github.com/gohugoio/hugo/commit/3c823408ee51bbfbad847d4b9f926ba813097185"><code>3c82340</code></a>
Move common/hugo/HugoInfo to resources/page</li>
<li><a
href="https://github.com/gohugoio/hugo/commit/3f9d0ad2b6045849cbafe133cb9fb82ed5f5ee06"><code>3f9d0ad</code></a>
commands: Fix --panicOnWarning flag having no effect with module version
warn...</li>
<li><a
href="https://github.com/gohugoio/hugo/commit/ab62320d6bceece0faa7029f8bd79d546d0f64be"><code>ab62320</code></a>
hugolib: Add hugo.Sites and .Site.IsDefault(), modify .Site.Sites</li>
<li><a
href="https://github.com/gohugoio/hugo/commit/21be4afd49767eb63e3a2304b4c10816c86f799d"><code>21be4af</code></a>
build(deps): bump github.com/bep/textandbinarywriter</li>
<li>Additional commits viewable in <a
href="https://github.com/gohugoio/hugo/compare/v0.155.2...v0.156.0">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps ubuntu from `c7eb020` to `3ba65aa`.
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
## Summary
Harden the OAuth2 provider with multiple security fixes addressing
`coder/security#121` (CSRF session takeover) and converge on OAuth 2.1
compliance.
### Security Fixes
| Fix | Description | Commits |
|-----|-------------|---------|
| **CSRF on `/oauth2/authorize`** | Enforce CSRF protection on the
authorize endpoint POST (consent form submission) | `ba7d646`, `b94a64e`
|
| **Clickjacking: `frame-ancestors` CSP** | Prevent consent page from
being iframed (`Content-Security-Policy: frame-ancestors 'none'` +
`X-Frame-Options: DENY`) | `597aeb2` |
| **Exact redirect URI matching** | Changed from prefix matching to full
string exact matching per OAuth 2.1 §4.1.2.1 | `73d64b1`, `93897f1` |
| **Store & verify `redirect_uri`** | Store redirect_uri with auth code
in DB, verify at token exchange matches exactly (RFC 6749 §4.1.3) |
`50569b9`, `d7ca315` |
| **Mandatory PKCE** | Require `code_challenge` at authorization (for
`response_type=code`) + unconditional `code_verifier` verification at
token exchange | `d7ca315`, `1cda1a9` |
| **Reject implicit grant** | `response_type=token` now returns
`unsupported_response_type` error page (OAuth 2.1 removes implicit flow)
| `d7ca315`, `91b8863` |
### Changes by File
**`coderd/httpmw/csrf.go`** — Extended the CSRF `ExemptFunc` to enforce
CSRF on `/oauth2/authorize` in addition to `/api` routes. The consent
form POST is now CSRF-protected to prevent cross-site authorization code
theft.
**`site/site.go`** — Added `Content-Security-Policy: frame-ancestors
'none'` and `X-Frame-Options: DENY` headers to `RenderOAuthAllowPage`
(consent page only — does not affect the SPA/global CSP used by AI
tasks).
**`coderd/httpapi/queryparams.go`** — Changed `RedirectURL` from prefix
matching (`strings.HasPrefix(v.Path, base.Path)`) to full URI exact
matching (`v.String() != base.String()`), comparing scheme, host, path,
and query.
**`coderd/oauth2provider/authorize.go`** — Added PKCE enforcement:
`code_challenge` is required when `response_type=code` (via a
conditional check, not `RequiredNotEmpty`, so `response_type=token` can
reach the explicit rejection path). `ShowAuthorizePage` (GET) validates
`response_type` before rendering and returns a 400 error page for
unsupported types. `ProcessAuthorize` (POST) stores the `redirect_uri`
with the auth code when explicitly provided.
**`coderd/oauth2provider/tokens.go`** — PKCE verification is now
unconditional (not gated on `code_challenge` being present in DB). If
the stored code has a `redirect_uri`, the token endpoint verifies it
matches exactly — mismatch returns `errBadCode` → `invalid_grant`.
Missing `code_verifier` returns `invalid_grant`.
**`codersdk/oauth2.go`** — `OAuth2ProviderResponseTypeToken` constant
and `Valid()` acceptance are **kept** so the authorize handler can parse
`response_type=token` and return the proper `unsupported_response_type`
error rather than failing at parameter validation.
**`coderd/database/migrations/000421_*`** — Added `redirect_uri text`
column to `oauth2_provider_app_codes`.
### Design Decisions
**`state` parameter remains optional** — The plan initially required
`state` via `RequiredNotEmpty`, but this was reverted in `376a753` to
avoid breaking existing clients. The `state` is still hashed and stored
when provided (via `state_hash` column), securing clients that opt in.
**`response_type=token` kept in `Valid()`** — Removing it from `Valid()`
would cause the parameter parser to reject the request before the
authorize handler can return the proper `unsupported_response_type`
error. The constant is kept for correct error handling flow.
**CSP scoped to consent page only** — `frame-ancestors 'none'` is set
only on the OAuth consent page renderer, not globally. The SPA/global
CSP was previously changed to allow framing for AI tasks
([#18102](https://github.com/coder/coder/pull/18102)); this change does
not regress that.
### Out of Scope (follow-up PRs)
- Bearer tokens in query strings (needs internal caller audit)
- Scope enforcement on OAuth2 tokens
- Rate limiting on dynamic client registration
---
<details>
<summary>📋 Implementation Plan</summary>
# Plan: Harden OAuth2 Provider — Security Fixes + OAuth 2.1 Compliance
## Context & Why
Security issue `coder/security#121` reports a critical session takeover
via CSRF on the OAuth2 provider. This plan covers all remaining security
fixes from that issue **plus** convergence on OAuth 2.1 requirements.
The goal is a single PR that closes all actionable gaps.
## Current State (already committed on branch `csrf-sjx1`)
| Fix | Status | Commits |
|-----|--------|---------|
| Fix 1: CSRF on `/oauth2/authorize` | ✅ Done | `ba7d646`, `b94a64e` |
| CSRF token in consent form HTML | ✅ Done | `b94a64e` |
| `state_hash` column + storage | ✅ Done (hash stored, but state still
optional) | `9167d83`, `b94a64e` |
| Tests for CSRF + state hash | ✅ Done | `e4119b5` |
## Remaining Work
### ~~Fix 2 — Require `state` parameter~~ (DROPPED)
> **Decision:** Do not enforce `state` as required. The `state`
parameter is still hashed and stored when provided (via
`hashOAuth2State` / `state_hash` column from prior commits), but clients
are not forced to supply it. This avoids breaking existing integrations
that omit state.
**Rollback:** Remove `"state"` from the `RequiredNotEmpty` call in
`coderd/oauth2provider/authorize.go:42`:
```go
// BEFORE (current on branch)
p.RequiredNotEmpty("response_type", "client_id", "state", "code_challenge")
// AFTER
p.RequiredNotEmpty("response_type", "client_id", "code_challenge")
```
No test changes needed — tests already pass `state` voluntarily.
### Fix 4 — Exact redirect URI matching
Currently `coderd/httpapi/queryparams.go:233` uses prefix matching:
```go
// CURRENT — prefix match
if v.Host != base.Host || !strings.HasPrefix(v.Path, base.Path) {
```
OAuth 2.1 requires **exact string matching**. Change to:
```go
// AFTER — exact match (OAuth 2.1 §4.1.2.1)
if v.Host != base.Host || v.Path != base.Path {
```
**File: `coderd/httpapi/queryparams.go` — `RedirectURL` method**
Also update the error message from "must be a subset of" to "must
exactly match".
**Additionally**, store `redirect_uri` with the auth code and verify at
the token endpoint (RFC 6749 §4.1.3):
1. **New migration** (same migration file or a new `000421`): Add
`redirect_uri text` column to `oauth2_provider_app_codes`
2. **Update INSERT query** in `coderd/database/queries/oauth2.sql` to
include `redirect_uri`
3. **`coderd/oauth2provider/authorize.go`**: Store
`params.redirectURL.String()` when inserting the code
4. **`coderd/oauth2provider/tokens.go`**: After retrieving the code from
DB, verify that `redirect_uri` from the token request matches the stored
value exactly. Currently `tokens.go:103` calls `p.RedirectURL(vals,
callbackURL, "redirect_uri")` for prefix validation only — it must
compare against the stored redirect_uri from the code, not just the
app's callback URL.
<details>
<summary>Why both exact match AND store+verify?</summary>
Exact matching at the authorize endpoint prevents open redirectors
(attacker can't use a sub-path).
Storing and verifying at the token endpoint prevents code injection — an
attacker who steals a code can't exchange it with a different
redirect_uri than was originally authorized. This is required by RFC
6749 §4.1.3 and OAuth 2.1.
</details>
### Fix 7 — `frame-ancestors` CSP on consent page
The consent page can be iframed by a workspace app (same-site), which is
the attack vector. Add a `Content-Security-Policy` header to prevent
framing.
**File: `site/site.go` — `RenderOAuthAllowPage` function (~line 731)**
Before writing the response, add:
```go
func RenderOAuthAllowPage(rw http.ResponseWriter, r *http.Request, data RenderOAuthAllowData) {
rw.Header().Set("Content-Type", "text/html; charset=utf-8")
// Prevent the consent page from being framed to mitigate
// clickjacking attacks (coder/security#121).
rw.Header().Set("Content-Security-Policy", "frame-ancestors 'none'")
rw.Header().Set("X-Frame-Options", "DENY")
...
```
Both headers for defense-in-depth (CSP for modern browsers,
X-Frame-Options for legacy).
### OAuth 2.1 — Mandatory PKCE
Currently PKCE is checked only when `code_challenge` was provided during
authorization (`tokens.go:258`):
```go
// CURRENT — conditional check
if dbCode.CodeChallenge.Valid && dbCode.CodeChallenge.String != "" {
// verify PKCE
}
```
OAuth 2.1 requires PKCE for ALL authorization code flows. Change to:
**File: `coderd/oauth2provider/authorize.go`** — Add `"code_challenge"`
to required params:
```go
p.RequiredNotEmpty("response_type", "client_id", "code_challenge")
```
**File: `coderd/oauth2provider/tokens.go:257-265`** — Make PKCE
verification unconditional:
```go
// AFTER — PKCE always required (OAuth 2.1)
if req.CodeVerifier == "" {
return codersdk.OAuth2TokenResponse{}, errInvalidPKCE
}
if !dbCode.CodeChallenge.Valid || dbCode.CodeChallenge.String == "" {
// Code was issued without a challenge — should not happen
// with the authorize endpoint enforcement, but defend in
// depth.
return codersdk.OAuth2TokenResponse{}, errInvalidPKCE
}
if !VerifyPKCE(dbCode.CodeChallenge.String, req.CodeVerifier) {
return codersdk.OAuth2TokenResponse{}, errInvalidPKCE
}
```
**File: `codersdk/oauth2.go`** — Remove
`OAuth2ProviderResponseTypeToken` from the enum or reject it explicitly
in the authorize handler. Currently it's defined at line 216 but the
handler ignores `response_type` and always issues a code. We should
either:
- (a) Remove the `"token"` variant from the enum and reject it with
`unsupported_response_type`, OR
- (b) Add an explicit check in `ProcessAuthorize` that rejects
`response_type=token`
Option (b) is simpler and more backwards-compatible:
```go
// In ProcessAuthorize, after extracting params:
if params.responseType != codersdk.OAuth2ProviderResponseTypeCode {
httpapi.WriteOAuth2Error(ctx, rw, http.StatusBadRequest,
codersdk.OAuth2ErrorCodeUnsupportedResponseType,
"Only response_type=code is supported")
return
}
```
### OAuth 2.1 — Bearer tokens in query strings
`coderd/httpmw/apikey.go:743` accepts `access_token` from URL query
parameters. OAuth 2.1 prohibits this. However, this may be used
internally (e.g., workspace apps, DERP). Need to audit callers before
removing.
**Approach:** This is a larger change with potential breakage. Mark as a
**separate follow-up issue** rather than including in this PR. Document
the finding.
### OAuth 2.1 — Removed flows
✅ **Already compliant.** `tokens.go` only supports `authorization_code`
and `refresh_token` grant types. The implicit grant
(`response_type=token`) will be explicitly rejected per the PKCE section
above.
### OAuth 2.1 — Refresh token rotation
✅ **Already compliant.** `tokens.go:442` deletes the old API key when a
refresh token is used.
## Migration Plan
All DB changes can go in a single new migration (or extend 000420 if the
branch is rebased before merge). Columns to add:
- `redirect_uri text` on `oauth2_provider_app_codes`
The `state_hash` column is already added by migration 000420.
## Implementation Order
1. **Fix 7** — CSP headers on consent page (isolated, no deps)
2. ~~**Fix 2** — Require `state` parameter~~ (DROPPED — state stays
optional)
3. **Fix 4** — Exact redirect URI matching + store/verify redirect_uri
4. **PKCE mandatory** — Require `code_challenge` + reject
`response_type=token`
5. **Rollback** — Remove `"state"` from `RequiredNotEmpty` in
`authorize.go`
6. **Tests** — Update/add tests for all changes
7. **`make gen`** after DB changes
## Out of Scope (separate PRs)
- Bearer tokens in query strings (needs internal caller audit)
- Scope enforcement on OAuth2 tokens
- Rate limiting / quota on dynamic client registration
</details>
---
_Generated with [`mux`](https://github.com/coder/mux) • Model:
`anthropic:claude-opus-4-6` • Thinking: `xhigh`_
This pull-request removes all the magic of `@mui/material/Alert` 🥳 We're
officially free of any alerts that are being handled by Material UI so
this is dead code.
After a PostgreSQL round-trip, job timestamps lose their monotonic
clock component, making the subtraction susceptible to wall-clock
adjustments producing a small negative delta. Floor at 1ms since
a zero or negative queue wait is meaningless. Fixes TestProvisionerJobQueueWaitMetric
flakes where small negative values (~ -2ms) are observed.
Use the server-rendered meta tag value as an intermediate fallback for
theme preference, between the JS-fetched value and the default theme.
This ensures the correct theme is applied before the API response loads.
Fixes#20050
Previously, when secret deployment options like CODER_OIDC_CLIENT_SECRET
were populated, the API correctly returned the "secret": "true"
annotation, but the UI did not indicate that these secrets were
configured. The UI would show "Not set" regardless of whether the secret
was set or not.
Now, the UI checks both the secret annotation and the value_source
field. When a secret is configured (value_source is set), it displays
"Set" to indicate the secret is populated. When a secret is not
configured, it displays "Not set".
Fixes#18913
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
`--secure-auth-cookie` now automatically sources it's default value from `--access-url`
If the access url uses HTTPS, secure is set to `true`.
To revert to old behavior, set the value explicitly to `false`
If a deployment has 2 domains, overriding the oidc url allows the oidc
redirect to differ from the access_url
response to https://github.com/coder/coder/discussions/21500
**This config setting is hidden by default**
In relation to
[`internal#1281`](https://github.com/coder/internal/issues/1281)
Remove the `soft_limit` field from the `Feature` type and simplify
license limit handling. This change:
- Removes the `soft_limit` field from the API and SDK
- Uses the soft limit value as the single `limit` value in the UI and
API
- Simplifies warning logic to only show warnings when the limit is
exceeded
- Updates tests to reflect the new behavior
- Updates the UI to use the single limit value for display
In relation to
[`internal#1281`](https://github.com/coder/internal/issues/1281)
Managed agent workspace build limits are now advisory only. Breaching
the limit no longer blocks workspace creation — it only surfaces a
warning.
- Removed hard-limit enforcement in `checkAIBuildUsage` so AI task
builds are always permitted regardless of managed agent count.
- Updated the license warning to remove "Further managed agent builds
will be blocked." verbiage.
- Updated tests to assert builds succeed beyond the limit instead of
failing.
- Removed the "Limit" display from the `ManagedAgentsConsumption`
progress bar — the bar is now relative to the included allowance (soft
limit) only, and turns orange when usage exceeds it.
Bonus:
- De-MUI'd `LicenseBannerView` — replaced Emotion CSS and MUI `Link`
with Tailwind classes.
- Added `highlight-orange` color token to the Tailwind theme.
This pull-request implement animations for each of our `<ChevronDown />`
(and a few other chevrons) so that everything is uniform with
`<Autocomplete />`.
Based on previous PR reviews it appears we don't want to use these
components anymore. We previously deprecated the use of `<Stack />` in
this way in #20973 so it would be good to take the same approach here.
This PR stops Vite from repeatedly re-optimizing certain MUI modules
during development, which was triggering an HMR feedback loop and
crashing my dev environment on specific pages — most notably
`<LicensesSettingsPage />`.
After some digging, the culprit turned out to be:
```ts
import Paper from "@mui/material/Paper";
```
Importing components this way causes Vite to continuously re-optimize
them during HMR, which leads to the page refreshing over and over until
the dev server taps out and `504 "Outdated Optimize Dep"`'s us.
The fix ensures these modules are computed once at startup instead of
being reprocessed on every hot update. Development is now stable, and
the infinite refresh loop is gone.
I did experiment with using globs to handle this more generically, but
since they’re still early-access in this context, they ended up breaking
things 😔
In short: fewer re-optimizations, no more HMR meltdown, and a much
calmer dev experience.
Continuation of #22186 (without `vitest` addon)
Upgrades the dependency so that we can actively make use of new
features/speed/less-dependencies. Short simple sweet and lovely 🙂
## Summary
Custom roles that can create workspaces on behalf of other users need to
be able to list users to populate the owner dropdown in the workspace
creation UI. Previously, this required a separate `user:read`
permission, causing the dropdown to fail for custom roles.
## Changes
- Modified `GetUsers` in `dbauthz` to check if the user can create
workspaces for any owner (`workspace:create` with `owner_id: *`)
- If the user has this permission, they can list all users without
needing explicit `user:read` permission
- Added tests to verify the new behavior
## Testing
- Updated mock tests to assert the new authorization check
- Added integration tests for both positive and negative cases
Fixes#18203
Parent agents were re-using AuthInstanceID when spawning child agents.
This caused GetWorkspaceAgentByInstanceID to return the most recently
created sub agent instead of the parent when the parent tried to refetch
its own manifest.
Fix by not reusing AuthInstanceID for sub agents, and updating
GetWorkspaceAgentByInstanceID to filter them out entirely.
The existing README for the Azure Linux starter template only mentioned
that the VM is ephemeral and the managed disk is persistent, but did not
explain that the resource group, virtual network, subnet, and network
interface also persist when a workspace is stopped.
This led to confusion where users expected all Azure resources to be
cleaned up on stop, when in reality only the VM is destroyed.
## Changes
- Added the persistent networking/infrastructure resources to the
resource list
- Added "What happens on stop" section explaining which resources
persist and why
- Added "What happens on delete" section confirming all resources are
cleaned up
- Moved the existing note about ephemeral tools/files into a "Workspace
restarts" subsection for clarity
These changes exactly mirror https://github.com/coder/registry/pull/713
since the registry is not yet linked to the starter templates in
`coder/coder`. Once the registry is linked, the starter templates will
pull from the registry and this duplication will no longer be necessary.
---------
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
Add a `TaskLogPreview` component that displays the last N messages of AI
chat logs when a task is paused or its build has failed. The preview
fetches log snapshots via a new `getTaskLogs` API method and renders
them in a scrollable panel with `[user]` and `[agent]` labels, colored
left borders on type transitions, and a snapshot timestamp tooltip.
The build-logs auto-scroll in `BuildingWorkspace` was simplified by
replacing the `useRef`/`useLayoutEffect` pattern with a `useCallback`
ref, and client-side message slicing was removed in favor of
server-side limits. `InfoTooltip` now accepts an optional `title` prop.
Updates the reference to `ANTHROPIC_API_KEY` in the Claude Code client
docs to `ANTHROPIC_AUTH_TOKEN`.
**File changed:**
- `docs/ai-coder/ai-bridge/clients/claude-code.md` — configuration
instructions
Created on behalf of @dannykopping
---------
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
Since Go 1.22, the loop variable capture issue is resolved. Variables
declared by for loops are now per-iteration rather than per-loop, making
the 'v := v' pattern unnecessary.
`coder templates version list` makes a call to determine the `active`
version:
```
➜ ~ coder templates version list aws-linux-dynamic
NAME CREATED AT CREATED BY STATUS ACTIVE
infallible_feistel2 2025-10-10T10:34:02+11:00 rowansmith Succeeded Active
mystifying_almeida1 2025-10-10T10:32:38+11:00 rowansmith Succeeded
```
but this is not carried across to the `-ojson` output version, so this
PR implements that in order to support programattic addressing.
It is added a top level entry. If it should be nested under
`TemplateVersion` let me know.
```
➜ ~ ./Downloads/coder-cli-templateversions-json-active templates version list aws-linux-dynamic -ojson | jq '.[] | select(.active == true) | { active, id: .TemplateVersion.id }'
{
"active": true,
"id": "38f66eae-ec63-49b7-a9d2-cdb79c379d19"
}
➜ ~ ./Downloads/coder-cli-templateversions-json-active templates version list aws-linux-dynamic -ojson |jq '.[] | select(.active == true)'
{
"TemplateVersion": {
"id": "38f66eae-ec63-49b7-a9d2-cdb79c379d19",
"template_id": "1a84ce78-06a6-41ad-99e4-8ea5d9b91e89",
"organization_id": "35f75f20-890e-4095-95f1-bb8f2ba02e79",
"created_at": "2025-10-10T10:34:02.254357+11:00",
"updated_at": "2025-10-10T10:34:46.594032+11:00",
"name": "infallible_feistel2",
"message": "Uploaded from the CLI",
"job": {
"id": "8afd05ca-b4be-48d5-a6b9-82dcfd12c960",
"created_at": "2025-10-10T10:34:02.251234+11:00",
"started_at": "2025-10-10T10:34:02.257301+11:00",
"completed_at": "2025-10-10T10:34:46.594032+11:00",
"status": "succeeded",
"worker_id": "a0940ade-ecdd-47c2-98c6-f2a4e5eb0733",
"file_id": "05fd653c-3a3f-4e5c-856b-29407732e1b1",
"tags": {
"owner": "",
"scope": "organization"
},
"queue_position": 0,
"queue_size": 0,
"organization_id": "35f75f20-890e-4095-95f1-bb8f2ba02e79",
"initiator_id": "d20c05ff-ecf3-4521-a99d-516c8befbaa6",
"input": {
"template_version_id": "38f66eae-ec63-49b7-a9d2-cdb79c379d19"
},
"type": "template_version_import",
"metadata": {
"template_version_name": "",
"template_id": "00000000-0000-0000-0000-000000000000",
"template_name": "",
"template_display_name": "",
"template_icon": ""
},
"logs_overflowed": false
},
"readme": "---\ndxxxxx,
"created_by": {
"id": "d20c05ff-ecf3-4521-a99d-516c8befbaa6",
"username": "rowansmith",
"name": "rowan smith"
},
"archived": false,
"has_external_agent": false
},
"active": true
}
```
Closes#21130
Adds documentation for Google Antigravity IDE integration, following the
same pattern as Cursor and Windsurf (dedicated page for desktop IDEs).
**Changes:**
- `docs/user-guides/workspace-access/antigravity.md` — New dedicated
page with install guide, Coder extension setup, and template
configuration example using the [Antigravity registry
module](https://registry.coder.com/modules/coder/antigravity)
- `docs/user-guides/workspace-access/index.md` — Added Antigravity IDE
section alongside Cursor and Windsurf
- `docs/manifest.json` — Added sidebar navigation entry after Windsurf
Antigravity uses the `antigravity://` protocol (added in #20873) and the
built-in `/icon/antigravity.svg` icon (added in #21068). The [registry
module](https://registry.coder.com/modules/coder/antigravity) wraps
`vscode-desktop-core` with `protocol = "antigravity"`.
Created on behalf of @matifali
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
### Notes
- Closes https://github.com/coder/internal/issues/558
- I closed previous attempt with `ptySemaphore`:
https://github.com/coder/coder/pull/21981
- We can consider implementing the retries proposed by Spike in:
https://github.com/coder/coder/pull/21981#pullrequestreview-3783200423,
if increasing the limit isn’t enough.
- I looked into Datadog — this particular test doesn’t seem very flaky
right now. It failed once in the Nightly gauntlet (3 weeks ago), but it
hasn’t failed again in the last 3 months (at least I couldn’t find any
other failures in Datadog).
## Fix PTY exhaustion flake on macOS CI
### Problem
macOS CI runners were experiencing PTY exhaustion during test runs,
causing flakes. The default PTY limit on macOS is 511, which can be
insufficient when running parallel tests.
### Solution
Added a CI step to increase the PTY limit on macOS runners from the
default 511 to the maximum allowed value of 999 before running tests.
### Changes
- Added `Increase PTY limit (macOS)` step in `.github/workflows/ci.yaml`
- Sets `kern.tty.ptmx_max=999` using `sysctl` (maximum value on our CI
runners)
- Runs only on macOS runners before the test-go-pg action
Description:
This PR updates the bundled Terraform binary and related version pins
from 1.14.1 to 1.14.5 (base image, installer fallback, and CI/test
fixtures). Terraform is statically built with an embedded Go runtime.
Moving to 1.14.5 updates the embedded toolchain and is intended to
address Go stdlib CVEs reported by security scanning.
Notes:
- Change is version-only; no functional Coder logic changes.
- Backport-friendly: intended to be cherry-picked to release branches
after merge.
## Summary
coder-logstream-kube and other tools that use the agent token to connect
to the RPC endpoint were incorrectly triggering connection monitoring,
causing false connected/disconnected timestamps on the agent. This led
to VSCode/JetBrains disconnections and incorrect dashboard status.
## Changes
Add a `role` query parameter to `/api/v2/workspaceagents/me/rpc`:
- `role=agent`: triggers connection monitoring (default for the agent
SDK)
- any other value (e.g. `logstream-kube`): skips connection monitoring
- omitted: triggers monitoring for backward compatibility with older
agents
The agent SDK now sends `role=agent` by default. A new `Role` field on
the `agentsdk.Client` allows non-agent callers to specify a different
role.
## Required follow-up
coder-logstream-kube needs to set `client.Role = "logstream-kube"`
before calling `ConnectRPC20()`. Without that change, it will still send
`role=agent` and trigger monitoring.
Fixes#21625
At present it is not possible to obtain the `id` of the template version
in the table output:
```
➜ ~ coder templates version list -h
coder v2.30.1+16408b1
USAGE:
coder templates versions list [flags] <template>
List all the versions of the specified template
OPTIONS:
-O, --org string, $CODER_ORGANIZATION
Select which organization (uuid or name) to use.
-c, --column [name|created at|created by|status|active|archived] (default: name,created at,created by,status,active)
Columns to display in table output.
➜ ~ coder templates version list aws-linux-dynamic
NAME CREATED AT CREATED BY STATUS ACTIVE
infallible_feistel2 2025-10-10T10:34:02+11:00 rowansmith Succeeded Active
mystifying_almeida1 2025-10-10T10:32:38+11:00 rowansmith Succeeded
```
Adding this because it is useful when wanting to programatically
retrieve the details of the latest template version, and `-ojson` does
not include `active` details in it's output.
```
➜ Downloads ./coder-cli-templateversions-list-id templates version list -h
coder v2.30.1-devel+bab99db9e7
USAGE:
coder templates versions list [flags] <template>
List all the versions of the specified template
OPTIONS:
-O, --org string, $CODER_ORGANIZATION
Select which organization (uuid or name) to use.
-c, --column [id|name|created at|created by|status|active|archived] (default: name,created at,created by,status,active)
Columns to display in table output.
--include-archived bool
Include archived versions in the result list.
-o, --output table|json (default: table)
Output format.
———
Run `coder --help` for a list of global options.
➜ Downloads ./coder-cli-templateversions-list-id templates version list aws-linux-dynamic -c id,name,'created at','created by',status,active
ID NAME CREATED AT CREATED BY STATUS ACTIVE
38f66eae-ec63-49b7-a9d2-cdb79c379d19 infallible_feistel2 2025-10-10T10:34:02+11:00 rowansmith Succeeded Active
aa797ea5-4221-461b-80b0-90c5164f8dc0 mystifying_almeida1 2025-10-10T10:32:38+11:00 rowansmith Succeeded
```
Closes#20965
This pull-request enables a quick permission check that the user is
allowed to view the `<RequestLogsPage />` under the admin panel.
Previously, users would be able to view this page and browse their own
logs if they had this permission (which was fine), however now we've
decided as this is an admin page, they should only be able to do this
via the API/CLI not from the main admin panel.
The login page component incorrectly uses client-side routing to handle
redirects to /oauth2/authorize. Since this path is not defined as a
route in the react application but as a backend endpoint for the OAuth2
provider flow, the frontend displays a 404 "Route not found" error.
- resolves#22097
<!--
If you have used AI to produce some or all of this PR, please ensure you
have read our [AI Contribution
guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING)
before submitting.
-->
Relates to https://github.com/coder/internal/issues/1252
When a workspace with a TaskID hits its deadline, use
BuildReasonTaskAutoPause instead of BuildReasonAutostop. This allows
downstream systems to distinguish between regular autostop and task
workspace pauses.
Created by Mux using Opus 4.5.
Remove the warning about JetBrains Toolbox not persisting log level
configuration between restarts.
As of JetBrains Toolbox 3.2, log level configuration now persists
between restarts, making this warning outdated.
Created on behalf of @matifali
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
## Summary
> NOTE: Calling this out as a breaking change in case existing consumers
of the CLI depend on being able to see expired tokens OR being able to
delete tokens immediately.
Updates the `coder tokens rm` command to immediately expire a token by
ID, preserving the token record for audit trail purposes. Tokens can
still be deleted by passing `--delete`.
## Problem
During an incident on dev.coder.com, operators needed to urgently expire
an API key that was stuck in a hot loop. The only way to do this was via
direct database access:
```sql
UPDATE api_keys SET expires_at = NOW() WHERE id = '...';
```
This is not ideal for operators who may not have direct DB access or
want to avoid manual SQL.
## Solution
This PR adds:
- **API endpoint**: `PUT /api/v2/users/{user}/keys/{keyid}/expire` -
Sets the token's `expires_at` to now
- **SDK method**: `ExpireAPIKey(ctx, userID, keyID)`
- **Updates CLI**: `coder tokens rm <name|id|token>` now _expires_ by
default. You can still delete by passing the `--delete` flag. The `coder
tokens list` command now also hides expired tokens by default. You can
`--include-expired` if needed to include them.
- **Audit logging**: The expire action is logged with old and new key
states
## Test plan
- Tests cover: owner expiring own token, admin expiring other user's
token, non-admin cannot expire other's token, 404 for non-existent token
Closes#21782🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Closes#20859
This page previously wasn't rendered to the user, however, there is a
possibility that they can navigate to this page and things will end up
in `<Spinner />`s until the requests ultimately fail. We can mitigate
this problem by showing them the `<RequirePermission />` modal.
<img width="1456" height="861" alt="image"
src="https://github.com/user-attachments/assets/57195643-ad55-4340-9c97-f8247b05a13b"
/>
Bumps rust from `760ad1d` to `9663b80`.
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Closes#21703
This doesn't make sense to have an `Activity bump` value when the
`Default autostop` is set to `0`. There is nothing to bump if we don't
have a timed stopping mechanism on the container. This is already
present on the backend and now we're describing this to the user on the
frontend.
## Summary
The license removal confirmation dialog always showed:
> Removing this license will disable all Premium features. You add a new
license at any time.
This is misleading when the license being removed is already expired —
an expired license isn't providing any features, so removing it won't
disable anything.
## Changes
- Extracted `isExpired` variable in `LicenseCard` (reusing the existing
expiry check)
- Made the dialog description conditional:
- **Expired license**: "This license has already expired and is not
providing any features. Removing it will not affect your current
entitlements."
- **Active license**: "Removing this license will disable all Premium
features. You can add a new license at any time."
- Also fixed a minor typo in the active license message ("You add" →
"You can add")
- Added two new tests covering both dialog variants
## Testing
All 5 `LicenseCard` tests pass, including the 2 new ones:
- `shows expired removal message for expired licenses`
- `shows disabling features warning for active licenses`
---------
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
## Problem
Site-wide admins (e.g., Owners) could not use `coder create --org <org>`
to create workspaces in organizations they are not members of. The error
was:
```
$ coder create my-workspace -t docker --org data-science
error: organization "data-science" not found, are you sure you are a member of this organization?
```
This was inconsistent with the web UI, where Owners can create
workspaces in any organization.
## Root Cause
The CLI's `OrganizationContext.Selected()` function only checked the
user's membership list, ignoring site-wide RBAC permissions that grant
Owners access to all organizations.
## Solution
Added a fallback in `OrganizationContext.Selected()` that fetches the
org directly via the API when not found in the membership list. This
works because the API endpoint applies RBAC filtering, allowing Owners
to read any org.
## Impact
This fixes `coder create --org` and all other CLI commands that use
`OrganizationContext.Selected()` (29+ commands), including:
- `coder templates push --org <any-org>`
- `coder organizations members add --org <any-org>`
- `coder provisioner list --org <any-org>`
## Testing
Added `TestEnterpriseCreate/OwnerCanCreateInNonMemberOrg` which:
- Creates an Owner user who is NOT a member of a second org
- Verifies they can create a workspace there using `--org`
- Properly fails without the code fix, passes with it
---
*This PR was generated by [mux](https://mux.coder.com) but reviewed by a
human.*
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore <dependency name> major version` will close this
group update PR and stop Dependabot creating any more for the specific
dependency's major version (unless you unignore this specific
dependency's major version or upgrade to it yourself)
- `@dependabot ignore <dependency name> minor version` will close this
group update PR and stop Dependabot creating any more for the specific
dependency's minor version (unless you unignore this specific
dependency's minor version or upgrade to it yourself)
- `@dependabot ignore <dependency name>` will close this group update PR
and stop Dependabot creating any more for the specific dependency
(unless you unignore this specific dependency or upgrade to it yourself)
- `@dependabot unignore <dependency name>` will remove all of the ignore
conditions of the specified dependency
- `@dependabot unignore <dependency name> <ignore condition>` will
remove the ignore condition of the specified dependency and ignore
conditions
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Closes#16148
This pull-request resolves a few issues with wider displays.
Particularly in ensuring the content's container center's as one would
expect and the content of the headings isn't being contained into a
`max-w-prose`.
**Background**
Reported in #17417, there is a `deleted` query parameter supported by
/api/v2/templates, but we do not respect this field on the client,
showing the "Create Workspace" button for deleted templates.
**Expected Behavior**
Don't show the "Create Workspace" button for deleted templates.
**Notes**
This PR adds a new `deleted` field to the templates API response.
Co-authored-by: Danielle Maywood <danielle@themaywoods.com>
## Description
This PR wires up the metrics scanner in the Makefile to automatically regenerate metrics documentation when source files change.
## Changes
* Add Makefile target `scripts/metricsdocgen/generated_metrics` to run the AST scanner to generate the metrics file
* Update `docs/admin/integrations/prometheus.md` Makefile target to depend on `scripts/metricsdocgen/generated_metrics`
* Add `scripts/metricsdocgen/README.md` documenting the metrics generation process
Closes: https://github.com/coder/coder/issues/13223
## Description
This PR refactors `scripts/metricsdocgen/main.go` to support merging static and generated metrics files for documentation generation.
The static `metrics` file remains necessary for metrics not defined in the coder codebase (`go_*`, `process_*`, `promhttp_*`, `coder_aibridged_*`), as well as **edge cases** the scanner cannot handle (e.g., such as metrics with runtime-determined labels or function-local variable references for fields, ...). Handling these edge cases in the scanner would make it significantly more complex, so we keep this hybrid approach to accommodate them. This means that in such cases, developers need to update the `metrics` file directly, meaning there is still a risk of out-of-date information in the documentation. However, this solution should already encompass most cases.
Static metrics take priority over generated metrics when both files contain the same metric name, allowing manual overrides without modifying the scanner. Some of these edge cases could be easily fixed by updating the codebase to use one of the supported patterns.
## Changes
* Update `scripts/metricsdocgen/main.go` to read from two separate metrics files:
* `metrics`: static, manually maintained metrics (e.g., `go_*`, `process_*`, `promhttp_*`, `coder_aibridged_*`)
* `generated_metrics`: auto-generated by the AST scanner
* Update `metrics` file to contain only static and edge-case metrics
* Skip metrics with empty HELP descriptions in the scanner
* Update `generated_metrics` to reflect skipped metrics
* Update `docs/admin/integrations/prometheus.md` with merged metrics
Related to: https://github.com/coder/coder/issues/13223
**Disclosure:** This PR was mainly developed with Claude Sonnet 4, with iterative review and refinement by @ssncferreira
## Description
This PR implements extraction of metrics defined using `promauto.With()` factory patterns.
## Changes
* Add `extractPromautoMetric()` to handle:
* `promauto.With(reg).NewCounterVec(prometheus.CounterOpts{...}, labels)`
* `factory.NewGaugeVec(prometheus.GaugeOpts{...}, labels)`
* Script generates an updated `scripts/metricsdocgen/generated_metrics` file
Related to: https://github.com/coder/coder/issues/13223
**Disclosure:** This PR was mainly developed with Claude Sonnet 4, with iterative review and refinement by @ssncferreira
## Description
This PR implements extraction of metrics defined using `prometheus.New*()` and `prometheus.New*Vec()` patterns with `*Opts{}` structs.
## Changes
* Add `extractOptsMetric()` to handle:
* `prometheus.NewGauge(prometheus.GaugeOpts{...})`
* `prometheus.NewCounter(prometheus.CounterOpts{...})`
* `prometheus.NewHistogram(prometheus.HistogramOpts{...})`
* `prometheus.NewSummary(prometheus.SummaryOpts{...})`
* `prometheus.New*Vec(prometheus.*Opts{...}, labels)`
* Script generates an updated `scripts/metricsdocgen/generated_metrics` file
Related to: https://github.com/coder/coder/issues/13223
**Disclosure:** This PR was mainly developed with Claude Sonnet 4, with iterative review and refinement by @ssncferreira
## Description
This PR implements extraction of metrics defined using the `prometheus.NewDesc()` pattern.
## Changes
* Add `extractNewDescMetric()` to extract metrics from `prometheus.NewDesc()` calls
* Script generates an updated `scripts/metricsdocgen/generated_metrics` file
Related to: https://github.com/coder/coder/issues/13223
**Disclosure:** This PR was mainly developed with Claude Sonnet 4, with iterative review and refinement by @ssncferreira
## Description
This PR adds an AST-based scanner to automatically generate Prometheus metrics documentation from the coder source code.
## Changes
* Add `scripts/metricsdocgen/scanner/scanner.go` with:
* Directory walking for `agent/`, `coderd/`, `enterprise/`, `provisionerd/`
* Go file parsing (skipping `*_test.go` files)
* AST inspection for metric extraction
* `Metric.String()` for Prometheus text exposition format rendering
* `writeMetrics()` to output metrics to stdout
* Placeholder `extractMetricFromCall()` (implemented in subsequent PRs)
* Empty `scripts/metricsdocgen/generated_metrics` placeholder (populated by subsequent PRs)
**Note:** To facilitate the review process, this was separated into scoped stacked PRs. The division was based on the main structure, the different Prometheus patterns currently present in the codebase, and updates to the build process.
Related to: https://github.com/coder/coder/issues/13223
**Disclosure:** This PR was mainly developed with Claude Sonnet 4, with iterative review and refinement by @ssncferreira
This pull-request refactors the `<Combobox />` component from a
monolithic design to a composable compound component pattern, providing
more flexibility and reusability across the codebase
- Migrates `<SelectFilter />` to use the new `<Combobox />` instead of
the legacy `<SelectMenu />` components
- Updates all existing consumers of `<Combobox />` and `<SelectFilter
/>` to use the new API
<img
src="https://github.com/user-attachments/assets/a3336431-590c-48b5-adde-3fc5c16f459d"
/>
The `<Combobox />` component has been refactored to use a compound
component pattern, exposing:
- `Combobox` - Root component with context provider for open/value state
- `ComboboxTrigger` - Trigger wrapper (re-exports PopoverTrigger)
- `ComboboxButton` - Styled button with chevron and selected option
display
- `ComboboxContent` - Popover content with Command wrapper
- `ComboboxInput` - Search input (re-exports CommandInput)
- `ComboboxList` - List container (re-exports CommandList)
- `ComboboxItem` - Individual option with checkmark indicator
- `ComboboxEmpty` - Empty state (re-exports CommandEmpty)
- `useCombobox` - Hook to access combobox context
This pattern allows consumers to compose their own combobox layouts
while sharing consistent behavior and styling.
Furthermore, we had an issue with `CreateWorkspacePageView.stories.tsx`
lacking stories which would let us see the passed parameters and presets
in context. I've added stories to surround this.
### Updated Consumers
- `DynamicParameter.tsx` - Updated to use new Combobox API for parameter
options
- `CreateWorkspacePageView.tsx` - Updated preset combobox usage
- `IdpOrgSyncPageView.tsx` - Updated organization sync form
- `IdpGroupSyncForm.tsx` - Updated group sync form
- `IdpRoleSyncForm.tsx` - Updated role sync form
- `WorkspacesPage/filter/menus.tsx` - Updated workspace filter menus
---------
Co-authored-by: ケイラ <mckayla@hey.com>
This PR adds some metrics to help identify job enqueue rates and
latencies. This work was initiated as a way to help reduce the cost of
the observation/measurement itself for autostart scaletests, which
impacts our ability to identify/reason about the load caused by
autostart. See: https://github.com/coder/internal/issues/1209
I've extended the metrics here to account for regular user initiated
builds, prebuilds, autostarts, etc. IMO there is still the question here
of whether we want to include or need the `transition` label, which is
only present on workspace builds. Including it does lead to an increase
in cardinality, and in the case of the histogram (when not using native
histograms) that's at least a few extra series for every bucket. We
could remove the transition label there but keep it on the counter.
Additionally, the histogram is currently observing latencies for other
jobs, such as template builds/version imports, those do not have a
transition type associated with them.
Tested briefly in a workspace, can see metric values like the following:
-
`coderd_workspace_builds_enqueued_total{build_reason="autostart",provisioner_type="terraform",status="success",transition="start"}
1`
-
`coderd_provisioner_job_queue_wait_seconds_bucket{build_reason="autostart",job_type="workspace_build",provisioner_type="terraform",transition="start",le="0.025"}
1`
---------
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Closes#21830
Remove redundant icon sizing across the frontend. Components like
`Button`, `DropdownMenuItem`, and `CommandItem` already control child
SVG sizes via CSS selectors (e.g., `[&>svg]:size-icon-lg`), so explicit
`size` props and `className` overrides on icons nested inside them are
unnecessary. This PR strips those out and lets parent components handle
sizing consistently.
As a bonus, also migrates the `DropdownArrow` component from Emotion
CSS-in-JS to Tailwind utilities, replaces raw `<a>` tags with the `<Link
/>` component in the Premium page, and adds Storybook coverage for
`PremiumPageView`.
The AI Bridge setup docs showed `CODER_AIBRIDGE_ENABLED=true coder
server` as a single line, which can confuse users into thinking the env
var is a one-time prefix rather than a persistent setting.
Split this into `export CODER_AIBRIDGE_ENABLED=true` on its own line
followed by `coder server`, which is clearer and consistent with how the
Bedrock credentials section already handles env vars in the same file.
Created on behalf of @dannykopping
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
## Problem
CI failure showed 3 goroutines leaked in the prebuilds reconciler, all
stuck in `select` state:
1) `MetricsCollector.BackgroundFetch` (metrics goroutine)
2) `StoreReconciler.Run` (main reconciliation loop)
3) `StoreReconciler.Run.func3()` (provisioner job publisher goroutine)
All three goroutines were waiting for `ctx.Done()`, which likely means
`cancelFn()` was never called to trigger shutdown.
**Note:** I was unable to reproduce the flake locally. The likely cause
was a race condition between `Run()` and `Stop()` where `Stop()` could
check `running` (seeing `false`), return early, and then `Run()` would
start goroutines that never get cleaned up. This could happen in any
`coderd` test that starts a server with prebuilds enabled.
### Problems identified
1) Missing waitgoroup tracking: provisioner job publisher goroutine was
not tracked in the waitgroup, therefore, this goroutine was not tracked
for a clean shutdown in `Run defer func()`.
2) The provisioner job publisher goroutine had a redundant `case
<-c.done` that could race with `Stop()` select statement.
3) Race condition between `Run()` and `Stop()`: the `running` and
`stopped` fields were `atomic.Bool` values checked and set
independently, allowing a window where `Stop()` could see
`running=false` and return early, then `Run()` would set `running=true`
and start goroutines that would never be cleaned up. This could happen
in any `coderd` test that starts a server with prebuilds enabled.
## Changes
* Added `wg.Add(1)` and `defer wg.Done()` to track provisioner job
publisher goroutine in waitgroup
* Removed redundant `case <-c.done` from provisioner job publisher
goroutine to eliminate race condition
* Replaced `atomic.Bool` for `running` and `stopped` with a `sync.Mutex`
lifecycle state, also protecting `cancelFn` under the same mutex, to
eliminate the race between `Run()` and `Stop()`
* Added a guard in `Run()` to prevent double-start (`c.stopped ||
c.running`)
* Improved comments in Stop() and Run() to clarify shutdown behavior
Closes: https://github.com/coder/internal/issues/1116
### Summary
Workspace created via mode=auto links now require explicit user
confirmation before provisioning. A warning dialog shows all prefilled
param.* values from the URL and blocks creation until the user clicks
`Confirm and Create`. Clicking `Cancel` falls back to the standard form
view.
<img width="820" height="475" alt="auto-create-consent-dialog"
src="https://github.com/user-attachments/assets/8339e3bd-434f-4a04-9385-436bf95f49d7"
/>
### Breaking behavior change
Links using `mode=auto` (e.g., "Open in Coder" buttons) will no longer
silently create workspaces. Users will now see a consent dialog and must
explicitly confirm before the workspace is provisioned. Any existing
integrations or automation relying on `mode=auto` for seamless workspace
creation will now require manual user interaction.
---------
Co-authored-by: Jake Howell <jacob@coder.com>
This change adds Linux support for Desktop VPN by aligning Linux
behavior with the existing Windows daemon implementation and adding a
Linux networking stack implementation.
### What changed
- Consolidated the daemon command implementation into a shared file:
- `cli/vpndaemon_windows_linux.go` (`//go:build windows || linux`)
- Consolidated daemon tests into a shared file:
- `cli/vpndaemon_windows_linux_test.go` (`//go:build windows || linux`)
- Removed Linux-only duplicate daemon files:
- `cli/vpndaemon_linux.go`
- `cli/vpndaemon_linux_test.go`
- Removed unsupported-platform stubs per current supported OS targets:
- `cli/vpndaemon_other.go`
- `vpn/tun.go`
- Kept Linux networking stack implementation in:
- `vpn/tun_linux.go`
### Notes
- Linux now uses the same `rpc-read-handle` / `rpc-write-handle` flags
and env vars as Windows.
- The daemon logs to stderr (via CLI logger sinks), and does not forward
logs over the RPC pipe.
## Problem
The Copilot provider was missing from the AI Bridge logs filter dropdown, so users couldn't filter interceptions by Copilot. Additionally, the `AIBridgeProviderIcon` component didn't handle the copilot provider, so it would render a fallback question mark icon.
<img width="1392" height="333" alt="Screenshot 2026-02-10 at 09 26 16" src="https://github.com/user-attachments/assets/ecb97400-a4dd-4e88-accc-68d7fdf19b2a" />
## Changes
* Added `copilot` case to `AIBridgeProviderIcon`, using the existing `/icon/github.svg`.
* Added Copilot as a provider option in the filter dropdown.
* Added `MockInterceptionAnthropic` and `MockInterceptionCopilot` mock data with sample prompts, and updated the Storybook stories to use one interception per provider.
## Problem
Previously, the AI Bridge model column icon was derived from the provider field. This worked because each provider only served its own models: OpenAI interceptions always used OpenAI models, and Anthropic interceptions always used Anthropic models.
With the introduction of the Copilot provider, this assumption no longer holds. Copilot can forward requests to both OpenAI and Anthropic models, so the provider field alone is not enough to determine the correct model icon. This caused Copilot interceptions to display a fallback question mark icon for the model.
<img width="1337" height="365" alt="Screenshot 2026-02-10 at 09 10 34" src="https://github.com/user-attachments/assets/1efd613d-16c9-4738-8337-6ccf92e610fc" />
## Changes
* Added `AIBridgeModelIcon` component that infers the model family (Claude, OpenAI) from the model name string and renders the appropriate icon.
* Updated `RequestLogsRow` to use `AIBridgeModelIcon` instead of `AIBridgeProviderIcon` in both the table row and the expanded detail view.
This PR fixes a workspace app authentication bug where requests that
include an `Authorization` header (intended for the upstream app) can
cause Coder to ignore the workspace app session cookie
(`coder_subdomain_app_session_token_*` /
`coder_path_app_session_token`). When that happens, Coder fails to mint
or renew `coder_signed_app_token` and redirects to
`/api/v2/applications/auth-redirect` instead of proxying the request to
the workspace.
This commonly shows up when users run a frontend and backend in the same
workspace and the backend requires `Authorization` (for example, `curl
-H "Authorization: bearer ..."` or browser `fetch()` calls).
Related issues / context:
* Primary bug report and repro:
[https://github.com/coder/coder/issues/21467](https://github.com/coder/coder/issues/21467)
* Related symptoms reported as CORS / redirect failures for workspace
apps:
*
[https://github.com/coder/coder/issues/20667](https://github.com/coder/coder/issues/20667)
*
[https://github.com/coder/coder/issues/19728](https://github.com/coder/coder/issues/19728)
## Root Cause
In `coderd/workspaceapps/cookies.go`, `AppCookies.TokenFromRequest`
checked `httpmw.APITokenFromRequest(r)` first. That helper returns a
token from several places, including `Authorization: Bearer ...`.
As a result, when a request included an upstream `Authorization` header,
that header value was returned as the “session token” for the app proxy,
and `coder_subdomain_app_session_token_*` was never read. Authentication
then failed and the request was treated as signed out.
## Fix
Change the precedence in `AppCookies.TokenFromRequest`:
1. First check the access-method-specific cookie:
* subdomain apps: `coder_subdomain_app_session_token_{hash}`
* path apps: `coder_path_app_session_token`
2. If not present, fall back to `httpmw.APITokenFromRequest(r)` (so
non-browser clients can still authenticate via query, header, or bearer
tokens if they really want to).
This ensures that:
* Backend requests that require `Authorization` still reach the
workspace.
* `coder_signed_app_token` can be renewed from the app session cookie
even when `Authorization` is present.
* `Authorization` is still forwarded to the upstream app (the reverse
proxy code does not strip it).
Initially, I attempted workarounds
([https://github.com/coder/coder/issues/20667#issuecomment-3868578388](https://github.com/coder/coder/issues/20667#issuecomment-3868578388),
[https://github.com/coder/coder/issues/19728#issuecomment-3868578093](https://github.com/coder/coder/issues/19728#issuecomment-3868578093)),
but adding `/auth-redirect` to the permissive CORS paths and extending
the validity of workspace app auth tokens from 1 minute to 1 hour only
partially masked the issue. After workspace restarts and token expiry, I
no longer saw CORS errors, but the tokens were still not renewed.
After patching my local Nix-based setup on Coder v1.30.0 with this
change, I can no longer observe this behavior.
When discussing the changes needed for #22032 I was complaining about
how the `overflow-hidden` didn't work correctly so we could safely
remove it.
To continue these changes, I've refactored down how we work on mobile
within these triggers and enable full truncating and `max-w-`'s on each
of the content. Everything stemmed from the `<fieldset />` having a
`width: max-content` causing the content to extend past the bounds of
the container with `flex` in-toe.
Furthermore, the `(Default)` on `Preset` has been turned into a badge so
that we get the full truncation effect as we do with `Template Version`.
Follow-up improvements here might be to wrap the content of this input
on smaller displays.
### Preview
Top is the old, bottom is the new.
<img width="924" height="594" alt="preview"
src="https://github.com/user-attachments/assets/c1bbf152-03a6-4cad-b925-aad0549536a7"
/>
I was trying to figure out why `goleak` was complaining about a dangling
http2 connection goroutine in tests. Turns out that `taskname.Generate`
will call out to Anthropic if an API key is set, and we're calling it in
`dbgen`. Modified to use testutil method instead.
Closes#22028
This pull-request simply takes debounces the message sent to our
web-socket backend and debounces it to ensure we're not overwriting the
users input as they type. As an added bonus this will debounce message
spam if people are going crazy on Radio Items or similar.
An extra flavour bit of flavour with resolving a good use-case for
`cn()` in diagnostic errors 🙂
This pull-request takes the MUI based components from `<AuditLogRow />`
and its subsidiaries and updates them to use the correct newer Tailwind
based components.
This reverts commit 5224387c5a.
This is causing layout shifts to `0,0` when attempting to open
dropdowns. Something more battle-tested is needed unfortunately, Radix +
Scrollgutters is really annoying.
Add the ability to pause a running task and resume a paused task directly
from the TaskPage. This includes showing contextual messages when a task
is paused (manual vs timeout) and proper error handling with dialogs for
API errors.
- Extract task action logic into reusable mutations (api/queries/tasks.ts)
- Move TaskActionButton to modules/tasks for better organization
- Add pause button to TaskStartingAgent component
- Show appropriate state messages for transitioning states (pausing,
canceling, deleting)
The "Deploy PR manually" image (`deploy-pr-manually.png`) referenced in
the contributing docs has never existed in the repository, resulting in
a broken image on the [docs
site](https://coder.com/docs/about/contributing/CONTRIBUTING#deploying-a-pr).
This PR removes the broken `<Image>` tag and ends the sentence with a
period instead. The `pr-deploy.yaml` workflow link remains intact for
users to navigate to the workflow dispatch page directly.
Created on behalf of @DavidFrawormo
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
Bumps the x group with 2 updates:
[golang.org/x/oauth2](https://github.com/golang/oauth2) and
[golang.org/x/sys](https://github.com/golang/sys).
Updates `golang.org/x/oauth2` from 0.34.0 to 0.35.0
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/golang/oauth2/commit/89ff2e1ac388c1a234a687cb2735341cde3f7122"><code>89ff2e1</code></a>
google: add safer credentials JSON loading options.</li>
<li>See full diff in <a
href="https://github.com/golang/oauth2/compare/v0.34.0...v0.35.0">compare
view</a></li>
</ul>
</details>
<br />
Updates `golang.org/x/sys` from 0.40.0 to 0.41.0
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/golang/sys/commit/fc646e489fd944b6f77d327ab77f1a4bab81d5ad"><code>fc646e4</code></a>
cpu: use IsProcessorFeaturePresent to calculate ARM64 on windows</li>
<li><a
href="https://github.com/golang/sys/commit/f11c7bb268eb8a49f5a42afe15387a159a506935"><code>f11c7bb</code></a>
windows: add IsProcessorFeaturePresent and processor feature consts</li>
<li><a
href="https://github.com/golang/sys/commit/d25a7aaff8c2b056b2059fd7065afe1d4132e082"><code>d25a7aa</code></a>
unix: add IoctlSetString on all platforms</li>
<li><a
href="https://github.com/golang/sys/commit/6fb913b30f367555467f08da4d60f49996c9b17a"><code>6fb913b</code></a>
unix: return early on error in Recvmsg</li>
<li>See full diff in <a
href="https://github.com/golang/sys/compare/v0.40.0...v0.41.0">compare
view</a></li>
</ul>
</details>
<br />
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore <dependency name> major version` will close this
group update PR and stop Dependabot creating any more for the specific
dependency's major version (unless you unignore this specific
dependency's major version or upgrade to it yourself)
- `@dependabot ignore <dependency name> minor version` will close this
group update PR and stop Dependabot creating any more for the specific
dependency's minor version (unless you unignore this specific
dependency's minor version or upgrade to it yourself)
- `@dependabot ignore <dependency name>` will close this group update PR
and stop Dependabot creating any more for the specific dependency
(unless you unignore this specific dependency or upgrade to it yourself)
- `@dependabot unignore <dependency name>` will remove all of the ignore
conditions of the specified dependency
- `@dependabot unignore <dependency name> <ignore condition>` will
remove the ignore condition of the specified dependency and ignore
conditions
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps rust from `df6ca8f` to `760ad1d`.
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Resolves the TODO in TestPool by adding TestPool_Expiry which uses Go
1.25's testing/synctest to verify TTL-based cache eviction.
I wanted to get familiar with the new `synctest` package in Go 1.25 and
found this TODO comment, so I decided to take a stab at it 😄
Migrates `ConnectionLogRow` and `ConnectionLogDescription` off MUI and
Emotion. Replaces `@mui/material/Link` with the existing shadcn-based
`Link` component, swaps the deprecated `Stack` wrappers for plain divs
with Tailwind flex utilities, and converts all Emotion `css` prop styles
to Tailwind classes.
Also fixes a pre-existing lint issue where `tabIndex` was set on a
non-interactive div.
Replace all usages of MUI's `visuallyHidden` utility from `@mui/utils`
with Tailwind's `sr-only` class. Both produce identical CSS, so this is
a no-op behaviorally -- just removes another MUI dependency from the
codebase. Also updates the accessibility example in the frontend
contributing docs to match.
closes: https://github.com/coder/internal/issues/1331
Fixes up an issue in the test where we end up calling `FailNow` outside
the main test goroutine. Also adds the ability to name a `ptytest.PTY`
for cases like this one where we start multiple commands. This will help
debugging if we see the issue again.
This doesn't address the root cause of the failure, but I think we
should close the flake issue. I think we'd need like a stacktrace of all
goroutines at the point of failing the test, but that's way too much
effort unless we see this again.
Closes https://github.com/coder/internal/issues/1261.
This pull request adds an endpoint to pause coder tasks by stopping the
underlying workspace.
* Instead of `POST /api/v2/tasks/{user}/{task}/pause`, the endpoint is
currently experimental.
* We do not currently set the build reason to `task_manual_pause`,
because build reasons are currently only used on stop transitions.
This pull-request takes our `@mui/*` dependencies and replaces them with
shiny new Tailwind ones. Furthermore, it resolves an issue with the
`input` where `aria-invalid` wouldn't give it a red-ring like
`<InputGroup />` does.
As an added touch we've applied Formik to `<RequestOTPPage />` so that
we can render an invalid email easily.
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore <dependency name> major version` will close this
group update PR and stop Dependabot creating any more for the specific
dependency's major version (unless you unignore this specific
dependency's major version or upgrade to it yourself)
- `@dependabot ignore <dependency name> minor version` will close this
group update PR and stop Dependabot creating any more for the specific
dependency's minor version (unless you unignore this specific
dependency's minor version or upgrade to it yourself)
- `@dependabot ignore <dependency name>` will close this group update PR
and stop Dependabot creating any more for the specific dependency
(unless you unignore this specific dependency or upgrade to it yourself)
- `@dependabot unignore <dependency name>` will remove all of the ignore
conditions of the specified dependency
- `@dependabot unignore <dependency name> <ignore condition>` will
remove the ignore condition of the specified dependency and ignore
conditions
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
This pull-request finds all of our previous instances of the MUI-based
Latency `color`'s and updates them to use the equivalents form the
Tailwind package.
Adds a standalone command that acts as a mock telemetry server,
receiving snapshots and printing them as a JSON stream to stdout. Useful
for local development testing with scripts/develop.sh by setting
CODER_TELEMETRY_ENABLE and CODER_TELEMETRY_URL environment variabless.
Adds coderd_template_workspace_build_duration_seconds histogram that
tracks the full duration from workspace build creation to agent ready.
This captures the complete user-perceived build time including
provisioning and agent startup.
The metric is emitted when the agent reports ready/error/timeout via the
lifecycle API, ensuring each build is counted exactly once per replica.
Previously, UpsertBoundaryUsageStats (INSERT...ON CONFLICT DO UPDATE) and
GetAndResetBoundaryUsageSummary (DELETE...RETURNING) could race during
telemetry period cutover. Without serialization, an upsert concurrent with the
delete could lose data (deleted right after being written) or commit after the
delete (miscounted in the next period). Both operations now acquire
LockIDBoundaryUsageStats within a transaction to ensure a clean cutover.
This pull request updates the documentation review workflow in
`.github/workflows/doc-check.yaml` to improve clarity and introduce
sticky comment logic for doc-check reviews. The changes focus on
refining the review context messages and providing detailed instructions
for updating existing doc-check comments, ensuring more consistent and
actionable documentation feedback.
**Workflow message and prompt improvements:**
* Refined the context messages for different PR trigger types to be
clearer and less repetitive, making instructions more concise for the
agent.
**Sticky comment logic and instructions:**
* Updated the task prompt to instruct the agent to look for an existing
doc-check comment containing `<!-- doc-check-sticky -->` and update it
instead of creating a new one, supporting more efficient and organized
review threads.
* Added detailed instructions for how to update sticky comments,
including checking off addressed items, striking through items no longer
needed, adding new items, and warning if changes can't be verified.
* Modified the comment format example to include sticky comment
conventions, such as strikethrough for reverted items, checkboxes for
addressed items, and warnings for unverifiable documentation changes.
* Ensured the `<!-- doc-check-sticky -->` marker is placed at the end of
the comment for easier identification and updates in future runs.
## Description
Fixes an incorrect path in the air-gapped/offline installation
documentation for publishing Coder modules to Artifactory.
The [coder/registry](https://github.com/coder/registry) repo has the
following structure:
```
registry/ # repo root
└── registry/ # subdirectory
└── coder/
└── modules/
```
The documentation previously instructed users to run:
```shell
cd registry/coder/modules
```
But the correct path is:
```shell
cd registry/registry/coder/modules
```
This was causing confusion for users trying to set up Coder modules in
air-gapped environments with Artifactory or similar repository managers.
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
Adds a Go wrapper (`scripts/apidocgen/swaginit/main.go`) that calls
swag's Go API with `Strict: true`. The `--strict` flag isn't available
in swag's CLI in any version, so the wrapper is the only way to enable
it.
Also upgrades swag from v1.16.2 to v1.16.6 (better generics support,
precise numeric formats, `x-enum-descriptions`, CVE-2024-45338 fix).
Closes [`internal#1292`](https://github.com/coder/internal/issues/1292)
This pull-request reduces our nesting of the `View Task` button. Its
easier to jump to tasks now as we don't have to wait for the app status
to exist.
Previously we returned 400 Bad Request for all non-active states. This
was semantically incorrect for transitional and paused states where the
request is valid but conflicts with current state.
We now return 409 Conflict for pending/initializing/paused (resolvable
by waiting or resuming) and 400 for error/unknown (actual problems).
This enables client-side auto-resume orchestration per the task
lifecycle RFC.
Closescoder/internal#1265
Task snapshots were orphaned when tasks were soft-deleted. The
`task_snapshots` table has an `ON DELETE CASCADE` foreign key, but
that only fires on hard deletes.
Modified DeleteTask to use a CTE that atomically soft-deletes the
task and removes its snapshot in a single transaction. The query now
returns just the task UUID instead of the full row.
Closescoder/internal#1283
Relates to https://github.com/coder/coder/pull/21922 /
https://github.com/coder/internal/issues/1259
* Adds `dbfake.BuilderOption func(*WorkspaceBuildBuilder)`
* Adds `BuilderOption` methods for setting various provisioner job
related fields on `WorkspaceBuildBuilder`.
* Migrates a number of existing tests that previously dependeded on
provisioner job timing to use these updated methods in the following
packages:
* `coderd/jobreaper`
* `coderd/notifications/reports`
* `enterprise/coderd/schedule`
* `enterprise/coderd/prebuilds`
* `scripts/workspace-runtime-audit`
🤖 Created using Mux (Opus 4.5)
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
We attempted to unify these previously in #21914 however it appears I
missed dropping this a `font-weight` level. This pull-request makes this
very simple change, its now inline with the Figma design!
fixes: https://github.com/coder/internal/issues/1300
Adds brotli and zstd compression to the binary cache. Also refactors coderd's streaming encoding middleware to use the same standard set of compression algorithms, so we have them in one place.
relates to: https://github.com/coder/internal/issues/1300
Refactors the options to the site handler to take the cache directory, rather than expecting the caller to call `ExtractOrReadBinFS` and pass the results.
This is important in this stack because we need direct access to the cache directory for compressed file caching.
relates to: https://github.com/coder/internal/issues/1300
Refactors the bin handler to be a `struct` instead of a handlerfunc. The reason we want this is because we are going to introduce a cache of compressed files, so we need somewhere to put this cache.
relates to: https://github.com/coder/internal/issues/1300
Refactors the site binary handler routines to their own file. The `site.go` was getting pretty long and I want to do some refactoring on how the binary handler works.
This PR is literally just moving code from file to file; at the package level nothing is changed.
relates to: https://github.com/coder/internal/issues/1300
Adds a new package called `cachecompress` which takes a `http.FileSystem` and wraps it with an on-disk cache of compressed files. We lazily compress files when they are requested over HTTP.
# Why we want this
With cached compress, we reduce CPU utilization during workspace creation significantly.

This is from a 2k scaletest at the top of this stack of PRs so that it's used to server `/bin/` files. Previously we pegged the 4-core Coderds, with profiling showing 40% of CPU going to `zstd` compression (c.f. https://github.com/coder/internal/issues/1300).
With this change compression is reduced down to 1s of CPU time (from 7 minutes).
# Implementation details
The basic structure is taken from Chi's Compressor middleware. I've reproduced the `LICENSE` in the directory because it's MIT licensed, not AGPL like the rest of Coder.
I've structured it not as a middleware that calls an arbitrary upstream HTTP handler, but taking an explicit `http.FileSystem`. This is done for safety so we are only caching static files and not dynamically generated content with this.
One limitation is that on first request for a resource, it compresses the whole file before starting to return any data to the client. For large files like the Coder binaries, this can add 1-5 seconds to the time-to-first-byte, depending on the compression used.
I think this is reasonable: it only affects the very first download of the binary with a particular compression for a particular Coderd.
If we later find this unacceptible, we can fix it without changing interfaces. We can poll the file system to figure out how much data is available while the compression is inprogress.
follows on from #21940.
The API endpoints existed for this already, so this PR just adds CLI functionality which uses those API endpoints.
Generated with the help of Mux
## Summary
Updates the AI Governance documentation to explicitly mention that both
Community and Premium deployments include 1,000 Agent Workspace Builds.
Also clarifies that Community deployments do not have access to AI
Bridge or Agent Boundaries.
This is a follow-up to #21943 which made the same clarification in the
Tasks documentation.
## Changes
- Updated the "Agent Workspace Build Limits" section in
`docs/ai-coder/ai-governance.md`
- Added explicit mention that Community deployments lack AI Bridge and
Agent Boundaries access
---
Created on behalf of @mattvollmer
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
## Summary
Fixes flaky `TestServer/BuiltinPostgres` test caused by port conflicts
in CI.
## Fix
Increase retry attempts from 3 to 10 for better odds when port conflicts
occur.
Fixes https://github.com/coder/internal/issues/1017
Adds additional logs for determining what signal the agent receives
prior to shut down. Also helps distinguish whether the signal originated
at the agent or reaper.
## Description
This PR adds documentation for configuring clients to work with AI
Bridge via AI Bridge Proxy, specifically GitHub Copilot.
Preview:
https://coder.com/docs/@docs-aibridge-proxy-client-config/ai-coder/ai-bridge/ai-bridge-proxy/setup#client-configuration
## Changes
* Add Client Configuration section to
`docs/ai-coder/ai-bridge/ai-bridge-proxy/setup.md` covering proxy and CA
certificate configuration
* Add `docs/ai-coder/ai-bridge/clients/copilot.md` with configuration
instructions for: Copilot CLI, VS Code Copilot Extension, JetBrains IDEs
* Update `docs/ai-coder/ai-bridge/clients/index.md`:
* Add introduction explaining base URL vs proxy-based integration
* Add GitHub Copilot to compatibility table
Related to: https://github.com/coder/internal/issues/1188
Context was created before expensive setup operations (building
workspaces, starting agents), leaving insufficient time for the actual
command execution. Split into setupCtx for setup and a fresh ctx for
the command to ensure both get the full timeout.
The API endpoints existed for this already, so this PR just adds CLI
functionality which uses those API endpoints.
closes#21891
Generated with the help of Mux
macOS runners lack GNU toolchain dependencies (bash 4+, GNU getopt, make
4+) required by `scripts/lib.sh`. When any script sources `lib.sh`, it
checks for these dependencies and fails if they're missing.
This caused consistent failures in the `test-go-pg (macos-latest)` job
in `nightly-gauntlet.yaml`, which didn't have the GNU tools setup that
`ci.yaml` had. Commit 9a417df ("ci: add retry logic for Go module
operations") added a macOS GNU tools step to `ci.yaml`, but
`nightly-gauntlet.yaml` was not updated.
This PR adds a reusable `setup-gnu-tools` action and uses it
consistently across all workflows with macOS jobs, replacing the inline
brew install steps.
Closes https://github.com/coder/internal/issues/1133
The Connection Log page has a preset filter "Active SSH connections"
that was using `status:connected`, but the only valid status enum values
are `completed` and `ongoing`. This caused the preset to generate an
invalid query.
This changes the preset to use `status:ongoing type:ssh` and adds a
typed helper function so that invalid enum values will be caught at
compile time.
---
PR generated by [mux](https://mux.coder.com), but reviewed by a human.
Adds support for filtering workspaces by health status using
healthy:true or healthy:false in the search query.
This is done by changing `has-agent` to accept a list of statuses and
aliasing `health:true` to `has-agent:connected` and `healthy:false` to
`has-agent:timeout,disconnected`.
Fixes#21623
Add the ability to pause and resume tasks directly from the Tasks table,
allowing users to manage workspace resources without navigating to
individual task pages.
This pull-request implements various permission checks to the
`<OAuth2App* />` stories and components. We're trying to ensure that
we're actually allowed to `create`/`view`/`delete` on both Secrets and
Applications before showing them to the user/allowing action.
Furthermore, I've added various stories to catch when a user lacks these
permissions.
I noticed this particularly because I'm only an `Auditor` on our DEV
instance and can't see these fields.
---------
Co-authored-by: coder-tasks[bot] <254784001+coder-tasks[bot]@users.noreply.github.com>
The comments generated are too noisy and not of sufficiently high signal
that we should automatically opt every PR in.
This PR moves the trigger to the `code-review` label _only_.
Signed-off-by: Danny Kopping <danny@coder.com>
This pull-request implements a super simple change, essentially when we
fail to login we'd like to persist the `email` used when attempting to
sign-in. This just speeds up the flow rather than having to type the
email in again.
This PR increases the size of the schedule increment/decrement buttons
([-] [+]) to match the icon button style at size `sm` (same as the Stop,
Restart buttons).
## Changes
- Button dimensions: 20×20px → 32×32px
- Icon size: `size-icon-xs` → `size-icon-sm`
- Border radius: 4px → 6px (consistent with other icon buttons)
## Before
The [-] [+] buttons were tiny (20×20px) and difficult to click.
## After
The buttons now match the icon button style at size `sm` (32×32px),
consistent with other topbar buttons.
---
Created on behalf of @christin
---------
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
> [!NOTE]
> It should be noted that these #21781#21807#21809 pull-request are
required before we can merge this. This will stop us to battling the
`z-index` that is provided by MUI.
This is avoiding the changes that would be required in #21819
This pull-request removes on our reliance to control the scroll from
within another`<div />`, this means that we can actively make use of
`<ScrollRestoration />` where the page will return the top of the page
when you navigate to a new URL.
Updates the multi-model support description in the Coder Research docs
to reference provider companies (Anthropic, xAI, OpenAI) instead of
specific model names (Claude sonnet-4/opus-4, Grok, GPT-5).
This makes the docs more stable as model names change frequently, while
provider names remain constant.
---------
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
Co-authored-by: Matt Vollmer <matthewjvollmer@outlook.com>
- remove beta labels
- clarify how AWB is measured
- reassurance of no downtimes when limit is reached
---------
Co-authored-by: Atif Ali <atif@coder.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Matt Vollmer <matthewjvollmer@outlook.com>
Adds `add-project` to the `mux` module in the dogfood Coder template so
Mux opens the cloned repo by default.
- Uses `local.repo_dir` (defaults to `/home/coder/coder`) so it stays
correct if the repo base dir parameter changes.
Testing:
- `terraform fmt -check dogfood/coder/main.tf`
Adds a new AI Bridge client configuration page for **Mux** and lists it
in the client compatibility table.
- Add `docs/ai-coder/ai-bridge/clients/mux.md` with a short intro, UI +
env var + `~/.mux/providers.jsonc` examples
- Add Mux to the AI Bridge client compatibility table
- Add the new page to `docs/manifest.json`
Refs: https://mux.coder.com/config/providers#environment-variables
This pull-request ensures that we're using `<DropdownMenu />` in the
`Admin Settings` button as things weren't uniform before. This is inline
with the Figma design with the darker ("black") background. This has an
added side-benefit of removing some MUI-specific code.
<img
src="https://github.com/user-attachments/assets/4eb9136b-91b3-44ac-81a0-5abd1cf2cdf2"
/>
Update the agent protobuf schema (agent/proto/agent.proto) to include:
- subagent_id field in WorkspaceAgentDevcontainer message
- id field in CreateSubAgentRequest message
Bump the Agent API version from v2.7 to v2.8 and update all client
references throughout the codebase (ConnectRPC27 -> ConnectRPC28,
DRPCAgentClient27 -> DRPCAgentClient28).
## Description
Add documentation for AI Bridge Proxy.
## Changes
This PR adds documentation for AI Bridge Proxy under
`docs/ai-coder/ai-bridge/ai-bridge-proxy/`:
* `index.md`: Overview of AI Bridge Proxy, how it works (MITM vs tunnel
modes), and when to use it
* `setup.md`: Setup guide covering:
* Proxy configuration and required settings
* Security considerations and deployment options
* CA certificate generation (self-signed and organization-signed)
* Upstream proxy chaining configuration
Note: TODO comments in the documentation will be addressed in follow-up
PRs.
Related to: https://github.com/coder/internal/issues/1188
This pull-request refactors filter-related dropdown and input components
from MUI to our Tailwind-based design system. This is more inline with
the Figma design, controversially we are changing the button group for
canned filters and input to two seperate components.
- **InputGroup**: Complete rewrite to a compound component pattern
(`InputGroup`, `InputGroupAddon`, `InputGroupInput`, `InputGroupButton`)
using Tailwind and CVA, replacing the old CSS-in-JS approach
- **SearchField**: Migrated from MUI TextField to use the new InputGroup
components, with a simplified API and proper ref forwarding
- **Filter/PresetMenu**: Replaced MUI Menu with our DropdownMenu
component, and updated icon to `SlidersHorizontal`
### Changes
| Component | Before | After |
|-----------|--------|-------|
| InputGroup | CSS-in-JS with MUI margin hacks | Compound component with
Tailwind group states |
| SearchField | MUI TextField + InputAdornment | InputGroup +
InputGroupAddon composition |
| PresetMenu | MUI Menu/MenuItem | DropdownMenu/DropdownMenuItem |
| MenuSearch | Complex CSS overrides | Single Tailwind class |
<img
src="https://github.com/user-attachments/assets/5b819027-2dca-4dcc-b6d6-7096fa3775c0"
/>
On Windows, `pty.New()` was creating a `ConPTY` (`PseudoConsole`) even
when no process would be attached. `ConPTY` requires a real process to
function correctly - without one, the pipe handles become invalid
intermittently, causing flaky test failures like `read |0: The handle is
invalid.`
This affected tests using the `ptytest.New()` + `Attach()` pattern for
in-process CLI testing.
The fix splits Windows PTY creation into two paths:
- `newPty()` now returns a simple pipe-based PTY for the `Attach()` use
case
- `newConPty()` creates a real `ConPTY`, called by `Start()` when a
process will be attached
AFAICT this will result in no change in behaviour outside of tests.
Fixescoder/internal#1277
_Disclaimer: investigated and implemented by Claude Opus 4.5, reviewed
by me._
---------
Signed-off-by: Danny Kopping <danny@coder.com>
* Adds support for parameter `format=text` in the following API routes:
* `/api/v2/workspaceagents/:id/logs`
* `/api/v2/workspacebuilds/:id/logs`
* `/api/v2/templateversions/:id/logs`
* `/api/v2/templateversions/:id/dry-run/:id/logs`
* Adds links to view raw logs on the following pages:
* Workspace build page
* Template editor page
* Template version page
* Refactors existing log formatting in `cli/logs.go` to live in `codersdk`.
🤖 Generated with Claude Opus 4.5, reviewed by me.
---------
Co-authored-by: Claude <noreply@anthropic.com>
The AcquireProvisionerJob query only checked started_at IS NULL, allowing
it to acquire jobs that were canceled while pending (which have
completed_at set but started_at still NULL).
Added completed_at IS NULL check to the query to prevent this.
Also fixed JobCompleteBuilder.Do() in dbfake to set started_at when
completing jobs to match production behavior.
Fixescoder/internal#1323
## Summary
Previously, `CODER_PPROF_ADDRESS` and `CODER_PROMETHEUS_ADDRESS` were
hardcoded in the Helm chart template to `0.0.0.0:6060` and
`0.0.0.0:2112` respectively. These values could not be overridden via
`coder.env` values because the hardcoded values were set first in the
template, and Kubernetes uses the first occurrence of duplicate env
vars.
This was a security concern because binding to `0.0.0.0` exposes these
endpoints to any pod in the cluster:
- **pprof** can expose sensitive runtime information (goroutine stacks,
heap profiles, CPU profiles that may contain memory contents)
- **Prometheus metrics** may contain sensitive operational data
## Changes
1. **`helm/coder/templates/_coder.tpl`**: Added logic to check if the
user has set `CODER_PPROF_ADDRESS` or `CODER_PROMETHEUS_ADDRESS` in
`coder.env` before applying the default values. If the user provides a
value, the hardcoded default is skipped.
2. **`helm/coder/values.yaml`**: Updated documentation to:
- Remove these vars from the "cannot be overridden" list
- Add them to a new "can be overridden" section with security
recommendations
3. **Tests**: Added test cases for both override scenarios with
corresponding golden files.
## Usage
Users can now restrict pprof and prometheus to localhost only:
```yaml
coder:
env:
- name: CODER_PPROF_ADDRESS
value: "127.0.0.1:6060"
- name: CODER_PROMETHEUS_ADDRESS
value: "127.0.0.1:2112"
```
## Local Testing
To verify the fix locally:
```bash
# Update helm dependencies
cd helm/coder && helm dependency update
# Test default behavior (should show 0.0.0.0)
helm template coder . -f tests/testdata/default_values.yaml --namespace default | grep -A1 'CODER_PPROF_ADDRESS\|CODER_PROMETHEUS_ADDRESS'
# Test pprof override (should show 127.0.0.1:6060)
helm template coder . -f tests/testdata/pprof_address_override.yaml --namespace default | grep -A1 'CODER_PPROF_ADDRESS'
# Test prometheus override (should show 127.0.0.1:2112)
helm template coder . -f tests/testdata/prometheus_address_override.yaml --namespace default | grep -A1 'CODER_PROMETHEUS_ADDRESS'
# Run Go tests
cd tests && go test . -v
```
Fixes#21713
---------
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
Co-authored-by: uzair-coder07 <uzair@coder.com>
## Summary
Updates the description for the "Use" role in the workspace sharing
dropdown to explicitly mention that users with this permission can start
and stop the workspace, not just read and access it.
## Changes
- Updated the "Use" role description from "Can read and access this
workspace." to "Can read, access, start, and stop this workspace."
## Context
This clarification helps users understand the full scope of the "Use"
permission, which includes `ActionWorkspaceStart` and
`ActionWorkspaceStop` as defined in `coderd/database/db2sdk/db2sdk.go`.
---
*Created on behalf of @geokat*
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
Fixes the state format for Workspace Sharing in `docs/manifest.json`.
Changes `"early_access"` to `"early access"` (with space, no underscore)
to match the format used by other early access entries and to fix builds
on coder/coder.com.
Follow-up to #21797.
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
This pull request adds a new documentation file that defines the
"code-review" skill for use in the project. The document outlines a
standard workflow, severity levels, key areas to focus on during code
reviews, and Coder-specific review guidelines. This aims to standardize
and improve the quality and consistency of code reviews across the team.
Documentation and process standardization:
* Added `.claude/skills/code-review/SKILL.md`, which describes the
code-review skill, including workflow steps, severity levels, what to
look for in reviews, and what not to comment on. It also provides
Coder-specific patterns and best practices for authorization, error
handling, and shell scripting.
This PR changes the shared workspaces documentation page from Beta to
Early Access status.
Changes `docs/manifest.json` to update the state from `["beta"]` to
`["early_access"]` for the Workspace Sharing page.
Ref: https://coder.com/docs/user-guides/shared-workspaces
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
When a workspace has multiple agents (e.g., main + devcontainer), the
build timeline was showing all events duplicated under each agent
instead of filtering by the agent they belong to.
Added agentId to the Stage type and filter timings by workspace_agent_id
so each agent section only shows its own events.
Fixes#18002
These tests use dbfake to set up database state directly and don't
need a provisioner daemon. Removing it fixes a flaky failure on
Windows where the provisioner daemon acquired a job that dbfake had
already "completed", causing the task status to be "error" instead
of "paused".
Fixescoder/internal#1322
Refs coder/internal#1323
Previously there were two issues that could cause incorrect boundary
usage telemetry data.
1. Bad handling across snapshot intervals: After telemetry snapshot deleted
the DB row, the next flush would INSERT the stale cumulative data (which
included already-reported usage). This would then be overwritten by
subsequent UPDATE flushes, causing the delta between the last snapshot
and the reset to be lost (under-reporting usage). Additionally, if there
was no new usage after the reset, the tracker would carry over all usage
from the previous period into the next period (over-reporting usage).
2. Missed usage from a race condition: Track() calls between the first
mutex unlock and second mutex lock in FlushToDB() were lost. The data
wasn't included in the current flush (already snapshotted) and was wiped
by the subsequent reset. This is likely low impact to overall usage
numbers in the real world.
Fix by tracking unique workspace/user deltas separately from cumulative
values and always tracking delta allowed/denied requests. Deltas are used
for INSERT (fresh start after reset), cumulative for UPDATE (accurate unique
counts within a period). All counters reset atomically before the DB operation
so Track() calls during the operation are preserved for the next flush.
Archiving modules attempts to save as many modules as it can before it hits the limit. Enabling the template as much as it can, rather than a hard failure.
## Description
Adds authentication support for upstream proxies in `aibridgeproxyd`.
When credentials are provided in the upstream proxy URL, the
`Proxy-Authorization` header is now included in `CONNECT` requests.
## Changes
* Extract credentials from upstream proxy URL and set
`Proxy-Authorization` header on tunneled `CONNECT` requests
* Support optional user and password
* Fail at startup if both username and password are empty
* Add tests for all auth scenarios
Follow-up: https://github.com/coder/internal/issues/1204
Apply optimizations:
* https://github.com/openai/openai-go/pull/602
* https://github.com/coder/aibridge/pull/160
These reduce CPU time and allocation count for OpenAI `chat/completions`
and `responses` APIs, making the use of OpenAI chat models through AI
Bridge more performant.
In order to test these changes, we add scaletesting support for the
responses API.
## Summary
This PR restructures the Agent Boundaries documentation to improve URL
clarity and consistency:
### Changes
- Renames `/docs/ai-coder/boundary/` to
`/docs/ai-coder/agent-boundaries/`
- Renames `agent-boundary.md` to `index.md` for cleaner URLs
- Updates all internal doc references to the new paths
- Updates `manifest.json` with new paths
- Updates prose references from "Boundary" to "Agent Boundaries"
throughout the documentation (33 changes across 4 files)
### New URL structure
| Old URL | New URL |
|---------|----------|
| `/docs/ai-coder/boundary/agent-boundary` |
`/docs/ai-coder/agent-boundaries` |
| `/docs/ai-coder/boundary/nsjail` |
`/docs/ai-coder/agent-boundaries/nsjail` |
| `/docs/ai-coder/boundary/landjail` |
`/docs/ai-coder/agent-boundaries/landjail` |
| `/docs/ai-coder/boundary/rules-engine` |
`/docs/ai-coder/agent-boundaries/rules-engine` |
| `/docs/ai-coder/boundary/version` |
`/docs/ai-coder/agent-boundaries/version` |
### Follow-up required
Redirects need to be added to `coder/coder.com` for the old URLs:
- `/docs/ai-coder/agent-boundary` → `/docs/ai-coder/agent-boundaries`
(this one is currently 404'ing from Google search results)
- `/docs/ai-coder/boundary/:path*` →
`/docs/ai-coder/agent-boundaries/:path*`
---
Created on behalf of @mattvollmer
---------
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
Co-authored-by: Matt Vollmer <matthewjvollmer@outlook.com>
Liveness checks are currently causing pods to be killed during
long-running migrations.
They are generally not advisable for our workloads; if a pod becomes
unresponsive we _need_ to know about it (due to a deadlock, etc) and not
paper over the issue by killing the pod.
I've also made all probe settings configurable.
---------
Signed-off-by: Danny Kopping <danny@coder.com>
## Summary
The `lint/actions/zizmor` target flakes in CI due to network
connectivity issues when running on depot runners
(https://github.com/coder/internal/issues/1233). The zizmor tool needs
to reach GitHub's API but intermittently fails with "Connection refused"
errors.
## Changes
- Creates a new `lint-actions` CI job that only runs when `.github/**`
files are touched (using existing `ci` filter)
- Removes zizmor from the main `lint` job
- Uses a Makefile conditional to include actionlint in `make lint`
locally but skip it in CI (where `lint-actions` handles it)
This reduces unnecessary flake exposure for PRs that don't modify GitHub
Actions files.
## Testing
- `actionlint` passes on the modified ci.yaml
- Verified Makefile conditional works: actionlint included locally,
skipped when `CI=true`
Fixes https://github.com/coder/internal/issues/1233
Closes#21044
This pull-request addresses an issue we were seeing where we would
attempt to filter the `<UserCombobox />` by the users username or email
not their username (which the rendered options would show).
To highlight this I created three different users. Each with a username
that did not contain their `email` or `name` and attempted to filter.
Attempting to search for `John` wouldn't actually show the user as his
username was `x`, and infact whereas a subset of users might be returned
from the backend for having `john` in the `email` it would've been
filtered by the frontend for not being in the `name` field.
| Name | Username |
| --- | --- |
| `Jake` | `z` |
| `Jeff` | `y` |
| `John` | `x` |
| Previously | Now |
| --- | --- |
| <img width="560" height="547" alt="OLD_USER_COMBOBOX"
src="https://github.com/user-attachments/assets/a0567264-0034-42ac-aba0-95b05c4f92dd"
/> | <img width="580" height="548" alt="NEW_USER_COMBOBOX"
src="https://github.com/user-attachments/assets/1aa0c942-d340-4b1c-8dde-b97879525bfb"
/> |
## Description
When configuring a From address with a display name (e.g., `Coder System
<system@coder.com>`), the SMTP `MAIL FROM` command was incorrectly
receiving the full address string instead of just the bare email
address, causing `501 Invalid MAIL argument` errors on some SMTP
servers.
## Changes
- Updated `validateFromAddr` to return both:
- `envelopeFrom`: bare email for SMTP `MAIL FROM` command (RFC 5321)
- `headerFrom`: original address with display name for email header (RFC
5322)
Fixes#20727
## Description
Mark `--ssh-hostname-prefix` flag and `CODER_SSH_HOSTNAME_PREFIX` env
variable as deprecated, recommending users to use
`--workspace-hostname-suffix` / `CODER_WORKSPACE_HOSTNAME_SUFFIX`
instead for consistency with Coder Desktop.
The deprecated option is now hidden from help output and docs but
remains functional for backward compatibility. When used, it will show a
deprecation warning pointing to the recommended alternative.
## Changes
- Added `UseInstead` pointing to `workspace-hostname-suffix` option
(triggers deprecation warning)
- Set `Hidden: true` to hide from CLI help and documentation
- Updated description to mention deprecation
- Regenerated docs and help files via `make gen`
Closes#18156
---
_Originally requested by @matifali in
https://github.com/coder/coder/pull/18085#discussion_r2115594447_
This pull-request addresses the size of the iconography within the
`<SingleSignOnSection />` section component. As a side-effect of the
changes in #21347 we are now rendering this too large.
Furthermore, to catch these issues in future we've introduced two new
stories within `SecurityPageView.stories.tsx` which render both `oidc`
and `github` login routes.
| Old | New |
| --- | --- |
| <img width="520" height="399" alt="OLD_SSO_PROVIDER"
src="https://github.com/user-attachments/assets/f6687b9a-d6bc-4bca-859a-0b59a3f6ba03"
/> | <img width="520" height="398" alt="NEW_SSO_PROVIDER"
src="https://github.com/user-attachments/assets/5beb8149-3e07-4dbc-9e0f-06f9207ecc59"
/> |
## Summary
The bottom admin bar (DeploymentBannerView) was showing a thick
scrollbar when content overflowed horizontally. This change applies the
native thin scrollbar style instead.
## Changes
- Added `[scrollbar-width:thin]` Tailwind CSS arbitrary value to the
deployment banner container
This uses the native CSS `scrollbar-width: thin` property which is
supported in modern browsers (Firefox, Chrome, Edge, Safari) and
provides a less obtrusive scrollbar when horizontal scrolling is needed.
## Testing
- The change is purely CSS and was verified with lint and format checks
passing
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Low Risk**
> Purely a CSS styling tweak with no behavioral, data, or security
impact; risk is limited to minor cross-browser appearance differences.
>
> **Overview**
> Updates the dashboard `DeploymentBannerView` bottom admin bar styling
to use the native CSS `scrollbar-width: thin` via Tailwind
(`[scrollbar-width:thin]`), reducing scrollbar thickness when the banner
overflows horizontally.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
ba36e48d66. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
Co-authored-by: Cursor Agent <cursor@coder.com>
This pull-request resolves a really annoying issue with the `<TasksPage
/>` switcher control. Essentially every time I navigated to this page my
eyes were drawn to this button that felt out of place. I finally figured
out why and its that its breaking the first rules of nested rounded
corners.
We should be using the following math to calculate the roundedness.
```
outerRadius - gap = innerRadius
```
<img width="852" height="596" alt="button-rounding"
src="https://github.com/user-attachments/assets/89de5d98-0891-4c9d-a5aa-66f722796630"
/>
## Summary
Adds support for pre-filling the OAuth2 application creation form via
URL query parameters.
## Query Parameters
| Parameter | Description |
|-----------|-------------|
| `name` | Pre-fills the "Application name" field |
| `callback_url` | Pre-fills the "Callback URL" field |
| `icon` | Pre-fills the "Application icon" field |
## Example
```
/deployment/oauth2-provider/apps/add?name=MyApp&callback_url=https://example.com/callback&icon=/icon/github.svg
```
This allows external tools or documentation to link directly to the
OAuth2 app creation page with pre-populated values.
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore <dependency name> major version` will close this
group update PR and stop Dependabot creating any more for the specific
dependency's major version (unless you unignore this specific
dependency's major version or upgrade to it yourself)
- `@dependabot ignore <dependency name> minor version` will close this
group update PR and stop Dependabot creating any more for the specific
dependency's minor version (unless you unignore this specific
dependency's minor version or upgrade to it yourself)
- `@dependabot ignore <dependency name>` will close this group update PR
and stop Dependabot creating any more for the specific dependency
(unless you unignore this specific dependency or upgrade to it yourself)
- `@dependabot unignore <dependency name>` will remove all of the ignore
conditions of the specified dependency
- `@dependabot unignore <dependency name> <ignore condition>` will
remove the ignore condition of the specified dependency and ignore
conditions
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore <dependency name> major version` will close this
group update PR and stop Dependabot creating any more for the specific
dependency's major version (unless you unignore this specific
dependency's major version or upgrade to it yourself)
- `@dependabot ignore <dependency name> minor version` will close this
group update PR and stop Dependabot creating any more for the specific
dependency's minor version (unless you unignore this specific
dependency's minor version or upgrade to it yourself)
- `@dependabot ignore <dependency name>` will close this group update PR
and stop Dependabot creating any more for the specific dependency
(unless you unignore this specific dependency or upgrade to it yourself)
- `@dependabot unignore <dependency name>` will remove all of the ignore
conditions of the specified dependency
- `@dependabot unignore <dependency name> <ignore condition>` will
remove the ignore condition of the specified dependency and ignore
conditions
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Update provisionerdserver to handle the changes introduced to
provisionerd in https://github.com/coder/coder/pull/21602
We now create a relationship between `workspace_agent_devcontainers` and
`workspace_agents` with the newly created `subagent_id`.
## Summary
Clarifies the [AI Bridge client config authentication
section](https://coder.com/docs/ai-coder/ai-bridge/client-config#authentication)
to explicitly state that only **Coder-issued tokens** are accepted.
## Changes
- Changed "API key" to "Coder API key" throughout the Authentication
section
- Added a note clarifying that provider-specific API keys (OpenAI,
Anthropic, etc.) will not work with AI Bridge
Fixes#21790
---
Created on behalf of @dannykopping
---------
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
Previously the task logs endpoint only worked when the workspace was
running, leaving users unable to view task history after pausing.
This change adds snapshot retrieval with state-based branching: active
tasks fetch live logs from AgentAPI, paused/initializing/pending tasks
return stored snapshots (providing continuity during pause/resume), and
error/unknown states return HTTP 409 Conflict.
The response includes snapshot metadata (snapshot, snapshot_at) to
indicate whether logs are live or historical.
Closescoder/internal#1254
Operators need to know which API key was used in HTTP requests.
For example, if a key is leaking and a DDOS is underway using that key, operators need a way to identify the key in use and take steps to expire the key (see https://github.com/coder/coder/issues/21782).
_Disclaimer: created using Claude Opus 4.5_
During development of #21659 I approved some `<Paywall />` code that had
an extensive props system, however, I wasn't a huge fan of this. This
approach attempts to take it further like something `shadcn` would,
where-in we define the `<Paywall />` (and its subset of components) and
we wrap around those when needed for `<PaywallAIGovernance />` and
`<PaywallPremium />`.
Theoretically there is no real CSS/Design changes here. However
screenshot for prosperity.
| Previously | Now |
| --- | --- |
| <img width="2306" height="614" alt="CleanShot 2026-01-29 at 10 56
05@2x"
src="https://github.com/user-attachments/assets/83a4aa1b-da74-459d-ae11-fae06c1a8371"
/> | <img width="2308" height="622" alt="CleanShot 2026-01-29 at 10 55
05@2x"
src="https://github.com/user-attachments/assets/4aa43b09-6705-4af3-86cc-edc0c08e53b1"
/> |
---------
Co-authored-by: Ben Potter <me@bpmct.net>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
## Description
Removes the following deprecated Prometheus metrics:
- `coderd_api_workspace_latest_build_total` → use
`coderd_api_workspace_latest_build` instead
- `coderd_oauth2_external_requests_rate_limit_total` → use
`coderd_oauth2_external_requests_rate_limit` instead
These metrics were deprecated in #12976 because gauge metrics should
avoid the `_total` suffix per [Prometheus naming
conventions](https://prometheus.io/docs/practices/naming/).
## Changes
- Removed deprecated metric `coderd_api_workspace_latest_build_total`
from `coderd/prometheusmetrics/prometheusmetrics.go`
- Removed deprecated metric
`coderd_oauth2_external_requests_rate_limit_total` from
`coderd/promoauth/oauth2.go`
- Updated tests to use the non-deprecated metric name
Fixes#12999
The test was creating two template versions without explicit names,
relying on `namesgenerator.NameDigitWith()` which can produce
collisions. When both versions got the same random name, the test failed
with a 409 Conflict error.
Fix by giving each version an explicit name (`v1`, `v2`).
Closes https://github.com/coder/internal/issues/1309
---
*Generated by [mux](https://mux.coder.com)*
Add PeriodStart and PeriodDurationMilliseconds fields to BoundaryUsageSummary
so consumers of telemetry data can understand usage within a particular time window.
## Summary
This PR updates the note on the Tasks documentation page to more clearly
explain the relationship between Premium task limits and the AI
Governance Add-On.
## Problem
The previous wording:
> "Premium Coder deployments are limited to running 1,000 tasks. Contact
us for pricing options or learn more about our AI Governance Add-On to
evaluate all of Coder's AI features."
The "or" in this sentence could be interpreted as two separate paths:
(1) contact sales for custom pricing that might not require the add-on,
OR (2) get AI Governance. This led to confusion about whether higher
task limits could be obtained without the AI Governance Add-On.
## Solution
Updated the note to be explicit about the scaling path:
> "Premium deployments include 1,000 Agent Workspace Builds for
proof-of-concept use. To scale beyond this limit, the AI Governance
Add-On provides expanded usage pools that grow with your user count.
Contact us to discuss pricing."
This makes it clear that:
1. Premium includes 1,000 builds for POC use
2. Scaling beyond that requires the AI Governance Add-On
3. Contact sales to discuss pricing for the add-on
Created on behalf of @mattvollmer
---------
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
Co-authored-by: Matt Vollmer <matthewjvollmer@outlook.com>
Justification:
- Populating `members` is authorized with `group_member.read` which is
not required to be able to share a workspace
- Populating `total_member_count` is authorized with `group.read` which
is required to be able to share
- The updated helper is only used in template/workspace sharing UIs, so
other pages that might need counts of readable members are unaffected
Related to: https://github.com/coder/internal/issues/1302
## Description
Adds Prometheus metrics to the AI Bridge Proxy for observability into
proxy traffic and performance.
## Changes
* Add Metrics struct with the following metrics:
* `connect_sessions_total`: counts CONNECT sessions by type
(mitm/tunneled)
* `mitm_requests_total`: counts MITM requests by provider
* `inflight_mitm_requests`: gauge tracking in-flight requests by
provider
* `mitm_request_duration_seconds`: histogram of request latencies by
provider
* `mitm_responses_total`: counts responses by status code class
(2XX/3XX/4XX/5XX) and provider
* Register metrics with `coder_aibridgeproxyd_` prefix in CLI
* Unregister metrics on server close to prevent registry leaks
* Add `tunneledMiddleware` to track non-allowlisted CONNECT sessions
* Add tests for metric recording in both MITM and tunneled paths
Closes: https://github.com/coder/internal/issues/1185
Adds a new subcommand to print the current session token for use in
scripts and automation, similar to `gh auth token`.
## Usage
```bash
CODER_SESSION_TOKEN=$(coder login token)
```
Fixes#21515
## Description
Add exponential backoff retries to all `go install` and `go mod
download` commands across CI workflows and actions.
## Why
Fixes
[coder/internal#1276](https://github.com/coder/internal/issues/1276) -
CI fails when `sum.golang.org` returns 500 errors during Go module
verification. This is an infrastructure-level flake that can't be
controlled.
## Changes
- Created `.github/scripts/retry.sh` - reusable retry helper with
exponential backoff (2s, 4s, 8s delays, max 3 attempts), using
`scripts/lib.sh` helpers
- Wrapped all `go install` and `go mod download` commands with retry in:
- `.github/actions/setup-go/action.yaml`
- `.github/actions/setup-sqlc/action.yaml`
- `.github/actions/setup-go-tools/action.yaml`
- `.github/workflows/ci.yaml`
- `.github/workflows/release.yaml`
- `.github/workflows/security.yaml`
- Added GNU tools setup (bash 4+, GNU getopt, make 4+) for macOS in
`test-go-pg` job, since `retry.sh` uses `lib.sh` which requires these
tools
## Summary
Fixes the broken Kilo Code documentation link in the AI Bridge
client-config page.
## Changes
- Updated the Kilo Code link from the old
`/docs/features/api-configuration-profiles` (returns 404) to the current
`/docs/ai-providers/openai-compatible` page
The Kilo Code documentation was restructured and the old URL no longer
exists.
Fixes#21750
Fixes: https://github.com/coder/internal/issues/560
"Select" CLI UI component should ignore "space" when `+Add custom value`
is highlighted. Otherwise it interprets that as a potential option...
and panics.
Fixes: coder/internal#767
Adds two new Prometheus metrics for license health monitoring:
- `coderd_license_warnings` - count of active license warnings
- `coderd_license_errors` - count of active license errors
Metrics endpoint after startup of a deployment with license enabled:
```
...
# HELP coderd_license_errors The number of active license errors.
# TYPE coderd_license_errors gauge
coderd_license_errors 0
...
# HELP coderd_license_warnings The number of active license warnings.
# TYPE coderd_license_warnings gauge
coderd_license_warnings 0
...
```
fixes: https://github.com/coder/internal/issues/1304
Subscribe to heartbeats synchronously on startup of PGCoordinator. This ensures tests that send heartbeats don't race with this subscription.
Closes [#1246](https://github.com/coder/internal/issues/1246)
This PR adds a new component to display AI Governance user entitlements
in the Licenses Settings page. The implementation includes:
- New `AIGovernanceUsersConsumptionChart` component that shows the
number of entitled users for AI Governance features
- Storybook stories for various states (default, disabled, error states)
- Integration with the existing license settings page
- Collapsible "Learn more" section with links to relevant documentation
- Updated the ManagedAgentsConsumption component with clearer
terminology ("Agent Workspace Builds" instead of "Managed AI Agents")
The chart displays the number of users entitled to use AI features like
AI Bridge, Boundary, and Tasks, with a note that additional analytics
are coming soon.
### Preview
<img width="3516" height="2390" alt="CleanShot 2026-01-27 at 22 44
25@2x"
src="https://github.com/user-attachments/assets/cb97a215-f054-45cb-a3e7-3055c249ef04"
/>
<img width="3516" height="2390" alt="CleanShot 2026-01-27 at 22 45
04@2x"
src="https://github.com/user-attachments/assets/d2534189-cffb-4ad2-b2e2-67eb045572e8"
/>
---------
Co-authored-by: Jaayden Halko <jaayden.halko@gmail.com>
This pull request makes a minor update to the documentation check
workflow. It clarifies that a comment should not be posted if there are
no documentation changes needed and simplifies the comment format
instructions.
The reaper (PID 1) now returns the child's exit code instead of always
exiting 0. Signal termination uses the standard Unix convention of 128 +
signal number.
fixes#21661
My previous change to this test did not create another **workspace**
using the template containing `coder_ai_task` resources, meaning that
this test was not actually testing the right thing. This PR addresses
this oversight.
The test occasionally times out at 15s on Windows CI runners.
Investigation of CI logs shows the HTTP request to the agent's
gitsshkey endpoint never appears in server logs, suggesting it
hangs before the request completes (possibly in connection setup,
middleware, or database queries). Increase to 60s to reduce flake
rate.
Fixescoder/internal#770
## Description
Moves the provider lookup from `handleRequest` to `authMiddleware` so
that the provider is determined during the `CONNECT` handshake and
stored in the request context. This enables provider information to be
available earlier in the request lifecycle.
## Changes
* Move `aibridgeProviderFromHost` call from `handleRequest` to
`authMiddleware`
* Store `Provider` in `requestContext` during `CONNECT` handshake
* Add provider validation in `authMiddleware` (reject if no provider
mapping)
* Keep defensive provider check in `handleRequest` for safety
Follow-up from: https://github.com/coder/coder/pull/21617
Closes#20598
This pull-request implements a very basic change to also render the
`icon` of the `Preset` when we've specifically defined one within the
template. Furthermore, theres a `ⓘ` icon with a description.
### Preview
<img width="984" height="442" alt="CleanShot 2026-01-27 at 20 15 29@2x"
src="https://github.com/user-attachments/assets/d4ceebf9-a5fe-4df4-a8b2-a8355d6bb25e"
/>
2026-01-28 18:51:22 +11:00
1379 changed files with 136346 additions and 21830 deletions
| `sql.NullString`, `sql.NullInt64`, etc. | `sql.Null[T]` | 1.22 |
| Manual `ctx, cancel := context.WithCancel(…)` + `t.Cleanup(cancel)` | `t.Context()` (auto-canceled when test ends) | 1.24 |
| `if d < 0 { d = -d }` on durations | `d.Abs()` (handles `math.MinInt64`) | 1.19 |
| Implement only `TextMarshaler` | also implement `TextAppender` for alloc-free marshaling | 1.24 |
| Custom `Unwrap() error` on multi-cause errors | `Unwrap() []error` (slice form; required for tree traversal) | 1.20 |
## New capabilities
These enable things that weren't practical before. Reach for them in the
described situations.
| What | Since | When to use it |
|---|---|---|
| `cmp.Or(a, b, c)` | 1.22 | Defaults/fallback chains: returns first non-zero value. Replaces verbose `if a != "" { return a }` cascades. |
| `context.WithoutCancel(ctx)` | 1.21 | Background work that must outlive the request (e.g. async cleanup after HTTP response). Derived context keeps parent's values but ignores cancellation. |
| `context.AfterFunc(ctx, fn)` | 1.21 | Register cleanup that fires on context cancellation without spawning a goroutine that blocks on `<-ctx.Done()`. |
| `context.WithCancelCause` / `Cause` | 1.20 | When callers need to know WHY a context was canceled, not just that it was. Retrieve cause with `context.Cause(ctx)`. |
| `context.WithDeadlineCause` / `WithTimeoutCause` | 1.21 | Attach a domain-specific error to deadline/timeout expiry (e.g. distinguish "DB query timed out" from "HTTP request timed out"). |
| `errors.ErrUnsupported` | 1.21 | Standard sentinel for "not supported." Use instead of per-package custom sentinels. Check with `errors.Is`. |
| `http.ResponseController` | 1.20 | Per-request flush, hijack, and deadline control without type-asserting `ResponseWriter` to `http.Flusher` or `http.Hijacker`. |
| Enhanced `ServeMux` routing | 1.22 | `"GET /items/{id}"` patterns in `http.ServeMux`. Access with `r.PathValue("id")`. Wildcards: `{name}`, catch-all: `{path...}`, exact: `{$}`. Eliminates many third-party router dependencies. |
| `os.Root` / `OpenRoot` | 1.24 | Confined directory access that prevents symlink escape. 1.25 adds `MkdirAll`, `ReadFile`, `WriteFile` for real use. |
| `os.CopyFS` | 1.23 | Copy an entire `fs.FS` to local filesystem in one call. |
| `os/signal.NotifyContext` with cause | 1.26 | Cancellation cause identifies which signal (SIGTERM vs SIGINT) triggered shutdown. |
| `io/fs.SkipAll` / `filepath.SkipAll` | 1.20 | Return from `WalkDir` callback to stop walking entirely. Cleaner than a sentinel error. |
| `GOMEMLIMIT` env / `debug.SetMemoryLimit` | 1.19 | Soft memory limit for GC. Use alongside or instead of `GOGC` in memory-constrained containers. |
With issues: "## 🔍 Code Review\\n\\nReviewed [5-8 words].\\n\\n**Found X issues** (Y critical, Z nitpicks).\\n\\n---\\n*AI review via [Coder Tasks](https://coder.com/docs/ai-coder/tasks)*"
No issues: "## 🔍 Code Review\\n\\nReviewed [5-8 words].\\n\\n✅ **Looks good** - no production issues found.\\n\\n---\\n*AI review via [Coder Tasks](https://coder.com/docs/ai-coder/tasks)*"
</github_api_documentation>
<critical_rules>
1. Read ENTIRE files before commenting - use read_file or grep to verify
2. Check the EXACT line you're commenting on - does the issue actually exist there?
3. Suggestion block = ONLY replacement lines (never include unchanged surrounding lines)
CONTEXT="This is a NEW PR. Perform a thorough documentation review."
CONTEXT="This is a NEW PR. Perform initial documentation review."
;;
pr_updated)
CONTEXT="This PR was UPDATED with new commits. Only comment if the changes affect documentation needs or address previous feedback."
CONTEXT="This PR was UPDATED with new commits. Check if previous feedback was addressed or if new doc needs arose."
;;
label_requested)
CONTEXT="A documentation review was REQUESTED via label. Perform a thorough documentation review."
CONTEXT="A documentation review was REQUESTED via label. Perform a thorough review."
;;
ready_for_review)
CONTEXT="This PR was marked READY FOR REVIEW (converted from draft). Perform a thorough documentation review."
CONTEXT="This PR was marked READY FOR REVIEW. Perform a thorough review."
;;
manual)
CONTEXT="This is a MANUAL review request. Perform a thorough documentation review."
CONTEXT="This is a MANUAL review request. Perform a thorough review."
;;
*)
CONTEXT="Perform a thorough documentation review."
CONTEXT="Perform a documentation review."
;;
esac
# Build task prompt with PR-specific context
# Build task prompt with sticky comment logic
TASK_PROMPT="Use the doc-check skill to review PR #${PR_NUMBER} in coder/coder.
${CONTEXT}
Use \`gh\` to get PR details, diff, and all comments. Check for previous doc-check comments (from coder-doc-check) and only post a new comment if it adds value.
Use \`gh\` to get PR details, diff, and all comments. Look for an existing doc-check comment containing \`<!-- doc-check-sticky -->\` - if one exists, you'll update it instead of creating a new one.
**Do not comment if no documentation changes are needed.**
If a sticky comment already exists, compare your current findings against it:
- Check off \`[x]\` items that are now addressed
- Strikethrough items no longer needed (e.g., code was reverted)
- Add new unchecked \`[ ]\` items for newly discovered needs
- If an item is checked but you can't verify the docs were added, add a warning note below it
- If nothing meaningful changed, don't update the comment at all
Fast checks that catch most CI failures. Allow at least 5 minutes.
- **pre-push**: `make pre-push` (full CI suite including tests).
Runs before pushing to catch everything CI would. Allow at least
15 minutes (race tests are slow without cache).
`git commit` and `git push` will appear to hang while hooks run.
This is normal. Do not interrupt, retry, or reduce the timeout.
NEVER run `git config core.hooksPath` to change or disable hooks.
If a hook fails, fix the issue and retry. Do not work around the
failure by skipping the hook.
### Git Workflow
When working on existing PRs, check out the branch first:
@@ -198,13 +230,12 @@ reviewer time and clutters the diff.
**Don't delete existing comments** that explain non-obvious behavior. These
comments preserve important context about why code works a certain way.
**When adding tests for new behavior**, add new test cases instead of modifying
existing ones. This preserves coverage for the original behavior and makes it
clear what the new test covers.
**When adding tests for new behavior**, read existing tests first to understand what's covered. Add new cases for uncovered behavior. Edit existing tests as needed, but don't change what they verify.
"file is %d bytes which exceeds the maximum of %d bytes. Use grep, sed, or awk to extract the content you need, or use offset and limit to read a portion.",
fileSize,limits.MaxFileSize,
))
}
// Read the entire file (up to MaxFileSize).
data,err:=io.ReadAll(f)
iferr!=nil{
returnerrResp(fmt.Sprintf("read file: %s",err))
}
// Split into lines.
content:=string(data)
// Handle empty file.
ifcontent==""{
returnReadFileLinesResponse{
Success:true,
FileSize:fileSize,
TotalLines:0,
LinesRead:0,
Content:"",
}
}
lines:=strings.Split(content,"\n")
totalLines:=len(lines)
// offset is 1-based line number.
ifoffset<1{
offset=1
}
ifoffset>int64(totalLines){
returnerrResp(fmt.Sprintf(
"offset %d is beyond the file length of %d lines",
offset,totalLines,
))
}
// Default limit.
iflimit<=0{
limit=int64(limits.MaxResponseLines)
}
startIdx:=int(offset-1)// convert to 0-based
endIdx:=startIdx+int(limit)
ifendIdx>totalLines{
endIdx=totalLines
}
varnumbered[]string
totalBytesAccumulated:=0
fori:=startIdx;i<endIdx;i++{
line:=lines[i]
// Per-line truncation.
iflen(line)>limits.MaxLineBytes{
line=line[:limits.MaxLineBytes]+"... [truncated]"
}
// Format with 1-based line number.
numberedLine:=fmt.Sprintf("%d\t%s",i+1,line)
lineBytes:=len(numberedLine)
// Check total byte budget.
newTotal:=totalBytesAccumulated+lineBytes
iflen(numbered)>0{
newTotal++// account for \n joiner
}
ifnewTotal>limits.MaxResponseBytes{
returnerrResp(fmt.Sprintf(
"output would exceed %d bytes. Read less at a time using offset and limit parameters.",
limits.MaxResponseBytes,
))
}
// Check line count.
iflen(numbered)>=limits.MaxResponseLines{
returnerrResp(fmt.Sprintf(
"output would exceed %d lines. Read less at a time using offset and limit parameters.",
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.