c.f. https://github.com/coder/coder/pull/13192#issuecomment-2097657692
We need to wait for PGCoordinator to finish its work before returning on `Close()`, so that we delete database state (best effort -- if this fails others will filter it out based on heartbeats).
Terraform changed the default output of the `terraform graph` command. You must put `-type=plan` to keep the prior behavior.
Co-authored-by: Colin Adler <colin1adler@gmail.com>
Includes db schema and dbauthz layer for upserting custom roles. Unit test in `customroles_test.go` verify against escalating permissions through this feature.
Verifies our built in roles are valid according to our policy.go. Working on custom roles requires the dynamic roles to adhere to these rules. Feels fair the built in ones do too.
* wip: commit progress on test revamps
* fix: update existing tests to new format
* chore: add test case for global snackbar
* refactor: consolidate files
* refactor: make http dependency more explicit
* chore: add extra test case for exposed error value
* docs: fix typos
* fix: make sure clipboard is reset between test runs
* docs: add more context to comments
* refactor: update mock console.error logic to use jest.spyOn
* docs: add more clarifying comments
* refactor: split off type alias for clarity
Removes our pseudo rbac resources like `WorkspaceApplicationConnect` in favor of additional verbs like `ssh`. This is to make more intuitive permissions for building custom roles.
The source of truth is now `policy.go`
Just moved `rbac.Action` -> `policy.Action`. This is for the stacked PR to not have circular dependencies when doing autogen. Without this, the autogen can produce broken golang code, which prevents the autogen from compiling.
So just avoiding circular dependencies. Doing this in it's own PR to reduce LoC diffs in the primary PR, since this has 0 functional changes.
* wip: commit progress on code split-up
* wip: commit more progress
* wip: finish initial version of class implementation
* chore: update all import paths to go through client instance
* fix: remove temp comments
* refactor: smoooooooosh the API
* refactor: update import setup for tests
- `DERPForceWebSockets`: Test that DERP over WebSocket (as well as DERPForceWebSockets works). This does not test the actual DERP failure detection code and automatic fallback.
- `DERPFallbackWebSockets`: Test that falling back to DERP over WebSocket works.
Also:
- Rearranges some test code and refactors `TestTopology.StartServer` to be `TestTopology.ServerOptions` and take a struct instead of a function
Closes#13045
One cause of #13139 is a peculiar failure mode of `WebsocketNetConn` which causes it to return `context.Canceled` in some circumstances when the underlying websocket fails. We have special processing for that error in the `agent.run()` routine, which is erroneously being triggered.
Since we don't actually need the returned context from `WebsocketNetConn`, we can simplify and just use the netConn from the `websocket` library directly.
Fixes#13139
Using a bare channel to signal dependent goroutines means that we can only signal success, not failure, which leads to deadlock if we fail in a way that doesn't cause the whole `apiConnRoutineManager` to tear down routines.
Instead, we use a new object called a `checkpoint` that signals success or failure, so that dependent routines get unblocked if the routine they depend on fails.
Removes a check for `context.Canceled` inside the `handleManifest` routine. This checking is handled in the `apiConnRoutineManager`, so checking inside the handler is redundant.
Instead of removing the mappings of unhealthy coordinators entirely,
mark them as lost instead. This prevents peers from disappearing from
other peers if a coordinator misses a heartbeat.
* chore: allow terraform & echo built-in provisioners
Built-in provisioners serve all specified types. This allows running terraform, echo, or both in built in.
The cli flag to control the types is hidden by default, to be used primarily for testing purposes.
* fix: remove some of the jank around our core App component
* refactor: scope navigation logic more aggressively
* refactor: add explicit return type to useAuthenticated
* refactor: clean up ProxyContext code
* wip: add code for consolidating the HTML metadata
* refactor: clean up hook logic
* refactor: rename useHtmlMetadata to useEmbeddedMetadata
* fix: correct names that weren't updated
* fix: update type-safety of useEmbeddedMetadata further
* wip: switch codebase to use metadata hook
* refactor: simplify design of metadata hook
* fix: update stray type mismatches
* fix: more type fixing
* fix: resolve illegal invocation error
* fix: get metadata issue resolved
* fix: update comments
* chore: add unit tests for MetadataManager
* fix: beef up tests
* fix: update typo in tests
* docs: add island integration guide
* make: fmt
* F
omit F
* fix: naming and manifest
---------
Co-authored-by: Matt Vollmer <matthewjvollmer@outlook.com>
I initially made this change when hacking wgengine to also capture wireguard packets going into the magicsock, so that we could capture the initial wireguard handshake.
I don't think we should ship that additional capture logic, but... it seems generally useful to capture packets from the get go on speedtest, so that you can see disco and pings before the TCP speedtest session starts.
When starting a workspace, if the deadline crosses an autostart boundary, the deadline is set to autostart + TTL.
This copies the behavior in `ActivityBumpWorkspace`, but does not require activity.
* chore: dynamically determine gitlab external auth defaults
Static defaults work for github cloud, but not self hosted.
Self hosted setups will now have sane defaults if omitted.
* feat: influence parameter defaults through cli flag/env
Add a --parameter-default flag / CODER_RICH_PARAMETER_DEFAULT
environment variable which overrides default values suggested for
parameters.
This allows scripts or middleware wrapping the CLI to substitute
defaults for parameter values beyond those defined at the template
level. For example, Git repository/branch parameters can be given
defaults based on the current checkout, or default parameter values can
be parsed out of files inside the repo.
* Rename defaults arg to defaultOverrides
This PR adds a command to bump versions in docs/markdown.
This is still standalone and needs to be wired up.
For now, I'm planning on putting this in `scripts/release.sh` (checkout main -> autoversion (this command) -> commit -> submit PR).
It would be pretty neat to make it a GH actions that's triggered on release though, something for the future.
Part of #12465
* fix: update API code to use Axios instance
* docs: fix typo
* fix: update all global axios imports to use Coder instance
* fix: remove needless import
* fix: update import order
* refactor: rename coderAxiosInstance to axiosInstance
* docs: update variable reference in FE contributing docs
* chore: reduce requests the dashboard makes from seeded data
We already inject all of this content in `index.html`.
There was also a bug with displaying a loading indicator when
the workspace proxies endpoint 404s.
* Fix first user fetch
* Add util
* Add cached query for entitlements and experiments
* Fix authmethods unnecessary request
* Fix unnecessary region request
* Fix fmt
* Debug
* Fix test
Adds checks to coderd/healthcheck/derphealth for STUN issues:
- Alerts if there is not least one healthy STUN server,
- Alerts if we see variable port mapping.
* chore: skip global.setup if first user already exists
treat test as a setup, rather than a test
Co-authored-by: Kayla Washburn-Love <mckayla@hey.com>
---------
Co-authored-by: Kayla Washburn-Love <mckayla@hey.com>
Fixes#12780
Adds indexes to the `tailnet_tunnels` table to speed up `GetTailnetTunnelPeerIDs` and `GetTailnetTunnelPeerBindings` queries, which match on `src_id` and `dst_id`.
When an agent receives a node, it responds with an ACK which is relayed
to the client. After the client receives the ACK, it's allowed to begin
pinging.
* chore: nix shell to support playwright e2e tests
nix is running an older version of chromium, so had to reduce the
playwright version.
* Add to e2e readme
* add enterprise test comment
* add note about install to readme
* make fmt
* remove shellhook message
Co-authored-by: Kayla Washburn-Love <mckayla@hey.com>
* add link to nixos playwright package to get version
* formatting
---------
Co-authored-by: Kayla Washburn-Love <mckayla@hey.com>
* chore: verify pass through external auth query params
Unit test added to verify behavior of query params set in the
auth url for external apps. This behavior is intended to specifically
support Auth0 audience query param.
fixes#12923
Prevents Coordinate peer connections from generating spurious database queries like DeleteTailnetPeer when the coordinator is unhealthy.
It does this by checking the health of the querier before accepting a connection, rather than unconditionally accepting it only for it to get swatted down later.
* chore: merge apikey/token session config values
There is a confusing difference between an apikey and a token. This
difference leaks into our configs. This change does not resolve the
difference. It only groups the config values to try and manage any
bloat that occurs from adding more similar config values
A customer hit like 200k of ErrSessionShutdown, which just dupes any errors we would have generated when shutting down the session for e.g. Ping failures.
* chore: remove InsertWorkspaceAgentStat query
InsertWorkspaceAgentStats (batch) exists. We only used the singular in
a single unit test place. Removing the single for the batch, reducing
the interface size.
* 2.10.0 changelog
* updated install docs for mainline/stable releases
* make fmt
* cpp icon -> C++
* added disclaimer on MAX_TTL, support bundle info
* 'release schedule'
* lowercase mainline
* Agent OOM protection info
* minor tweak
* docs: describe mutually exclusive create workspace template fields
Ideally we could do this in the OpenAPI spec, but there is no first
class "mutually exclusive" feature in OpenAPI. So in lieu of something
more complex, or changing our struct/validation, a description comment
should suffice.
* chore: Add description to code sample as well
* chore: use fork of chroma to remove unused inits
This seems fine to do since compilation errors would occur
if it were actually in use.
Everything seems fine here.
* Update validator
* feat(cli): add golden tests for errors (#11588)
Creates golden files from `coder/cli/errors.go`.
Adds a unit test to test against golden files.
Adds a make file command to regenerate golden files.
Abstracts test against golden files.
* chore: merge authorization contexts
Instead of 2 auth contexts from apikey and dbauthz, merge them to
just use dbauthz. It is annoying to have two.
* fixup authorization reference
* docs: document how to run workspace-proxy as a system service
* Update workspace-proxies.md
* Update workspace-proxies.md
Co-authored-by: Muhammad Atif Ali <me@matifali.dev>
* docs: fix duplication
---------
Co-authored-by: Muhammad Atif Ali <me@matifali.dev>
NOTE: terraform-provider-coder was updated to facilitate this change, and your template will require v0.19.0 for this feature to work. You can run terraform init -upgrade in your template directory. If you have a version constraint set, ensure it points to this version.
pbkdf2 is too expensive to run in init, so this change makes it load
lazily. I introduced a lazy package that I hope to use more in my
`GODEBUG=inittrace=1` adventure.
Benchmark results:
```
$ hyperfine "coder --help" "coder-new --help"
Benchmark 1: coder --help
Time (mean ± σ): 82.1 ms ± 3.8 ms [User: 93.3 ms, System: 30.4 ms]
Range (min … max): 72.2 ms … 90.7 ms 35 runs
Benchmark 2: coder-new --help
Time (mean ± σ): 52.0 ms ± 4.3 ms [User: 62.4 ms, System: 30.8 ms]
Range (min … max): 41.9 ms … 62.2 ms 52 runs
Summary
coder-new --help ran
1.58 ± 0.15 times faster than coder --help
```
Currently, importing `codersdk` just to interact with the API requires
importing tailscale, which causes builds to fail unless manually using
our fork.
Just upgraded to macOS 14.4 and TestAgentScript/Run fails for me with error `signal: killed`. I opened the test directory in a terminal and sure enough, when you execute the `echo` binary, it is immediately killed. The binary has no extended attributes and is byte-identical to the one in `/bin/`.
This fix uses a different strategy: instead of copying the `echo` binary from the system around, we just copy a small bash script that _calls_ the `echo` command.
This cleans up `root.go` a bit, adds tests for middleware HTTP transport
functions, and removes two HTTP requests we always always performed previously
when executing *any* client command.
It should improve CLI performance (especially for users with higher latency).
This PR updates the tests in `insights_test.go` to enable commented-out scenarios. This behavior was fixed by previous PRs in this stack. Note that the updated golden files are correct since they are "second template only" meaning that the newly introduced data is considered as expected. In other golden files there is no change since "only count once" is applied.
This PR updates the `*ByTempalte` insights queries used for generating Prometheus metrics to behave the same way as the new rollup query and re-written insights queries that utilize the rolled up data.
Add `dbrollup` service that runs the `UpsertTemplateUsageStats` query
every 5 minutes, on the minute. This allows us to have fairly real-time
insights data when viewing "today".
Add `template_usage_stats` table for aggregating tempalte usage data.
Data is rolled up by the `UpsertTemplateUsageStats` query, which fetches
data from the `workspace_agent_stats` and `workspace_app_stats` tables.
- Updates existing tests under workspaceapps/apptest to not reuse existing appDetails as assertWorkspaceLastUsed(Not)?Updated calls FlushStats() which was causing racy test behaviour and incorrect test assertions.
- Expands scope of assertWorkspaceLastUsedAtUpdated and its counterpart to ProxySubdomain tests.
This more closely aligns with GitHub's label search style. Actual search params need to be converted to allow this format, by default they will throw an error if they do not support listing.
This PR updates the coder port-forward command to periodically inform coderd that the workspace is being used:
- Adds workspaceusage.Tracker which periodically batch-updates workspace LastUsedAt
- Adds coderd endpoint to signal workspace usage
- Updates coder port-forward to periodically hit this endpoint
- Modifies BatchUpdateWorkspacesLastUsedAt to avoid overwriting with stale data
Co-authored-by: Danny Kopping <danny@coder.com>
* chore: remove max_ttl from templates
Completely removing max_ttl as a feature on template scheduling. Must use other template scheduling features to achieve autostop.
* chore: add org ID as optional param to AcquireJob
* chore: plumb through organization id to provisioner daemons
* add org id to provisioner domain key
* enforce org id argument
* dbgen provisioner jobs defaults to default org
In the peer healthcheck code, when an error pinging peers is detected we
write a "replicaErr" string with the error reason. However, if there are
no peer replicas to ping we returned early without setting the string to
empty. This would cause replicas that had peers (which were failing) and
then the peers left to permanently show an error until a new peer
appeared.
Also demotes DERP replica checking to a "warning" rather than an "error"
which should prevent the primary from removing the proxy from the region
map if DERP meshing is non-functional. This can happen without causing
problems if the peer is shutting down so we don't want to disrupt
everything if there isn't an issue.
Fixes#10760
The coder CLI quietly accepts any subcommand arguments and silently swallows them.
Currently:
```sh
❯ coder | head -n5
coder v2.3.3+e491217
USAGE:
coder [global-flags] <subcommand>
```
```sh
❯ coder idontexist | head -n5
coder v2.3.3+e491217
USAGE:
coder [global-flags] <subcommand>
```
Now help output will not be show when there is an unknown subcommand error. Instead users will be given the command for the help output.
```sh
❯ coder idontexist
Encountered an error running "coder", see "coder --help" for more information
error: unrecognized subcommand "idontexist"
```
```sh
❯ coder iexistbut idontexist
Encountered an error running "coder iexistbut", see "coder iexistbut --help" for more information
error: unrecognized subcommand "idontexist"
```
Also this stuff: `Encountered an error running "coder iexistbut"... ` gets written to `os.Stdout` in `prettyErrorFormatter{w: os.Stderr, verbose: r.verbose}`, not sure how to test that output.
* fix: separate signals for passive, active, and forced shutdown
`SIGTERM`: Passive shutdown stopping provisioner daemons from accepting new
jobs but waiting for existing jobs to successfully complete.
`SIGINT` (old existing behavior): Notify provisioner daemons to cancel in-flight jobs, wait 5s for jobs to be exited, then force quit.
`SIGKILL`: Untouched from before, will force-quit.
* Revert dramatic signal changes
* Rename
* Fix shutdown behavior for provisioner daemons
* Add test for graceful shutdown
Apptest requires a port without a listening server to test failure
cases. This port was chosen and had a chance of actually being
provisioned. To prevent this accident, a port <1k is chosen,
since those will never be allocated.
Every time I run `pnpm` in the project it adds the package manager attribute on package.json so I just decided to push it since it does not look like an issue and we can make sure everyone is running the same pnpm version.
fixes#11950https://github.com/coder/coder/issues/11950#issuecomment-1987756088 explains the bug
We were also calling into `Unlisten()` and `Close()` while holding the mutex. I don't believe that `Close()` depends on the notification loop being unblocked, but it's hard to be sure, and the safest thing to do is assume it could block.
So, I added a unit test that fakes out `pq.Listener` and sends a bunch of notifies every time we call into it to hopefully prevent regression where we hold the mutex while calling into these functions.
It also removes the use of a `context.Context` to stop the PubSub -- it must be explicitly `Closed()`. This simplifies a bunch of the logic, and is how we use the pubsub anyway.
Allow `coder login` to log into existing deployment if available.
Update help and error messages to indicate that `coder login` is
available as a command.
Fixes#10925Fixes#9551
* coderd: add test to reproduce trailing directory issue
* coderd: add trailing path separator to dir entries when converting to zip
* provisionersdk: add trailing path separator to directory entries
* chore: rename useTab to useSearchParamsKey and add test
* chore: mark old renderHookWithAuth as deprecated (temp)
* fix: update imports for useResourcesNav
* refactor: change API for useSearchParamsKey
* chore: let user pass in their own URLSearchParams value
* refactor: clean up comments for clarity
* fix: update import
* wip: commit progress on useWorkspaceDuplication revamp
* chore: migrate duplication test to new helper
* refactor: update code for clarity
* refactor: reorder test cases for clarity
* refactor: split off hook helper into separate file
* refactor: remove reliance on internal React Router state property
* refactor: move variables around for more clarity
* refactor: more updates for clarity
* refactor: reorganize test cases for clarity
* refactor: clean up test cases for useWorkspaceDupe
* refactor: clean up test cases for useWorkspaceDupe
* fix: do not set max deadline for workspaces on template update
When templates are updated and schedule data is changed, we update all
running workspaces to have up-to-date scheduling information that sticks
to the new policy.
When updating the max_deadline for existing running workspaces, if the
max_deadline was before now()+2h we would set the max_deadline to
now()+2h.
Builds that don't/shouldn't have a max_deadline have it set to 0, which
is always before now()+2h, and thus would always have the max_deadline
updated.
* test: add unit test to excercise template schedule bug
---------
Co-authored-by: Steven Masley <stevenmasley@gmail.com>
I noticed in my logs that sometimes `coder ssh` doesn't gracefully disconnect from the coordinator.
The cause is the `closerStack` construct we use in that function. It has two paths to start closing things down:
1. explicit `close()` which we do in `defer`
2. context cancellation, which happens if the cli function returns an error
sometimes the ssh remote command returns an error, and this triggers context cancellation of the `closerStack`. That is fine in and of itself, but we still want the explicit `close()` to wait until everything is closed before returning, since that's where we do cleanup, including the graceful disconnect. Prior to this fix the `close()` just immediately exits if another goroutine is closing the stack. Here we add a wait until everything is done.
* fix: ensure auto-workspace creation waits until all parameters are ready
* refactor: move creation blocking logic to main callback
* fix: let creation start if experimental feature is off
Adds a `--debug` flag to `scripts/develop.sh` that will start coder under `dlv debug` instead.
You can then use e.g. the following launch snippet to connect dlv:
```
{
"name": "Delve Remote",
"type": "go",
"request": "attach",
"mode": "remote",
"port": 12345,
}
```
You can also run invididual CLI commands under dlv e.g.
```
debug=1 scripts/coder-dev.sh list
```
Also sets CGO_ENABLED=0 in develop.sh by default.
- Reworks the proxy registration loop into a struct (so I can add a `RegisterNow` method)
- Changes the proxy registration loop interval to 15s (previously 30s)
- Adds test which tests bidirectional DERP meshing on all possible paths between 6 workspace proxy replicas
Related to https://github.com/coder/customers/issues/438
This fixes a vulnerability with the `CODER_OIDC_EMAIL_DOMAIN` option,
where users with a superset of the allowed email domain would be allowed
to login. For example, given `CODER_OIDC_EMAIL_DOMAIN=google.com`, a
user would be permitted entry if their email domain was
`colin-google.com`.
Part of #12163
- Adds a command coder support bundle <workspace> that generates a
support bundle and writes it to coder-support-$(date +%s).zip.
- Note: this is hidden currently until the rest of the functionality is fleshed out.
This adds the ability for `TunnelAuth` to also authorize incoming wireguard node IPs, preventing agents from reporting anything other than their static IP generated from the agent ID.
- Adds more testcases to TestAcquirer_MatchTags
- Adds functionality to generate a table from above test
- Update provisioner tag documentation with generated table
- Apply other feedback from #12315
DERP mesh key setup would do a SELECT and then an INSERT on failure, without a lock. During some testing with multiple replicas, I managed to cause a replica to crash due to them initializing simultaneously.
Fixes:
Encountered an error running "coder server"
create coder API: insert mesh key: pq: duplicate key value violates unique constraint "site_configs_key_key"
Co-authored-by: Cian Johnston <cian@coder.com>
* refactor: clean up and update API for useClipboard
* wip: commit current progress on useClipboard test
* docs: clean up wording on showCopySuccess
* chore: make sure tests can differentiate between HTTP/HTTPS
* chore: add test ID to dummy input
* wip: commit progress on useClipboard test
* wip: commit more test progress
* refactor: rewrite code for clarity
* chore: finish clipboard tests
* fix: prevent double-firing for button click aliases
* refactor: clean up test setup
* fix: rename incorrect test file
* refactor: update code to display user errors
* refactor: redesign useClipboard to be easier to test
* refactor: clean up GlobalSnackbar
* feat: add functionality for notifying user of errors (with tests)
* refactor: clean up test code
* refactor: centralize cleanup steps
Beginnings of a solution to #12297
Doesn't cover disco or definitively display whether we successfully connected to DERP, but shows some checklist diagnostics for connecting to an agent.
For this first PR, I just added it to `coder ping` to see how we like it, but could be incorporated into `coder ssh` _et al._ after a timeout.
```
$ coder ping dogfood2
p2p connection established in 147ms
pong from dogfood2 p2p via 95.217.xxx.yyy:42631 in 147ms
pong from dogfood2 p2p via 95.217.xxx.yyy:42631 in 140ms
pong from dogfood2 p2p via 95.217.xxx.yyy:42631 in 140ms
✔ preferred DERP region 999 (Council Bluffs, Iowa)
✔ sent local data to Coder networking coodinator
✔ received remote agent data from Coder networking coordinator
preferred DERP 10013 (Europe Fly.io (Paris))
endpoints: 95.217.xxx.yyy:42631, 95.217.xxx.yyy:37576, 172.17.0.1:37576, 172.20.0.10:37576
✔ Wireguard handshake 11s ago
```
* fix(coderd): mark provisioner daemon psk as secret
Marks provisioner daemon PSK with the secret annotation.
This ensures it will be scrubbed from API requests to
/api/v2/deployment/config.
* make gen
When viewing the Authentication page, the diagram showing the flow is a useful
resource for understanding the rest of the page.
Rather than linking to a specific version of the SVG, inline it as part of the
documentation.
* chore: drop github per user rate limit tracking
Rate limits for authenticated requests are per user.
This would be an excessive number of prometheus labels,
so we only track the unauthorized limit.
Moves healthcheck report-related structs from coderd/healthcheck to codersdk
This prevents an import cycle when adding a codersdk.Client method to hit /api/v2/debug/health.
Changes the agent to use the new v2 API for sending logs, via the logSender component.
We keep the PatchLogs function around, but deprecate it so that we can test the v1 endpoint.
In anticipation of needing the `LogSender` to run on a context that doesn't get immediately canceled when you `Close()` the agent, I've undertaken a little refactor to manage the goroutines that get run against the Tailnet and Agent API connection.
This handles controlling two contexts, one that gets canceled right away at the start of graceful shutdown, and another that stays up to allow graceful shutdown to complete.
Alternative solution to #6442
Modifies the behaviour of AcquireProvisionerJob and adds a special case for 'un-tagged' jobs such that they can only be picked up by 'un-tagged' provisioners.
Also adds comprehensive test coverage for AcquireJob given various combinations of tags.
Fixes race seen here: https://github.com/coder/coder/runs/21852483781
What happens is that the agent connects, completes the test, and then disconnects before the Eventually condition runs. The waiter then times out because it's looking for a connected agent.
Then, since it's a `require` in a goroutine, that causes the `tGo` cleanup to hang and the whole test suite to timeout after 10 minutes.
Anyway, `agenttest.New` doesn't block, and we don't actually need to wait for the agent to connect, since a successful SSH session is evidence that it connected.
* feat: convertGroups() no longer requires organization info
Removing role information from some users in the api. This info is
excessive and not required. It is costly to always include
* fix: ignore surronding whitespace for cli config
Cli config files break if you edit them manually with any editor.
Editors drop a newline at the end, and we not break on this.
If a developer manually edits a file, it should still work
* fix: move oauth2 routes
From /login/oauth2/* to /oauth2/*.
/login/oauth2 causes /login to no longer get served by the frontend,
even if nothing is actually served on /login itself.
* Add forgotten comment on delete
* feat: disable directory listings for static files
Static file server handles serving static asset files (js, css, etc).
The default file server would also list all files in a directory.
This has been changed to only serve files.
* chore: add database test fixture to insert non-unique linked_ids
* chore: create unit test to exercise failed email change bug
* fix: add postgres triggers to keep user_links clear of deleted users
* Add migrations to prevent deleted users with links
* Force soft delete of users, do not allow un-delete
* refactor: clean up tests for debounce
* refactor: clean up tests for useCustomEvent
* refactor: clean up events file
* refactor: clean up tests for hookPolyfills
- These CSS changes were for making sure there weren't layout shifts
when using the non-secure clipboard fallback, which could cause janky
UI flickers. It seems to be breaking things for some users on HTTP-only
connections, though.
This PR removes the prometheus-http port entirely from the coder service specification (originally added in #10448). It also removes the Helm value coder.service.prometheusNodePort.
Rationale: some cloud providers will helpfully expose all ports on a LoadBalancer service for you. The net effect of this is that setting CODER_PROMETHEUS_ENABLE will end up exposing port 2112 on your coderd service to the internet, which is likely undesired behaviour.
The agent is extended with a `--script-data-dir` flag, defaulting to the
OS temp dir. This dir is used for storing `coder-script-data/bin` and
`coder-script/[script uuid]`. The former is a place for all scripts to
place executable binaries that will be available by other scripts, SSH
sessions, etc. The latter is a place for the script to store files.
Since we default to OS temp dir, files are ephemeral by default. In the
future, we may consider adding new env vars or changing the default
storage location. Workspace startup speed could potentially benefit from
scripts being able to skip steps that require downloading software. We
may also extend this with more env variables (e.g. persistent storage in
HOME).
Fixes#11131
This commit refactors where custom environment variables are set in the
workspace and decouples agent specific configs from the `agentssh.Server`.
To reproduce all functionality, `agentssh.Config` is introduced.
The custom environment variables are now configured in `agent/agent.go`
and the agent retains control of the final state. This will allow for
easier extension in the future and keep other modules decoupled.
* fix: assign new oauth users to default org
This is not a final solution, as we eventually want to be able
to map to different orgs. This makes it so multi-org does not break oauth/oidc.
- prevent importing from the "monolith" lodash module. individual modules are better for tree shaking.
- prevent importing `useTheme` and types from @mui/material/styles. prefer importing from @emotion/react.
* fix: assign new oauth users to default org
This is not a final solution, as we eventually want to be able
to map to different orgs. This makes it so multi-org does not break oauth/oidc.
The first organization created is now marked as "default". This is
to allow "single org" behavior as we move to a multi org codebase.
It is intentional that the user cannot change the default org at this
stage. Only 1 default org can exist, and it is always the first org.
Closes: https://github.com/coder/coder/issues/11961
Adds a new subcomponent of the agent for queueing up logs until they can be sent over the Agent API.
Subsequent PR will change the agent to use this instead of the HTTP API for posting logs.
Relates to #10534
Fixes#12141Fixes#11750
PGCoord shutdown was uncoordinated, so an update at an inopportune time during shutdown would be rejected because the coordinator row was already deleted.
This PR ensures that the PGCoord subcomponents that write updates are shut down before we take down the heartbeats, which is responsible for deleting the coordinator row.
I noticed a possible race where tailnet.Conn can try to dial the embedded region before we've set our custom dialer that send the DERP in-memory. This closes that race and adds a test case for servertailnet with no STUN and an embedded relay
I think this will resolve#12136 but lets get a proper test at the system level before closing.
Before this change, we only register the node callback at start of day for the server tailnet. If the coordinator changes, like we know happens when we are licensed for the PGCoordinator, we close the connection to the old coord, and open a new one to the new coord.
The callback is designed to direct the updates to the new coordinator, but there is nothing that specifically triggers it to fire after we connect to the new coordinator.
If we have STUN, then period re-STUNs will generally get it to fire eventually, but without STUN it we could go indefinitely without a callback.
This PR changes the servertailnet to re-register the callback each time we reconnect to the coordinator. Registering a callback (even if it's the same callback) triggers an immediate call with our node information, so the new coordinator will have it.
Adds some debug endpoints for looking into the DERP server.
The `api/v2/debug/derp/traffic` endpoint requires the `ss` utility to be present in order to function. I have *not* added the `iproute2` package to our base image as it adds 11MB, so this endpoint won't be useful by default. However, in a debugging situation, we could exec into the container and then `apk add iproute2`, or build a special debug image.
The `api/v2/debug/expvar` handler contains DERP metrics as well as commandline and memstats.
Example:
```
{
"alert_failed": 0,
"alert_generated": 0,
"cmdline": ["/Users/spike/repos/coder/build/coder_darwin_arm64","--global-config","/Users/spike/repos/coder/.coderv2","server","--http-address","0.0.0.0:3000","--swagger-enable","--access-url","http://127.0.0.1:3000","--dangerous-allow-cors-requests=true"],
"derp": {"accepts": 1, "average_queue_duration_ms": 0, "bytes_received": 0, "bytes_sent": 0, "counter_packets_dropped_reason": {"gone_disconnected": 0, "gone_not_here": 0, "queue_head": 0, "queue_tail": 0, "unknown_dest": 0, "unknown_dest_on_fwd": 0, "write_error": 0}, "counter_packets_dropped_type": {"disco": 0, "other": 0}, "counter_packets_received_kind": {"disco": 0, "other": 0}, "counter_tcp_rtt": {}, "counter_total_dup_client_conns": 0, "gauge_clients_local": 1, "gauge_clients_remote": 0, "gauge_clients_total": 1, "gauge_current_connections": 1, "gauge_current_dup_client_conns": 0, "gauge_current_dup_client_keys": 0, "gauge_current_file_descriptors": 0, "gauge_current_home_connections": 1, "gauge_memstats_sys0": 20874504, "gauge_watchers": 0, "got_ping": 0, "home_moves_in": 0, "home_moves_out": 0, "multiforwarder_created": 0, "multiforwarder_deleted": 0, "packet_forwarder_delete_other_value": 0, "packets_dropped": 0, "packets_forwarded_in": 0, "packets_forwarded_out": 0, "packets_received": 0, "packets_sent": 0, "peer_gone_disconnected_frames": 0, "peer_gone_not_here_frames": 0, "sent_pong": 0, "unknown_frames": 0, "version": "1.47.0-dev20240214-t64db8c604"},
"memstats": {"Alloc":286506256,"TotalAlloc":297594632,"Sys":310621512,"Lookups":0,"Mallocs":304204,"Frees":171570,"HeapAlloc":286506256,"HeapSys":294060032,"HeapIdle":3694592,"HeapInuse":290365440,"HeapReleased":3620864,"HeapObjects":132634,"StackInuse":3735552,"StackSys":3735552,"MSpanInuse":347256,"MSpanSys":358512,"MCacheInuse":9600,"MCacheSys":15600,"BuckHashSys":1469877,"GCSys":9434896,"OtherSys":1547043,"NextGC":551867656,"LastGC":1707892877408883000,"PauseTotalNs":1247000,"PauseNs":[200333,229375,239875,209542,106958,203792,57125,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"PauseEnd":[1707892876217481000,1707892876219726000,1707892876222273000,1707892876226151000,1707892876234815000,1707892877398146000,1707892877408883000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"NumGC":7,"NumForcedGC":0,"GCCPUFraction":0.0022425810335762954,"EnableGC":true,"DebugGC":false,"BySize":[{"Size":0,"Mallocs":0,"Frees":0},{"Size":8,"Mallocs":14396,"Frees":9143},{"Size":16,"Mallocs":89090,"Frees":50507},{"Size":24,"Mallocs":40839,"Frees":24456},{"Size":32,"Mallocs":22404,"Frees":12379},{"Size":48,"Mallocs":51174,"Frees":23718},{"Size":64,"Mallocs":15406,"Frees":3501},{"Size":80,"Mallocs":6688,"Frees":2352},{"Size":96,"Mallocs":2567,"Frees":374},{"Size":112,"Mallocs":19371,"Frees":16883},{"Size":128,"Mallocs":2873,"Frees":1061},{"Size":144,"Mallocs":5600,"Frees":2742},{"Size":160,"Mallocs":2159,"Frees":622},{"Size":176,"Mallocs":454,"Frees":86},{"Size":192,"Mallocs":227,"Frees":128},{"Size":208,"Mallocs":1407,"Frees":732},{"Size":224,"Mallocs":1365,"Frees":1090},{"Size":240,"Mallocs":82,"Frees":48},{"Size":256,"Mallocs":310,"Frees":162},{"Size":288,"Mallocs":1945,"Frees":562},{"Size":320,"Mallocs":1200,"Frees":458},{"Size":352,"Mallocs":133,"Frees":33},{"Size":384,"Mallocs":582,"Frees":51},{"Size":416,"Mallocs":747,"Frees":200},{"Size":448,"Mallocs":113,"Frees":22},{"Size":480,"Mallocs":34,"Frees":21},{"Size":512,"Mallocs":951,"Frees":91},{"Size":576,"Mallocs":364,"Frees":122},{"Size":640,"Mallocs":532,"Frees":270},{"Size":704,"Mallocs":93,"Frees":39},{"Size":768,"Mallocs":83,"Frees":35},{"Size":896,"Mallocs":308,"Frees":175},{"Size":1024,"Mallocs":226,"Frees":122},{"Size":1152,"Mallocs":198,"Frees":100},{"Size":1280,"Mallocs":314,"Frees":171},{"Size":1408,"Mallocs":77,"Frees":47},{"Size":1536,"Mallocs":80,"Frees":54},{"Size":1792,"Mallocs":199,"Frees":107},{"Size":2048,"Mallocs":112,"Frees":48},{"Size":2304,"Mallocs":71,"Frees":32},{"Size":2688,"Mallocs":206,"Frees":81},{"Size":3072,"Mallocs":39,"Frees":15},{"Size":3200,"Mallocs":16,"Frees":7},{"Size":3456,"Mallocs":44,"Frees":29},{"Size":4096,"Mallocs":192,"Frees":83},{"Size":4864,"Mallocs":44,"Frees":25},{"Size":5376,"Mallocs":105,"Frees":43},{"Size":6144,"Mallocs":25,"Frees":5},{"Size":6528,"Mallocs":22,"Frees":7},{"Size":6784,"Mallocs":3,"Frees":0},{"Size":6912,"Mallocs":4,"Frees":2},{"Size":8192,"Mallocs":59,"Frees":10},{"Size":9472,"Mallocs":31,"Frees":12},{"Size":9728,"Mallocs":5,"Frees":2},{"Size":10240,"Mallocs":5,"Frees":0},{"Size":10880,"Mallocs":27,"Frees":11},{"Size":12288,"Mallocs":4,"Frees":1},{"Size":13568,"Mallocs":4,"Frees":2},{"Size":14336,"Mallocs":9,"Frees":2},{"Size":16384,"Mallocs":10,"Frees":2},{"Size":18432,"Mallocs":4,"Frees":2}]},
"warning_failed": 0,
"warning_generated": 0
}
```
If we find the DERP metrics useful we could consider how to include them in Prometheus scrapes based on the tailnet `varz` package. That's for a later PR if at all.
Adds documentation on port requirements and a short overview of STUN with some example scenarios.
Co-authored-by: Dean Sheather <dean@deansheather.com>
Co-authored-by: Spike Curtis <spike@coder.com>
When we exceed the db-imposed limit of logs, we need to communicate that back to the agent. In v1 we did it with a 4xx-level HTTP status, but with dRPC, the errors are delivered as strings, which feels fragile to me for something we want to gracefully handle.
So, this PR adds the log limit exceeded as a field on the response message, and fixes the API handler to set it as appropriate instead of an error.
* feat(provisioner): relax max terraform version constraint
* feat!(scripts/Dockerfile.base): update bundled terraform to 1.6.x
* bump terraform version in Dogfood image
* fix over-zealous rename
Fixes#12030
This is a good example of the kind of thing I'd like to address with a time-testing lib. The problem is that there is a race between the watchdog starting it's timer and the test incrementing the time. What would make this easier is if the time-testing library could wait for and assert the call to start the timer before incrementing the time.
adds a watchdog to our pubsub and runs it for Coder server.
If the watchdog times out, it triggers a graceful exit in `coder server` to give any provisioner jobs a chance to shut down.
c.f. #11950
I noticed in testing that the CLI wasn't correctly sending the disconnect message when it shuts down, and thus agents are seeing this as a "lost" peer, rather than a "disconnected" one.
What was happening is that we just used a single context for everything from the netconn to the RPCs, and when the context was canceled we failed to send the disconnect message due to canceled context.
So, this PR splits things into two contexts, with a graceful one set to last up to 1 second longer than the main one.
* docs: update remote docker host docs
Adds a link to external provisioners as a method to use remote docker hosts
* `make fmt`
* Update docker.md
* fmt
Annoyingly, prometheus Registry collects metrics async, which is causing our test to be racy. They also don't export enough from the Metric interface for us to replicate a synchronous collect, so we have to use Eventually to test.
Adds prometheus metrics to PGPubsub for monitoring its health and performance in production.
Related to #11950 --- additional diagnostics to help figure out what's happening
Since we run yamux over the websocket, we don't need to ping at the websocket layer because yamux has a 30 second keepalive mechanism enabled in the default config.
The RPC() function isn't called, since Listen() was modified to do this job.
Listen() has the right signature, since it returns a drpc.Conn, rather than the Agent API. That's because tailnet v2 and agent v2 are separate APIs served over the same connection.
It might be clearer to rename `Listen()` to `RPC()` but I'll save that for a different PR.
Adds a new statsReporter subcomponent of the agent, which in a later PR will be used to report stats over the v2 API.
Refactors the logic a bit so that we can handle starting and stopping stats reporting if the agent API connection drops and reconnects.
Moves monitoring of the agent v2 API connection to the yamux layer.
Present behavior monitors this at the websocket layer, and closes the websocket on completion. This can cause yamux to hit unexpected errors since the connection is closed underneath it.
This might be the cause of yamux errors that some customers are seeing

In any case, it's more graceful to close yamux first and let yamux close the underlying websocket. That should limit yamux error logging to truly unexpected/error cases.
The only downside is that the yamux `Close()` doesn't accept a reason, so if the agent becomes outdated and we close the API connection, the agent just sees the connection close without a reason. I'm not sure we log this at the agent anyway, but it would be nice. I think more accurate logging on Coderd are more important.
I've also added some logging when the monitor disconnects for reasons other than the context being canceled (e.g. agent outdated, failed pings).
* wip: commit progress for clipboard update
* wip: push more progress
* chore: finish initial version of useClipboard revamp
* refactor: update API query to use newer RQ patterns
* fix: update importers of useClipboard
* fix: increase clickable area of CodeExample
* fix: update styles for CliAuthPageView
* fix: resolve issue with ref re-routing
* docs: update comments for clarity
* wip: commit progress on clipboard tests
* chore: add extra test case for referential stability
* wip: disable test stub to avoid breaking CI
* wip: add test case for tab-switching
* feat: finish changes
* fix: improve styling for strong text
* fix: make sure period doesn't break onto separate line
* fix: make center styling more friendly to screen readers
* refactor: clean up mocking implementation
* fix: resolve security concern for clipboard text
* fix: update CodeExample to obscure text when appropriate
* fix: apply secret changes to relevant code examples
* refactor: simplify code for obfuscating text
* fix: partially revert clipboard changes
* fix: clean up page styling further
* fix: remove duplicate property identifier
* refactor: rename variables for clarity
* fix: simplify/revert CopyButton component design
* fix: update how dummy input is hidden from page
* fix: remove unused onClick handler prop
* fix: resolve unused import
* fix: opt code examples out of secret behavior
* fix: strip timezone information from a date in dau response
Timezone information is lost, so do not forward it to the client.
* fix: timezone offset should be flipped
* Make tests deterministic
This PR solves #10478 by auto-filling previously used template values in create and update workspace flows.
I decided against explicit user values in settings for these reasons:
* Autofill is far easier to implement
* Users benefit from autofill _by default_ — we don't need to teach them new concepts
* If we decide that autofill creates more harm than good, we can remove it without breaking compatibility
Fixes an issue where a MultiAgentConn isn't closed properly when the coordinator it is connected to is closed.
Since servertailnet checks whether the conn is closed before reinitializing, it is important that we check this, otherwise servertailnet can get stuck if the coordinator closes (e.g. when we switch from AGPL to PGCoordinator after decoding a license).
We're failing tests on error logs like this: https://github.com/coder/coder/actions/runs/7706053882/job/21000984583
Unfortunately, the error we hit, when the underlying connection is closed, is unexported, so we can't specifically ignore it.
Part of the issue is that agent.Close() doesn't wait for these goroutines to complete before returning, so the test harness proceeds to close the connection. This looks to our product code like the network connection failing. It would be possible to fix this, but just doesn't seem worth it for the extra insurance of catching other error logs in these tests.
Adds logging to yamux when used for tailnet client connections, e.g. CLI and wsproxy. This could be useful for debugging connection issues with tailnet v2 API.
`agentsdk` depends on `agent/proto` because it needs to get the version to dial.
Therefore, the conversion routines need to live in `agentsdk` so that we can convert to and from the Manifest.
I briefly considered refactoring the agent to only reference `proto.Manifest`, but decided against it because we might have multiple protocol versions in the future, its useful to have a protocol-independent data structure.
Fixes#8218
Removes `wsconncache` and related "is legacy?" functions and API calls that were used by it.
The only leftover is that Agents still use the legacy IP, so that back level clients or workspace proxies can dial them correctly.
We should eventually remove this: #11819
This PR updates the Agent API to use the appearance.Fetcher, which is set by entitlement code in Enterprise coderd.
This brings the agentapi into compliance with the Enterprise feature.
The new Agent API needs an interface for ServiceBanners, so this PR creates it and refactors the AGPL and Enterprise code to achieve it.
Before we depended on the fact that the HTTP endpoint was missing to serve an empty ServiceBanner on AGPL deployments, but that won't work with dRPC, so we need a real interface to call.
The original test is bugged in that it
1. creates a new AGPL coderd with a new database, so no appearance is set in the DB.
2. overwrites the agentClient so the assertion after removing the license is against the AGPL coderd
fixes#10533
refactors `codersdk` workspace agent dialer to use a single websocket connection to the tailnet v2 API for both coordination and DERPMap updates, rather than separate websockets (and the v1 API for DERPMaps).
#7439 added global caching of RBAC results.
Calls are cached based on hash(subject, object, action).
We often use dbauthz.AsSystemRestricted to handle "internal" authz calls, and these are often repeated with similar arguments and are likely to get cached.
So a transient error doing an authz check on a system function will be cached for up to a minute.
I'm just starting off with excluding context.Canceled but there's likely a whole suite of different errors we want to also exclude from the global cache.
Ok, so my last attempt at a fix here failed
https://github.com/coder/coder/actions/runs/7666229961/job/20893608286
I have a new theory: it's not the `terraform` binary that's busy, it's actually `fake_cancel.sh` and it gets marked busy when we `exec` it from the script we write.
Use of `exec` also replaces the executing code in place, rather than starting a new process/shell, so that's why the error we get says `terraform` is busy.
* docs: use coder modules in offline deployments
* fix typos
* Update offline installation instructions with Artifactory support for Coder modules
* Review suggestions
- Adds column `favorite` to workspaces table
- Adds API endpoints to favorite/unfavorite workspaces
- Modifies sorting order to return owners' favorite workspaces first
Fixes 2 related issues:
1. wsconncache had incorrect logic to test whether to send DERPMap updates, sending if the maps were equivalent, instead of if they were _not equivalent_.
2. configmaps used a bugged check to test equality between DERPMaps, since it contains a map and the map entries are serialized in random order. Instead, we avoid comparing the protobufs and instead depend on the existing function that compares `tailcfg.DERPMap`. This also has the effect of reducing the number of times we convert to and from protobuf.
fixes#10531
Adds a check for `version` on connection to the Agent API websocket endpoint. This is primarily for future-proofing, so that up-level agents get a sensible error if they connect to a back-level Coderd.
It also refactors the location of the `CurrentVersion` variables, to be part of the `proto` packages, since the versions refer to the APIs defined therein.
Adds support to `ServerTailnet` to set all peers lost before attempting to reconnect to the coordinator. In practice, this only really affects `wsproxy` since coderd has a local connection to the coordinator that only goes down if we're shutting down or change licenses.
These will show up when configuring the application along with the
client ID and everything else. Should make it easier to configure the
application, otherwise you will have to go look up the URLs in the
docs (which are not yet written).
Co-authored-by: Steven Masley <stevenmasley@gmail.com>
Rather than passing all the deployment values. This is to make it
easier to generate API keys as part of the oauth flow.
I also added and fixed a test for when the lifetime is set and the
default and expiration are unset.
Co-authored-by: Steven Masley <stevenmasley@gmail.com>
Use TSMP ping for reachability, but leave Disco ping for when we call Ping() since we often use that to determine whether we have a direct connection.
Also adds unit tests to make sure Ping() returns direct connection vs DERP correctly.
Adds support to Coordination to call SetAllPeersLost() when it is closed. This ensure that when we disconnect from a Coordinator, we set all peers lost.
This covers CoderSDK (CLI client) and Agent. Next PR will cover MultiAgent (notably, `wsproxy`).
We don't have visibility into some feature usage, so this adds a lot of fields missing from `database.Template` to `telemetry.Template`. Deprecation message is not collected, just whether it's set or not.
adds setAllPeersLost to the configMaps subcomponent of tailnet.Conn --- we'll call this when we disconnect from a coordinator so we'll eventually clean up peers if they disconnect while we are retrying the coordinator connection (or we don't succeed in reconnecting to the coordinator).
This one is huge, and I'm sorry.
The problem is that once I change `tailnet.Conn` to start doing v2 behavior, I kind of have to change it everywhere, including in CoderSDK (CLI), the agent, wsproxy, and ServerTailnet.
There is still a bit more cleanup to do, and I need to add code so that when we lose connection to the Coordinator, we mark all peers as LOST, but that will be in a separate PR since this is big enough!
Otherwise if for example you try to run `yarn storybook` it complains
that the version of Node is wrong.
`pnpm storybook` works fine and that is probably what we should
actually use, but as long as we are installing Yarn and not restricting
its use we might as well make it use the right version of Node.
* fix: doing a noop patch to templates resulted in 404
The patch response did not include the template. The UI required the
template to be returned to form the new page path
null is more explicit, and harder to make occur by mistake.
* fix: allow ports in wildcard url configuration
This just forwards the port to the ui that generates urls.
Our existing parsing + regex already supported ports for
subdomain app requests.
Fixes flake seen here, I think
https://github.com/coder/coder/actions/runs/7565915337/job/20602500818
golang's file processing is complex, and in at least some cases it can return from a file.Close() call without having actually closed the file descriptor.
If we're holding open the file descriptor of an executable we just wrote, and try to execute it, it will fail with "text file busy" which is what we have seen.
So, to be extra sure, I've avoided the standard library and directly called the syscalls to open, write, and close the file we intend to use in the test.
I've also added some more logging so if it's some issue of multiple tests writing to the same location, the we might have a chance to see it.
wsproxy also needs to be updated to use tailnet v2 because the `tailnet.Conn` stores peers by ID, and the peerID was not being carried by the JSON protocol. This adds a query param to the endpoint to conditionally switch to the new protocol.
Fixes a flake seen here: https://github.com/coder/coder/actions/runs/7541558190/job/20528545916
```
=== FAIL: enterprise/provisionerd TestRemoteConnector_Fuzz (0.06s)
t.go:84: 2024-01-16 12:32:27.024 [info] connector: failed provisioner authentication remote_addr=[::1]:45138 ...
error= failed to receive jobID:
github.com/coder/coder/v2/enterprise/provisionerd.(*remoteConnector).authenticate
/home/runner/actions-runner/_work/coder/coder/enterprise/provisionerd/remoteprovisioners.go:438
- bufio.Scanner: token too long
t.go:84: 2024-01-16 12:32:27.024 [debu] connector: closed connection remote_addr=[::1]:45138 error=<nil>
remoteprovisioners_test.go:209:
Error Trace: /home/runner/actions-runner/_work/coder/coder/enterprise/provisionerd/remoteprovisioners_test.go:209
Error: "2992256" is not less than "2097152"
Test: TestRemoteConnector_Fuzz
Messages: should not allow more than 1 MiB
```
This was an attempt to test that malicious actors can't abuse our authentication protocol to make us allocate a bunch of memory.
However, the test asserted on the number of bytes sent by the fuzzer, not the number of bytes read (& allocated) by the service. The former is affected by network queue sizes and is thus flaky without actively managing the socket queues, which I don't think we want to do.
In actual practise, the thing that matters is how much memory the bufio Scanner allocates. By inspection, the scanner will allocate up to 64k, and testing this is true devolves into testing the go standard library, which I don't think is worth doing.
So... let's just drop the assertion because
a) its flaky,
b) it doesn't test what we actually want to test,
c) the behavior we actually care about is part of the standard library.
- Adds a new query BatchUpdateLastUsedAt
- Adds calls to BatchUpdateLastUsedAt in app stats handler upon flush
- Passes a stats flush channel to apptest setup scaffolding and updates unit tests to assert modifications to LastUsedAt.
Cli errors are pretty formatted. This handles nested pretty types. Before it found the first error it could understand and return that. Now it will print the full error stack with more information.
To prevent information loss, a "[Trace=...]" was added to capture some extra error context for debugging.
- refactors`getFormHelpers` to accept an options object
- adds a `maxLength` option which will display a message and character counter for fields with length limits
- set `maxLength` option for template description fields
Merging since Mark is out.
* chore: add optional coder_app to faq
* applied Atif's suggestions
* make fmt again
---------
Co-authored-by: kirby <kirby@coder.com>
Co-authored-by: Stephen Kirby <58410745+stirby@users.noreply.github.com>
The `SingleTailnet` behavior only checked to see if the `MultiAgent` was
closed, but the websocket error was not being propogated into the
`MultiAgent`, causing it to never be swapped for a new working one.
Fixes https://github.com/coder/coder/issues/11401
Before:
```
Coder Workspace Proxy v0.0.0-devel+85ff030 - Your Self-Hosted Remote Development Platform
Started HTTP listener at http://0.0.0.0:3001
View the Web UI: http://127.0.0.1:3001
==> Logs will stream in below (press ctrl+c to gracefully exit):
2024-01-04 20:11:56.376 [warn] net.workspace-proxy.servertailnet: broadcast server node to agents ...
error= write message:
github.com/coder/coder/v2/enterprise/wsproxy/wsproxysdk.(*remoteMultiAgentHandler).writeJSON
/home/coder/coder/enterprise/wsproxy/wsproxysdk/wsproxysdk.go:524
- failed to write msg: WebSocket closed: failed to read frame header: EOF
```
After:
```
Coder Workspace Proxy v0.0.0-devel+12f1878 - Your Self-Hosted Remote Development Platform
Started HTTP listener at http://0.0.0.0:3001
View the Web UI: http://127.0.0.1:3001
==> Logs will stream in below (press ctrl+c to gracefully exit):
2024-01-04 20:26:38.545 [warn] net.workspace-proxy.servertailnet: multiagent closed, reinitializing
2024-01-04 20:26:38.546 [erro] net.workspace-proxy.servertailnet: reinit multi agent ...
error= dial coordinate websocket:
github.com/coder/coder/v2/enterprise/wsproxy/wsproxysdk.(*Client).DialCoordinator
/home/coder/coder/enterprise/wsproxy/wsproxysdk/wsproxysdk.go:454
- failed to WebSocket dial: failed to send handshake request: Get "http://127.0.0.1:3000/api/v2/workspaceproxies/me/coordinate": dial tcp 127.0.0.1:3000: connect: connection refused
2024-01-04 20:26:38.587 [erro] net.workspace-proxy.servertailnet: reinit multi agent ...
error= dial coordinate websocket:
github.com/coder/coder/v2/enterprise/wsproxy/wsproxysdk.(*Client).DialCoordinator
/home/coder/coder/enterprise/wsproxy/wsproxysdk/wsproxysdk.go:454
- failed to WebSocket dial: failed to send handshake request: Get "http://127.0.0.1:3000/api/v2/workspaceproxies/me/coordinate": dial tcp 127.0.0.1:3000: connect: connection refusedhandshake request: Get "http://127.0.0.1:3000/api/v2/workspaceproxies/me/coordinate": dial tcp 127.0.0.1:3000: connect: connection refused
2024-01-04 20:26:40.446 [info] net.workspace-proxy.servertailnet: successfully reinitialized multiagent agents=0 took=1.900892615s
```
It causes the sign-in page to reload whenever a user enters a page or changes the window's focus. This is happening because when the "user" fetch is made, the server returns an error, making the react-query mark the data as stale and try to load it whenever possible.
* Shows the overall report error at the top of the page, if present.
* Shows workspaceproxy errors above warnings inside the corresponding element, if present.
* Improves unregistered proxy status
Adds a nodeUpdater component, which serves a similar role to configMaps, but tracks information from tailscale going out to the coordinator as node updates. This first PR just handles netInfo, subsequent PRs will
handle DERP forced websockets, endpoints, and addresses.
* docs: remove cloud logos from 1-click install
They were looking good and are not adding much value.
* Delete docs/images/install/render.png
* Delete docs/images/install/ec2.svg
* Delete docs/images/install/eks.svg
* Delete docs/images/install/fly.io.svg
* Delete docs/images/install/gce.svg
* Delete docs/images/install/heroku.svg
* Delete docs/images/install/railway.svg
This test case fails with an error log, showing "context canceled" when trying to send an acquired job to an in-mem provisionerd.
https://github.com/coder/coder/runs/20331469006
In this case, we don't want to supress this error, since it could mean that we acquired a job, locked it in the database, then failed to send it to a provisioner.
(We also don't want to mark the job as failed because we don't know whether the job made it to the provisionerd or not --- in the failed test you can see that the job is actually processed just fine).
The reason we got context canceled is because the API was shutting down --- we don't want provisionerdserver to abruptly stop processing job stuff as the API shuts down as this will leave jobs in a bad state. This PR fixes up the use of contexts with provisionerdserver and the associated drpc service calls.
Work in progress on a subcomponent of the Conn which will handle configuring the wireguard engine on changes. I've implemented setAddresses as the simplest case and added unit tests of the reconfiguration loop.
Besides making the code easier to test and understand, the goal is for this component to handle disconnect and loss updates about peers, and thereby, implement the v2 Tailnet API.
Further PRs will handle peer updates, status updates, and net info updates.
Then, after the subcomponent is implemented and tested, I will refactor Conn to use it instead of the current monolithic architecture.
It looks like we updated mockgen to use Uber's fork, but the flake still
pointed to a nixos-unstable commit containing the old mockgen resulting
in an error like:
missing go.sum entry for module providing package github.com/golang/mock/mockgen/model
Fixes#11451
A refactor of the Agent API passes metrics as protobufs, which include pointers to label name/value pairs. The aggregator tested for sameness by doing a shallow compare of label values, which for different stats reports would compare unequal because the pointers would be different.
This fix does a deep compare.
While testing I also noted that we neglect to compare template names. This is unlikely to have caused any issue in practice, since the combination of username/workspace is unique, but in the context of comparing metric labels we should do the comparison.
If a user creates a workspace, deletes it, then recreates from a different template, we could in principle have reported incorrect stats for the old template.
Part of #10676
- Adds a health section for provisioner daemons (mostly cannibalized from the Workspace Proxy section)
- Adds a corresponding storybook entry for provisioner daemons health section
- Fixed an issue where dismissing the provisioner daemons warnings would result in a 500 error
- Adds provisioner daemon error codes to docs
* assert provisioner daemon version and api_version in unit tests
* add build info in HTTP header, extract codersdk.BuildVersionHeader
* add api_version to codersdk.ProvisionerDaemon
* testutil.MustString -> testutil.MustRandString
Refactors the code that handles monitoring an agent websocket with pings and updating the connection times in the DB.
Consolidates v1 and v2 agent APIs under the same code for this.
One substantive change (not _just_ a refactor) is that I've made it so that we actually disconnect if the agent fails to respond to our pings, rather than the old behavior where we would update the database, but not actually tear down the websocket.
We're seeing some flaky tests related to agent connectivity - https://github.com/coder/coder/actions/runs/7286675441/job/19856270998
I'm pretty sure what happened in this one is that the client opened a connection while the wgengine was in the process of reconfiguring the wireguard device, so the fact that the peer became "active" as a result of traffic being sent was not noticed.
The test calls `AwaitReachable()` but this only tests the disco layer, so it doesn't wait for wireguard to come up.
I think we should be using TSMP for pinging and reachability, since this operates at the IP layer, and therefore requires that wireguard comes up before being successful.
This should also help with the problems we have seen where a TCP connection starts before wireguard is up and the initial round trip has to wait for the 5 second wireguard handshake retry.
fixes: #11294
Refactors our DRPC service definitions slightly.
In the previous version, I inserted the RPCs from the tailnet proto directly into the Agent service. This makes things hard to deal with because DRPC then generates a new set of methods with new interfaces with the `DRPCAgent_` prefixed. Since you can't have a single method that takes different argument types, we couldn't reuse the implementation of those RFCs without a lot of extra classes and pass-thru methods.
Instead, the "right" way to do it is to integrate at the DRPC layer. So, we have two DRPC services available over the Agent websocket, and register them both on the DRPC `mux`.
Since the tailnet proto RPC service is now for both clients and agents, I renamed some things to clarify and shorten.
This PR also removes the `TailnetAPI` implementation from the `agentapi` package, and the next PR in the stack replaces it with the implementation from the `tailnet` package.
* Add database tables for OAuth2 applications
These are applications that will be able to use OAuth2 to get an API key
from Coder.
* Add endpoints for managing OAuth2 applications
These let you add, update, and remove OAuth2 applications.
* Add frontend for managing OAuth2 applications
* feat: enable csrf token header
* Exempt external auth requets
* ensure dev server bypasses CSRF
* external auth is just get requests
* Add some more routes
* Extra assurance nothing breaks
* chore: fix flake, use time closer to actual test
The tests were queued, and the autostart time was being set
to the time the table was created, not when the test was actually
being run. This diff was causing failures in CI
* Adds UpdateProvisionerDaemonLastSeenAt
* Adds heartbeat to provisioner daemons
* Inserts provisioner daemons to database upon start
* Ensures TagOwner is an empty string and not nil
* Adds COALESCE() in idx_provisioner_daemons_name_owner_key
* chore: add unit test to excercise flake
* Implement a *fix for cron stop() before run()
This fix still has a race condition. I do not see a clean solution
without modifying the cron libary. The cron library uses a boolean
to indicate running, and that boolean needs to be set to "true"
before we call "Close()". Or "Close()" should prevent "Run()"
from doing anything.
In either case, this solves the issue for a niche unit test bug
in which the test finishes, calling Close(), before there was
an oppertunity to start the go routine. It probably isn't worth
a lot of time investment, and this fix will suffice
* setup manifest
* added okta guide from steven M
* improved index by adding children
* changed icon to notes.svg
* added meta guide, fixed profile photo fmt
closes#10532
Adds v2 support to the /coordinate endpoint via a query parameter.
v1 already has test cases, and we haven't implemented v2 at the client yet, so the only new test case is an unsupported version.
Part of #10532
Adds a tailnet ClientService that accepts a net.Conn and serves v1 or v2 of the tailnet API.
Also adds a DRPCService that implements the DRPC interface for the v2 API. This component is within the ClientService, but needs to be reusable and exported so that we can also embed it in the Agent API.
Finally, includes a NewDRPCClient function that takes a net.Conn and runs dRPC in yamux over it on the client side.
Part of #10532
DRPC transport over yamux and in-mem pipes was previously only used on the provisioner APIs, but now will also be used in tailnet. Moved to subpackage of codersdk to avoid import loops.
Fixes flake https://github.com/coder/coder/runs/19639217635
AGPL coordinator used to process node updates for single_tailnet synchronously, but it's been refactored to process async, so in this test we need to wait for it to be processed.
This sends the email the license was issued to, and whether or not it's a trial in the telemetry payload. It's a bit janky since the license parsing is all enterprise licensed.
Addresses the issue in #11185 for the StringMap datatype.
There are other slice data types in our database package that also need to be fixed, but that'll be a different PR
Adds column api_version to the provisioner_daemons table.
This is distinct from the coderd version, and is used to handle breaking changes in the provisioner daemon API.
* added sharkymark FAQs page
* make fmt
* fixed typos for link
* changed FAQs icon to (i)
* satisfied review
* make fmt
* added docs links for coder_app, CODER_ACCESS_URL
* removed mentions of mark
* fixed some minor code formatting issues
* fixed numbered bullets rendering, make fmt
* chore(Makefile): use golangci-lint version from dogfood Dockerfile
* chore(dogfood/Dockerfile): update golangci-lint to latest version
* chore(coderd): address linter complaints
Fixes#10979
Testing code that listens on a specific port has created a long battle with flakes. Previous attempts to deal with this include opening a listener on a port chosen by the OS, then closing the listener, noting the port and starting the test with that port.
This still flakes, notably in macOS which has a proclivity to reuse ports quickly.
Instead of fighting with the chaos that is an OS networking stack, this PR fakes the host networking in tests.
I've taken a small step here, only faking out the Listen() calls that port-forward makes, but I think over time we should be transitioning all networking the CLI does to an abstract interface so we can fake it. This allows us to run in parallel without flakes and
presents an opportunity to test error paths as well.
* chore: check if process is nil
We check if process is nil in the ports_supported file.
Just matching that defensive check, not sure if it can be nil.
- Adds a --name argument to provisionerd start
- Plumbs through name to integrated and external provisioners
- Defaults to hostname if not specified for external, hostname-N for integrated
- Adds cliutil.Hostname
* feat: add endpoints to list all authed external apps
Listing the apps allows users to auth to external apps without going through the create workspace flow.
* Force typegen types for some fields of derp health report
* Explicitly allocate slices for RegionReport.{Errors,Warnings} to avoid nulls in API response
The nix image isn't used because it doesn't work, and we haven't been
updating our "pre-nix" tag since the changes were made. Reverts back to
being a regular Dockerfile.
* chore: add Pagination component, add new test, and update other pagination tests
* fix: add back temp spacing for WorkspacesPageView
* chore: update AuditPage to use Pagination
* chore: update UsersPage to use Pagination
* refactor: move parts of Pagination into WorkspacesPageView
* fix: handle empty states for pagination labels better
* docs: rewrite comment for clarity
* refactor: rename components/properties for clarity
* fix: rename component files for clarity
* chore: add story for PaginationContainer
* chore: rename story for clarity
* fix: handle undefined case better
* fix: update imports for PaginationContainer mocks
* fix: update story values for clarity
* fix: update scroll logic to go to the bottom instead of the top
* fix: update mock setup for test
* fix: update stories
* fix: remove scrolling functionality
* fix: remove deprecated property
* refactor: rename prop
* fix: remove debounce flake
Updates coder/customers#365
This PR updates our migration framework to run all migrations in a single transaction. This is the same behavior we had in v1 and ensures that failed migrations don't bring the whole deployment down. If a migration fails now, it will automatically be rolled back to the previous version, allowing the deployment to continue functioning.
Relates to #8965
* Fixes offlinedocs that broke from change in feat(coderd/healthcheck): add access URL error codes and healthcheck doc #10915 by removing the offending anchor links from the page subheadings.
* Makes offlinedocs also conditional on changes to docs
Fixes flake seen here: https://github.com/coder/coder/runs/19170327767
The goroutine that attempts to dial the socket didn't complete before the test did. Here we add an explicit wait for it to complete in each run of the loop.
Spotted during a code read. ConnIO unlocks the mutex before attempting to write to the response channel, which could allow another goroutine to call Close() and close the channel, causing a panic.
Fix is to hold the mutex. This won't cause a deadlock because the `select{}` has a `default` case, so we won't block even if the receiver isn't keeping up.
Adds cleanup queries to clean out "lost" peer and tunnel state after 24 hours. We leave this state in the database so that anything trying to connect to the peer can see that it was lost, but clean it up after 24 hours to ensure our table doesn't grow without bounds.
Adds support for graceful disconnect to PGCoordinator. When peers gracefully disconnect, they send a disconnect message. This triggers the peer to be disconnected from all tunneled peers.
The Multi-Agent Client supports graceful disconnect, since it is in memory and we know that when it is closed, we really mean to disconnect.
The v1 agent and client Websocket connections do not support graceful disconnect, since the v1 protocol doesn't have this feature. That means that if a v1 peer connects to a v2 peer, when the v1 peer's coordinator connection is closed, the v2 peer will
see it as "lost" since we don't know whether the v1 peer meant to disconnect, or it just lost connectivity to the coordinator.
* wip: commit current progress on usePaginatedQuery
* chore: add cacheTime to users query
* chore: update cache logic for UsersPage usersQuery
* wip: commit progress on Pagination
* chore: add function overloads to prepareQuery
* wip: commit progress on usePaginatedQuery
* docs: add clarifying comment about implementation
* chore: remove optional prefetch property from query options
* chore: redefine queryKey
* refactor: consolidate how queryKey/queryFn are called
* refactor: clean up pagination code more
* fix: remove redundant properties
* refactor: clean up code
* wip: commit progress on usePaginatedQuery
* wip: commit current pagination progress
* docs: clean up comments for clarity
* wip: get type signatures compatible (breaks runtime logic slightly)
* refactor: clean up type definitions
* chore: add support for custom onInvalidPage functions
* refactor: clean up type definitions more for clarity reasons
* chore: delete Pagination component (separate PR)
* chore: remove cacheTime fixes (to be resolved in future PR)
* docs: add clarifying/intellisense comments for DX
* refactor: link users queries to same queryKey implementation
* docs: remove misleading comment
* docs: more comments
* chore: update onInvalidPage params for more flexibility
* fix: remove explicit any
* refactor: clean up type definitions
* refactor: rename query params for consistency
* refactor: clean up input validation for page changes
* refactor/fix: update hook to be aware of async data
* chore: add contravariance to dictionary
* refactor: increase type-safety of usePaginatedQuery
* docs: more comments
* chore: move usePaginatedQuery file
* fix: add back cacheTime
* chore: swap in usePaginatedQuery for users table
* chore: add goToFirstPage to usePaginatedQuery
* fix: make page redirects work properly
* refactor: clean up clamp logic
* chore: swap in usePaginatedQuery for Audits table
* refactor: move dependencies around
* fix: remove deprecated properties from hook
* refactor: clean up code more
* docs: add todo comment
* chore: update testing fixtures
* wip: commit current progress for tests
* fix: update useEffectEvent to sync via layout effects
* wip: commit more progress on tests
* wip: stub out all expected test cases
* wip: more test progress
* wip: more test progress
* wip: commit more test progress
* wip: AHHHHHHHH
* chore: finish two more test cases
* wip: add in all tests (still need to investigate prefetching
* refactor: clean up code slightly
* fix: remove math bugs when calculating pages
* fix: wrap up all testing and clean up cases
* docs: update comments for clarity
* fix: update error-handling for invalid page handling
* fix: apply suggestions
Relates to #8965
- Added error codes for separate code paths in health checks
- Prefixed errors and warnings with error code prefixes
- Added a docs page with details on each code, cause and solution
Co-authored-by: Muhammad Atif Ali <atif@coder.com>
The 'userOIDC' method body was getting unwieldy.
I think there is a good way to redesign the flow, but
I do not want to undertake that at this time.
The easy win is just to move some LoC to other methods
and cleanup the main method.
Drop "New" and "Builder" from the function names, in favor of the top-level resource created. This shortens tests and gives a nice syntax. Since everything is a builder, the prefix and suffix don't add much value and just make things harder to read.
I've also chosen to leave `Do()` as the function to insert into the database. Even though it's a builder pattern, I fear `.Build()` might be confusing with Workspace Builds. One other idea is `Insert()` but if we later add dbfake functions that update, this might be inconsistent.
I noticed we have been overusing colors in the UI, so simplifying is better for the "look and feel" and maintaining the styles over time.

If you want to have a better sense of what it looks like, I recommend you go to the Chromatic snapshot.
- Updates plugin staleness check to check mtime instead of atime, as atime has been shown to be unreliable
- Updates existing unit test to use a real filesystem as Afero's in-memory FS doesn't support atimes at all
Convert to builder for consistency with rest of the package. This will make it easier to use, and means we can drop "Builder" from function arguments since they are all builders in the package.
* Adds workspace proxy section to health page
* Conditionally places workspace proxy warnings in errors or warnings based on calculated severity
* Adds some more stories we were missing for HealthPage
Fixes#10799
The flake happens when we try to remote forward, but the port we've chosen is not free. In the flaked example, it's actually the SSH listener that occupies the port we try to remote forward, leading to confusing reads (c.f. the linked issue).
This fix simplies the tests considerably by using the Go ssh client, rather than shelling out to OpenSSH. This avoids using a pseudoterminal, avoids the need for starting any local OS listeners to communicate the forwarding (go SSH just returns in-process listeners), and avoids an OS listener to wire OpenSSH up to the agentConn.
With the simplied logic, we can immediately tell if a remote forward on a random port fails, so we can do this in a loop until success or timeout.
I've also simplified and fixed up the other forwarding tests. Since we set up forwarding in-process with Go ssh, we can remove a lot of the `require.Eventually` logic.
- Adds a template_insights pseudo-resource
- Grants auditor and template admin roles read access on template_insights
- Updates existing RBAC checks to check for read template_insights, falling back to template update permissions where necessary
- Updates TemplateLayout to show Insights tab if can read template_insights or can update template
Adds a health check for workspace proxies:
- Healthy iff all proxies are healthy and the same version,
- Warning if some proxies are unhealthy,
- Error if all proxies are unhealthy, or do not all have the same version.
* refactor: remove workspace error enums
* fix: add in retry button for failed workspaces
* fix: make handleBuildRetry auto-detect debug permissions
* chore: consolidate retry messaging
* chore: update renderWorkspacePage to accept parameters
* chore: make workspace test helpers take explicit workspace parameter
* refactor: update how parameters for tests are defined
* fix: update old tests to be correctly parameterized
fixes#10810
The tailnet coordinators don't depend on replicasync, so we can still enable HA coordinators even if the relay URL is unset.
The in-memory, non-HA coordinator probably has lower latency than the PG Coordinator, since we have to query the database, so enterprise customers might want to disable it for single-replica deployments, but this PR default-enables the HA coordinator. We could add support later to disable it if anyone complains. Latency setting up connections matters, but I don't believe the coordinator contributes significantly at this point for reasonable postgres round-trip-time.
Man, graceful shutdown is hard. Even after my changes, we were still hitting a graceful shutdown race: https://github.com/coder/coder/runs/18886842123
The problem was that while we attempt a graceful shutdown at the SSH layer by closing the session for writing, we were not giving it a chance to complete before continuing to tear down the stack of closers, including one that closes the netstack, and thus drop the TCP connection before it closes.
I'd like to convert dbfake into a builder pattern to prevent a proliferation of XXXWithYYY methods. This is one step of the way by removing the Non-builder function.
Refactors SSH tests to skip provisionerd and instead use dbfake to insert workspaces and builds. This should make tests faster and more reliable.
dbfake.WorkspaceBuild is refactored to use a "builder" pattern with "fluent" options, as the number of options and variants was starting to get out of hand.
* fix: clarify language in orphan section of delete modal
* tinted title
* Update site/src/pages/WorkspacePage/WorkspaceDeleteDialog/WorkspaceDeleteDialog.tsx
Co-authored-by: Muhammad Atif Ali <atif@coder.com>
* prettier
---------
Co-authored-by: Muhammad Atif Ali <atif@coder.com>
* feat: implement deprecated flag for templates to prevent new workspaces
* Add deprecated filter to template fetching
* Add deprecated to template table
* Add deprecated notice to template page
* Add ui to deprecate a template
* Remove Typography from NavbarView
* Remove Typography from EmptyState
* Remove Typography from Paywall
* Fix font size
* Remove Typography from CliAuthPage
* Remove Typography from Single SignOn
* Remove Typography from file dialog
* Remove from not found
* Remove from Section
* Remove from global snackbar
* Remove Typography component
* Add eslint role
re: #10528
Refactors PG Coordinator to work with the Tailnet v2 API, including wrappers for the existing v1 API.
The debug endpoint functions, but doesn't return sensible data, that will be in another stacked PR.
> Can someone help me understand the differences between these env variables:
>
> CODER_REDIRECT_TO_ACCESS_URL
> CODER_TLS_REDIRECT_HTTP_TO_HTTPS
> CODER_TLS_REDIRECT_HTTP
Oh man, what a mess. It looks like `CODER_TLS_REDIRECT_HTTP ` appears in our config docs. Maybe that was the initial name for the environment variable?
At some point, both the flag and the environment variable were `--tls-redirect-http-to-https` and `CODER_TLS_REDIRECT_HTTP_TO_HTTPS`. `CODER_TLS_REDIRECT_HTTP` did nothing.
However, then we introduced `CODER_REDIRECT_TO_ACCESS_URL`, we put in some deprecation code that was maybe fat-fingered such that we accept the environment variable `CODER_TLS_REDIRECT_HTTP` but the flag `--tls-redirect-http-to-https`. Our docs still refer to `CODER_TLS_REDIRECT_HTTP` at https://coder.com/docs/v2/latest/admin/configure#address
So, I think what we gotta do is still accept `CODER_TLS_REDIRECT_HTTP` since it was working and in an example doc, but also fix the deprecation code to accept `CODER_TLS_REDIRECT_HTTP_TO_HTTPS` environment variable.
Re-enables TestSSH/RemoteForward_Unix_Signal and addresses the underlying race: we were not closing the remote forward on context expiry, only the session and connection.
However, there is still a more fundamental issue in that we don't have the ability to ensure that TCP sessions are properly terminated before tearing down the Tailnet conn. This is due to the assumption in the sockets API, that the underlying IP interface is long
lived compared with the TCP socket, and thus closing a socket returns immediately and does not wait for the TCP termination handshake --- that is handled async in the tcpip stack. However, this assumption does not hold for us and tailnet, since on shutdown,
we also tear down the tailnet connection, and this can race with the TCP termination.
Closing the remote forward explicitly should prevent forward state from accumulating, since the Close() function waits for a reply from the remote SSH server.
I've also attempted to workaround the TCP/tailnet issue for `--stdio` by using `CloseWrite()` instead of `Close()`. By closing the write side of the connection, half-close the TCP connection, and the server detects this and closes the other direction, which then
triggers our read loop to exit only after the server has had a chance to process the close.
TODO in a stacked PR is to implement this logic for `vscodessh` as well.
Marked as a breaking change as the previous activity bump was always the TTL duration of the workspace/template.
This change is more cost conservative, only bumping by 1 hour for workspace activity. To accommodate wrap around, eg bumping a workspace into the next autostart, the deadline is bumped by the TTL if the workspace crosses the autostart threshold.
This is a niche case that is likely caused by an idle terminal making a workspace survive through a night. The next morning, the workspace will get activity bumped the default TTL on the autostart, being similar to as if the workspace was autostarted again.
In practice, a good way to avoid this is to set a max_deadline of <24hrs to avoid wrap around entirely.
* Adds an annotation format_duration_ns to all deployment values of type clibase.Duration
* Adds a unit test that complains if you forget to add the above annotation to a clibase.Duration
* Modifies optionValue() to check for the presence of format_duration_ns when displaying an option.
Adds a Logger to cli Invocation and standardizes CLI commands to use it. clitest creates a test logger by default so that CLI command logs are captured in the test logs.
CLI commands that do their own log configuration are modified to add sinks to the existing logger, rather than create a new one. This ensures we still capture logs in CLI tests.
Bumps [github.com/coder/retry](https://github.com/coder/retry) from 1.4.0 to 1.5.1.
<details>
<summary>Commits</summary>
<ul>
<li><a href="https://github.com/coder/retry/commit/f5ccc4d2d45135bf65c7ccc5e78942dd7df19c84"><code>f5ccc4d</code></a> Fix double-scaling bug</li>
<li><a href="https://github.com/coder/retry/commit/14c7c27e14e40827a36754dd2071b09249d426f8"><code>14c7c27</code></a> Add support for Jitter (<a href="https://redirect.github.com/coder/retry/issues/28">#28</a>)</li>
<li><a href="https://github.com/coder/retry/commit/12627b155ff59e5f62c15d262ba1ba06f17daa90"><code>12627b1</code></a> Update README to give a goto example</li>
<li><a href="https://github.com/coder/retry/commit/a8710231a1a7a7f884eb894aca0bee24c5caf21c"><code>a871023</code></a> Make minor format improvements to README</li>
<li>See full diff in <a href="https://github.com/coder/retry/compare/v1.4.0...v1.5.1">compare view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore <dependency name> major version` will close this group update PR and stop Dependabot creating any more for the specific dependency's major version (unless you unignore this specific dependency's major version or upgrade to it yourself)
- `@dependabot ignore <dependency name> minor version` will close this group update PR and stop Dependabot creating any more for the specific dependency's minor version (unless you unignore this specific dependency's minor version or upgrade to it yourself)
- `@dependabot ignore <dependency name>` will close this group update PR and stop Dependabot creating any more for the specific dependency (unless you unignore this specific dependency or upgrade to it yourself)
- `@dependabot unignore <dependency name>` will remove all of the ignore conditions of the specified dependency
- `@dependabot unignore <dependency name> <ignore condition>` will remove the ignore condition of the specified dependency and ignore conditions
</details>
* fix: add focus styling to checkboxes
* fix: add focus styling to icon buttons
* fix: add focus styling to switches
* fix: swap outlines for box-shadows for more styling control
Fixes an issue where remote forwards are not correctly torn down when using OpenSSH with `coder ssh --stdio`. OpenSSH sends a disconnect signal, but then also sends SIGHUP to `coder`. Previously, we just exited when we got SIGHUP, and this raced against properly disconnecting.
Fixes https://github.com/coder/customers/issues/327
* coder list: adds information about next start / stop to available columns (not default)
* coder schedule: show now essentially coder list with a different set of columns
* Updates cli schedule unit tests to use new dbfake
Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
* chore: revamp Page Utility tests
* refactor: simplify component design for PageButton
* chore: beef up isNonInitialPage and add tests
* docs: clean up comments
* chore: quick refactor for buildPagedList
* refactor: clean up math calculations for buildPagedList
* chore: rename PageButtons file
* chore: revamp how nav buttons are defined
* fix: remove test disabled state
* chore: clean up base nav button
* chore: rename props for clarity
* refactor: clean up logic for isNonInitialPage
* chore: add more tests and catch bugs
* docs: fix confusing typo in comments
* chore: add one more test case for pagination buttons
* refactor: update props definition for PaginationNavButton
* fix: remove possible state sync bugs
* Updated testingWithOwnerUser ruleguard rule to detect:
a) Passing client from coderdenttest.New() to clitest.SetupConfig() similar to what already exists for AGPL code
b) Usage of any method of the owner client from coderdenttest.New() - all usages of the owner client must be justified with a `//nolint:gocritic` comment.
* Fixed resulting linter complaints.
* Added new coderdtest helpers CreateGroup and UpdateTemplateMeta.
* Modified check_enterprise_import.sh to ignore scripts/rules.go.
This change tests that the patch request is cancelled instead of hoping
that there's no race between context cancellations leading to patch
never being called.
* chore: add query for a user's groups
* chore: integrate user groups into UI
* refactor: split UI card into separate component
* chore: enforce alt text for AvatarCard
* chore: add proper alt text support for Avatar
* fix: update props for Avatar call sites
* finish AccountPage changes
* wip: commit progress on AvatarCard
* fix: add better UI error handling
* fix: update theme setup for AvatarCard
* fix: update styling for AccountPage
* fix: make error message conditional
* chore: update styling for AvatarCard
* chore: finish AvatarCard
* fix: add maxWidth support to AvatarCard
* chore: update how no max width is defined
* chore: add AvatarCard stories
* fix: remove incorrect semantics for AvatarCard
* docs: add comment about flexbox behavior
* docs: add clarifying text about prop
* fix: fix grammar for singular groups
* refactor: split off AccountUserGroups and add story
* fix: differentiate mock groups more
* Adds a Contains() method on MockAuditor to help with asserting the presence of an audit log with specific fields.
* Updates existing usages of verifyAuditWorkspaceCreated to use the new helper
* Updates test referenced in PR#10396.
* feat: add dbfakedata for workspace builds and resources
This creates `coderdtest.NewWithDatabase` and adds a series of
helper functions to `dbfake` that insert structured fake data
for resources into the database.
It allows us to remove provisionerd from a significant amount of
tests which should speed them up and reduce flakes.
* Rename dbfakedata to dbfake
* Migrate workspaceagents_test.go to use the new dbfake
* Migrate agent_test.go to use the new fakes
* Fix comments
* feat: add observability configuration values to deployment page
- Moved audit logging to this page
- Logging, prometheus, tracing, debug, and pprof settings
Fixes flake seen here: https://github.com/coder/coder/actions/runs/6716682414/job/18253279654
The test used a cron schedule to compute autobuild ticks, with ticks every hour on the hour. The default TTL was set to an hour. Usually, the next tick is less than one hour in the future, unless the test runs at :00 past the hour, which it did in my flake'd
run. But, given that this is an autostop test, the cron schedule is irrelevant (such schedules are used for auto_start_). So, I've removed it from the test and compute the build ticks directly.
Also, the test originally had the workspace TTL set to longer than the default template TTL, and then tested that no build happened when the tick was prior to both. This seems odd to me, as we want to demonstrate the the executor disregards the workspace TTL.
So, I changed the test to set the workspace TTL shorter, and then send in a tick between the two, verify that we don't autostop, then a tick after the template TTL and verify that we do.
I've said it before, I'll say it again: you can't create a timed context before calling `t.Parallel()` and then use it after.
Fixes flakes like https://github.com/coder/coder/actions/runs/6716682414/job/18253279157
I've chosen just to drop `t.Parallel()` entirely rather than create a second context after the parallel call, since the vast majority of the test time happens before where the parallel call was. It does all the tailnet setup before `t.Parallel()`.
Leaving a call to `t.Parallel()` is a bug risk for future maintainers to come in and use the wrong context in the latter part of the test by accident.
* Fit once during creation
This does not fix any bugs (that I know of) but we only need to fit once
when the terminal is created, not every time we reconnect. Granted,
currently we do not support reconnecting without refreshing anyway so it
does not really matter, but this just seems more correct.
Plus now we will not have to pass the fit addon around.
* Pass size when connecting web socket URL
I think this will solve an issue where screen does does not correctly
handle an immediate resize. It seems to ignore the resize, but even if
you send it again nothing changes, seemingly thinking it is already at
that size?
* Use new struct for decoding reconnecting pty requests
Decoding a JSON message does not touch omitted (or null) fields so once
a message with a resize comes in, every single message from that point
will cause a resize.
I am not sure if this is an actual problem in practice but at the very
least it seems unintentional.
* Remove terminalXService
This is a prelude to the change I actually want to make, which is to
send the size of the terminal on the web socket URL after we do a fit.
I have found xstate so confusing that it was easier to just rewrite it.
* Fix hanging tests
I am not really sure what ws.connected is doing but it seems to somehow
block updates. Something to do with `act()` maybe?
Basically, the useEffect creating the terminal never updates once the
config query finishes, so the web socket is never created, and the test
hangs forever.
It might have been working before only because the web socket was
created using xstate rather than useEffect and once it connected it
would unblock and React could update again but this is just a guess.
* Ignore other config changes
The terminal only cares about the renderer specifically, no need to
recreate the terminal if something else changes.
* Break out port forward URL open to util
Felt like this could be broken out to reduce the component size. Also
trying to figure out why it is causing the terminal to create multiple
times.
* Prevent handleWebLink change from recreating terminal
Depending on the timing, handleWebLink was causing the terminal to get
recreated. We only need to create the terminal once unless the render
type changes.
Recreating the terminal was also recreating the web socket pointlessly.
* refactor: extract UserRoleCell into separate component
* wip: add placeholder Groups column
* fix: remove redundant css styles
* refactor: update EditRolesButton to use Sets to detect selections
* wip: commit progress for updated roles column
* wip: commit current role pill progress
* fix: update state sync logic
* chore: add groupsByUserId query options factory
* fix: update return value of select function
* chore: drill groups data down to cell component
* wip: commit current cell progress
* fix: remove redundant classes
* wip: commit current styling progress
* fix: update line height for CTA
* fix: update spacing
* chore: add tooltip for Groups column header
* fix: remove tsbuild file
* refactor: consolidate tooltip components
* fix: update font size defaults inside theme
* fix: expand hoverable/clickable area of groups cell
* fix: remove possible undefined cases from HelpTooltip
* chore: add popover functionality to groups
* wip: commit progress on groups tooltip
* fix: remove zero-height group name visual bug
* feat: get basic version of user group tooltips done
* perf: move sort order callback outside loop
* fix: update spacing for tooltip
* feat: make popovers entirely hover-based
* fix: disable scroll locking for popover
* docs: add comments explaining some pitfalls with Popover component
* refactor: simplify userRoleCell implementation
* feat: complete main feature
* fix: prevent scroll lock for role tooltips
* fix: change import to type import
* refactor: simplify how groups are clustered
* refactor: update UserRoleCell to use Popover
* refactor: remove unnecessary fragment
* chore: add id/aria support for Popover
* refactor: update UserGroupsCell to use Popover
* chore: redo visual design for UserGroupsCell
* fix: shrink UserGroupsCell text
* fix: update UsersTable test to include groups info
if:(github.event.comment.body == 'recheck' || github.event.comment.body == 'I have read the CLA Document and I hereby sign the CLA') || github.event_name == 'pull_request_target'
uses:contributor-assistant/github-action@v2.3.1
uses:contributor-assistant/github-action@v2.4.0
env:
GITHUB_TOKEN:${{ secrets.GITHUB_TOKEN }}
# the below token should have repo scope and must be manually added by you in the repository's secret
workflow_dispatch:# allows to run manually for testing
pull_request:
branches:
- main
paths:
- "docs/**"
jobs:
check-docs:
@@ -24,7 +29,7 @@ jobs:
file-path:"./README.md"
- name:Send Slack notification
if:failure()
if:failure() && github.event_name == 'schedule'
run:|
curl -X POST -H 'Content-type: application/json' -d '{"msg":"Broken links found in the documentation. Please check the logs at ${{ env.LOGS_URL }}"}' ${{ secrets.DOCS_LINK_SLACK_WEBHOOK }}
if[["$${COMMIT_FROM}"=="$${COMMIT_TO}"]];thenecho"Nothing to do!";exit 0;fi
echo"DROP DATABASE IF EXISTS migrate_test_$${COMMIT_FROM}; CREATE DATABASE migrate_test_$${COMMIT_FROM};"| psql 'postgresql://postgres:postgres@localhost:5432/postgres?sslmode=disable'
go run ./scripts/migrate-test/main.go --from="$$COMMIT_FROM" --to="$$COMMIT_TO" --postgres-url="postgresql://postgres:postgres@localhost:5432/migrate_test_$${COMMIT_FROM}?sslmode=disable"
# NOTE: we set --memory to the same size as a GitHub runner.
test-postgres-docker:
docker rm -f test-postgres-docker ||true
docker run \
@@ -697,6 +825,7 @@ test-postgres-docker:
--name test-postgres-docker \
--restart no \
--detach \
--memory 16GB \
gcr.io/coder-dev-1/postgres:13 \
-c shared_buffers=1GB \
-c work_mem=1GB \
@@ -715,9 +844,21 @@ test-postgres-docker:
# Make sure to keep this in sync with test-go-race from .github/workflows/ci.yaml.
[Coder](https://coder.com) enables organizations to set up development environments in the cloud. Environments are defined with Terraform, connected through a secure high-speed Wireguard® tunnel, and are automatically shut down when not in use to save on costs. Coder gives engineering teams the flexibility to use the cloud for workloads that are most beneficial to them.
[Coder](https://coder.com) enables organizations to set up development environments in their public or private cloud infrastructure. Cloud development environments are defined with Terraform, connected through a secure high-speed Wireguard® tunnel, and are automatically shut down when not in use to save on costs. Coder gives engineering teams the flexibility to use the cloud for workloads that are most beneficial to them.
- Define development environments in Terraform
- Define cloud development environments in Terraform
- EC2 VMs, Kubernetes Pods, Docker Containers, etc.
- Automatically shutdown idle resources to save on costs
- Onboard developers in seconds instead of days
@@ -44,7 +43,7 @@
## Quickstart
The most convenient way to try Coder is to install it on your local machine and experiment with provisioning development environments using Docker (works on Linux, macOS, and Windows).
The most convenient way to try Coder is to install it on your local machine and experiment with provisioning cloud development environments using Docker (works on Linux, macOS, and Windows).
```
# First, install Coder
@@ -53,8 +52,8 @@ curl -L https://coder.com/install.sh | sh
# Start the Coder server (caches data in ~/.cache/coder)
coder server
# Navigate to http://localhost:3000 to create your initial user
# Create a Docker template, and provision a workspace
# Navigate to http://localhost:3000 to create your initial user,
# create a Docker template, and provision a workspace
```
## Install
@@ -68,11 +67,11 @@ Releases.
curl -L https://coder.com/install.sh | sh
```
You can run the install script with `--dry-run` to see the commands that will be used to install without executing them. You can modify the installation process by including flags. Run the install script with `--help` for reference.
You can run the install script with `--dry-run` to see the commands that will be used to install without executing them. Run the install script with `--help` for additional flags.
> See [install](docs/install) for additional methods.
> See [install](https://coder.com/docs/v2/latest/install) for additional methods.
Once installed, you can start a production deployment<sup>1</sup> with a single command:
Once installed, you can start a production deployment with a single command:
```shell
# Automatically sets up an external access URL on *.try.coder.app
@@ -82,8 +81,6 @@ coder server
coder server --postgres-url <url> --access-url <url>
```
> <sup>1</sup> For production deployments, set up an external PostgreSQL instance for reliability.
Use `coder --help` to get a list of flags and environment variables. Use our [install guides](https://coder.com/docs/v2/latest/install) for a full walkthrough.
## Documentation
@@ -96,19 +93,13 @@ Browse our docs [here](https://coder.com/docs/v2) or visit a specific section be
- [**Administration**](https://coder.com/docs/v2/latest/admin): Learn how to operate Coder
- [**Enterprise**](https://coder.com/docs/v2/latest/enterprise): Learn about our paid features built for large teams
## Community and Support
## Support
Feel free to [open an issue](https://github.com/coder/coder/issues/new) if you have questions, run into bugs, or have a feature request.
[Join our Discord](https://discord.gg/coder) to provide feedback on in-progress features, and chat with the community using Coder!
## Contributing
Contributions are welcome! Read the [contributing docs](https://coder.com/docs/v2/latest/CONTRIBUTING) to get started.
Find our list of contributors [here](https://github.com/coder/coder/graphs/contributors).
## Related
## Integrations
We are always working on new integrations. Feel free to open an issue to request an integration. Contributions are welcome in any official or community repositories.
@@ -116,10 +107,22 @@ We are always working on new integrations. Feel free to open an issue to request
- [**VS Code Extension**](https://marketplace.visualstudio.com/items?itemName=coder.coder-remote): Open any Coder workspace in VS Code with a single click
- [**JetBrains Gateway Extension**](https://plugins.jetbrains.com/plugin/19620-coder): Open any Coder workspace in JetBrains Gateway with a single click
- [**Dev Container Builder**](https://github.com/coder/envbuilder): Build development environments using `devcontainer.json` on Docker, Kubernetes, and OpenShift
- [**Module Registry**](https://registry.coder.com): Extend development environments with common use-cases
- [**Kubernetes Log Stream**](https://github.com/coder/coder-logstream-kube): Stream Kubernetes Pod events to the Coder startup logs
- [**Self-Hosted VS Code Extension Marketplace**](https://github.com/coder/code-marketplace): A private extension marketplace that works in restricted or airgapped networks integrating with [code-server](https://github.com/coder/code-server).
### Community
- [**Provision Coder with Terraform**](https://github.com/ElliotG/coder-oss-tf): Provision Coder on Google GKE, Azure AKS, AWS EKS, DigitalOcean DOKS, IBMCloud K8s, OVHCloud K8s, and Scaleway K8s Kapsule with Terraform
- [**Coder GitHub Action**](https://github.com/marketplace/actions/update-coder-template): A GitHub Action that updates Coder templates
- [**Various Templates**](./examples/templates/community-templates.md): Hetzner Cloud, Docker in Docker, and other templates the community has built.
- [**Coder Template GitHub Action**](https://github.com/marketplace/actions/update-coder-template): A GitHub Action that updates Coder templates
## Contributing
We are always happy to see new contributors to Coder. If you are new to the Coder codebase, we have
[a guide on how to get started](https://coder.com/docs/v2/latest/CONTRIBUTING). We'd love to see your
contributions!
## Hiring
Apply [here](https://cdr.co/github-apply) if you're interested in joining our team.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.