Every time I run `pnpm` in the project it adds the package manager attribute on package.json so I just decided to push it since it does not look like an issue and we can make sure everyone is running the same pnpm version.
fixes#11950https://github.com/coder/coder/issues/11950#issuecomment-1987756088 explains the bug
We were also calling into `Unlisten()` and `Close()` while holding the mutex. I don't believe that `Close()` depends on the notification loop being unblocked, but it's hard to be sure, and the safest thing to do is assume it could block.
So, I added a unit test that fakes out `pq.Listener` and sends a bunch of notifies every time we call into it to hopefully prevent regression where we hold the mutex while calling into these functions.
It also removes the use of a `context.Context` to stop the PubSub -- it must be explicitly `Closed()`. This simplifies a bunch of the logic, and is how we use the pubsub anyway.
Allow `coder login` to log into existing deployment if available.
Update help and error messages to indicate that `coder login` is
available as a command.
Fixes#10925Fixes#9551
* coderd: add test to reproduce trailing directory issue
* coderd: add trailing path separator to dir entries when converting to zip
* provisionersdk: add trailing path separator to directory entries
* chore: rename useTab to useSearchParamsKey and add test
* chore: mark old renderHookWithAuth as deprecated (temp)
* fix: update imports for useResourcesNav
* refactor: change API for useSearchParamsKey
* chore: let user pass in their own URLSearchParams value
* refactor: clean up comments for clarity
* fix: update import
* wip: commit progress on useWorkspaceDuplication revamp
* chore: migrate duplication test to new helper
* refactor: update code for clarity
* refactor: reorder test cases for clarity
* refactor: split off hook helper into separate file
* refactor: remove reliance on internal React Router state property
* refactor: move variables around for more clarity
* refactor: more updates for clarity
* refactor: reorganize test cases for clarity
* refactor: clean up test cases for useWorkspaceDupe
* refactor: clean up test cases for useWorkspaceDupe
* fix: do not set max deadline for workspaces on template update
When templates are updated and schedule data is changed, we update all
running workspaces to have up-to-date scheduling information that sticks
to the new policy.
When updating the max_deadline for existing running workspaces, if the
max_deadline was before now()+2h we would set the max_deadline to
now()+2h.
Builds that don't/shouldn't have a max_deadline have it set to 0, which
is always before now()+2h, and thus would always have the max_deadline
updated.
* test: add unit test to excercise template schedule bug
---------
Co-authored-by: Steven Masley <stevenmasley@gmail.com>
I noticed in my logs that sometimes `coder ssh` doesn't gracefully disconnect from the coordinator.
The cause is the `closerStack` construct we use in that function. It has two paths to start closing things down:
1. explicit `close()` which we do in `defer`
2. context cancellation, which happens if the cli function returns an error
sometimes the ssh remote command returns an error, and this triggers context cancellation of the `closerStack`. That is fine in and of itself, but we still want the explicit `close()` to wait until everything is closed before returning, since that's where we do cleanup, including the graceful disconnect. Prior to this fix the `close()` just immediately exits if another goroutine is closing the stack. Here we add a wait until everything is done.
* fix: ensure auto-workspace creation waits until all parameters are ready
* refactor: move creation blocking logic to main callback
* fix: let creation start if experimental feature is off
Adds a `--debug` flag to `scripts/develop.sh` that will start coder under `dlv debug` instead.
You can then use e.g. the following launch snippet to connect dlv:
```
{
"name": "Delve Remote",
"type": "go",
"request": "attach",
"mode": "remote",
"port": 12345,
}
```
You can also run invididual CLI commands under dlv e.g.
```
debug=1 scripts/coder-dev.sh list
```
Also sets CGO_ENABLED=0 in develop.sh by default.
- Reworks the proxy registration loop into a struct (so I can add a `RegisterNow` method)
- Changes the proxy registration loop interval to 15s (previously 30s)
- Adds test which tests bidirectional DERP meshing on all possible paths between 6 workspace proxy replicas
Related to https://github.com/coder/customers/issues/438
This fixes a vulnerability with the `CODER_OIDC_EMAIL_DOMAIN` option,
where users with a superset of the allowed email domain would be allowed
to login. For example, given `CODER_OIDC_EMAIL_DOMAIN=google.com`, a
user would be permitted entry if their email domain was
`colin-google.com`.
Part of #12163
- Adds a command coder support bundle <workspace> that generates a
support bundle and writes it to coder-support-$(date +%s).zip.
- Note: this is hidden currently until the rest of the functionality is fleshed out.
This adds the ability for `TunnelAuth` to also authorize incoming wireguard node IPs, preventing agents from reporting anything other than their static IP generated from the agent ID.
- Adds more testcases to TestAcquirer_MatchTags
- Adds functionality to generate a table from above test
- Update provisioner tag documentation with generated table
- Apply other feedback from #12315
DERP mesh key setup would do a SELECT and then an INSERT on failure, without a lock. During some testing with multiple replicas, I managed to cause a replica to crash due to them initializing simultaneously.
Fixes:
Encountered an error running "coder server"
create coder API: insert mesh key: pq: duplicate key value violates unique constraint "site_configs_key_key"
Co-authored-by: Cian Johnston <cian@coder.com>
* refactor: clean up and update API for useClipboard
* wip: commit current progress on useClipboard test
* docs: clean up wording on showCopySuccess
* chore: make sure tests can differentiate between HTTP/HTTPS
* chore: add test ID to dummy input
* wip: commit progress on useClipboard test
* wip: commit more test progress
* refactor: rewrite code for clarity
* chore: finish clipboard tests
* fix: prevent double-firing for button click aliases
* refactor: clean up test setup
* fix: rename incorrect test file
* refactor: update code to display user errors
* refactor: redesign useClipboard to be easier to test
* refactor: clean up GlobalSnackbar
* feat: add functionality for notifying user of errors (with tests)
* refactor: clean up test code
* refactor: centralize cleanup steps
Beginnings of a solution to #12297
Doesn't cover disco or definitively display whether we successfully connected to DERP, but shows some checklist diagnostics for connecting to an agent.
For this first PR, I just added it to `coder ping` to see how we like it, but could be incorporated into `coder ssh` _et al._ after a timeout.
```
$ coder ping dogfood2
p2p connection established in 147ms
pong from dogfood2 p2p via 95.217.xxx.yyy:42631 in 147ms
pong from dogfood2 p2p via 95.217.xxx.yyy:42631 in 140ms
pong from dogfood2 p2p via 95.217.xxx.yyy:42631 in 140ms
✔ preferred DERP region 999 (Council Bluffs, Iowa)
✔ sent local data to Coder networking coodinator
✔ received remote agent data from Coder networking coordinator
preferred DERP 10013 (Europe Fly.io (Paris))
endpoints: 95.217.xxx.yyy:42631, 95.217.xxx.yyy:37576, 172.17.0.1:37576, 172.20.0.10:37576
✔ Wireguard handshake 11s ago
```
* fix(coderd): mark provisioner daemon psk as secret
Marks provisioner daemon PSK with the secret annotation.
This ensures it will be scrubbed from API requests to
/api/v2/deployment/config.
* make gen
When viewing the Authentication page, the diagram showing the flow is a useful
resource for understanding the rest of the page.
Rather than linking to a specific version of the SVG, inline it as part of the
documentation.
* chore: drop github per user rate limit tracking
Rate limits for authenticated requests are per user.
This would be an excessive number of prometheus labels,
so we only track the unauthorized limit.
Moves healthcheck report-related structs from coderd/healthcheck to codersdk
This prevents an import cycle when adding a codersdk.Client method to hit /api/v2/debug/health.
Changes the agent to use the new v2 API for sending logs, via the logSender component.
We keep the PatchLogs function around, but deprecate it so that we can test the v1 endpoint.
In anticipation of needing the `LogSender` to run on a context that doesn't get immediately canceled when you `Close()` the agent, I've undertaken a little refactor to manage the goroutines that get run against the Tailnet and Agent API connection.
This handles controlling two contexts, one that gets canceled right away at the start of graceful shutdown, and another that stays up to allow graceful shutdown to complete.
Alternative solution to #6442
Modifies the behaviour of AcquireProvisionerJob and adds a special case for 'un-tagged' jobs such that they can only be picked up by 'un-tagged' provisioners.
Also adds comprehensive test coverage for AcquireJob given various combinations of tags.
Fixes race seen here: https://github.com/coder/coder/runs/21852483781
What happens is that the agent connects, completes the test, and then disconnects before the Eventually condition runs. The waiter then times out because it's looking for a connected agent.
Then, since it's a `require` in a goroutine, that causes the `tGo` cleanup to hang and the whole test suite to timeout after 10 minutes.
Anyway, `agenttest.New` doesn't block, and we don't actually need to wait for the agent to connect, since a successful SSH session is evidence that it connected.
* feat: convertGroups() no longer requires organization info
Removing role information from some users in the api. This info is
excessive and not required. It is costly to always include
* fix: ignore surronding whitespace for cli config
Cli config files break if you edit them manually with any editor.
Editors drop a newline at the end, and we not break on this.
If a developer manually edits a file, it should still work
* fix: move oauth2 routes
From /login/oauth2/* to /oauth2/*.
/login/oauth2 causes /login to no longer get served by the frontend,
even if nothing is actually served on /login itself.
* Add forgotten comment on delete
* feat: disable directory listings for static files
Static file server handles serving static asset files (js, css, etc).
The default file server would also list all files in a directory.
This has been changed to only serve files.
* chore: add database test fixture to insert non-unique linked_ids
* chore: create unit test to exercise failed email change bug
* fix: add postgres triggers to keep user_links clear of deleted users
* Add migrations to prevent deleted users with links
* Force soft delete of users, do not allow un-delete
* refactor: clean up tests for debounce
* refactor: clean up tests for useCustomEvent
* refactor: clean up events file
* refactor: clean up tests for hookPolyfills
- These CSS changes were for making sure there weren't layout shifts
when using the non-secure clipboard fallback, which could cause janky
UI flickers. It seems to be breaking things for some users on HTTP-only
connections, though.
This PR removes the prometheus-http port entirely from the coder service specification (originally added in #10448). It also removes the Helm value coder.service.prometheusNodePort.
Rationale: some cloud providers will helpfully expose all ports on a LoadBalancer service for you. The net effect of this is that setting CODER_PROMETHEUS_ENABLE will end up exposing port 2112 on your coderd service to the internet, which is likely undesired behaviour.
The agent is extended with a `--script-data-dir` flag, defaulting to the
OS temp dir. This dir is used for storing `coder-script-data/bin` and
`coder-script/[script uuid]`. The former is a place for all scripts to
place executable binaries that will be available by other scripts, SSH
sessions, etc. The latter is a place for the script to store files.
Since we default to OS temp dir, files are ephemeral by default. In the
future, we may consider adding new env vars or changing the default
storage location. Workspace startup speed could potentially benefit from
scripts being able to skip steps that require downloading software. We
may also extend this with more env variables (e.g. persistent storage in
HOME).
Fixes#11131
This commit refactors where custom environment variables are set in the
workspace and decouples agent specific configs from the `agentssh.Server`.
To reproduce all functionality, `agentssh.Config` is introduced.
The custom environment variables are now configured in `agent/agent.go`
and the agent retains control of the final state. This will allow for
easier extension in the future and keep other modules decoupled.
* fix: assign new oauth users to default org
This is not a final solution, as we eventually want to be able
to map to different orgs. This makes it so multi-org does not break oauth/oidc.
- prevent importing from the "monolith" lodash module. individual modules are better for tree shaking.
- prevent importing `useTheme` and types from @mui/material/styles. prefer importing from @emotion/react.
* fix: assign new oauth users to default org
This is not a final solution, as we eventually want to be able
to map to different orgs. This makes it so multi-org does not break oauth/oidc.
The first organization created is now marked as "default". This is
to allow "single org" behavior as we move to a multi org codebase.
It is intentional that the user cannot change the default org at this
stage. Only 1 default org can exist, and it is always the first org.
Closes: https://github.com/coder/coder/issues/11961
Adds a new subcomponent of the agent for queueing up logs until they can be sent over the Agent API.
Subsequent PR will change the agent to use this instead of the HTTP API for posting logs.
Relates to #10534
Fixes#12141Fixes#11750
PGCoord shutdown was uncoordinated, so an update at an inopportune time during shutdown would be rejected because the coordinator row was already deleted.
This PR ensures that the PGCoord subcomponents that write updates are shut down before we take down the heartbeats, which is responsible for deleting the coordinator row.
I noticed a possible race where tailnet.Conn can try to dial the embedded region before we've set our custom dialer that send the DERP in-memory. This closes that race and adds a test case for servertailnet with no STUN and an embedded relay
I think this will resolve#12136 but lets get a proper test at the system level before closing.
Before this change, we only register the node callback at start of day for the server tailnet. If the coordinator changes, like we know happens when we are licensed for the PGCoordinator, we close the connection to the old coord, and open a new one to the new coord.
The callback is designed to direct the updates to the new coordinator, but there is nothing that specifically triggers it to fire after we connect to the new coordinator.
If we have STUN, then period re-STUNs will generally get it to fire eventually, but without STUN it we could go indefinitely without a callback.
This PR changes the servertailnet to re-register the callback each time we reconnect to the coordinator. Registering a callback (even if it's the same callback) triggers an immediate call with our node information, so the new coordinator will have it.
Adds some debug endpoints for looking into the DERP server.
The `api/v2/debug/derp/traffic` endpoint requires the `ss` utility to be present in order to function. I have *not* added the `iproute2` package to our base image as it adds 11MB, so this endpoint won't be useful by default. However, in a debugging situation, we could exec into the container and then `apk add iproute2`, or build a special debug image.
The `api/v2/debug/expvar` handler contains DERP metrics as well as commandline and memstats.
Example:
```
{
"alert_failed": 0,
"alert_generated": 0,
"cmdline": ["/Users/spike/repos/coder/build/coder_darwin_arm64","--global-config","/Users/spike/repos/coder/.coderv2","server","--http-address","0.0.0.0:3000","--swagger-enable","--access-url","http://127.0.0.1:3000","--dangerous-allow-cors-requests=true"],
"derp": {"accepts": 1, "average_queue_duration_ms": 0, "bytes_received": 0, "bytes_sent": 0, "counter_packets_dropped_reason": {"gone_disconnected": 0, "gone_not_here": 0, "queue_head": 0, "queue_tail": 0, "unknown_dest": 0, "unknown_dest_on_fwd": 0, "write_error": 0}, "counter_packets_dropped_type": {"disco": 0, "other": 0}, "counter_packets_received_kind": {"disco": 0, "other": 0}, "counter_tcp_rtt": {}, "counter_total_dup_client_conns": 0, "gauge_clients_local": 1, "gauge_clients_remote": 0, "gauge_clients_total": 1, "gauge_current_connections": 1, "gauge_current_dup_client_conns": 0, "gauge_current_dup_client_keys": 0, "gauge_current_file_descriptors": 0, "gauge_current_home_connections": 1, "gauge_memstats_sys0": 20874504, "gauge_watchers": 0, "got_ping": 0, "home_moves_in": 0, "home_moves_out": 0, "multiforwarder_created": 0, "multiforwarder_deleted": 0, "packet_forwarder_delete_other_value": 0, "packets_dropped": 0, "packets_forwarded_in": 0, "packets_forwarded_out": 0, "packets_received": 0, "packets_sent": 0, "peer_gone_disconnected_frames": 0, "peer_gone_not_here_frames": 0, "sent_pong": 0, "unknown_frames": 0, "version": "1.47.0-dev20240214-t64db8c604"},
"memstats": {"Alloc":286506256,"TotalAlloc":297594632,"Sys":310621512,"Lookups":0,"Mallocs":304204,"Frees":171570,"HeapAlloc":286506256,"HeapSys":294060032,"HeapIdle":3694592,"HeapInuse":290365440,"HeapReleased":3620864,"HeapObjects":132634,"StackInuse":3735552,"StackSys":3735552,"MSpanInuse":347256,"MSpanSys":358512,"MCacheInuse":9600,"MCacheSys":15600,"BuckHashSys":1469877,"GCSys":9434896,"OtherSys":1547043,"NextGC":551867656,"LastGC":1707892877408883000,"PauseTotalNs":1247000,"PauseNs":[200333,229375,239875,209542,106958,203792,57125,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"PauseEnd":[1707892876217481000,1707892876219726000,1707892876222273000,1707892876226151000,1707892876234815000,1707892877398146000,1707892877408883000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"NumGC":7,"NumForcedGC":0,"GCCPUFraction":0.0022425810335762954,"EnableGC":true,"DebugGC":false,"BySize":[{"Size":0,"Mallocs":0,"Frees":0},{"Size":8,"Mallocs":14396,"Frees":9143},{"Size":16,"Mallocs":89090,"Frees":50507},{"Size":24,"Mallocs":40839,"Frees":24456},{"Size":32,"Mallocs":22404,"Frees":12379},{"Size":48,"Mallocs":51174,"Frees":23718},{"Size":64,"Mallocs":15406,"Frees":3501},{"Size":80,"Mallocs":6688,"Frees":2352},{"Size":96,"Mallocs":2567,"Frees":374},{"Size":112,"Mallocs":19371,"Frees":16883},{"Size":128,"Mallocs":2873,"Frees":1061},{"Size":144,"Mallocs":5600,"Frees":2742},{"Size":160,"Mallocs":2159,"Frees":622},{"Size":176,"Mallocs":454,"Frees":86},{"Size":192,"Mallocs":227,"Frees":128},{"Size":208,"Mallocs":1407,"Frees":732},{"Size":224,"Mallocs":1365,"Frees":1090},{"Size":240,"Mallocs":82,"Frees":48},{"Size":256,"Mallocs":310,"Frees":162},{"Size":288,"Mallocs":1945,"Frees":562},{"Size":320,"Mallocs":1200,"Frees":458},{"Size":352,"Mallocs":133,"Frees":33},{"Size":384,"Mallocs":582,"Frees":51},{"Size":416,"Mallocs":747,"Frees":200},{"Size":448,"Mallocs":113,"Frees":22},{"Size":480,"Mallocs":34,"Frees":21},{"Size":512,"Mallocs":951,"Frees":91},{"Size":576,"Mallocs":364,"Frees":122},{"Size":640,"Mallocs":532,"Frees":270},{"Size":704,"Mallocs":93,"Frees":39},{"Size":768,"Mallocs":83,"Frees":35},{"Size":896,"Mallocs":308,"Frees":175},{"Size":1024,"Mallocs":226,"Frees":122},{"Size":1152,"Mallocs":198,"Frees":100},{"Size":1280,"Mallocs":314,"Frees":171},{"Size":1408,"Mallocs":77,"Frees":47},{"Size":1536,"Mallocs":80,"Frees":54},{"Size":1792,"Mallocs":199,"Frees":107},{"Size":2048,"Mallocs":112,"Frees":48},{"Size":2304,"Mallocs":71,"Frees":32},{"Size":2688,"Mallocs":206,"Frees":81},{"Size":3072,"Mallocs":39,"Frees":15},{"Size":3200,"Mallocs":16,"Frees":7},{"Size":3456,"Mallocs":44,"Frees":29},{"Size":4096,"Mallocs":192,"Frees":83},{"Size":4864,"Mallocs":44,"Frees":25},{"Size":5376,"Mallocs":105,"Frees":43},{"Size":6144,"Mallocs":25,"Frees":5},{"Size":6528,"Mallocs":22,"Frees":7},{"Size":6784,"Mallocs":3,"Frees":0},{"Size":6912,"Mallocs":4,"Frees":2},{"Size":8192,"Mallocs":59,"Frees":10},{"Size":9472,"Mallocs":31,"Frees":12},{"Size":9728,"Mallocs":5,"Frees":2},{"Size":10240,"Mallocs":5,"Frees":0},{"Size":10880,"Mallocs":27,"Frees":11},{"Size":12288,"Mallocs":4,"Frees":1},{"Size":13568,"Mallocs":4,"Frees":2},{"Size":14336,"Mallocs":9,"Frees":2},{"Size":16384,"Mallocs":10,"Frees":2},{"Size":18432,"Mallocs":4,"Frees":2}]},
"warning_failed": 0,
"warning_generated": 0
}
```
If we find the DERP metrics useful we could consider how to include them in Prometheus scrapes based on the tailnet `varz` package. That's for a later PR if at all.
Adds documentation on port requirements and a short overview of STUN with some example scenarios.
Co-authored-by: Dean Sheather <dean@deansheather.com>
Co-authored-by: Spike Curtis <spike@coder.com>
When we exceed the db-imposed limit of logs, we need to communicate that back to the agent. In v1 we did it with a 4xx-level HTTP status, but with dRPC, the errors are delivered as strings, which feels fragile to me for something we want to gracefully handle.
So, this PR adds the log limit exceeded as a field on the response message, and fixes the API handler to set it as appropriate instead of an error.
* feat(provisioner): relax max terraform version constraint
* feat!(scripts/Dockerfile.base): update bundled terraform to 1.6.x
* bump terraform version in Dogfood image
* fix over-zealous rename
Fixes#12030
This is a good example of the kind of thing I'd like to address with a time-testing lib. The problem is that there is a race between the watchdog starting it's timer and the test incrementing the time. What would make this easier is if the time-testing library could wait for and assert the call to start the timer before incrementing the time.
adds a watchdog to our pubsub and runs it for Coder server.
If the watchdog times out, it triggers a graceful exit in `coder server` to give any provisioner jobs a chance to shut down.
c.f. #11950
I noticed in testing that the CLI wasn't correctly sending the disconnect message when it shuts down, and thus agents are seeing this as a "lost" peer, rather than a "disconnected" one.
What was happening is that we just used a single context for everything from the netconn to the RPCs, and when the context was canceled we failed to send the disconnect message due to canceled context.
So, this PR splits things into two contexts, with a graceful one set to last up to 1 second longer than the main one.
* docs: update remote docker host docs
Adds a link to external provisioners as a method to use remote docker hosts
* `make fmt`
* Update docker.md
* fmt
Annoyingly, prometheus Registry collects metrics async, which is causing our test to be racy. They also don't export enough from the Metric interface for us to replicate a synchronous collect, so we have to use Eventually to test.
Adds prometheus metrics to PGPubsub for monitoring its health and performance in production.
Related to #11950 --- additional diagnostics to help figure out what's happening
Since we run yamux over the websocket, we don't need to ping at the websocket layer because yamux has a 30 second keepalive mechanism enabled in the default config.
The RPC() function isn't called, since Listen() was modified to do this job.
Listen() has the right signature, since it returns a drpc.Conn, rather than the Agent API. That's because tailnet v2 and agent v2 are separate APIs served over the same connection.
It might be clearer to rename `Listen()` to `RPC()` but I'll save that for a different PR.
Adds a new statsReporter subcomponent of the agent, which in a later PR will be used to report stats over the v2 API.
Refactors the logic a bit so that we can handle starting and stopping stats reporting if the agent API connection drops and reconnects.
Moves monitoring of the agent v2 API connection to the yamux layer.
Present behavior monitors this at the websocket layer, and closes the websocket on completion. This can cause yamux to hit unexpected errors since the connection is closed underneath it.
This might be the cause of yamux errors that some customers are seeing

In any case, it's more graceful to close yamux first and let yamux close the underlying websocket. That should limit yamux error logging to truly unexpected/error cases.
The only downside is that the yamux `Close()` doesn't accept a reason, so if the agent becomes outdated and we close the API connection, the agent just sees the connection close without a reason. I'm not sure we log this at the agent anyway, but it would be nice. I think more accurate logging on Coderd are more important.
I've also added some logging when the monitor disconnects for reasons other than the context being canceled (e.g. agent outdated, failed pings).
* wip: commit progress for clipboard update
* wip: push more progress
* chore: finish initial version of useClipboard revamp
* refactor: update API query to use newer RQ patterns
* fix: update importers of useClipboard
* fix: increase clickable area of CodeExample
* fix: update styles for CliAuthPageView
* fix: resolve issue with ref re-routing
* docs: update comments for clarity
* wip: commit progress on clipboard tests
* chore: add extra test case for referential stability
* wip: disable test stub to avoid breaking CI
* wip: add test case for tab-switching
* feat: finish changes
* fix: improve styling for strong text
* fix: make sure period doesn't break onto separate line
* fix: make center styling more friendly to screen readers
* refactor: clean up mocking implementation
* fix: resolve security concern for clipboard text
* fix: update CodeExample to obscure text when appropriate
* fix: apply secret changes to relevant code examples
* refactor: simplify code for obfuscating text
* fix: partially revert clipboard changes
* fix: clean up page styling further
* fix: remove duplicate property identifier
* refactor: rename variables for clarity
* fix: simplify/revert CopyButton component design
* fix: update how dummy input is hidden from page
* fix: remove unused onClick handler prop
* fix: resolve unused import
* fix: opt code examples out of secret behavior
* fix: strip timezone information from a date in dau response
Timezone information is lost, so do not forward it to the client.
* fix: timezone offset should be flipped
* Make tests deterministic
This PR solves #10478 by auto-filling previously used template values in create and update workspace flows.
I decided against explicit user values in settings for these reasons:
* Autofill is far easier to implement
* Users benefit from autofill _by default_ — we don't need to teach them new concepts
* If we decide that autofill creates more harm than good, we can remove it without breaking compatibility
Fixes an issue where a MultiAgentConn isn't closed properly when the coordinator it is connected to is closed.
Since servertailnet checks whether the conn is closed before reinitializing, it is important that we check this, otherwise servertailnet can get stuck if the coordinator closes (e.g. when we switch from AGPL to PGCoordinator after decoding a license).
We're failing tests on error logs like this: https://github.com/coder/coder/actions/runs/7706053882/job/21000984583
Unfortunately, the error we hit, when the underlying connection is closed, is unexported, so we can't specifically ignore it.
Part of the issue is that agent.Close() doesn't wait for these goroutines to complete before returning, so the test harness proceeds to close the connection. This looks to our product code like the network connection failing. It would be possible to fix this, but just doesn't seem worth it for the extra insurance of catching other error logs in these tests.
Adds logging to yamux when used for tailnet client connections, e.g. CLI and wsproxy. This could be useful for debugging connection issues with tailnet v2 API.
`agentsdk` depends on `agent/proto` because it needs to get the version to dial.
Therefore, the conversion routines need to live in `agentsdk` so that we can convert to and from the Manifest.
I briefly considered refactoring the agent to only reference `proto.Manifest`, but decided against it because we might have multiple protocol versions in the future, its useful to have a protocol-independent data structure.
Fixes#8218
Removes `wsconncache` and related "is legacy?" functions and API calls that were used by it.
The only leftover is that Agents still use the legacy IP, so that back level clients or workspace proxies can dial them correctly.
We should eventually remove this: #11819
This PR updates the Agent API to use the appearance.Fetcher, which is set by entitlement code in Enterprise coderd.
This brings the agentapi into compliance with the Enterprise feature.
The new Agent API needs an interface for ServiceBanners, so this PR creates it and refactors the AGPL and Enterprise code to achieve it.
Before we depended on the fact that the HTTP endpoint was missing to serve an empty ServiceBanner on AGPL deployments, but that won't work with dRPC, so we need a real interface to call.
The original test is bugged in that it
1. creates a new AGPL coderd with a new database, so no appearance is set in the DB.
2. overwrites the agentClient so the assertion after removing the license is against the AGPL coderd
fixes#10533
refactors `codersdk` workspace agent dialer to use a single websocket connection to the tailnet v2 API for both coordination and DERPMap updates, rather than separate websockets (and the v1 API for DERPMaps).
#7439 added global caching of RBAC results.
Calls are cached based on hash(subject, object, action).
We often use dbauthz.AsSystemRestricted to handle "internal" authz calls, and these are often repeated with similar arguments and are likely to get cached.
So a transient error doing an authz check on a system function will be cached for up to a minute.
I'm just starting off with excluding context.Canceled but there's likely a whole suite of different errors we want to also exclude from the global cache.
Ok, so my last attempt at a fix here failed
https://github.com/coder/coder/actions/runs/7666229961/job/20893608286
I have a new theory: it's not the `terraform` binary that's busy, it's actually `fake_cancel.sh` and it gets marked busy when we `exec` it from the script we write.
Use of `exec` also replaces the executing code in place, rather than starting a new process/shell, so that's why the error we get says `terraform` is busy.
* docs: use coder modules in offline deployments
* fix typos
* Update offline installation instructions with Artifactory support for Coder modules
* Review suggestions
- Adds column `favorite` to workspaces table
- Adds API endpoints to favorite/unfavorite workspaces
- Modifies sorting order to return owners' favorite workspaces first
Fixes 2 related issues:
1. wsconncache had incorrect logic to test whether to send DERPMap updates, sending if the maps were equivalent, instead of if they were _not equivalent_.
2. configmaps used a bugged check to test equality between DERPMaps, since it contains a map and the map entries are serialized in random order. Instead, we avoid comparing the protobufs and instead depend on the existing function that compares `tailcfg.DERPMap`. This also has the effect of reducing the number of times we convert to and from protobuf.
fixes#10531
Adds a check for `version` on connection to the Agent API websocket endpoint. This is primarily for future-proofing, so that up-level agents get a sensible error if they connect to a back-level Coderd.
It also refactors the location of the `CurrentVersion` variables, to be part of the `proto` packages, since the versions refer to the APIs defined therein.
Adds support to `ServerTailnet` to set all peers lost before attempting to reconnect to the coordinator. In practice, this only really affects `wsproxy` since coderd has a local connection to the coordinator that only goes down if we're shutting down or change licenses.
These will show up when configuring the application along with the
client ID and everything else. Should make it easier to configure the
application, otherwise you will have to go look up the URLs in the
docs (which are not yet written).
Co-authored-by: Steven Masley <stevenmasley@gmail.com>
Rather than passing all the deployment values. This is to make it
easier to generate API keys as part of the oauth flow.
I also added and fixed a test for when the lifetime is set and the
default and expiration are unset.
Co-authored-by: Steven Masley <stevenmasley@gmail.com>
Use TSMP ping for reachability, but leave Disco ping for when we call Ping() since we often use that to determine whether we have a direct connection.
Also adds unit tests to make sure Ping() returns direct connection vs DERP correctly.
Adds support to Coordination to call SetAllPeersLost() when it is closed. This ensure that when we disconnect from a Coordinator, we set all peers lost.
This covers CoderSDK (CLI client) and Agent. Next PR will cover MultiAgent (notably, `wsproxy`).
We don't have visibility into some feature usage, so this adds a lot of fields missing from `database.Template` to `telemetry.Template`. Deprecation message is not collected, just whether it's set or not.
adds setAllPeersLost to the configMaps subcomponent of tailnet.Conn --- we'll call this when we disconnect from a coordinator so we'll eventually clean up peers if they disconnect while we are retrying the coordinator connection (or we don't succeed in reconnecting to the coordinator).
This one is huge, and I'm sorry.
The problem is that once I change `tailnet.Conn` to start doing v2 behavior, I kind of have to change it everywhere, including in CoderSDK (CLI), the agent, wsproxy, and ServerTailnet.
There is still a bit more cleanup to do, and I need to add code so that when we lose connection to the Coordinator, we mark all peers as LOST, but that will be in a separate PR since this is big enough!
Otherwise if for example you try to run `yarn storybook` it complains
that the version of Node is wrong.
`pnpm storybook` works fine and that is probably what we should
actually use, but as long as we are installing Yarn and not restricting
its use we might as well make it use the right version of Node.
* fix: doing a noop patch to templates resulted in 404
The patch response did not include the template. The UI required the
template to be returned to form the new page path
null is more explicit, and harder to make occur by mistake.
* fix: allow ports in wildcard url configuration
This just forwards the port to the ui that generates urls.
Our existing parsing + regex already supported ports for
subdomain app requests.
Fixes flake seen here, I think
https://github.com/coder/coder/actions/runs/7565915337/job/20602500818
golang's file processing is complex, and in at least some cases it can return from a file.Close() call without having actually closed the file descriptor.
If we're holding open the file descriptor of an executable we just wrote, and try to execute it, it will fail with "text file busy" which is what we have seen.
So, to be extra sure, I've avoided the standard library and directly called the syscalls to open, write, and close the file we intend to use in the test.
I've also added some more logging so if it's some issue of multiple tests writing to the same location, the we might have a chance to see it.
wsproxy also needs to be updated to use tailnet v2 because the `tailnet.Conn` stores peers by ID, and the peerID was not being carried by the JSON protocol. This adds a query param to the endpoint to conditionally switch to the new protocol.
Fixes a flake seen here: https://github.com/coder/coder/actions/runs/7541558190/job/20528545916
```
=== FAIL: enterprise/provisionerd TestRemoteConnector_Fuzz (0.06s)
t.go:84: 2024-01-16 12:32:27.024 [info] connector: failed provisioner authentication remote_addr=[::1]:45138 ...
error= failed to receive jobID:
github.com/coder/coder/v2/enterprise/provisionerd.(*remoteConnector).authenticate
/home/runner/actions-runner/_work/coder/coder/enterprise/provisionerd/remoteprovisioners.go:438
- bufio.Scanner: token too long
t.go:84: 2024-01-16 12:32:27.024 [debu] connector: closed connection remote_addr=[::1]:45138 error=<nil>
remoteprovisioners_test.go:209:
Error Trace: /home/runner/actions-runner/_work/coder/coder/enterprise/provisionerd/remoteprovisioners_test.go:209
Error: "2992256" is not less than "2097152"
Test: TestRemoteConnector_Fuzz
Messages: should not allow more than 1 MiB
```
This was an attempt to test that malicious actors can't abuse our authentication protocol to make us allocate a bunch of memory.
However, the test asserted on the number of bytes sent by the fuzzer, not the number of bytes read (& allocated) by the service. The former is affected by network queue sizes and is thus flaky without actively managing the socket queues, which I don't think we want to do.
In actual practise, the thing that matters is how much memory the bufio Scanner allocates. By inspection, the scanner will allocate up to 64k, and testing this is true devolves into testing the go standard library, which I don't think is worth doing.
So... let's just drop the assertion because
a) its flaky,
b) it doesn't test what we actually want to test,
c) the behavior we actually care about is part of the standard library.
- Adds a new query BatchUpdateLastUsedAt
- Adds calls to BatchUpdateLastUsedAt in app stats handler upon flush
- Passes a stats flush channel to apptest setup scaffolding and updates unit tests to assert modifications to LastUsedAt.
Cli errors are pretty formatted. This handles nested pretty types. Before it found the first error it could understand and return that. Now it will print the full error stack with more information.
To prevent information loss, a "[Trace=...]" was added to capture some extra error context for debugging.
- refactors`getFormHelpers` to accept an options object
- adds a `maxLength` option which will display a message and character counter for fields with length limits
- set `maxLength` option for template description fields
Merging since Mark is out.
* chore: add optional coder_app to faq
* applied Atif's suggestions
* make fmt again
---------
Co-authored-by: kirby <kirby@coder.com>
Co-authored-by: Stephen Kirby <58410745+stirby@users.noreply.github.com>
The `SingleTailnet` behavior only checked to see if the `MultiAgent` was
closed, but the websocket error was not being propogated into the
`MultiAgent`, causing it to never be swapped for a new working one.
Fixes https://github.com/coder/coder/issues/11401
Before:
```
Coder Workspace Proxy v0.0.0-devel+85ff030 - Your Self-Hosted Remote Development Platform
Started HTTP listener at http://0.0.0.0:3001
View the Web UI: http://127.0.0.1:3001
==> Logs will stream in below (press ctrl+c to gracefully exit):
2024-01-04 20:11:56.376 [warn] net.workspace-proxy.servertailnet: broadcast server node to agents ...
error= write message:
github.com/coder/coder/v2/enterprise/wsproxy/wsproxysdk.(*remoteMultiAgentHandler).writeJSON
/home/coder/coder/enterprise/wsproxy/wsproxysdk/wsproxysdk.go:524
- failed to write msg: WebSocket closed: failed to read frame header: EOF
```
After:
```
Coder Workspace Proxy v0.0.0-devel+12f1878 - Your Self-Hosted Remote Development Platform
Started HTTP listener at http://0.0.0.0:3001
View the Web UI: http://127.0.0.1:3001
==> Logs will stream in below (press ctrl+c to gracefully exit):
2024-01-04 20:26:38.545 [warn] net.workspace-proxy.servertailnet: multiagent closed, reinitializing
2024-01-04 20:26:38.546 [erro] net.workspace-proxy.servertailnet: reinit multi agent ...
error= dial coordinate websocket:
github.com/coder/coder/v2/enterprise/wsproxy/wsproxysdk.(*Client).DialCoordinator
/home/coder/coder/enterprise/wsproxy/wsproxysdk/wsproxysdk.go:454
- failed to WebSocket dial: failed to send handshake request: Get "http://127.0.0.1:3000/api/v2/workspaceproxies/me/coordinate": dial tcp 127.0.0.1:3000: connect: connection refused
2024-01-04 20:26:38.587 [erro] net.workspace-proxy.servertailnet: reinit multi agent ...
error= dial coordinate websocket:
github.com/coder/coder/v2/enterprise/wsproxy/wsproxysdk.(*Client).DialCoordinator
/home/coder/coder/enterprise/wsproxy/wsproxysdk/wsproxysdk.go:454
- failed to WebSocket dial: failed to send handshake request: Get "http://127.0.0.1:3000/api/v2/workspaceproxies/me/coordinate": dial tcp 127.0.0.1:3000: connect: connection refusedhandshake request: Get "http://127.0.0.1:3000/api/v2/workspaceproxies/me/coordinate": dial tcp 127.0.0.1:3000: connect: connection refused
2024-01-04 20:26:40.446 [info] net.workspace-proxy.servertailnet: successfully reinitialized multiagent agents=0 took=1.900892615s
```
It causes the sign-in page to reload whenever a user enters a page or changes the window's focus. This is happening because when the "user" fetch is made, the server returns an error, making the react-query mark the data as stale and try to load it whenever possible.
* Shows the overall report error at the top of the page, if present.
* Shows workspaceproxy errors above warnings inside the corresponding element, if present.
* Improves unregistered proxy status
Adds a nodeUpdater component, which serves a similar role to configMaps, but tracks information from tailscale going out to the coordinator as node updates. This first PR just handles netInfo, subsequent PRs will
handle DERP forced websockets, endpoints, and addresses.
* docs: remove cloud logos from 1-click install
They were looking good and are not adding much value.
* Delete docs/images/install/render.png
* Delete docs/images/install/ec2.svg
* Delete docs/images/install/eks.svg
* Delete docs/images/install/fly.io.svg
* Delete docs/images/install/gce.svg
* Delete docs/images/install/heroku.svg
* Delete docs/images/install/railway.svg
This test case fails with an error log, showing "context canceled" when trying to send an acquired job to an in-mem provisionerd.
https://github.com/coder/coder/runs/20331469006
In this case, we don't want to supress this error, since it could mean that we acquired a job, locked it in the database, then failed to send it to a provisioner.
(We also don't want to mark the job as failed because we don't know whether the job made it to the provisionerd or not --- in the failed test you can see that the job is actually processed just fine).
The reason we got context canceled is because the API was shutting down --- we don't want provisionerdserver to abruptly stop processing job stuff as the API shuts down as this will leave jobs in a bad state. This PR fixes up the use of contexts with provisionerdserver and the associated drpc service calls.
Work in progress on a subcomponent of the Conn which will handle configuring the wireguard engine on changes. I've implemented setAddresses as the simplest case and added unit tests of the reconfiguration loop.
Besides making the code easier to test and understand, the goal is for this component to handle disconnect and loss updates about peers, and thereby, implement the v2 Tailnet API.
Further PRs will handle peer updates, status updates, and net info updates.
Then, after the subcomponent is implemented and tested, I will refactor Conn to use it instead of the current monolithic architecture.
It looks like we updated mockgen to use Uber's fork, but the flake still
pointed to a nixos-unstable commit containing the old mockgen resulting
in an error like:
missing go.sum entry for module providing package github.com/golang/mock/mockgen/model
Fixes#11451
A refactor of the Agent API passes metrics as protobufs, which include pointers to label name/value pairs. The aggregator tested for sameness by doing a shallow compare of label values, which for different stats reports would compare unequal because the pointers would be different.
This fix does a deep compare.
While testing I also noted that we neglect to compare template names. This is unlikely to have caused any issue in practice, since the combination of username/workspace is unique, but in the context of comparing metric labels we should do the comparison.
If a user creates a workspace, deletes it, then recreates from a different template, we could in principle have reported incorrect stats for the old template.
Part of #10676
- Adds a health section for provisioner daemons (mostly cannibalized from the Workspace Proxy section)
- Adds a corresponding storybook entry for provisioner daemons health section
- Fixed an issue where dismissing the provisioner daemons warnings would result in a 500 error
- Adds provisioner daemon error codes to docs
* assert provisioner daemon version and api_version in unit tests
* add build info in HTTP header, extract codersdk.BuildVersionHeader
* add api_version to codersdk.ProvisionerDaemon
* testutil.MustString -> testutil.MustRandString
Refactors the code that handles monitoring an agent websocket with pings and updating the connection times in the DB.
Consolidates v1 and v2 agent APIs under the same code for this.
One substantive change (not _just_ a refactor) is that I've made it so that we actually disconnect if the agent fails to respond to our pings, rather than the old behavior where we would update the database, but not actually tear down the websocket.
We're seeing some flaky tests related to agent connectivity - https://github.com/coder/coder/actions/runs/7286675441/job/19856270998
I'm pretty sure what happened in this one is that the client opened a connection while the wgengine was in the process of reconfiguring the wireguard device, so the fact that the peer became "active" as a result of traffic being sent was not noticed.
The test calls `AwaitReachable()` but this only tests the disco layer, so it doesn't wait for wireguard to come up.
I think we should be using TSMP for pinging and reachability, since this operates at the IP layer, and therefore requires that wireguard comes up before being successful.
This should also help with the problems we have seen where a TCP connection starts before wireguard is up and the initial round trip has to wait for the 5 second wireguard handshake retry.
fixes: #11294
Refactors our DRPC service definitions slightly.
In the previous version, I inserted the RPCs from the tailnet proto directly into the Agent service. This makes things hard to deal with because DRPC then generates a new set of methods with new interfaces with the `DRPCAgent_` prefixed. Since you can't have a single method that takes different argument types, we couldn't reuse the implementation of those RFCs without a lot of extra classes and pass-thru methods.
Instead, the "right" way to do it is to integrate at the DRPC layer. So, we have two DRPC services available over the Agent websocket, and register them both on the DRPC `mux`.
Since the tailnet proto RPC service is now for both clients and agents, I renamed some things to clarify and shorten.
This PR also removes the `TailnetAPI` implementation from the `agentapi` package, and the next PR in the stack replaces it with the implementation from the `tailnet` package.
* Add database tables for OAuth2 applications
These are applications that will be able to use OAuth2 to get an API key
from Coder.
* Add endpoints for managing OAuth2 applications
These let you add, update, and remove OAuth2 applications.
* Add frontend for managing OAuth2 applications
* feat: enable csrf token header
* Exempt external auth requets
* ensure dev server bypasses CSRF
* external auth is just get requests
* Add some more routes
* Extra assurance nothing breaks
* chore: fix flake, use time closer to actual test
The tests were queued, and the autostart time was being set
to the time the table was created, not when the test was actually
being run. This diff was causing failures in CI
* Adds UpdateProvisionerDaemonLastSeenAt
* Adds heartbeat to provisioner daemons
* Inserts provisioner daemons to database upon start
* Ensures TagOwner is an empty string and not nil
* Adds COALESCE() in idx_provisioner_daemons_name_owner_key
* chore: add unit test to excercise flake
* Implement a *fix for cron stop() before run()
This fix still has a race condition. I do not see a clean solution
without modifying the cron libary. The cron library uses a boolean
to indicate running, and that boolean needs to be set to "true"
before we call "Close()". Or "Close()" should prevent "Run()"
from doing anything.
In either case, this solves the issue for a niche unit test bug
in which the test finishes, calling Close(), before there was
an oppertunity to start the go routine. It probably isn't worth
a lot of time investment, and this fix will suffice
* setup manifest
* added okta guide from steven M
* improved index by adding children
* changed icon to notes.svg
* added meta guide, fixed profile photo fmt
closes#10532
Adds v2 support to the /coordinate endpoint via a query parameter.
v1 already has test cases, and we haven't implemented v2 at the client yet, so the only new test case is an unsupported version.
Part of #10532
Adds a tailnet ClientService that accepts a net.Conn and serves v1 or v2 of the tailnet API.
Also adds a DRPCService that implements the DRPC interface for the v2 API. This component is within the ClientService, but needs to be reusable and exported so that we can also embed it in the Agent API.
Finally, includes a NewDRPCClient function that takes a net.Conn and runs dRPC in yamux over it on the client side.
Part of #10532
DRPC transport over yamux and in-mem pipes was previously only used on the provisioner APIs, but now will also be used in tailnet. Moved to subpackage of codersdk to avoid import loops.
Fixes flake https://github.com/coder/coder/runs/19639217635
AGPL coordinator used to process node updates for single_tailnet synchronously, but it's been refactored to process async, so in this test we need to wait for it to be processed.
This sends the email the license was issued to, and whether or not it's a trial in the telemetry payload. It's a bit janky since the license parsing is all enterprise licensed.
Addresses the issue in #11185 for the StringMap datatype.
There are other slice data types in our database package that also need to be fixed, but that'll be a different PR
Adds column api_version to the provisioner_daemons table.
This is distinct from the coderd version, and is used to handle breaking changes in the provisioner daemon API.
* added sharkymark FAQs page
* make fmt
* fixed typos for link
* changed FAQs icon to (i)
* satisfied review
* make fmt
* added docs links for coder_app, CODER_ACCESS_URL
* removed mentions of mark
* fixed some minor code formatting issues
* fixed numbered bullets rendering, make fmt
* chore(Makefile): use golangci-lint version from dogfood Dockerfile
* chore(dogfood/Dockerfile): update golangci-lint to latest version
* chore(coderd): address linter complaints
Fixes#10979
Testing code that listens on a specific port has created a long battle with flakes. Previous attempts to deal with this include opening a listener on a port chosen by the OS, then closing the listener, noting the port and starting the test with that port.
This still flakes, notably in macOS which has a proclivity to reuse ports quickly.
Instead of fighting with the chaos that is an OS networking stack, this PR fakes the host networking in tests.
I've taken a small step here, only faking out the Listen() calls that port-forward makes, but I think over time we should be transitioning all networking the CLI does to an abstract interface so we can fake it. This allows us to run in parallel without flakes and
presents an opportunity to test error paths as well.
* chore: check if process is nil
We check if process is nil in the ports_supported file.
Just matching that defensive check, not sure if it can be nil.
- Adds a --name argument to provisionerd start
- Plumbs through name to integrated and external provisioners
- Defaults to hostname if not specified for external, hostname-N for integrated
- Adds cliutil.Hostname
* feat: add endpoints to list all authed external apps
Listing the apps allows users to auth to external apps without going through the create workspace flow.
* Force typegen types for some fields of derp health report
* Explicitly allocate slices for RegionReport.{Errors,Warnings} to avoid nulls in API response
The nix image isn't used because it doesn't work, and we haven't been
updating our "pre-nix" tag since the changes were made. Reverts back to
being a regular Dockerfile.
* chore: add Pagination component, add new test, and update other pagination tests
* fix: add back temp spacing for WorkspacesPageView
* chore: update AuditPage to use Pagination
* chore: update UsersPage to use Pagination
* refactor: move parts of Pagination into WorkspacesPageView
* fix: handle empty states for pagination labels better
* docs: rewrite comment for clarity
* refactor: rename components/properties for clarity
* fix: rename component files for clarity
* chore: add story for PaginationContainer
* chore: rename story for clarity
* fix: handle undefined case better
* fix: update imports for PaginationContainer mocks
* fix: update story values for clarity
* fix: update scroll logic to go to the bottom instead of the top
* fix: update mock setup for test
* fix: update stories
* fix: remove scrolling functionality
* fix: remove deprecated property
* refactor: rename prop
* fix: remove debounce flake
Updates coder/customers#365
This PR updates our migration framework to run all migrations in a single transaction. This is the same behavior we had in v1 and ensures that failed migrations don't bring the whole deployment down. If a migration fails now, it will automatically be rolled back to the previous version, allowing the deployment to continue functioning.
Relates to #8965
* Fixes offlinedocs that broke from change in feat(coderd/healthcheck): add access URL error codes and healthcheck doc #10915 by removing the offending anchor links from the page subheadings.
* Makes offlinedocs also conditional on changes to docs
Fixes flake seen here: https://github.com/coder/coder/runs/19170327767
The goroutine that attempts to dial the socket didn't complete before the test did. Here we add an explicit wait for it to complete in each run of the loop.
Spotted during a code read. ConnIO unlocks the mutex before attempting to write to the response channel, which could allow another goroutine to call Close() and close the channel, causing a panic.
Fix is to hold the mutex. This won't cause a deadlock because the `select{}` has a `default` case, so we won't block even if the receiver isn't keeping up.
Adds cleanup queries to clean out "lost" peer and tunnel state after 24 hours. We leave this state in the database so that anything trying to connect to the peer can see that it was lost, but clean it up after 24 hours to ensure our table doesn't grow without bounds.
Adds support for graceful disconnect to PGCoordinator. When peers gracefully disconnect, they send a disconnect message. This triggers the peer to be disconnected from all tunneled peers.
The Multi-Agent Client supports graceful disconnect, since it is in memory and we know that when it is closed, we really mean to disconnect.
The v1 agent and client Websocket connections do not support graceful disconnect, since the v1 protocol doesn't have this feature. That means that if a v1 peer connects to a v2 peer, when the v1 peer's coordinator connection is closed, the v2 peer will
see it as "lost" since we don't know whether the v1 peer meant to disconnect, or it just lost connectivity to the coordinator.
* wip: commit current progress on usePaginatedQuery
* chore: add cacheTime to users query
* chore: update cache logic for UsersPage usersQuery
* wip: commit progress on Pagination
* chore: add function overloads to prepareQuery
* wip: commit progress on usePaginatedQuery
* docs: add clarifying comment about implementation
* chore: remove optional prefetch property from query options
* chore: redefine queryKey
* refactor: consolidate how queryKey/queryFn are called
* refactor: clean up pagination code more
* fix: remove redundant properties
* refactor: clean up code
* wip: commit progress on usePaginatedQuery
* wip: commit current pagination progress
* docs: clean up comments for clarity
* wip: get type signatures compatible (breaks runtime logic slightly)
* refactor: clean up type definitions
* chore: add support for custom onInvalidPage functions
* refactor: clean up type definitions more for clarity reasons
* chore: delete Pagination component (separate PR)
* chore: remove cacheTime fixes (to be resolved in future PR)
* docs: add clarifying/intellisense comments for DX
* refactor: link users queries to same queryKey implementation
* docs: remove misleading comment
* docs: more comments
* chore: update onInvalidPage params for more flexibility
* fix: remove explicit any
* refactor: clean up type definitions
* refactor: rename query params for consistency
* refactor: clean up input validation for page changes
* refactor/fix: update hook to be aware of async data
* chore: add contravariance to dictionary
* refactor: increase type-safety of usePaginatedQuery
* docs: more comments
* chore: move usePaginatedQuery file
* fix: add back cacheTime
* chore: swap in usePaginatedQuery for users table
* chore: add goToFirstPage to usePaginatedQuery
* fix: make page redirects work properly
* refactor: clean up clamp logic
* chore: swap in usePaginatedQuery for Audits table
* refactor: move dependencies around
* fix: remove deprecated properties from hook
* refactor: clean up code more
* docs: add todo comment
* chore: update testing fixtures
* wip: commit current progress for tests
* fix: update useEffectEvent to sync via layout effects
* wip: commit more progress on tests
* wip: stub out all expected test cases
* wip: more test progress
* wip: more test progress
* wip: commit more test progress
* wip: AHHHHHHHH
* chore: finish two more test cases
* wip: add in all tests (still need to investigate prefetching
* refactor: clean up code slightly
* fix: remove math bugs when calculating pages
* fix: wrap up all testing and clean up cases
* docs: update comments for clarity
* fix: update error-handling for invalid page handling
* fix: apply suggestions
Relates to #8965
- Added error codes for separate code paths in health checks
- Prefixed errors and warnings with error code prefixes
- Added a docs page with details on each code, cause and solution
Co-authored-by: Muhammad Atif Ali <atif@coder.com>
The 'userOIDC' method body was getting unwieldy.
I think there is a good way to redesign the flow, but
I do not want to undertake that at this time.
The easy win is just to move some LoC to other methods
and cleanup the main method.
Drop "New" and "Builder" from the function names, in favor of the top-level resource created. This shortens tests and gives a nice syntax. Since everything is a builder, the prefix and suffix don't add much value and just make things harder to read.
I've also chosen to leave `Do()` as the function to insert into the database. Even though it's a builder pattern, I fear `.Build()` might be confusing with Workspace Builds. One other idea is `Insert()` but if we later add dbfake functions that update, this might be inconsistent.
I noticed we have been overusing colors in the UI, so simplifying is better for the "look and feel" and maintaining the styles over time.

If you want to have a better sense of what it looks like, I recommend you go to the Chromatic snapshot.
- Updates plugin staleness check to check mtime instead of atime, as atime has been shown to be unreliable
- Updates existing unit test to use a real filesystem as Afero's in-memory FS doesn't support atimes at all
Convert to builder for consistency with rest of the package. This will make it easier to use, and means we can drop "Builder" from function arguments since they are all builders in the package.
* Adds workspace proxy section to health page
* Conditionally places workspace proxy warnings in errors or warnings based on calculated severity
* Adds some more stories we were missing for HealthPage
Fixes#10799
The flake happens when we try to remote forward, but the port we've chosen is not free. In the flaked example, it's actually the SSH listener that occupies the port we try to remote forward, leading to confusing reads (c.f. the linked issue).
This fix simplies the tests considerably by using the Go ssh client, rather than shelling out to OpenSSH. This avoids using a pseudoterminal, avoids the need for starting any local OS listeners to communicate the forwarding (go SSH just returns in-process listeners), and avoids an OS listener to wire OpenSSH up to the agentConn.
With the simplied logic, we can immediately tell if a remote forward on a random port fails, so we can do this in a loop until success or timeout.
I've also simplified and fixed up the other forwarding tests. Since we set up forwarding in-process with Go ssh, we can remove a lot of the `require.Eventually` logic.
- Adds a template_insights pseudo-resource
- Grants auditor and template admin roles read access on template_insights
- Updates existing RBAC checks to check for read template_insights, falling back to template update permissions where necessary
- Updates TemplateLayout to show Insights tab if can read template_insights or can update template
Adds a health check for workspace proxies:
- Healthy iff all proxies are healthy and the same version,
- Warning if some proxies are unhealthy,
- Error if all proxies are unhealthy, or do not all have the same version.
* refactor: remove workspace error enums
* fix: add in retry button for failed workspaces
* fix: make handleBuildRetry auto-detect debug permissions
* chore: consolidate retry messaging
* chore: update renderWorkspacePage to accept parameters
* chore: make workspace test helpers take explicit workspace parameter
* refactor: update how parameters for tests are defined
* fix: update old tests to be correctly parameterized
fixes#10810
The tailnet coordinators don't depend on replicasync, so we can still enable HA coordinators even if the relay URL is unset.
The in-memory, non-HA coordinator probably has lower latency than the PG Coordinator, since we have to query the database, so enterprise customers might want to disable it for single-replica deployments, but this PR default-enables the HA coordinator. We could add support later to disable it if anyone complains. Latency setting up connections matters, but I don't believe the coordinator contributes significantly at this point for reasonable postgres round-trip-time.
Man, graceful shutdown is hard. Even after my changes, we were still hitting a graceful shutdown race: https://github.com/coder/coder/runs/18886842123
The problem was that while we attempt a graceful shutdown at the SSH layer by closing the session for writing, we were not giving it a chance to complete before continuing to tear down the stack of closers, including one that closes the netstack, and thus drop the TCP connection before it closes.
I'd like to convert dbfake into a builder pattern to prevent a proliferation of XXXWithYYY methods. This is one step of the way by removing the Non-builder function.
Refactors SSH tests to skip provisionerd and instead use dbfake to insert workspaces and builds. This should make tests faster and more reliable.
dbfake.WorkspaceBuild is refactored to use a "builder" pattern with "fluent" options, as the number of options and variants was starting to get out of hand.
* fix: clarify language in orphan section of delete modal
* tinted title
* Update site/src/pages/WorkspacePage/WorkspaceDeleteDialog/WorkspaceDeleteDialog.tsx
Co-authored-by: Muhammad Atif Ali <atif@coder.com>
* prettier
---------
Co-authored-by: Muhammad Atif Ali <atif@coder.com>
* feat: implement deprecated flag for templates to prevent new workspaces
* Add deprecated filter to template fetching
* Add deprecated to template table
* Add deprecated notice to template page
* Add ui to deprecate a template
* Remove Typography from NavbarView
* Remove Typography from EmptyState
* Remove Typography from Paywall
* Fix font size
* Remove Typography from CliAuthPage
* Remove Typography from Single SignOn
* Remove Typography from file dialog
* Remove from not found
* Remove from Section
* Remove from global snackbar
* Remove Typography component
* Add eslint role
re: #10528
Refactors PG Coordinator to work with the Tailnet v2 API, including wrappers for the existing v1 API.
The debug endpoint functions, but doesn't return sensible data, that will be in another stacked PR.
[Coder](https://coder.com) enables organizations to set up development environments in the cloud. Environments are defined with Terraform, connected through a secure high-speed Wireguard® tunnel, and are automatically shut down when not in use to save on costs. Coder gives engineering teams the flexibility to use the cloud for workloads that are most beneficial to them.
[Coder](https://coder.com) enables organizations to set up development environments in their public or private cloud infrastructure. Cloud development environments are defined with Terraform, connected through a secure high-speed Wireguard® tunnel, and are automatically shut down when not in use to save on costs. Coder gives engineering teams the flexibility to use the cloud for workloads that are most beneficial to them.
- Define development environments in Terraform
- Define cloud development environments in Terraform
- EC2 VMs, Kubernetes Pods, Docker Containers, etc.
- Automatically shutdown idle resources to save on costs
- Onboard developers in seconds instead of days
@@ -44,7 +44,7 @@
## Quickstart
The most convenient way to try Coder is to install it on your local machine and experiment with provisioning development environments using Docker (works on Linux, macOS, and Windows).
The most convenient way to try Coder is to install it on your local machine and experiment with provisioning cloud development environments using Docker (works on Linux, macOS, and Windows).
```
# First, install Coder
@@ -100,7 +100,7 @@ Browse our docs [here](https://coder.com/docs/v2) or visit a specific section be
Feel free to [open an issue](https://github.com/coder/coder/issues/new) if you have questions, run into bugs, or have a feature request.
[Join our Discord](https://discord.gg/coder) to provide feedback on in-progress features, and chat with the community using Coder!
[Join our Discord](https://discord.gg/coder) or [Slack](https://cdr.co/join-community) to provide feedback on in-progress features, and chat with the community using Coder!
Description:"Name or ID of the template. Traffic generation will be limited to workspaces created from this template.",
Value:clibase.StringOf(&template),
},
{
Flag:"target-workspaces",
Env:"CODER_SCALETEST_TARGET_WORKSPACES",
Description:"Target a specific range of workspaces in the format [START]:[END] (exclusive). Example: 0:10 will target the 10 first alphabetically sorted workspaces (0-9).",
Description:"Target a specific range of users in the format [START]:[END] (exclusive). Example: 0:10 will target the 10 first alphabetically sorted users (0-9).",
_,_=fmt.Fprintf(inv.Stdout,"Workspace %q removed from favorites.\n",ws.Name)
returnnil
},
}
returncmd
}
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.