Terraform changed the default output of the `terraform graph` command.
You must put `-type=plan` to keep the prior behavior.
(cherry picked from commit 3364abecdd)
Co-authored-by: Colin Adler <colin1adler@gmail.com>
This fixes an RCE in git and gets us one minor version closer to fixing
a critical Terraform vulnerability. In the next release we'll bump to
1.8.x.
(cherry picked from commit 80538c079d)
Instead of removing the mappings of unhealthy coordinators entirely,
mark them as lost instead. This prevents peers from disappearing from
other peers if a coordinator misses a heartbeat.
* chore: allow terraform & echo built-in provisioners
Built-in provisioners serve all specified types. This allows running terraform, echo, or both in built in.
The cli flag to control the types is hidden by default, to be used primarily for testing purposes.
* fix: remove some of the jank around our core App component
* refactor: scope navigation logic more aggressively
* refactor: add explicit return type to useAuthenticated
* refactor: clean up ProxyContext code
* wip: add code for consolidating the HTML metadata
* refactor: clean up hook logic
* refactor: rename useHtmlMetadata to useEmbeddedMetadata
* fix: correct names that weren't updated
* fix: update type-safety of useEmbeddedMetadata further
* wip: switch codebase to use metadata hook
* refactor: simplify design of metadata hook
* fix: update stray type mismatches
* fix: more type fixing
* fix: resolve illegal invocation error
* fix: get metadata issue resolved
* fix: update comments
* chore: add unit tests for MetadataManager
* fix: beef up tests
* fix: update typo in tests
* docs: add island integration guide
* make: fmt
* F
omit F
* fix: naming and manifest
---------
Co-authored-by: Matt Vollmer <matthewjvollmer@outlook.com>
I initially made this change when hacking wgengine to also capture wireguard packets going into the magicsock, so that we could capture the initial wireguard handshake.
I don't think we should ship that additional capture logic, but... it seems generally useful to capture packets from the get go on speedtest, so that you can see disco and pings before the TCP speedtest session starts.
When starting a workspace, if the deadline crosses an autostart boundary, the deadline is set to autostart + TTL.
This copies the behavior in `ActivityBumpWorkspace`, but does not require activity.
* chore: dynamically determine gitlab external auth defaults
Static defaults work for github cloud, but not self hosted.
Self hosted setups will now have sane defaults if omitted.
* feat: influence parameter defaults through cli flag/env
Add a --parameter-default flag / CODER_RICH_PARAMETER_DEFAULT
environment variable which overrides default values suggested for
parameters.
This allows scripts or middleware wrapping the CLI to substitute
defaults for parameter values beyond those defined at the template
level. For example, Git repository/branch parameters can be given
defaults based on the current checkout, or default parameter values can
be parsed out of files inside the repo.
* Rename defaults arg to defaultOverrides
This PR adds a command to bump versions in docs/markdown.
This is still standalone and needs to be wired up.
For now, I'm planning on putting this in `scripts/release.sh` (checkout main -> autoversion (this command) -> commit -> submit PR).
It would be pretty neat to make it a GH actions that's triggered on release though, something for the future.
Part of #12465
* fix: update API code to use Axios instance
* docs: fix typo
* fix: update all global axios imports to use Coder instance
* fix: remove needless import
* fix: update import order
* refactor: rename coderAxiosInstance to axiosInstance
* docs: update variable reference in FE contributing docs
* chore: reduce requests the dashboard makes from seeded data
We already inject all of this content in `index.html`.
There was also a bug with displaying a loading indicator when
the workspace proxies endpoint 404s.
* Fix first user fetch
* Add util
* Add cached query for entitlements and experiments
* Fix authmethods unnecessary request
* Fix unnecessary region request
* Fix fmt
* Debug
* Fix test
Adds checks to coderd/healthcheck/derphealth for STUN issues:
- Alerts if there is not least one healthy STUN server,
- Alerts if we see variable port mapping.
* chore: skip global.setup if first user already exists
treat test as a setup, rather than a test
Co-authored-by: Kayla Washburn-Love <mckayla@hey.com>
---------
Co-authored-by: Kayla Washburn-Love <mckayla@hey.com>
Fixes#12780
Adds indexes to the `tailnet_tunnels` table to speed up `GetTailnetTunnelPeerIDs` and `GetTailnetTunnelPeerBindings` queries, which match on `src_id` and `dst_id`.
When an agent receives a node, it responds with an ACK which is relayed
to the client. After the client receives the ACK, it's allowed to begin
pinging.
* chore: nix shell to support playwright e2e tests
nix is running an older version of chromium, so had to reduce the
playwright version.
* Add to e2e readme
* add enterprise test comment
* add note about install to readme
* make fmt
* remove shellhook message
Co-authored-by: Kayla Washburn-Love <mckayla@hey.com>
* add link to nixos playwright package to get version
* formatting
---------
Co-authored-by: Kayla Washburn-Love <mckayla@hey.com>
* chore: verify pass through external auth query params
Unit test added to verify behavior of query params set in the
auth url for external apps. This behavior is intended to specifically
support Auth0 audience query param.
fixes#12923
Prevents Coordinate peer connections from generating spurious database queries like DeleteTailnetPeer when the coordinator is unhealthy.
It does this by checking the health of the querier before accepting a connection, rather than unconditionally accepting it only for it to get swatted down later.
* chore: merge apikey/token session config values
There is a confusing difference between an apikey and a token. This
difference leaks into our configs. This change does not resolve the
difference. It only groups the config values to try and manage any
bloat that occurs from adding more similar config values
A customer hit like 200k of ErrSessionShutdown, which just dupes any errors we would have generated when shutting down the session for e.g. Ping failures.
* chore: remove InsertWorkspaceAgentStat query
InsertWorkspaceAgentStats (batch) exists. We only used the singular in
a single unit test place. Removing the single for the batch, reducing
the interface size.
* 2.10.0 changelog
* updated install docs for mainline/stable releases
* make fmt
* cpp icon -> C++
* added disclaimer on MAX_TTL, support bundle info
* 'release schedule'
* lowercase mainline
* Agent OOM protection info
* minor tweak
* docs: describe mutually exclusive create workspace template fields
Ideally we could do this in the OpenAPI spec, but there is no first
class "mutually exclusive" feature in OpenAPI. So in lieu of something
more complex, or changing our struct/validation, a description comment
should suffice.
* chore: Add description to code sample as well
* chore: use fork of chroma to remove unused inits
This seems fine to do since compilation errors would occur
if it were actually in use.
Everything seems fine here.
* Update validator
* feat(cli): add golden tests for errors (#11588)
Creates golden files from `coder/cli/errors.go`.
Adds a unit test to test against golden files.
Adds a make file command to regenerate golden files.
Abstracts test against golden files.
* chore: merge authorization contexts
Instead of 2 auth contexts from apikey and dbauthz, merge them to
just use dbauthz. It is annoying to have two.
* fixup authorization reference
* docs: document how to run workspace-proxy as a system service
* Update workspace-proxies.md
* Update workspace-proxies.md
Co-authored-by: Muhammad Atif Ali <me@matifali.dev>
* docs: fix duplication
---------
Co-authored-by: Muhammad Atif Ali <me@matifali.dev>
NOTE: terraform-provider-coder was updated to facilitate this change, and your template will require v0.19.0 for this feature to work. You can run terraform init -upgrade in your template directory. If you have a version constraint set, ensure it points to this version.
pbkdf2 is too expensive to run in init, so this change makes it load
lazily. I introduced a lazy package that I hope to use more in my
`GODEBUG=inittrace=1` adventure.
Benchmark results:
```
$ hyperfine "coder --help" "coder-new --help"
Benchmark 1: coder --help
Time (mean ± σ): 82.1 ms ± 3.8 ms [User: 93.3 ms, System: 30.4 ms]
Range (min … max): 72.2 ms … 90.7 ms 35 runs
Benchmark 2: coder-new --help
Time (mean ± σ): 52.0 ms ± 4.3 ms [User: 62.4 ms, System: 30.8 ms]
Range (min … max): 41.9 ms … 62.2 ms 52 runs
Summary
coder-new --help ran
1.58 ± 0.15 times faster than coder --help
```
Currently, importing `codersdk` just to interact with the API requires
importing tailscale, which causes builds to fail unless manually using
our fork.
Just upgraded to macOS 14.4 and TestAgentScript/Run fails for me with error `signal: killed`. I opened the test directory in a terminal and sure enough, when you execute the `echo` binary, it is immediately killed. The binary has no extended attributes and is byte-identical to the one in `/bin/`.
This fix uses a different strategy: instead of copying the `echo` binary from the system around, we just copy a small bash script that _calls_ the `echo` command.
This cleans up `root.go` a bit, adds tests for middleware HTTP transport
functions, and removes two HTTP requests we always always performed previously
when executing *any* client command.
It should improve CLI performance (especially for users with higher latency).
This PR updates the tests in `insights_test.go` to enable commented-out scenarios. This behavior was fixed by previous PRs in this stack. Note that the updated golden files are correct since they are "second template only" meaning that the newly introduced data is considered as expected. In other golden files there is no change since "only count once" is applied.
This PR updates the `*ByTempalte` insights queries used for generating Prometheus metrics to behave the same way as the new rollup query and re-written insights queries that utilize the rolled up data.
Add `dbrollup` service that runs the `UpsertTemplateUsageStats` query
every 5 minutes, on the minute. This allows us to have fairly real-time
insights data when viewing "today".
Add `template_usage_stats` table for aggregating tempalte usage data.
Data is rolled up by the `UpsertTemplateUsageStats` query, which fetches
data from the `workspace_agent_stats` and `workspace_app_stats` tables.
- Updates existing tests under workspaceapps/apptest to not reuse existing appDetails as assertWorkspaceLastUsed(Not)?Updated calls FlushStats() which was causing racy test behaviour and incorrect test assertions.
- Expands scope of assertWorkspaceLastUsedAtUpdated and its counterpart to ProxySubdomain tests.
This more closely aligns with GitHub's label search style. Actual search params need to be converted to allow this format, by default they will throw an error if they do not support listing.
This PR updates the coder port-forward command to periodically inform coderd that the workspace is being used:
- Adds workspaceusage.Tracker which periodically batch-updates workspace LastUsedAt
- Adds coderd endpoint to signal workspace usage
- Updates coder port-forward to periodically hit this endpoint
- Modifies BatchUpdateWorkspacesLastUsedAt to avoid overwriting with stale data
Co-authored-by: Danny Kopping <danny@coder.com>
* chore: remove max_ttl from templates
Completely removing max_ttl as a feature on template scheduling. Must use other template scheduling features to achieve autostop.
* chore: add org ID as optional param to AcquireJob
* chore: plumb through organization id to provisioner daemons
* add org id to provisioner domain key
* enforce org id argument
* dbgen provisioner jobs defaults to default org
In the peer healthcheck code, when an error pinging peers is detected we
write a "replicaErr" string with the error reason. However, if there are
no peer replicas to ping we returned early without setting the string to
empty. This would cause replicas that had peers (which were failing) and
then the peers left to permanently show an error until a new peer
appeared.
Also demotes DERP replica checking to a "warning" rather than an "error"
which should prevent the primary from removing the proxy from the region
map if DERP meshing is non-functional. This can happen without causing
problems if the peer is shutting down so we don't want to disrupt
everything if there isn't an issue.
Fixes#10760
The coder CLI quietly accepts any subcommand arguments and silently swallows them.
Currently:
```sh
❯ coder | head -n5
coder v2.3.3+e491217
USAGE:
coder [global-flags] <subcommand>
```
```sh
❯ coder idontexist | head -n5
coder v2.3.3+e491217
USAGE:
coder [global-flags] <subcommand>
```
Now help output will not be show when there is an unknown subcommand error. Instead users will be given the command for the help output.
```sh
❯ coder idontexist
Encountered an error running "coder", see "coder --help" for more information
error: unrecognized subcommand "idontexist"
```
```sh
❯ coder iexistbut idontexist
Encountered an error running "coder iexistbut", see "coder iexistbut --help" for more information
error: unrecognized subcommand "idontexist"
```
Also this stuff: `Encountered an error running "coder iexistbut"... ` gets written to `os.Stdout` in `prettyErrorFormatter{w: os.Stderr, verbose: r.verbose}`, not sure how to test that output.
* fix: separate signals for passive, active, and forced shutdown
`SIGTERM`: Passive shutdown stopping provisioner daemons from accepting new
jobs but waiting for existing jobs to successfully complete.
`SIGINT` (old existing behavior): Notify provisioner daemons to cancel in-flight jobs, wait 5s for jobs to be exited, then force quit.
`SIGKILL`: Untouched from before, will force-quit.
* Revert dramatic signal changes
* Rename
* Fix shutdown behavior for provisioner daemons
* Add test for graceful shutdown
Apptest requires a port without a listening server to test failure
cases. This port was chosen and had a chance of actually being
provisioned. To prevent this accident, a port <1k is chosen,
since those will never be allocated.
Every time I run `pnpm` in the project it adds the package manager attribute on package.json so I just decided to push it since it does not look like an issue and we can make sure everyone is running the same pnpm version.
fixes#11950https://github.com/coder/coder/issues/11950#issuecomment-1987756088 explains the bug
We were also calling into `Unlisten()` and `Close()` while holding the mutex. I don't believe that `Close()` depends on the notification loop being unblocked, but it's hard to be sure, and the safest thing to do is assume it could block.
So, I added a unit test that fakes out `pq.Listener` and sends a bunch of notifies every time we call into it to hopefully prevent regression where we hold the mutex while calling into these functions.
It also removes the use of a `context.Context` to stop the PubSub -- it must be explicitly `Closed()`. This simplifies a bunch of the logic, and is how we use the pubsub anyway.
Allow `coder login` to log into existing deployment if available.
Update help and error messages to indicate that `coder login` is
available as a command.
Fixes#10925Fixes#9551
* coderd: add test to reproduce trailing directory issue
* coderd: add trailing path separator to dir entries when converting to zip
* provisionersdk: add trailing path separator to directory entries
* chore: rename useTab to useSearchParamsKey and add test
* chore: mark old renderHookWithAuth as deprecated (temp)
* fix: update imports for useResourcesNav
* refactor: change API for useSearchParamsKey
* chore: let user pass in their own URLSearchParams value
* refactor: clean up comments for clarity
* fix: update import
* wip: commit progress on useWorkspaceDuplication revamp
* chore: migrate duplication test to new helper
* refactor: update code for clarity
* refactor: reorder test cases for clarity
* refactor: split off hook helper into separate file
* refactor: remove reliance on internal React Router state property
* refactor: move variables around for more clarity
* refactor: more updates for clarity
* refactor: reorganize test cases for clarity
* refactor: clean up test cases for useWorkspaceDupe
* refactor: clean up test cases for useWorkspaceDupe
* fix: do not set max deadline for workspaces on template update
When templates are updated and schedule data is changed, we update all
running workspaces to have up-to-date scheduling information that sticks
to the new policy.
When updating the max_deadline for existing running workspaces, if the
max_deadline was before now()+2h we would set the max_deadline to
now()+2h.
Builds that don't/shouldn't have a max_deadline have it set to 0, which
is always before now()+2h, and thus would always have the max_deadline
updated.
* test: add unit test to excercise template schedule bug
---------
Co-authored-by: Steven Masley <stevenmasley@gmail.com>
I noticed in my logs that sometimes `coder ssh` doesn't gracefully disconnect from the coordinator.
The cause is the `closerStack` construct we use in that function. It has two paths to start closing things down:
1. explicit `close()` which we do in `defer`
2. context cancellation, which happens if the cli function returns an error
sometimes the ssh remote command returns an error, and this triggers context cancellation of the `closerStack`. That is fine in and of itself, but we still want the explicit `close()` to wait until everything is closed before returning, since that's where we do cleanup, including the graceful disconnect. Prior to this fix the `close()` just immediately exits if another goroutine is closing the stack. Here we add a wait until everything is done.
* fix: ensure auto-workspace creation waits until all parameters are ready
* refactor: move creation blocking logic to main callback
* fix: let creation start if experimental feature is off
Adds a `--debug` flag to `scripts/develop.sh` that will start coder under `dlv debug` instead.
You can then use e.g. the following launch snippet to connect dlv:
```
{
"name": "Delve Remote",
"type": "go",
"request": "attach",
"mode": "remote",
"port": 12345,
}
```
You can also run invididual CLI commands under dlv e.g.
```
debug=1 scripts/coder-dev.sh list
```
Also sets CGO_ENABLED=0 in develop.sh by default.
- Reworks the proxy registration loop into a struct (so I can add a `RegisterNow` method)
- Changes the proxy registration loop interval to 15s (previously 30s)
- Adds test which tests bidirectional DERP meshing on all possible paths between 6 workspace proxy replicas
Related to https://github.com/coder/customers/issues/438
This fixes a vulnerability with the `CODER_OIDC_EMAIL_DOMAIN` option,
where users with a superset of the allowed email domain would be allowed
to login. For example, given `CODER_OIDC_EMAIL_DOMAIN=google.com`, a
user would be permitted entry if their email domain was
`colin-google.com`.
Part of #12163
- Adds a command coder support bundle <workspace> that generates a
support bundle and writes it to coder-support-$(date +%s).zip.
- Note: this is hidden currently until the rest of the functionality is fleshed out.
This adds the ability for `TunnelAuth` to also authorize incoming wireguard node IPs, preventing agents from reporting anything other than their static IP generated from the agent ID.
- Adds more testcases to TestAcquirer_MatchTags
- Adds functionality to generate a table from above test
- Update provisioner tag documentation with generated table
- Apply other feedback from #12315
DERP mesh key setup would do a SELECT and then an INSERT on failure, without a lock. During some testing with multiple replicas, I managed to cause a replica to crash due to them initializing simultaneously.
Fixes:
Encountered an error running "coder server"
create coder API: insert mesh key: pq: duplicate key value violates unique constraint "site_configs_key_key"
Co-authored-by: Cian Johnston <cian@coder.com>
* refactor: clean up and update API for useClipboard
* wip: commit current progress on useClipboard test
* docs: clean up wording on showCopySuccess
* chore: make sure tests can differentiate between HTTP/HTTPS
* chore: add test ID to dummy input
* wip: commit progress on useClipboard test
* wip: commit more test progress
* refactor: rewrite code for clarity
* chore: finish clipboard tests
* fix: prevent double-firing for button click aliases
* refactor: clean up test setup
* fix: rename incorrect test file
* refactor: update code to display user errors
* refactor: redesign useClipboard to be easier to test
* refactor: clean up GlobalSnackbar
* feat: add functionality for notifying user of errors (with tests)
* refactor: clean up test code
* refactor: centralize cleanup steps
Beginnings of a solution to #12297
Doesn't cover disco or definitively display whether we successfully connected to DERP, but shows some checklist diagnostics for connecting to an agent.
For this first PR, I just added it to `coder ping` to see how we like it, but could be incorporated into `coder ssh` _et al._ after a timeout.
```
$ coder ping dogfood2
p2p connection established in 147ms
pong from dogfood2 p2p via 95.217.xxx.yyy:42631 in 147ms
pong from dogfood2 p2p via 95.217.xxx.yyy:42631 in 140ms
pong from dogfood2 p2p via 95.217.xxx.yyy:42631 in 140ms
✔ preferred DERP region 999 (Council Bluffs, Iowa)
✔ sent local data to Coder networking coodinator
✔ received remote agent data from Coder networking coordinator
preferred DERP 10013 (Europe Fly.io (Paris))
endpoints: 95.217.xxx.yyy:42631, 95.217.xxx.yyy:37576, 172.17.0.1:37576, 172.20.0.10:37576
✔ Wireguard handshake 11s ago
```
* fix(coderd): mark provisioner daemon psk as secret
Marks provisioner daemon PSK with the secret annotation.
This ensures it will be scrubbed from API requests to
/api/v2/deployment/config.
* make gen
When viewing the Authentication page, the diagram showing the flow is a useful
resource for understanding the rest of the page.
Rather than linking to a specific version of the SVG, inline it as part of the
documentation.
* chore: drop github per user rate limit tracking
Rate limits for authenticated requests are per user.
This would be an excessive number of prometheus labels,
so we only track the unauthorized limit.
Moves healthcheck report-related structs from coderd/healthcheck to codersdk
This prevents an import cycle when adding a codersdk.Client method to hit /api/v2/debug/health.
Changes the agent to use the new v2 API for sending logs, via the logSender component.
We keep the PatchLogs function around, but deprecate it so that we can test the v1 endpoint.
In anticipation of needing the `LogSender` to run on a context that doesn't get immediately canceled when you `Close()` the agent, I've undertaken a little refactor to manage the goroutines that get run against the Tailnet and Agent API connection.
This handles controlling two contexts, one that gets canceled right away at the start of graceful shutdown, and another that stays up to allow graceful shutdown to complete.
Alternative solution to #6442
Modifies the behaviour of AcquireProvisionerJob and adds a special case for 'un-tagged' jobs such that they can only be picked up by 'un-tagged' provisioners.
Also adds comprehensive test coverage for AcquireJob given various combinations of tags.
Fixes race seen here: https://github.com/coder/coder/runs/21852483781
What happens is that the agent connects, completes the test, and then disconnects before the Eventually condition runs. The waiter then times out because it's looking for a connected agent.
Then, since it's a `require` in a goroutine, that causes the `tGo` cleanup to hang and the whole test suite to timeout after 10 minutes.
Anyway, `agenttest.New` doesn't block, and we don't actually need to wait for the agent to connect, since a successful SSH session is evidence that it connected.
* feat: convertGroups() no longer requires organization info
Removing role information from some users in the api. This info is
excessive and not required. It is costly to always include
* fix: ignore surronding whitespace for cli config
Cli config files break if you edit them manually with any editor.
Editors drop a newline at the end, and we not break on this.
If a developer manually edits a file, it should still work
* fix: move oauth2 routes
From /login/oauth2/* to /oauth2/*.
/login/oauth2 causes /login to no longer get served by the frontend,
even if nothing is actually served on /login itself.
* Add forgotten comment on delete
* feat: disable directory listings for static files
Static file server handles serving static asset files (js, css, etc).
The default file server would also list all files in a directory.
This has been changed to only serve files.
* chore: add database test fixture to insert non-unique linked_ids
* chore: create unit test to exercise failed email change bug
* fix: add postgres triggers to keep user_links clear of deleted users
* Add migrations to prevent deleted users with links
* Force soft delete of users, do not allow un-delete
* refactor: clean up tests for debounce
* refactor: clean up tests for useCustomEvent
* refactor: clean up events file
* refactor: clean up tests for hookPolyfills
- These CSS changes were for making sure there weren't layout shifts
when using the non-secure clipboard fallback, which could cause janky
UI flickers. It seems to be breaking things for some users on HTTP-only
connections, though.
This PR removes the prometheus-http port entirely from the coder service specification (originally added in #10448). It also removes the Helm value coder.service.prometheusNodePort.
Rationale: some cloud providers will helpfully expose all ports on a LoadBalancer service for you. The net effect of this is that setting CODER_PROMETHEUS_ENABLE will end up exposing port 2112 on your coderd service to the internet, which is likely undesired behaviour.
The agent is extended with a `--script-data-dir` flag, defaulting to the
OS temp dir. This dir is used for storing `coder-script-data/bin` and
`coder-script/[script uuid]`. The former is a place for all scripts to
place executable binaries that will be available by other scripts, SSH
sessions, etc. The latter is a place for the script to store files.
Since we default to OS temp dir, files are ephemeral by default. In the
future, we may consider adding new env vars or changing the default
storage location. Workspace startup speed could potentially benefit from
scripts being able to skip steps that require downloading software. We
may also extend this with more env variables (e.g. persistent storage in
HOME).
Fixes#11131
This commit refactors where custom environment variables are set in the
workspace and decouples agent specific configs from the `agentssh.Server`.
To reproduce all functionality, `agentssh.Config` is introduced.
The custom environment variables are now configured in `agent/agent.go`
and the agent retains control of the final state. This will allow for
easier extension in the future and keep other modules decoupled.
* fix: assign new oauth users to default org
This is not a final solution, as we eventually want to be able
to map to different orgs. This makes it so multi-org does not break oauth/oidc.
- prevent importing from the "monolith" lodash module. individual modules are better for tree shaking.
- prevent importing `useTheme` and types from @mui/material/styles. prefer importing from @emotion/react.
* fix: assign new oauth users to default org
This is not a final solution, as we eventually want to be able
to map to different orgs. This makes it so multi-org does not break oauth/oidc.
The first organization created is now marked as "default". This is
to allow "single org" behavior as we move to a multi org codebase.
It is intentional that the user cannot change the default org at this
stage. Only 1 default org can exist, and it is always the first org.
Closes: https://github.com/coder/coder/issues/11961
Adds a new subcomponent of the agent for queueing up logs until they can be sent over the Agent API.
Subsequent PR will change the agent to use this instead of the HTTP API for posting logs.
Relates to #10534
Fixes#12141Fixes#11750
PGCoord shutdown was uncoordinated, so an update at an inopportune time during shutdown would be rejected because the coordinator row was already deleted.
This PR ensures that the PGCoord subcomponents that write updates are shut down before we take down the heartbeats, which is responsible for deleting the coordinator row.
I noticed a possible race where tailnet.Conn can try to dial the embedded region before we've set our custom dialer that send the DERP in-memory. This closes that race and adds a test case for servertailnet with no STUN and an embedded relay
I think this will resolve#12136 but lets get a proper test at the system level before closing.
Before this change, we only register the node callback at start of day for the server tailnet. If the coordinator changes, like we know happens when we are licensed for the PGCoordinator, we close the connection to the old coord, and open a new one to the new coord.
The callback is designed to direct the updates to the new coordinator, but there is nothing that specifically triggers it to fire after we connect to the new coordinator.
If we have STUN, then period re-STUNs will generally get it to fire eventually, but without STUN it we could go indefinitely without a callback.
This PR changes the servertailnet to re-register the callback each time we reconnect to the coordinator. Registering a callback (even if it's the same callback) triggers an immediate call with our node information, so the new coordinator will have it.
Adds some debug endpoints for looking into the DERP server.
The `api/v2/debug/derp/traffic` endpoint requires the `ss` utility to be present in order to function. I have *not* added the `iproute2` package to our base image as it adds 11MB, so this endpoint won't be useful by default. However, in a debugging situation, we could exec into the container and then `apk add iproute2`, or build a special debug image.
The `api/v2/debug/expvar` handler contains DERP metrics as well as commandline and memstats.
Example:
```
{
"alert_failed": 0,
"alert_generated": 0,
"cmdline": ["/Users/spike/repos/coder/build/coder_darwin_arm64","--global-config","/Users/spike/repos/coder/.coderv2","server","--http-address","0.0.0.0:3000","--swagger-enable","--access-url","http://127.0.0.1:3000","--dangerous-allow-cors-requests=true"],
"derp": {"accepts": 1, "average_queue_duration_ms": 0, "bytes_received": 0, "bytes_sent": 0, "counter_packets_dropped_reason": {"gone_disconnected": 0, "gone_not_here": 0, "queue_head": 0, "queue_tail": 0, "unknown_dest": 0, "unknown_dest_on_fwd": 0, "write_error": 0}, "counter_packets_dropped_type": {"disco": 0, "other": 0}, "counter_packets_received_kind": {"disco": 0, "other": 0}, "counter_tcp_rtt": {}, "counter_total_dup_client_conns": 0, "gauge_clients_local": 1, "gauge_clients_remote": 0, "gauge_clients_total": 1, "gauge_current_connections": 1, "gauge_current_dup_client_conns": 0, "gauge_current_dup_client_keys": 0, "gauge_current_file_descriptors": 0, "gauge_current_home_connections": 1, "gauge_memstats_sys0": 20874504, "gauge_watchers": 0, "got_ping": 0, "home_moves_in": 0, "home_moves_out": 0, "multiforwarder_created": 0, "multiforwarder_deleted": 0, "packet_forwarder_delete_other_value": 0, "packets_dropped": 0, "packets_forwarded_in": 0, "packets_forwarded_out": 0, "packets_received": 0, "packets_sent": 0, "peer_gone_disconnected_frames": 0, "peer_gone_not_here_frames": 0, "sent_pong": 0, "unknown_frames": 0, "version": "1.47.0-dev20240214-t64db8c604"},
"memstats": {"Alloc":286506256,"TotalAlloc":297594632,"Sys":310621512,"Lookups":0,"Mallocs":304204,"Frees":171570,"HeapAlloc":286506256,"HeapSys":294060032,"HeapIdle":3694592,"HeapInuse":290365440,"HeapReleased":3620864,"HeapObjects":132634,"StackInuse":3735552,"StackSys":3735552,"MSpanInuse":347256,"MSpanSys":358512,"MCacheInuse":9600,"MCacheSys":15600,"BuckHashSys":1469877,"GCSys":9434896,"OtherSys":1547043,"NextGC":551867656,"LastGC":1707892877408883000,"PauseTotalNs":1247000,"PauseNs":[200333,229375,239875,209542,106958,203792,57125,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"PauseEnd":[1707892876217481000,1707892876219726000,1707892876222273000,1707892876226151000,1707892876234815000,1707892877398146000,1707892877408883000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"NumGC":7,"NumForcedGC":0,"GCCPUFraction":0.0022425810335762954,"EnableGC":true,"DebugGC":false,"BySize":[{"Size":0,"Mallocs":0,"Frees":0},{"Size":8,"Mallocs":14396,"Frees":9143},{"Size":16,"Mallocs":89090,"Frees":50507},{"Size":24,"Mallocs":40839,"Frees":24456},{"Size":32,"Mallocs":22404,"Frees":12379},{"Size":48,"Mallocs":51174,"Frees":23718},{"Size":64,"Mallocs":15406,"Frees":3501},{"Size":80,"Mallocs":6688,"Frees":2352},{"Size":96,"Mallocs":2567,"Frees":374},{"Size":112,"Mallocs":19371,"Frees":16883},{"Size":128,"Mallocs":2873,"Frees":1061},{"Size":144,"Mallocs":5600,"Frees":2742},{"Size":160,"Mallocs":2159,"Frees":622},{"Size":176,"Mallocs":454,"Frees":86},{"Size":192,"Mallocs":227,"Frees":128},{"Size":208,"Mallocs":1407,"Frees":732},{"Size":224,"Mallocs":1365,"Frees":1090},{"Size":240,"Mallocs":82,"Frees":48},{"Size":256,"Mallocs":310,"Frees":162},{"Size":288,"Mallocs":1945,"Frees":562},{"Size":320,"Mallocs":1200,"Frees":458},{"Size":352,"Mallocs":133,"Frees":33},{"Size":384,"Mallocs":582,"Frees":51},{"Size":416,"Mallocs":747,"Frees":200},{"Size":448,"Mallocs":113,"Frees":22},{"Size":480,"Mallocs":34,"Frees":21},{"Size":512,"Mallocs":951,"Frees":91},{"Size":576,"Mallocs":364,"Frees":122},{"Size":640,"Mallocs":532,"Frees":270},{"Size":704,"Mallocs":93,"Frees":39},{"Size":768,"Mallocs":83,"Frees":35},{"Size":896,"Mallocs":308,"Frees":175},{"Size":1024,"Mallocs":226,"Frees":122},{"Size":1152,"Mallocs":198,"Frees":100},{"Size":1280,"Mallocs":314,"Frees":171},{"Size":1408,"Mallocs":77,"Frees":47},{"Size":1536,"Mallocs":80,"Frees":54},{"Size":1792,"Mallocs":199,"Frees":107},{"Size":2048,"Mallocs":112,"Frees":48},{"Size":2304,"Mallocs":71,"Frees":32},{"Size":2688,"Mallocs":206,"Frees":81},{"Size":3072,"Mallocs":39,"Frees":15},{"Size":3200,"Mallocs":16,"Frees":7},{"Size":3456,"Mallocs":44,"Frees":29},{"Size":4096,"Mallocs":192,"Frees":83},{"Size":4864,"Mallocs":44,"Frees":25},{"Size":5376,"Mallocs":105,"Frees":43},{"Size":6144,"Mallocs":25,"Frees":5},{"Size":6528,"Mallocs":22,"Frees":7},{"Size":6784,"Mallocs":3,"Frees":0},{"Size":6912,"Mallocs":4,"Frees":2},{"Size":8192,"Mallocs":59,"Frees":10},{"Size":9472,"Mallocs":31,"Frees":12},{"Size":9728,"Mallocs":5,"Frees":2},{"Size":10240,"Mallocs":5,"Frees":0},{"Size":10880,"Mallocs":27,"Frees":11},{"Size":12288,"Mallocs":4,"Frees":1},{"Size":13568,"Mallocs":4,"Frees":2},{"Size":14336,"Mallocs":9,"Frees":2},{"Size":16384,"Mallocs":10,"Frees":2},{"Size":18432,"Mallocs":4,"Frees":2}]},
"warning_failed": 0,
"warning_generated": 0
}
```
If we find the DERP metrics useful we could consider how to include them in Prometheus scrapes based on the tailnet `varz` package. That's for a later PR if at all.
Adds documentation on port requirements and a short overview of STUN with some example scenarios.
Co-authored-by: Dean Sheather <dean@deansheather.com>
Co-authored-by: Spike Curtis <spike@coder.com>
When we exceed the db-imposed limit of logs, we need to communicate that back to the agent. In v1 we did it with a 4xx-level HTTP status, but with dRPC, the errors are delivered as strings, which feels fragile to me for something we want to gracefully handle.
So, this PR adds the log limit exceeded as a field on the response message, and fixes the API handler to set it as appropriate instead of an error.
* feat(provisioner): relax max terraform version constraint
* feat!(scripts/Dockerfile.base): update bundled terraform to 1.6.x
* bump terraform version in Dogfood image
* fix over-zealous rename
Fixes#12030
This is a good example of the kind of thing I'd like to address with a time-testing lib. The problem is that there is a race between the watchdog starting it's timer and the test incrementing the time. What would make this easier is if the time-testing library could wait for and assert the call to start the timer before incrementing the time.
adds a watchdog to our pubsub and runs it for Coder server.
If the watchdog times out, it triggers a graceful exit in `coder server` to give any provisioner jobs a chance to shut down.
c.f. #11950
I noticed in testing that the CLI wasn't correctly sending the disconnect message when it shuts down, and thus agents are seeing this as a "lost" peer, rather than a "disconnected" one.
What was happening is that we just used a single context for everything from the netconn to the RPCs, and when the context was canceled we failed to send the disconnect message due to canceled context.
So, this PR splits things into two contexts, with a graceful one set to last up to 1 second longer than the main one.
* docs: update remote docker host docs
Adds a link to external provisioners as a method to use remote docker hosts
* `make fmt`
* Update docker.md
* fmt
Annoyingly, prometheus Registry collects metrics async, which is causing our test to be racy. They also don't export enough from the Metric interface for us to replicate a synchronous collect, so we have to use Eventually to test.
Adds prometheus metrics to PGPubsub for monitoring its health and performance in production.
Related to #11950 --- additional diagnostics to help figure out what's happening
Since we run yamux over the websocket, we don't need to ping at the websocket layer because yamux has a 30 second keepalive mechanism enabled in the default config.
The RPC() function isn't called, since Listen() was modified to do this job.
Listen() has the right signature, since it returns a drpc.Conn, rather than the Agent API. That's because tailnet v2 and agent v2 are separate APIs served over the same connection.
It might be clearer to rename `Listen()` to `RPC()` but I'll save that for a different PR.
Adds a new statsReporter subcomponent of the agent, which in a later PR will be used to report stats over the v2 API.
Refactors the logic a bit so that we can handle starting and stopping stats reporting if the agent API connection drops and reconnects.
Moves monitoring of the agent v2 API connection to the yamux layer.
Present behavior monitors this at the websocket layer, and closes the websocket on completion. This can cause yamux to hit unexpected errors since the connection is closed underneath it.
This might be the cause of yamux errors that some customers are seeing

In any case, it's more graceful to close yamux first and let yamux close the underlying websocket. That should limit yamux error logging to truly unexpected/error cases.
The only downside is that the yamux `Close()` doesn't accept a reason, so if the agent becomes outdated and we close the API connection, the agent just sees the connection close without a reason. I'm not sure we log this at the agent anyway, but it would be nice. I think more accurate logging on Coderd are more important.
I've also added some logging when the monitor disconnects for reasons other than the context being canceled (e.g. agent outdated, failed pings).
* wip: commit progress for clipboard update
* wip: push more progress
* chore: finish initial version of useClipboard revamp
* refactor: update API query to use newer RQ patterns
* fix: update importers of useClipboard
* fix: increase clickable area of CodeExample
* fix: update styles for CliAuthPageView
* fix: resolve issue with ref re-routing
* docs: update comments for clarity
* wip: commit progress on clipboard tests
* chore: add extra test case for referential stability
* wip: disable test stub to avoid breaking CI
* wip: add test case for tab-switching
* feat: finish changes
* fix: improve styling for strong text
* fix: make sure period doesn't break onto separate line
* fix: make center styling more friendly to screen readers
* refactor: clean up mocking implementation
* fix: resolve security concern for clipboard text
* fix: update CodeExample to obscure text when appropriate
* fix: apply secret changes to relevant code examples
* refactor: simplify code for obfuscating text
* fix: partially revert clipboard changes
* fix: clean up page styling further
* fix: remove duplicate property identifier
* refactor: rename variables for clarity
* fix: simplify/revert CopyButton component design
* fix: update how dummy input is hidden from page
* fix: remove unused onClick handler prop
* fix: resolve unused import
* fix: opt code examples out of secret behavior
* fix: strip timezone information from a date in dau response
Timezone information is lost, so do not forward it to the client.
* fix: timezone offset should be flipped
* Make tests deterministic
This PR solves #10478 by auto-filling previously used template values in create and update workspace flows.
I decided against explicit user values in settings for these reasons:
* Autofill is far easier to implement
* Users benefit from autofill _by default_ — we don't need to teach them new concepts
* If we decide that autofill creates more harm than good, we can remove it without breaking compatibility
Fixes an issue where a MultiAgentConn isn't closed properly when the coordinator it is connected to is closed.
Since servertailnet checks whether the conn is closed before reinitializing, it is important that we check this, otherwise servertailnet can get stuck if the coordinator closes (e.g. when we switch from AGPL to PGCoordinator after decoding a license).
We're failing tests on error logs like this: https://github.com/coder/coder/actions/runs/7706053882/job/21000984583
Unfortunately, the error we hit, when the underlying connection is closed, is unexported, so we can't specifically ignore it.
Part of the issue is that agent.Close() doesn't wait for these goroutines to complete before returning, so the test harness proceeds to close the connection. This looks to our product code like the network connection failing. It would be possible to fix this, but just doesn't seem worth it for the extra insurance of catching other error logs in these tests.
Adds logging to yamux when used for tailnet client connections, e.g. CLI and wsproxy. This could be useful for debugging connection issues with tailnet v2 API.
`agentsdk` depends on `agent/proto` because it needs to get the version to dial.
Therefore, the conversion routines need to live in `agentsdk` so that we can convert to and from the Manifest.
I briefly considered refactoring the agent to only reference `proto.Manifest`, but decided against it because we might have multiple protocol versions in the future, its useful to have a protocol-independent data structure.
Fixes#8218
Removes `wsconncache` and related "is legacy?" functions and API calls that were used by it.
The only leftover is that Agents still use the legacy IP, so that back level clients or workspace proxies can dial them correctly.
We should eventually remove this: #11819
This PR updates the Agent API to use the appearance.Fetcher, which is set by entitlement code in Enterprise coderd.
This brings the agentapi into compliance with the Enterprise feature.
The new Agent API needs an interface for ServiceBanners, so this PR creates it and refactors the AGPL and Enterprise code to achieve it.
Before we depended on the fact that the HTTP endpoint was missing to serve an empty ServiceBanner on AGPL deployments, but that won't work with dRPC, so we need a real interface to call.
The original test is bugged in that it
1. creates a new AGPL coderd with a new database, so no appearance is set in the DB.
2. overwrites the agentClient so the assertion after removing the license is against the AGPL coderd
fixes#10533
refactors `codersdk` workspace agent dialer to use a single websocket connection to the tailnet v2 API for both coordination and DERPMap updates, rather than separate websockets (and the v1 API for DERPMaps).
#7439 added global caching of RBAC results.
Calls are cached based on hash(subject, object, action).
We often use dbauthz.AsSystemRestricted to handle "internal" authz calls, and these are often repeated with similar arguments and are likely to get cached.
So a transient error doing an authz check on a system function will be cached for up to a minute.
I'm just starting off with excluding context.Canceled but there's likely a whole suite of different errors we want to also exclude from the global cache.
Ok, so my last attempt at a fix here failed
https://github.com/coder/coder/actions/runs/7666229961/job/20893608286
I have a new theory: it's not the `terraform` binary that's busy, it's actually `fake_cancel.sh` and it gets marked busy when we `exec` it from the script we write.
Use of `exec` also replaces the executing code in place, rather than starting a new process/shell, so that's why the error we get says `terraform` is busy.
* docs: use coder modules in offline deployments
* fix typos
* Update offline installation instructions with Artifactory support for Coder modules
* Review suggestions
- Adds column `favorite` to workspaces table
- Adds API endpoints to favorite/unfavorite workspaces
- Modifies sorting order to return owners' favorite workspaces first
Fixes 2 related issues:
1. wsconncache had incorrect logic to test whether to send DERPMap updates, sending if the maps were equivalent, instead of if they were _not equivalent_.
2. configmaps used a bugged check to test equality between DERPMaps, since it contains a map and the map entries are serialized in random order. Instead, we avoid comparing the protobufs and instead depend on the existing function that compares `tailcfg.DERPMap`. This also has the effect of reducing the number of times we convert to and from protobuf.
fixes#10531
Adds a check for `version` on connection to the Agent API websocket endpoint. This is primarily for future-proofing, so that up-level agents get a sensible error if they connect to a back-level Coderd.
It also refactors the location of the `CurrentVersion` variables, to be part of the `proto` packages, since the versions refer to the APIs defined therein.
Adds support to `ServerTailnet` to set all peers lost before attempting to reconnect to the coordinator. In practice, this only really affects `wsproxy` since coderd has a local connection to the coordinator that only goes down if we're shutting down or change licenses.
These will show up when configuring the application along with the
client ID and everything else. Should make it easier to configure the
application, otherwise you will have to go look up the URLs in the
docs (which are not yet written).
Co-authored-by: Steven Masley <stevenmasley@gmail.com>
Rather than passing all the deployment values. This is to make it
easier to generate API keys as part of the oauth flow.
I also added and fixed a test for when the lifetime is set and the
default and expiration are unset.
Co-authored-by: Steven Masley <stevenmasley@gmail.com>
Use TSMP ping for reachability, but leave Disco ping for when we call Ping() since we often use that to determine whether we have a direct connection.
Also adds unit tests to make sure Ping() returns direct connection vs DERP correctly.
Adds support to Coordination to call SetAllPeersLost() when it is closed. This ensure that when we disconnect from a Coordinator, we set all peers lost.
This covers CoderSDK (CLI client) and Agent. Next PR will cover MultiAgent (notably, `wsproxy`).
We don't have visibility into some feature usage, so this adds a lot of fields missing from `database.Template` to `telemetry.Template`. Deprecation message is not collected, just whether it's set or not.
adds setAllPeersLost to the configMaps subcomponent of tailnet.Conn --- we'll call this when we disconnect from a coordinator so we'll eventually clean up peers if they disconnect while we are retrying the coordinator connection (or we don't succeed in reconnecting to the coordinator).
This one is huge, and I'm sorry.
The problem is that once I change `tailnet.Conn` to start doing v2 behavior, I kind of have to change it everywhere, including in CoderSDK (CLI), the agent, wsproxy, and ServerTailnet.
There is still a bit more cleanup to do, and I need to add code so that when we lose connection to the Coordinator, we mark all peers as LOST, but that will be in a separate PR since this is big enough!
Otherwise if for example you try to run `yarn storybook` it complains
that the version of Node is wrong.
`pnpm storybook` works fine and that is probably what we should
actually use, but as long as we are installing Yarn and not restricting
its use we might as well make it use the right version of Node.
* fix: doing a noop patch to templates resulted in 404
The patch response did not include the template. The UI required the
template to be returned to form the new page path
null is more explicit, and harder to make occur by mistake.
* fix: allow ports in wildcard url configuration
This just forwards the port to the ui that generates urls.
Our existing parsing + regex already supported ports for
subdomain app requests.
Fixes flake seen here, I think
https://github.com/coder/coder/actions/runs/7565915337/job/20602500818
golang's file processing is complex, and in at least some cases it can return from a file.Close() call without having actually closed the file descriptor.
If we're holding open the file descriptor of an executable we just wrote, and try to execute it, it will fail with "text file busy" which is what we have seen.
So, to be extra sure, I've avoided the standard library and directly called the syscalls to open, write, and close the file we intend to use in the test.
I've also added some more logging so if it's some issue of multiple tests writing to the same location, the we might have a chance to see it.
wsproxy also needs to be updated to use tailnet v2 because the `tailnet.Conn` stores peers by ID, and the peerID was not being carried by the JSON protocol. This adds a query param to the endpoint to conditionally switch to the new protocol.
Fixes a flake seen here: https://github.com/coder/coder/actions/runs/7541558190/job/20528545916
```
=== FAIL: enterprise/provisionerd TestRemoteConnector_Fuzz (0.06s)
t.go:84: 2024-01-16 12:32:27.024 [info] connector: failed provisioner authentication remote_addr=[::1]:45138 ...
error= failed to receive jobID:
github.com/coder/coder/v2/enterprise/provisionerd.(*remoteConnector).authenticate
/home/runner/actions-runner/_work/coder/coder/enterprise/provisionerd/remoteprovisioners.go:438
- bufio.Scanner: token too long
t.go:84: 2024-01-16 12:32:27.024 [debu] connector: closed connection remote_addr=[::1]:45138 error=<nil>
remoteprovisioners_test.go:209:
Error Trace: /home/runner/actions-runner/_work/coder/coder/enterprise/provisionerd/remoteprovisioners_test.go:209
Error: "2992256" is not less than "2097152"
Test: TestRemoteConnector_Fuzz
Messages: should not allow more than 1 MiB
```
This was an attempt to test that malicious actors can't abuse our authentication protocol to make us allocate a bunch of memory.
However, the test asserted on the number of bytes sent by the fuzzer, not the number of bytes read (& allocated) by the service. The former is affected by network queue sizes and is thus flaky without actively managing the socket queues, which I don't think we want to do.
In actual practise, the thing that matters is how much memory the bufio Scanner allocates. By inspection, the scanner will allocate up to 64k, and testing this is true devolves into testing the go standard library, which I don't think is worth doing.
So... let's just drop the assertion because
a) its flaky,
b) it doesn't test what we actually want to test,
c) the behavior we actually care about is part of the standard library.
- Adds a new query BatchUpdateLastUsedAt
- Adds calls to BatchUpdateLastUsedAt in app stats handler upon flush
- Passes a stats flush channel to apptest setup scaffolding and updates unit tests to assert modifications to LastUsedAt.
Cli errors are pretty formatted. This handles nested pretty types. Before it found the first error it could understand and return that. Now it will print the full error stack with more information.
To prevent information loss, a "[Trace=...]" was added to capture some extra error context for debugging.
- refactors`getFormHelpers` to accept an options object
- adds a `maxLength` option which will display a message and character counter for fields with length limits
- set `maxLength` option for template description fields
Merging since Mark is out.
* chore: add optional coder_app to faq
* applied Atif's suggestions
* make fmt again
---------
Co-authored-by: kirby <kirby@coder.com>
Co-authored-by: Stephen Kirby <58410745+stirby@users.noreply.github.com>
The `SingleTailnet` behavior only checked to see if the `MultiAgent` was
closed, but the websocket error was not being propogated into the
`MultiAgent`, causing it to never be swapped for a new working one.
Fixes https://github.com/coder/coder/issues/11401
Before:
```
Coder Workspace Proxy v0.0.0-devel+85ff030 - Your Self-Hosted Remote Development Platform
Started HTTP listener at http://0.0.0.0:3001
View the Web UI: http://127.0.0.1:3001
==> Logs will stream in below (press ctrl+c to gracefully exit):
2024-01-04 20:11:56.376 [warn] net.workspace-proxy.servertailnet: broadcast server node to agents ...
error= write message:
github.com/coder/coder/v2/enterprise/wsproxy/wsproxysdk.(*remoteMultiAgentHandler).writeJSON
/home/coder/coder/enterprise/wsproxy/wsproxysdk/wsproxysdk.go:524
- failed to write msg: WebSocket closed: failed to read frame header: EOF
```
After:
```
Coder Workspace Proxy v0.0.0-devel+12f1878 - Your Self-Hosted Remote Development Platform
Started HTTP listener at http://0.0.0.0:3001
View the Web UI: http://127.0.0.1:3001
==> Logs will stream in below (press ctrl+c to gracefully exit):
2024-01-04 20:26:38.545 [warn] net.workspace-proxy.servertailnet: multiagent closed, reinitializing
2024-01-04 20:26:38.546 [erro] net.workspace-proxy.servertailnet: reinit multi agent ...
error= dial coordinate websocket:
github.com/coder/coder/v2/enterprise/wsproxy/wsproxysdk.(*Client).DialCoordinator
/home/coder/coder/enterprise/wsproxy/wsproxysdk/wsproxysdk.go:454
- failed to WebSocket dial: failed to send handshake request: Get "http://127.0.0.1:3000/api/v2/workspaceproxies/me/coordinate": dial tcp 127.0.0.1:3000: connect: connection refused
2024-01-04 20:26:38.587 [erro] net.workspace-proxy.servertailnet: reinit multi agent ...
error= dial coordinate websocket:
github.com/coder/coder/v2/enterprise/wsproxy/wsproxysdk.(*Client).DialCoordinator
/home/coder/coder/enterprise/wsproxy/wsproxysdk/wsproxysdk.go:454
- failed to WebSocket dial: failed to send handshake request: Get "http://127.0.0.1:3000/api/v2/workspaceproxies/me/coordinate": dial tcp 127.0.0.1:3000: connect: connection refusedhandshake request: Get "http://127.0.0.1:3000/api/v2/workspaceproxies/me/coordinate": dial tcp 127.0.0.1:3000: connect: connection refused
2024-01-04 20:26:40.446 [info] net.workspace-proxy.servertailnet: successfully reinitialized multiagent agents=0 took=1.900892615s
```
It causes the sign-in page to reload whenever a user enters a page or changes the window's focus. This is happening because when the "user" fetch is made, the server returns an error, making the react-query mark the data as stale and try to load it whenever possible.
* Shows the overall report error at the top of the page, if present.
* Shows workspaceproxy errors above warnings inside the corresponding element, if present.
* Improves unregistered proxy status
Adds a nodeUpdater component, which serves a similar role to configMaps, but tracks information from tailscale going out to the coordinator as node updates. This first PR just handles netInfo, subsequent PRs will
handle DERP forced websockets, endpoints, and addresses.
* docs: remove cloud logos from 1-click install
They were looking good and are not adding much value.
* Delete docs/images/install/render.png
* Delete docs/images/install/ec2.svg
* Delete docs/images/install/eks.svg
* Delete docs/images/install/fly.io.svg
* Delete docs/images/install/gce.svg
* Delete docs/images/install/heroku.svg
* Delete docs/images/install/railway.svg
This test case fails with an error log, showing "context canceled" when trying to send an acquired job to an in-mem provisionerd.
https://github.com/coder/coder/runs/20331469006
In this case, we don't want to supress this error, since it could mean that we acquired a job, locked it in the database, then failed to send it to a provisioner.
(We also don't want to mark the job as failed because we don't know whether the job made it to the provisionerd or not --- in the failed test you can see that the job is actually processed just fine).
The reason we got context canceled is because the API was shutting down --- we don't want provisionerdserver to abruptly stop processing job stuff as the API shuts down as this will leave jobs in a bad state. This PR fixes up the use of contexts with provisionerdserver and the associated drpc service calls.
Work in progress on a subcomponent of the Conn which will handle configuring the wireguard engine on changes. I've implemented setAddresses as the simplest case and added unit tests of the reconfiguration loop.
Besides making the code easier to test and understand, the goal is for this component to handle disconnect and loss updates about peers, and thereby, implement the v2 Tailnet API.
Further PRs will handle peer updates, status updates, and net info updates.
Then, after the subcomponent is implemented and tested, I will refactor Conn to use it instead of the current monolithic architecture.
It looks like we updated mockgen to use Uber's fork, but the flake still
pointed to a nixos-unstable commit containing the old mockgen resulting
in an error like:
missing go.sum entry for module providing package github.com/golang/mock/mockgen/model
Fixes#11451
A refactor of the Agent API passes metrics as protobufs, which include pointers to label name/value pairs. The aggregator tested for sameness by doing a shallow compare of label values, which for different stats reports would compare unequal because the pointers would be different.
This fix does a deep compare.
While testing I also noted that we neglect to compare template names. This is unlikely to have caused any issue in practice, since the combination of username/workspace is unique, but in the context of comparing metric labels we should do the comparison.
If a user creates a workspace, deletes it, then recreates from a different template, we could in principle have reported incorrect stats for the old template.
Part of #10676
- Adds a health section for provisioner daemons (mostly cannibalized from the Workspace Proxy section)
- Adds a corresponding storybook entry for provisioner daemons health section
- Fixed an issue where dismissing the provisioner daemons warnings would result in a 500 error
- Adds provisioner daemon error codes to docs
* assert provisioner daemon version and api_version in unit tests
* add build info in HTTP header, extract codersdk.BuildVersionHeader
* add api_version to codersdk.ProvisionerDaemon
* testutil.MustString -> testutil.MustRandString
Refactors the code that handles monitoring an agent websocket with pings and updating the connection times in the DB.
Consolidates v1 and v2 agent APIs under the same code for this.
One substantive change (not _just_ a refactor) is that I've made it so that we actually disconnect if the agent fails to respond to our pings, rather than the old behavior where we would update the database, but not actually tear down the websocket.
We're seeing some flaky tests related to agent connectivity - https://github.com/coder/coder/actions/runs/7286675441/job/19856270998
I'm pretty sure what happened in this one is that the client opened a connection while the wgengine was in the process of reconfiguring the wireguard device, so the fact that the peer became "active" as a result of traffic being sent was not noticed.
The test calls `AwaitReachable()` but this only tests the disco layer, so it doesn't wait for wireguard to come up.
I think we should be using TSMP for pinging and reachability, since this operates at the IP layer, and therefore requires that wireguard comes up before being successful.
This should also help with the problems we have seen where a TCP connection starts before wireguard is up and the initial round trip has to wait for the 5 second wireguard handshake retry.
fixes: #11294
Refactors our DRPC service definitions slightly.
In the previous version, I inserted the RPCs from the tailnet proto directly into the Agent service. This makes things hard to deal with because DRPC then generates a new set of methods with new interfaces with the `DRPCAgent_` prefixed. Since you can't have a single method that takes different argument types, we couldn't reuse the implementation of those RFCs without a lot of extra classes and pass-thru methods.
Instead, the "right" way to do it is to integrate at the DRPC layer. So, we have two DRPC services available over the Agent websocket, and register them both on the DRPC `mux`.
Since the tailnet proto RPC service is now for both clients and agents, I renamed some things to clarify and shorten.
This PR also removes the `TailnetAPI` implementation from the `agentapi` package, and the next PR in the stack replaces it with the implementation from the `tailnet` package.
* Add database tables for OAuth2 applications
These are applications that will be able to use OAuth2 to get an API key
from Coder.
* Add endpoints for managing OAuth2 applications
These let you add, update, and remove OAuth2 applications.
* Add frontend for managing OAuth2 applications
* feat: enable csrf token header
* Exempt external auth requets
* ensure dev server bypasses CSRF
* external auth is just get requests
* Add some more routes
* Extra assurance nothing breaks
* chore: fix flake, use time closer to actual test
The tests were queued, and the autostart time was being set
to the time the table was created, not when the test was actually
being run. This diff was causing failures in CI
* Adds UpdateProvisionerDaemonLastSeenAt
* Adds heartbeat to provisioner daemons
* Inserts provisioner daemons to database upon start
* Ensures TagOwner is an empty string and not nil
* Adds COALESCE() in idx_provisioner_daemons_name_owner_key
* chore: add unit test to excercise flake
* Implement a *fix for cron stop() before run()
This fix still has a race condition. I do not see a clean solution
without modifying the cron libary. The cron library uses a boolean
to indicate running, and that boolean needs to be set to "true"
before we call "Close()". Or "Close()" should prevent "Run()"
from doing anything.
In either case, this solves the issue for a niche unit test bug
in which the test finishes, calling Close(), before there was
an oppertunity to start the go routine. It probably isn't worth
a lot of time investment, and this fix will suffice
* setup manifest
* added okta guide from steven M
* improved index by adding children
* changed icon to notes.svg
* added meta guide, fixed profile photo fmt
closes#10532
Adds v2 support to the /coordinate endpoint via a query parameter.
v1 already has test cases, and we haven't implemented v2 at the client yet, so the only new test case is an unsupported version.
Part of #10532
Adds a tailnet ClientService that accepts a net.Conn and serves v1 or v2 of the tailnet API.
Also adds a DRPCService that implements the DRPC interface for the v2 API. This component is within the ClientService, but needs to be reusable and exported so that we can also embed it in the Agent API.
Finally, includes a NewDRPCClient function that takes a net.Conn and runs dRPC in yamux over it on the client side.
Part of #10532
DRPC transport over yamux and in-mem pipes was previously only used on the provisioner APIs, but now will also be used in tailnet. Moved to subpackage of codersdk to avoid import loops.
Fixes flake https://github.com/coder/coder/runs/19639217635
AGPL coordinator used to process node updates for single_tailnet synchronously, but it's been refactored to process async, so in this test we need to wait for it to be processed.
This sends the email the license was issued to, and whether or not it's a trial in the telemetry payload. It's a bit janky since the license parsing is all enterprise licensed.
Addresses the issue in #11185 for the StringMap datatype.
There are other slice data types in our database package that also need to be fixed, but that'll be a different PR
Adds column api_version to the provisioner_daemons table.
This is distinct from the coderd version, and is used to handle breaking changes in the provisioner daemon API.
* added sharkymark FAQs page
* make fmt
* fixed typos for link
* changed FAQs icon to (i)
* satisfied review
* make fmt
* added docs links for coder_app, CODER_ACCESS_URL
* removed mentions of mark
* fixed some minor code formatting issues
* fixed numbered bullets rendering, make fmt
if:(github.event.comment.body == 'recheck' || github.event.comment.body == 'I have read the CLA Document and I hereby sign the CLA') || github.event_name == 'pull_request_target'
uses:contributor-assistant/github-action@v2.3.1
uses:contributor-assistant/github-action@v2.3.2
env:
GITHUB_TOKEN:${{ secrets.GITHUB_TOKEN }}
# the below token should have repo scope and must be manually added by you in the repository's secret
curl -X POST -H 'Content-type: application/json' -d '{"msg":"Broken links found in the documentation. Please check the logs at ${{ env.LOGS_URL }}"}' ${{ secrets.DOCS_LINK_SLACK_WEBHOOK }}
echo"DROP DATABASE IF EXISTS migrate_test_$${COMMIT_FROM}; CREATE DATABASE migrate_test_$${COMMIT_FROM};"| psql 'postgresql://postgres:postgres@localhost:5432/postgres?sslmode=disable'
go run ./scripts/migrate-test/main.go --from="$$COMMIT_FROM" --to="$$COMMIT_TO" --postgres-url="postgresql://postgres:postgres@localhost:5432/migrate_test_$${COMMIT_FROM}?sslmode=disable"
# NOTE: we set --memory to the same size as a GitHub runner.
[Coder](https://coder.com) enables organizations to set up development environments in the cloud. Environments are defined with Terraform, connected through a secure high-speed Wireguard® tunnel, and are automatically shut down when not in use to save on costs. Coder gives engineering teams the flexibility to use the cloud for workloads that are most beneficial to them.
[Coder](https://coder.com) enables organizations to set up development environments in their public or private cloud infrastructure. Cloud development environments are defined with Terraform, connected through a secure high-speed Wireguard® tunnel, and are automatically shut down when not in use to save on costs. Coder gives engineering teams the flexibility to use the cloud for workloads that are most beneficial to them.
- Define development environments in Terraform
- Define cloud development environments in Terraform
- EC2 VMs, Kubernetes Pods, Docker Containers, etc.
- Automatically shutdown idle resources to save on costs
- Onboard developers in seconds instead of days
@@ -44,7 +43,7 @@
## Quickstart
The most convenient way to try Coder is to install it on your local machine and experiment with provisioning development environments using Docker (works on Linux, macOS, and Windows).
The most convenient way to try Coder is to install it on your local machine and experiment with provisioning cloud development environments using Docker (works on Linux, macOS, and Windows).
```
# First, install Coder
@@ -53,8 +52,8 @@ curl -L https://coder.com/install.sh | sh
# Start the Coder server (caches data in ~/.cache/coder)
coder server
# Navigate to http://localhost:3000 to create your initial user
# Create a Docker template, and provision a workspace
# Navigate to http://localhost:3000 to create your initial user,
# create a Docker template, and provision a workspace
```
## Install
@@ -68,11 +67,11 @@ Releases.
curl -L https://coder.com/install.sh | sh
```
You can run the install script with `--dry-run` to see the commands that will be used to install without executing them. You can modify the installation process by including flags. Run the install script with `--help` for reference.
You can run the install script with `--dry-run` to see the commands that will be used to install without executing them. Run the install script with `--help` for additional flags.
> See [install](https://coder.com/docs/v2/latest/install) for additional methods.
Once installed, you can start a production deployment<sup>1</sup> with a single command:
Once installed, you can start a production deployment with a single command:
```shell
# Automatically sets up an external access URL on *.try.coder.app
@@ -82,8 +81,6 @@ coder server
coder server --postgres-url <url> --access-url <url>
```
> <sup>1</sup> For production deployments, set up an external PostgreSQL instance for reliability.
Use `coder --help` to get a list of flags and environment variables. Use our [install guides](https://coder.com/docs/v2/latest/install) for a full walkthrough.
## Documentation
@@ -96,19 +93,13 @@ Browse our docs [here](https://coder.com/docs/v2) or visit a specific section be
- [**Administration**](https://coder.com/docs/v2/latest/admin): Learn how to operate Coder
- [**Enterprise**](https://coder.com/docs/v2/latest/enterprise): Learn about our paid features built for large teams
## Community and Support
## Support
Feel free to [open an issue](https://github.com/coder/coder/issues/new) if you have questions, run into bugs, or have a feature request.
[Join our Discord](https://discord.gg/coder) to provide feedback on in-progress features, and chat with the community using Coder!
## Contributing
Contributions are welcome! Read the [contributing docs](https://coder.com/docs/v2/latest/CONTRIBUTING) to get started.
Find our list of contributors [here](https://github.com/coder/coder/graphs/contributors).
## Related
## Integrations
We are always working on new integrations. Feel free to open an issue to request an integration. Contributions are welcome in any official or community repositories.
@@ -116,10 +107,18 @@ We are always working on new integrations. Feel free to open an issue to request
- [**VS Code Extension**](https://marketplace.visualstudio.com/items?itemName=coder.coder-remote): Open any Coder workspace in VS Code with a single click
- [**JetBrains Gateway Extension**](https://plugins.jetbrains.com/plugin/19620-coder): Open any Coder workspace in JetBrains Gateway with a single click
- [**Dev Container Builder**](https://github.com/coder/envbuilder): Build development environments using `devcontainer.json` on Docker, Kubernetes, and OpenShift
- [**Module Registry**](https://registry.coder.com): Extend development environments with common use-cases
- [**Kubernetes Log Stream**](https://github.com/coder/coder-logstream-kube): Stream Kubernetes Pod events to the Coder startup logs
- [**Self-Hosted VS Code Extension Marketplace**](https://github.com/coder/code-marketplace): A private extension marketplace that works in restricted or airgapped networks integrating with [code-server](https://github.com/coder/code-server).
### Community
- [**Provision Coder with Terraform**](https://github.com/ElliotG/coder-oss-tf): Provision Coder on Google GKE, Azure AKS, AWS EKS, DigitalOcean DOKS, IBMCloud K8s, OVHCloud K8s, and Scaleway K8s Kapsule with Terraform
- [**Coder GitHub Action**](https://github.com/marketplace/actions/update-coder-template): A GitHub Action that updates Coder templates
- [**Various Templates**](./examples/templates/community-templates.md): Hetzner Cloud, Docker in Docker, and other templates the community has built.
- [**Coder Template GitHub Action**](https://github.com/marketplace/actions/update-coder-template): A GitHub Action that updates Coder templates
## Contributing
We are always happy to see new contributors to Coder. If you are new to the Coder codebase, we have
[a guide on how to get started](https://coder.com/docs/v2/latest/CONTRIBUTING). We'd love to see your
// Read reads the file to a string. All leading and trailing whitespace
// is removed.
func(fFile)Read()(string,error){
iff==""{
return"",xerrors.Errorf("empty file path")
}
byt,err:=read(string(f))
returnstring(byt),err
returnstrings.TrimSpace(string(byt)),err
}
// open opens a file in the configuration directory,
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.