Remember to search #kjøremiljø-support in Slack for your error message as well.
okctl forward postgres works, but connecting to database does not
Symptom
You are running okctl forward postgres
, but when attempting to connect to the database on localhost, you SQL client states that the authentication is invalid ("wrong username/password" or something similar).
Example
$ okctl forward postgres \
--cluster-declaration my-cluster-dev.yaml \
--name my-database \
--username my-db-user \
--password-file /tmp/pw.txt \
--aws-credentials-type aws-profile
In another terminal, we can see the error by running:
$ psql -U my-db-user -h localhost -d my-database
psql: error: connection to server at "localhost" (127.0.0.1),
port 5432 failed: ERROR: pgbouncer cannot connect to server
Issue
okctl forward postgres does not work when --username
parameter value is the same as the root user to the database, which is stated in the cluster declaration. So in this example, if my-cluster-dev.yaml
-> databases -> postgres -> my-database -> user is equal to my-db-user
, which is the same username chosen in the okctl forward postgres command, connecting to the database will fail.
Solution
Choose another username for the parameter --username
in the okctl forward postgres
command.
Continuing the example above, we would solve the issue by running
$ okctl forward postgres \
--cluster-declaration my-cluster-dev.yaml \
--name my-database \
--username foo \
--password-file /tmp/pw.txt \
--aws-credentials-type aws-profile
Background
https://oslokommune.slack.com/archives/CV9EGL9UG/p1663591208723989
Okctl complains that "state is locked"
Although okctl is supposed to release locks when completing stuff, or when failing, it sometimes doesn't work. The next time you run a Okctl command, you can get the following error:
$ okctl delete cluster -f cluster-dev.yaml
Error: acquiring lock: state is locked
If you know you're working alone and not doing something at the same time as someone else, you can release the lock like this.
$ okctl maintenance state-release-lock
releasing state lock
successfully released lock
Logging doesn't work
It might be because your cluster is somehow running Loki 2.4.x. Check if this is the case by running:
okctl venv ... # fill in parameters matching your cluster
kubectl -n monitoring describe statefulset loki | grep -i image
Output should be like this:
Image: grafana/loki:2.1.0
or perhaps grafana/loki:2.3.0
, which also works. If the version is 2.4.0 or higher, this is known to not work (see https://github.com/oslokommune/okctl/pull/979).
One way to fix this is be reinstalling Loki. In your cluster manifest, set loki: false
and run the command below.
Note! This will erase your existing logs!
okctl apply cluster -f cluster-manifest.yaml
Then set loki: true
and re-run the above command.
Failed to create pod sandbox
When running kubectl get event
or kubectl describe pod ...
, you might see the warning
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "e24268322e55c8185721f52df6493684f6c2c3bf4fd59c9c121fd4cdc894579f" network for pod "my-deployment-59f5f68b58-c89wx": networkPlugin
cni failed to set up pod "my-deployment-59f5f68b58-c89wx_my-namespace" network: add cmd: failed to assign an IP address to container
This is just a warning, and it is also expected, and can be ignored. See https://docs.aws.amazon.com/eks/latest/userguide/security-groups-for-pods.html (search for "sandbox").
The Loki pod has stopped running, and won't start again
It can be multiple reasons, but one reason is that it was caused because of too high memory usage. We have seen in one instance that this causes one node to become unavailable, and then Loki won't start up again. To verify if this is the issue, run
kubectl get node
If the output looks like this:
NAME STATUS
........
ip-xxxxxxxxxxxx.eu-west-1.compute.internal NotReady,SchedulingDisabled
and you have no nodes with status Ready
, it means Loki cannot spawn up again. To make some space for Loki to spawn, you can manually increase the number of nodes available.
Go to AWS console -> EC2 -> Auto scaling groups.
Find three node groups with name eksctl-okctl-mycluster-nodegroup-ng-generic-1-20-1b-NodeGroup-xxxxxxxxxxxxxxxx. If you know which one runs Loki, you can select only that one. If not, increase the size for all nodegroups. You can do this by going to each node group, and under "Group details" you can click "Edit" and increase Desired Capacity and Minimum Capaity by 1 (from 1 to 2 for instance).
This should give Loki some space to get working again.
ArgoCD is not working, but I need to deploy stuff
If ArgoCD for whatever reason doesn't work, you can apply changes yourself, the same way ArgoCD does it.
- Install Kustomize from https://kustomize.io/
- Run:
okctl venv -c my-cluster.yaml
kustomize build infrastructure/applications/my-app/overlays/my-cluster/argocd-applicaiton.yaml | kubectl apply -f -
I get an error message containing "ssh" and/or "knownhosts"
Some examples:
Possible solution: known_hosts contains something invalid
mv ~/.ssh/known_hosts{,.bak}
ssh-keyscan github.com > ~/.ssh/known_hosts
Possible solution: SSH agent doesn't know about private key for your IAC repository
ssh-add ~/.ssh/<relevant private key>
When running okctl upgrade, I get Body: Not found
Example
This can happen if you are running an upgrade at the same time we're in the middle of uploading a release. The solution is to just wait for 10 minutes and try again.
Specifically, if there is any instances of "goreleaser" running on https://github.com/oslokommune/okctl-upgrade/actions, they must be completed before you attempt upgrading again.
This bug is tracked in issue: https://trello.com/c/xYHF2vVe/596-okctl-upgrade-fails-if-run-at-the-same-time-as-were-making-a-release
I get an error message with "getting existing cluster <cluster-name>: not found"
This can be an indication that your state.db
was not found AWS S3. If you have just updated okctl from 0.0.79 or lower, you need to:
- Run
okctl maintenance state-upload <path-to-state.db>
to move the state.db file to a remote location.
The state.db usually resides in/infrastructure/<cluster-name>/state.db
- Delete the relevant
state.db
file, commit and push the changes.
See release notes for 0.0.80
Okctl keeps trying to do the Github Device Authentication Flow while trying to do <any action>
This is known to happen if pass init <gpg-key-id>
has not been run after installing pass
.
okctl forward postgres fails on applying security group policy
The command okctl forward postgres
fails with an error
This is due to a bug.
Workaround:
- Open
~/.okctl/conf.yml
- Make sure the username is set to your username, like this:
user:
id: ... # (Don't edit this)
username: ooo123456 # Replace with your username
On Okctl delete cluster, some resources are not deleted (automatic deletion is coming in a later version)
Workaround: manually delete the following resources, as described in Delete cluster.
It is recommended to delete the infrastructure
directory in your IAC-repository as the last manual step.
Okctl create cluster: Create identity pool fails / Re-create cluster within short timespan fails
If you do the following:
- Create a cluster
- Delete it
- Create a new cluster with the same domain name (e.g. whatever.oslo.systems)
This might fail if you do these steps within 15 minutes. This is due to DNS resolvers caching NS server records.
More details: https://github.com/oslokommune/okctl/pull/231
Workaround: Wait for up to 15 minutes before creating the cluster again.
15 minutes is the TTL (Time to live, i.e. cache expiry) of the NS record. You can see this value in Route 53
-> Hosted zones
-> [Your domain]
-> [NS record for your top domain]
-> Edit -> See TTL field.
Okctl create cluster: Failed to create external secrets helm chart
You get the following error (shortened):
.. creating: external-secrets (elapsed: 1 second 76 microseconds)WARN[0007] failed to process request, because: failed to create external secrets helm chart: failed to update repository: failed to fetch https://kubernetes-charts-incubator.storage.googleapis.com/index.yaml : 403 Forbidden endpoint=create service=helm/externalSecrets
✓ creating
Error:
....
request failed with Internal Server Error, because: failed to create external secrets helm chart: failed to update repository: failed to fetch https://kubernetes-charts-incubator.storage.googleapis.com/index.yaml : 403 Forbidden
This happens because Helm changed URLs to their repositories. Update your ~/.okctl/helm/repositories.yaml
, and update URLs from:
Name | Old Location | New Location |
---|---|---|
stable | https://kubernetes-charts.storage.googleapis.com | https://charts.helm.sh/stable |
incubator | https://kubernetes-charts-incubator.storage.googleapis.com | https://charts.helm.sh/incubator |
Okctl apply cluster: Always prompts for GitHub machine authentication, even after it has been set
There is an issue with some versions of pinentry-curses where sometimes the prompt to enter a password for your PGP key will not appear. We store the authentication token in a keyring, and since it cannot be decrypted without the password Okctl just skips ahead. The solution is to export the following environment variable:
GPG_TTY=$(tty)
export GPG_TTY
This can be done in your current shell before you run Okctl commands or can be put in your .bashrc
or similar to ensure you will always be prompted for your
encryption key password. A bit more detailed explanation can be found on StackOverflow.
Okctl is expecting an oslokommune-boundary to be present, but it's missing
You're probably trying to create an okctl cluster on a Crayon account. We've yet to adapt okctl to work on the new
accounts, so until then you can run the following command to create a dummy policy in the new account.
aws iam create-policy \
--policy-name oslokommune-boundary \
--path /oslokommune/ \
--policy-document "{\"Version\": \"2012-10-17\", \"Statement\": [ {\"Sid\": \"AllowAccessToAllServices\", \"Effect\": \"Allow\", \"NotAction\": [\"iam:CreateUser\"], \"Resource\": \"*\"}]}"