Remember to search #kjøremiljø-support in Slack for your error message as well.

okctl forward postgres works, but connecting to database does not

Symptom

You are running okctl forward postgres, but when attempting to connect to the database on localhost, you SQL client states that the authentication is invalid  ("wrong username/password" or something similar).

Example

$ okctl forward postgres \
--cluster-declaration my-cluster-dev.yaml \
--name my-database \
--username my-db-user \
--password-file /tmp/pw.txt \
--aws-credentials-type aws-profile

In another terminal, we can see the error by running:

$ psql -U my-db-user -h localhost -d my-database
psql: error: connection to server at "localhost" (127.0.0.1),
port 5432 failed: ERROR:  pgbouncer cannot connect to server

Issue

okctl forward postgres does not work when --username parameter value is the same as the root user to the database, which is stated in the cluster declaration. So in this example, if  my-cluster-dev.yaml  -> databases -> postgres -> my-database -> user is equal to my-db-user , which is the same username chosen in the okctl forward postgres command, connecting to the database will fail.

Solution

Choose another username for the parameter --username  in the okctl forward postgres command.

Continuing the example above, we would solve the issue by running

$ okctl forward postgres \
--cluster-declaration my-cluster-dev.yaml \
--name my-database \
--username foo \
--password-file /tmp/pw.txt \
--aws-credentials-type aws-profile

Background

https://oslokommune.slack.com/archives/CV9EGL9UG/p1663591208723989

Okctl complains that "state is locked"

Although okctl is supposed to release locks when completing stuff, or when failing, it sometimes doesn't work. The next time you run a Okctl command, you can get the following error:

$ okctl delete cluster -f cluster-dev.yaml 
Error: acquiring lock: state is locked

If you know you're working alone and not doing something at the same time as someone else, you can release the lock like this.

$ okctl maintenance state-release-lock
releasing state lock
successfully released lock

Logging doesn't work

It might be because your cluster is somehow running Loki 2.4.x. Check if this is the case by running:

okctl venv ... # fill in parameters matching your cluster

kubectl -n monitoring describe statefulset loki | grep -i image

Output should be like this:

Image:      grafana/loki:2.1.0

or  perhaps grafana/loki:2.3.0, which also works. If the version is 2.4.0 or higher, this is known to not work (see https://github.com/oslokommune/okctl/pull/979).

One way to fix this is be reinstalling Loki. In your cluster manifest, set loki: false and run the command below.

Note! This will erase your existing logs!

 okctl apply cluster -f cluster-manifest.yaml

Then set loki: true and re-run the above command.

Failed to create pod sandbox

When running kubectl get event or kubectl describe pod ..., you might see the warning

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "e24268322e55c8185721f52df6493684f6c2c3bf4fd59c9c121fd4cdc894579f" network for pod "my-deployment-59f5f68b58-c89wx": networkPlugin
cni failed to set up pod "my-deployment-59f5f68b58-c89wx_my-namespace" network: add cmd: failed to assign an IP address to container

This is just a warning, and it is also expected, and can be ignored. See https://docs.aws.amazon.com/eks/latest/userguide/security-groups-for-pods.html (search for "sandbox").

The Loki pod has stopped running, and won't start again

It can be multiple reasons, but one reason is that it was caused because of too high memory usage. We have seen in one instance that this causes one node to become unavailable, and then Loki won't start up again. To verify if this is the issue, run

kubectl get node

If the output looks like this:

NAME                                                  STATUS
........
ip-xxxxxxxxxxxx.eu-west-1.compute.internal            NotReady,SchedulingDisabled

and you have no nodes with status Ready, it means Loki cannot spawn up again. To make some space for Loki to spawn, you can manually increase the number of nodes available.

Go to AWS console -> EC2 -> Auto scaling groups.

Find three node groups with name eksctl-okctl-mycluster-nodegroup-ng-generic-1-20-1b-NodeGroup-xxxxxxxxxxxxxxxx. If you know which one runs Loki, you can select only that one. If not, increase the size for all nodegroups. You can do this by going to each node group, and under "Group details" you can click "Edit" and increase Desired Capacity and Minimum Capaity by 1 (from 1 to 2 for instance).

This should give Loki some space to get working again.

ArgoCD is not working, but I need to deploy stuff

If ArgoCD for whatever reason doesn't work, you can apply changes yourself, the same way ArgoCD does it.

okctl venv -c my-cluster.yaml
kustomize build infrastructure/applications/my-app/overlays/my-cluster/argocd-applicaiton.yaml | kubectl apply -f -

I get an error message containing "ssh" and/or "knownhosts"

Some examples:

Error: synchronizing declaration with state: reconciling nameserver delegation: initiating dns zone delegation: staging repository: cloning repository: knownhosts: /home/x/.ssh/known_hosts:10: illegal base64 data at input byte 140
Error: synchronizing declaration with state: reconciling nameserver delegation: initiating dns zone delegation: staging repository: cloning repository: ssh: handshake failed: knownhosts: key mismatch
Error: synchronizing declaration with state: reconciling nameserver delegation: initiating dns zone delegation: staging repository: cloning repository: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain

Possible solution: known_hosts contains something invalid

mv ~/.ssh/known_hosts{,.bak}
ssh-keyscan github.com > ~/.ssh/known_hosts

Possible solution: SSH agent doesn't know about private key for your IAC repository

ssh-add ~/.ssh/<relevant private key>

When running okctl upgrade, I get Body: Not found

Example

Error: upgrading: parsing upgrade binaries: validating release: fetching checksums: downloading checksum file: http call did not return status OK. URL: https://github.com/oslokommune/okctl-upgrade/releases/download/untagged-f95a8beee472bebe3a19/okctl-upgrade-checksums.txt. Status: 404 Not Found. Body: Not Found

This can happen if you are running an upgrade at the same time we're in the middle of uploading a release. The solution is to just wait for 10 minutes and try again.

Specifically, if there is any instances of "goreleaser" running on https://github.com/oslokommune/okctl-upgrade/actions, they must be completed before you attempt upgrading again.

This bug is tracked in issue: https://trello.com/c/xYHF2vVe/596-okctl-upgrade-fails-if-run-at-the-same-time-as-were-making-a-release


I get an error message with "getting existing cluster <cluster-name>: not found"

This can be an indication that your state.db was not found AWS S3. If you have just updated okctl from 0.0.79 or lower, you need to:

  1. Run okctl maintenance state-upload <path-to-state.db> to move the state.db file to a remote location.
    The state.db usually resides in /infrastructure/<cluster-name>/state.db
  2. Delete the relevant state.db file, commit and push the changes.

See release notes for 0.0.80


Okctl keeps trying to do the Github Device Authentication Flow while trying to do <any action>

This is known to happen if pass init <gpg-key-id> has not been run after installing pass.


okctl forward postgres fails on applying security group policy

The command okctl forward postgres fails with an error

Error: applying security group policy: SecurityGroupPolicy.vpcresources.k8s.aws "xxxxxxxxx-pgbouncer-" is invalid: metadata.name: Invalid value: "xxxxxxxxx-pgbouncer-": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')

This is due to a bug.

Workaround:

  • Open ~/.okctl/conf.yml
  • Make sure the username is set to your username, like this:
user:
    id: ... # (Don't edit this)
    username: ooo123456 # Replace with your username

On Okctl delete cluster, some resources are not deleted (automatic deletion is coming in a later version)

Workaround: manually delete the following resources, as described in Delete cluster.

It is recommended to delete the infrastructure directory in your IAC-repository as the last manual step.


Okctl create cluster: Create identity pool fails / Re-create cluster within short timespan fails

If you do the following:

  • Create a cluster
  • Delete it
  • Create a new cluster with the same domain name (e.g. whatever.oslo.systems)

This might fail if you do these steps within 15 minutes. This is due to DNS resolvers caching NS server records.
More details: https://github.com/oslokommune/okctl/pull/231

Workaround: Wait for up to 15 minutes before creating the cluster again.

15 minutes is the TTL (Time to live, i.e. cache expiry) of the NS record. You can see this value in Route 53-> Hosted zones -> [Your domain] -> [NS record for your top domain] -> Edit -> See TTL field.


Okctl create cluster: Failed to create external secrets helm chart

You get the following error (shortened):

..  creating: external-secrets (elapsed: 1 second 76 microseconds)WARN[0007] failed to process request, because: failed to create external secrets helm chart: failed to update repository: failed to fetch https://kubernetes-charts-incubator.storage.googleapis.com/index.yaml : 403 Forbidden  endpoint=create service=helm/externalSecrets
✓   creating
Error:
....
request failed with Internal Server Error, because: failed to create external secrets helm chart: failed to update repository: failed to fetch https://kubernetes-charts-incubator.storage.googleapis.com/index.yaml : 403 Forbidden

This happens because Helm changed URLs to their repositories. Update your ~/.okctl/helm/repositories.yaml, and update URLs from:

Name Old Location New Location
stable https://kubernetes-charts.storage.googleapis.com https://charts.helm.sh/stable
incubator https://kubernetes-charts-incubator.storage.googleapis.com https://charts.helm.sh/incubator

Okctl apply cluster: Always prompts for GitHub machine authentication, even after it has been set

There is an issue with some versions of pinentry-curses where sometimes the prompt to enter a password for your PGP key will not appear. We store the authentication token in a keyring, and since it cannot be decrypted without the password Okctl just skips ahead. The solution is to export the following environment variable:

GPG_TTY=$(tty)
export GPG_TTY

This can be done in your current shell before you run Okctl commands or can be put in your .bashrc or similar to ensure you will always be prompted for your
encryption key password. A bit more detailed explanation can be found on StackOverflow.


Okctl is expecting an oslokommune-boundary to be present, but it's missing

You're probably trying to create an okctl cluster on a Crayon account. We've yet to adapt okctl to work on the new
accounts, so until then you can run the following command to create a dummy policy in the new account.

aws iam create-policy \
  --policy-name oslokommune-boundary \
  --path /oslokommune/ \
  --policy-document "{\"Version\": \"2012-10-17\", \"Statement\": [ {\"Sid\": \"AllowAccessToAllServices\", \"Effect\": \"Allow\", \"NotAction\": [\"iam:CreateUser\"], \"Resource\": \"*\"}]}"