Recently, I’ve been working against Kubernetes and Helm for a while. Today, I faced a strange problem that could only be triggered in a very very specific condition. After determined what happened under the hood, I decided to write it down in case someone else needs it.
Also, BTW, to practice my English. :D
Let’s say we have a retail-api
chart that contains some normal services and deployments like others do. And mysql
chart is one of the dependencies of it.
Also, we set up GitLab CI review apps to install and upgrade the chart using helmfile. That means every new branch creates a fresh environment, including the release and its resources. Actually, we are now starting to work with review apps recently. So we choose to use $CI_COMMIT_REF_SLUG
for the environment url, also a part of the Helm release name as the doc says.
In the first few days, it works perfectly as expected, until I pushed a branch named bugfix/hook-deploy-mysql-connection-timeout
:
$ helmfile -e ${HELM_ENVIRONMENT} apply --suppress-secrets --concurrency=1
...
Comparing release=retail-api-bugfix-hook-deploy-mysql-connection-timeout, chart=deploy/chart
in ./helmfile.yaml: failed processing release retail-api-bugfix-hook-deploy-mysql-connection-timeout: helm exited with status 1:
Error: invalid release name, must match regex ^(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])+$ and the length must not be longer than 53
Error: plugin "diff" exited with error
/root/.helm/plugins/helm-tiller/scripts/tiller.sh: line 174: 144 Killed ./bin/tiller --storage=${HELM_TILLER_STORAGE} --listen=127.0.0.1:${HELM_TILLER_PORT} ${PROBE_LISTEN_FLAG} --history-max=${HELM_TILLER_HISTORY_MAX} (wd: ~/.helm/plugins/helm-tiller)
Error: plugin "tiller" exited with error
Looks like easy to fix. I edit the helmfile.yaml
then, using functions trunc
and trimSuffix
to build a vaild release name:
releases:
- name: retail-api-{{ requiredEnv "CI_COMMIT_REF_SLUG" | trunc 32 | trimSuffix "-" }}
After so, the release name has been truncated to retail-api-bugfix-hook-deploy-mysql-connect
. Things weird started from now on, see the CI logs:
Upgrading release=retail-api-bugfix-hook-deploy-mysql-connect, chart=deploy/chart
Creating tiller namespace (if missing): tiller
Release "retail-api-bugfix-hook-deploy-mysql-connect" does not exist. Installing it now.
in ./helmfile.yaml: failed processing release retail-api-bugfix-hook-deploy-mysql-connect: helm exited with status 1:
Error: release retail-api-bugfix-hook-deploy-mysql-connect failed: services "retail-api-bugfix-hook-deploy-mysql-connect" already exists
Error: plugin "tiller" exited with error
ERROR: Job failed: command terminated with exit code 1
The previous failed job mistakenly leads me to the wrong hypothesis - it is due to the length. In order to verify, I even tried to shorten the release name by decrease 32 characters into 16:
releases:
- name: retail-api-{{ requiredEnv "CI_COMMIT_REF_SLUG" | trunc 16 | trimSuffix "-" }}
It is now retail-api-bugfix-hook-depl
, and works well.
So, I put all my minds to the length issue, and did the following research:
- The logs said that a service already exists. Is there any additional limitation for services in Kubernetes? Refer to the naming doc, No.
- Do the DNS specs or other RFCs ever specify the max length? Yes, but 63 characters. Even the longer service name is just 43 chars. Fair enough to the requirements.
- Okay, no reference pointing to this issue. Let’s have a diagnosis of exclusion. First, I found all usages of the release name is only the
retail.fullname
helper function, which is generated by Helm by default. Fix the helper, fix the problem. Then I tried to truncate every value of fields that contains{{ include "retail.fullname" . }}
, one by one. Finally, I located 2 resources - the deployment and service, both of them use exact the helper as their name. - In order to find the critical value that can precisely reproduce this issue, I started to try it from
{{ include "retail.fullname" . | trunc 20 }}
to{{ include "retail.fullname" . | trunc 43 }}
. I got every test passed, except for43
, which is the original full name without being cut out.
After that, I started to think, do we really have any “dirty” resource being conflicted? It seems not. But checking out successful applied releases makes me have some new discoveries. The release actually deployed 2 deployment and services that have similar names:
==> v1/Deployment
NAME READY UP-TO-DATE AVAILABLE AGE
retail-api-bugfix-hook-deploy-mysql-connec 0/1 1 0 5s
==> v1beta1/Deployment
NAME READY UP-TO-DATE AVAILABLE AGE
retail-api-bugfix-hook-deploy-mysql-connect 0/1 1 0 5s
==> v1/Service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
retail-api-bugfix-hook-deploy-mysql-connect ClusterIP 10.100.57.38 <none> 3306/TCP 6s
retail-api-bugfix-hook-deploy-mysql-connec ClusterIP 10.100.192.93 <none> 80/TCP 6s
That is very confusing. Did I just catch a bug in Helm?
Obviously, not. Suddenly, I saw an abnormal value of the service:
$ kubectl get service -n review-apps
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
retail-api-bugfix-hook-deploy-mysql-connect ClusterIP 10.100.57.38 <none> 3306/TCP 18h
Port 3306? That’s the service of mysql chart, not ours! Finally, I realized that is because the mysql.fullname
helper generates the same value of retail.fullname
does, due to the if
statement in the helper:
{{- $name := default .Chart.Name .Values.nameOverride -}}
{{- if contains $name .Release.Name -}}
{{- printf .Release.Name | trunc 63 | trimSuffix "-" -}}
The name of the mysql service and deployment will be exactly the release name if it contains mysql
, the chart name.
In the end, I haven’t been digging deeper for the reason why the developers of helm would like to do it that way. In fact, I don’t really like that kind of behavior, which does not “show up” in my daily ops. Even though I might read the codes someday, it could be easily forgotten.
However, this time of debugging notices me it’s worth to pay more attention while we are using helpers and building resource names, and be careful using some kind of “user-generated” value as a Helm release name.