Merge pull request #11 from olblak/iep-007

Add IEP-007: Kubernetes cluster upgrade
This commit is contained in:
R. Tyler Croy 2017-11-27 11:51:08 -08:00 committed by GitHub
commit de212a6599
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 158 additions and 0 deletions

158
iep-007/README.adoc Normal file
View File

@ -0,0 +1,158 @@
ifdef::env-github[]
:tip-caption: :bulb:
:note-caption: :information_source:
:important-caption: :heavy_exclamation_mark:
:caution-caption: :fire:
:warning-caption: :warning:
endif::[]
= IEP-007: Kubernetes cluster upgrade
:toc:
.Metadata
[cols="2"]
|===
| IEP
| 007
| Title
| Kubernetes cluster upgrade
| Author
| link:https://github.com/olblak[Olblak]
| Status
| :speech_balloon: In-process
| Type
| [Process]
| Created
| 2016-07-04
|===
== Abstract
As Kubernetes project often releases new versions, it's interesting to enjoy new features or bugfixes introduced by those new versions.
Therefor we need an easy way to upgrade Kubernetes cluster on a regular basis.
== Specification
Currently there are two main strategies to "upgrade" a cluster. +
. Upgrading an existing cluster.
. Migrating on a second cluster.
As Azure do not (yet) provide any tools to upgrade existing clusters, we have to upgrade them manually. +
It appears to be easier and safer to deploy a new cluster and re-deploy all resources on the new one.
As long as the new cluster stays in the same region than the previous one, we can use same blob storages and attach them to the new cluster. +
The only important element that we loose when we migrate to a new cluster, is the cluster public IP. +
This means that we need to update 'nginx.azure.jenkins.io' with a new public IP.
=== Migration Process
IMPORTANT: The old cluster must be kept until the new one is ready to serve requests.
==== Step 1: Backup
Ensure secrets containing Letsencrypt certificates are exported. (Require https://github.com/jenkins-infra/jenkins-infra/pull/819[#PR819]) +
A cron job should periodically export letsencrypt certificate into `~/backup/$(CLUSTER)/secret.$(APP)-tls.yaml`
.Export a secret
----
.bin/kubectl get secret $(APPLICATION)-tls --export=true --kubeconfig .kube/$(CLUSTER).conf -o yaml > ~/backup/$(CLUSTER)/secret.$(APPLICATION)-tls.yaml
----
==== Step 2: Deploy the new cluster
Add a second k8s resource in github.com/jenkins-infra/azure, named 'pea' (Require https://github.com/jenkins-infra/iep/pull/11[#PR11])
==== Step 3: Configure the new cluster
* Update following hieraconfig variables with new the k8s cluster information(Require PR on jenkins-infra/jenkins-infra)
----
profile::kubernetes::params::clusters:
- server: https://clusterexample1.eastus.cloudapp.azure.com
username: clusterexample1-admin
clustername: clusterexample1
certificate_authority_data: ...
client_certificate_data: ...
client_key_data: ...
----
* Run puppet agent
* Get new public IP (Manual operation)
----
kubectl get service nginx --namespace nginx-ingress
----
* Restore backed up secrets containing Letsencrypt certificates on the new cluster (Manual operation)
----
.bin/kubectl apply -f ~/backup/$(OLD_CLUSTER)/secret.*-tls.yaml --kubeconfig .kube/$(CLUSTER).conf
----
* Validate HTTPS endpoint (Manual operation)
----
curl --header 'Host: plugins.jenkins.io' 'https://<new_public_ip>
curl --header 'Host: repo-proxy.jenkins.io' 'https://<new_public_ip>
curl --header 'Host: accounts.jenkins.io' 'https://<new_public_ip>
----
==== Step 4: Update DNS Record
Update nginx.azure.jenkins.io with the new public IP (Require PR on jenkins-infra/jenkins-infra)
[NOTE]
During DNS update, requests will be send either to the new cluster, either to the old cluster.
Users shouldn't detect any differences.
==== Step 5: Remove the old cluster
Remove k8s.tf from jenkins-infra/azure (Require PR on jenkins-infra/azure)
[NOTE]
It may be safer to not automate this step, and just delete the good storage account through Azure portal
==== Conclusion
With this scenario, we shouldn't have any downtime as HTTP/HTTPS requests will almost have (depending on the service) the same response whatever we reach the old or the new cluster.
== Motivation
As testing environments have short lives, we create them to validate deployments then we trash them so they often use the latest version available from Azure.
This means that we may not detect issues when those versions are not aligned with production
We also would like to enjoy improvements from bugfixes and new features.
It's easier to follow Kubernetes documentation if we use a version close to the upstream version.
== Rationale
There isn't any tool to easily upgrade a Kubernetes cluster on Azure so I tried to do it manually.
I only spent one day to try this upgrade but it wasn't trivial.
I applied following steps manually:
* Update kubelet && kubectl
* Update Kubernetes options according the version (some options were deprecated and others were new)
* Reproduce the production cluster in order to validate the migration.
* Restart each node (master&client) after the upgrade
Even after those operations I faced weird issues, so I decided to stop there and concluded that cluster migration was easier and a safer process.
They are several open issues regarding update procedure so I suppose that it may be a possible alternative in the future.
* https://github.com/Azure/ACS/issues/5[Azure/acs]
* https://github.com/Azure/acs-engine/issues/464[Azure/ace-engine]
== Costs
It will costs a second cluster during the migration but when everything is switched to the new cluster, the previous one can be decommissioned.
== Reference implementation
As of right now there is not reference implementation of Kubernetes Cluster upgrade