Add a bunch of blog posts from the past couple months

I'm apparently a very lazy committer
This commit is contained in:
R. Tyler Croy 2017-12-02 08:40:21 -08:00
parent b85175a379
commit 477df3475b
No known key found for this signature in database
GPG Key ID: 1426C7DC3F51E16F
7 changed files with 1074 additions and 0 deletions

View File

@ -0,0 +1,65 @@
---
layout: post
title: "They will blame you"
tags:
- sysadmin
- devops
- opinion
---
Over the past decade two things have become increasingly clear: practically
every modern industry is part of "the software industry," in one way or
another, and "the software industry" is rife with shortcuts and technical debt.
Working in an Operations or Systems Administration capacity provides a
front-row seat to many of these dysfunctional behaviors. But it's not just
sysadmins, many developers are also called to engage in or allow: half-baked
product launches, poor-quality code deployments, or subpar patch lifecycle
management.
Make no mistake, if something goes wrong, **they will blame you.**
Just yesterday, I was working on my truck in the driveway and a neighbor struck
up a conversation about diesel engines. The conversation naturally led to a
discussion about Volkswagen's massive diesel emissions scandal. I mentioned to
my neighbor how infuriated I was that [Volkswagen executives blamed developers](http://www.latimes.com/business/autos/la-fi-hy-vw-hearing-20151009-story.html)
for the scandal. Prior to that news story, I naively assumed that executives
took ultimate responsibility for the successes, and failures, of their
organizations.
As the sun set, I wrapped up my work and came back inside to see [this story from Engadget](https://www.engadget.com/2017/10/03/former-equifax-ceo-blames-breach-on-one-it-employee/)
wherein former Equifax CEO blamed IT staff for the failure. The Equifax breach
was made possible because of an out-of-date Apache Struts dependency.
Setting aside for a moment that personal-identifying information should _never_
be a single vulnerability away from exposure. Setting aside for a moment that
the majority of the Equifax business relies on **trust**, and should have
therefore been subject to vigorous and regular third-party security audits.
Setting aside for a moment that information security relies on defense in
depth, which is an organization-wide practice. The former CEO blamed
underlings, rather than leadership for the systemic failures of Equifax to
secure highly sensitive personal information.
Make no mistake, if something goes wrong, **they will blame you.**
---
Before I dropped out of college, while I was still pretending to study
Computer Engineering, I took an Engineering Ethics course. We discussed Space
Shuttle disasters, bridge failures, and other calamities, at length. One
recurring theme from many of the incidents was management ignoring or covering
up expert advice, or concerns, by engineering staff. The conclusion drawn, for
the auditorium of young engineering students, was that it was our
responsibility as "Professional Engineers" to ensure the safety and quality of
our work, and make sure that we had solid documentation for any safety concerns
we raise, otherwise we could be held liable.
I am starting to believe that, before the decade is over, we will start to see
developers and systems administrators held civilly liable for failures in
systems we create and for which we are responsible.
It is up to you to advocate for good patch lifecycle management practices. It
is up to you to build systems which prevent poor-quality code deployments. It
is up to you to advocate for well-designed products which defend user privacy
and personally-identifiable information. Because make no mistake, if something
goes catastrophically wrong, they will blame you.

View File

@ -0,0 +1,72 @@
---
layout: post
title: "Watching fire come down the mountain"
tags:
- california
- santarosa
- opinion
---
The insanely strong gusts of wind would not stop clattering the tin roof panels
over the back patio. Begrudgingly, I awoke, dressed, and tried to secure the
roof panels before the neighbors got too ornery. Stepping up the ladder, I
noticed an orange glow north of the house. Just after midnight, I had not heard
any sirens, I jumped into the car on the assumption that one of those houses
by the park was burning and had not yet been reported.
Wearing a flannel, jeans, and my flip-flops, I speed off into the night. Not
entirely sure what aid I could render, as a mostly-useless person wearing
inappropriate fire-fighting footwear.
Passing the park, seeing nothing, I figure it's the neighborhood behind, and
continued driving. The next neighborhood doesn't show any fire but I smell
smoke, so I continue on towards Fountaingrove Parkway which crosses one of the
highest ridges in Santa Rosa.
Atop Fountaingrove Parkway, I see the hills to the north, an area I later learn
is "Shiloh ridge", are glowing.
I do not see flames, but they're glowing. I turn my hat backwards so the gusts
of wind don't blow my hat from my head. Not more than two minutes pass and
flames crest the ridge.
"Oh shit" I exclaim to nobody in particular.
Walking back to the car, I stand on the bumper for a better view and see the
flames already pushing more than halfway down Shiloh Ridge. In a matter of
minutes, the ridge glowing against the smokey night sky had erupted in flames.
"Oh fuck this!" and I scurry into the car and speed off.
---
Driving back to house, I call my wife, who is rather surprised to learn I'm not
sleeping beside her. She puts a kettle on, and starts preparing the go-bag. I
arrive home around 1:00, half the sky is clear with a full moon, the other half
smoke filled with an orange backlight.
While preparing some stuff to go, we start listening to the scanner, and begin
to watch Twitter.
Within 30 minutes the evacuation notices are rolling out.
Within 60 minutes the fire jumps over US Highway 101.
---
We voluntarily evacuated to Sebastopol at 3:00.
---
Between Santa Rosa and Sebastopol, the air foggy with smoke and ash, we are
able to see fires raging on the hills to the southeast of Santa Rosa. Arriving
in Sebastopol at 3:45, everybody had already been awoken by the smell of smoke.
By 10:00, significant chunks of northern Santa Rosa have burnt to the ground.
The neighborhood from that glowing ridge, which I saw around midnight: gone.
The valley below, where I watched the flames flicker down the hill: gone. The
ridge I stood atop for all of five minutes, is now also on fire.
It is still uncertain how the fire will develop throughout the day, how long
the fire will burn, and how scarred the beautiful Sonoma and Napa Valleys will
be when it's all over.

View File

@ -0,0 +1,132 @@
---
layout: post
title: "This is your reality now"
tags:
- santarosa
- sonoma
- fire
- sonomafireinfo
---
The traffic on the Bay Bridge connecting San Francisco to Oakland is one of the
most congested routes of traffic in all of Northern California. Somehow it gets
even worse on Saturday and Sunday. One weekend, a few years ago, I was driving my wife
and some of the women from her soccer team back to Berkeley, from a game in
San Francisco's Golden Gate Park. On the east side of the bridge, before
inching onto I-580N, I was pretty pissed off, and half-joking half-frustrated
shook back-and-forth at the steering wheel "GAHHHHHHHHHHHH." The woman sitting
behind me, who was certainly the "funny one" of the group, put her hand on my
arm and gently said "Tyler, this is your reality now."
Certainly a well-delivered line, perfect timing, received with laughter all around, but
the phrase has stuck in my memory longer than the woman's name.
My [last post](/2017/10/09/fire-coming-down-the-mountain.html) I wrote as a way
to process and capture the trauma of watching fire rip into northern Santa
Rosa. A town I have adopted and which is the subject of a number of picturesque
photos I have posted over the past three years, always titled with my
unofficial city motto: "Santa Rosa: It's nice."
The day after I wrote that post, I ended up at the [Chimera Arts and
Makerspace](http://chimeraarts.org) in Sebastopol, the little hippie town west
of Santa Rosa, where I joined a fledgling effort called [Sonoma Fire
Info](http://sonomafireinfo.org). I took the remainder of the week off from
work, and our little volunteer organization rapidly became a clearinghouse for
verified information across the county in its time of need. Soaking up the
efforts of over 60 volunteers who made thousands of phone calls, scoured social
media, and captured truth amid the chaos. In a two week period, the website had
been viewed by over 100k people.
I think we did a great job of informing Sonoma County. The rest of the country,
and world, remains frustratingly less informed about an event from which my adorable
little city is going to take _years_ to recover.
The fire that I watched whip down the hillside is known as the "Tubbs
Fire". The fire that I could see from miles away on Llano Rd during our
voluntary evacuation to Sebastopol at 3:45 that morning is known as the "Nuns
Fire." While I saw both of these with my own eyes, there were **four other
fires**, of various sizes, engorged by 50-70mph winds, raging in Northern
California:
* The "Sulphur Fire" burned in Lake County to our northeast.
* The "Pocket Fire" destroyed parts of northern Sonoma county.
* The "Redwood Valley Fire" incinerated Mendocino County further to the north.
* The "Atlas Fire" tore through Napa County to our east.
At one time there were **six active fires** in the part of Northern California north of
San Francisco and west of Sacramento. To put this into a historic context,
**four** of those six fires rank in the 20 most destructive (structures destroyed)
wildfires ever recorded in California history:
![The 20 most destructive fires](/images/post-images/your-reality-now/destructive-fires.jpg)
(posted by [@CALFIRE](https://twitter.com/CAL_FIRE/status/921441414981885952/photo/1) on October 20th)
The most destructive (Tubbs), and sixth most destructive (Nuns), wildfires in
the Bear Republic's history scarred Sonoma county on a difficult to understand
and on a difficult to process scope.
The impact on Santa Rosa, in particular, from this [unfathomably big fire](https://twitter.com/agentdero/status/921609069810532353)
cannot be understated. Considered the fifth most populous city in the "Bay
Area," with just over 170k residents, it lost **5%** of its housing in less than
twelve hours. The gale-force winds which woke me up at 12:30am on October 9th
pushed the fire through neighborhoods, across 4-6 lanes of Highway 101, and
through hundreds more homes before it could be stopped, all in a matter of
about 8 hours.
---
We returned to our house the Thursday night after the fires started, exhausted.
After a full day working at Chimera on Sonoma Fire Info, and some dinner that
Friday, I holed up in my office and continued scouring the internet for news
and updates when I startled at the sound of water falling on the tin patio roof.
My first thought: "did a water-tanker helicopter just fly over?" Followed
quickly by "no fucking way, did it start raining!?" Bolting out the front door,
I was disappointed to learn it had not started raining, but then was bemused to
find my neighbor, watering my house.
I can understand the compulsion to water down the house "just in case" in areas
near wildfires, but this wasn't a "just in case" rather, my neighbor caught an
ember burning on my roof earlier in the week. He had since taken to watering both our
houses a couple times a day.
I also learned from my night-owl of a neighbor that he had been sitting on my
corner-lot house's porch, and brandished his pistol a few times at some cars
which took an especially slow roll through our neighborhood, not about to let
any thieves take advantage of the situation.
The CALFIRE maps show that we are almost exactly one mile south of the last
structures completely destroyed by the Tubbs Fire.
This was close, terrifyingly close.
---
The next Monday, a week after the fires broke out, I return to work, to
questions of "are things okay?"
I lie.
Everybody in Sonoma county who didn't lose a house, knows somebody who did.
Thousands of people will have to wait until early 2018 for the EPA to remove
thousands of tons of toxic ash and debris, requiring a clean-up operation of
unprecedented size, before they can begin to rebuild. Large portions of
Sugarloaf Ridge State Park are burned, the majority of Annadel State Park is
destroyed. Most of the little Sonoma Valley towns I drive through on my way to
Napa have suffered severe damage.
This region, this adopted home of mine, is scarred in places beyond appreciation
for many Americans, including some who live here.
Much as I would like to wallow in that frustration and despair, there is no
direction to go but forward. There is nothing that will undo what has been
done, nothing will make this "okay."
There is no option for Sonoma county, and Santa Rosa, but to enjoy the warmth
of the autumn sun, pick up the pieces, and to rebuild.
"This is your reality now."

View File

@ -0,0 +1,80 @@
---
layout: post
title: "Call for Proposals: Testing and Automation @ FOSDEM 2018"
tags:
- fosdem
- testingautomation
- jenkins
---
2018 will be the sixth year for the Testing/Automation dev room at
[FOSDEM](https://fosdem.org/2016). This room is about creating better
software through a focus on testing and automation at all layers of
the stack. From creating libraries and end-user applications all the
way down to packaging, distribution and deployment. Testing and
automation is not isolated to a single toolchain, language or
platform, there is much to learn and share regardless of background!
# What
Since this is the sixth year we're hosting the Testing and Automation
dev room, here are some ideas of what we would like to see, and what
worked in prior years, they're just ideas though! Check out the
[2013](https://archive.fosdem.org/2013/schedule/track/testing_and_automation/),
[2014](https://archive.fosdem.org/2014/schedule/track/testing_and_automation/),
[2015](https://archive.fosdem.org/2015/schedule/track/testing_and_automation/),
[2016](https://archive.fosdem.org/2016/schedule/track/testing_and_automation/),
[2017](https://archive.fosdem.org/2017/schedule/track/testing_and_automation/)
schedules for inspiration.
### Testing in the real, open source world
* War stories/strategies for testing large scale or complex projects
* Tools that extend the ability to test low-level code
* Projects that are introducing new/interesting ways of testing "systems"
### Cool Tools (good candidates for lightning talks)
* Explain/demo how your open source tool made developing quality software better
* Combining projects/plugins/tools to build amazing things "Not enough
people in the open source community know how to use $X, but here's a
tutorial on how to use $X to make your project better."
# Where
FOSDEM is hosted at [Universite libre de Bruxelles in Brussels,
Belgium](https://fosdem.org/2018/practical/transportation/). The
Testing and Automation dev room is likely slated for Building H, room
2213, which seats ~100.
# When
* CFP Submission Deadline: **23:59 UTC, 26 November 2017**
* Schedule Announced: **15 December 2017**
* Presentations: **3 February 2018**
# How
Please submit one (or more) 30-40 minute talk proposal(s) OR one (or
more) 10 minute lightning talk proposal(s) by **23:59 UTC on November
26th 2017**. We will notify all those submitting proposals about their
acceptance by December 15th 2017.
To submit a talk proposal (you can submit multiple proposals if you'd
like) with [Pentabarf](https://penta.fosdem.org/submission/FOSDEM18/),
the FOSDEM paper submission system. Be sure to select `Testing and
Automation` otherwise we won't see it!
You can create an account, or use an existing account if you already have one.
Please note: FOSDEM is a
[FLOSS](https://en.wikipedia.org/wiki/Free_and_open-source_software)
community event, by and for the community, please ensure your topic is
appropriate (i.e. this isn't the right forum for commercial product
presentations)
# Who
* [R. Tyler Croy](https://github.com/rtyler) - Jenkins hacker
* [Mark Waite](https://github.com/markewaite) - Jenkins/Git hacker

View File

@ -0,0 +1,206 @@
---
layout: post
title: "Running tasks with Docker and Azure Functions"
tags:
- azure
- docker
---
Months ago Microsoft announced [Azure Container
Instances](https://docs.microsoft.com/en-us/azure/container-instances/) (ACI), which
allow for rapidly provisioning containers "in the cloud." When they were first
announced, I played around with them for a bit, before realizing that the
pricing for running a container "full-time" was almost 3x what it would cost to
deploy that container on an equitable Standard A0 virtual machine. Since then
however, Azure has added support for a "Never" restart policy, which opens the
door for using Azure Container Instances for [arbitrary task
execution](https://docs.microsoft.com/en-us/azure/container-instances/container-instances-restart-policy).
The ability to quickly run arbitrary containerized tasks is a really exciting
feature. Any Ruby, Python, JavaScript, script that I can package into a Docker
container I can kick out to Azure Container Instances in seconds, and pay by
the second of runtime. **Very** exciting, but it's not practical for me to
always have the Azure CLI at the ready to execute something akin to:
```
az container create \
--resource-group myResourceGroup \
--name mycontainer \
--image rtyler/my-silly-container:latest \
--restart-policy Never
```
Fortunately, Microsoft publishes a number of client libraries for Azure,
including a Node.js one. This is where introducing [Azure
Functions](https://docs.microsoft.com/en-us/azure/azure-functions/) can help
make Azure Container Instances really _shine_. Similar to AWS Lambda, or
Google Cloud Functions, Azure Functions provide a light-weight computing
environment for running teeny-tiny little bits of code, typically JavaScript,
"in the cloud."
This past weekend I had an arguably good argument for combining the two in a
novel fashion: launching a (containerized) script every ten minutes.
The expensive and old fashioned way to handle this would be to just deploy a
small VM, add a crontab entry, and spend the money to keep that machine online
for what equates to approximately 6 hours of work throughout the month.
* Standard A0 virtual machine monthly cost: $14.64
* Azure Container Instance, for 6 hours a month, cost: $0.56
In this blog post I won't go too deeply into the creation of an Azure Function,
but I will focus on the code which actually provisions an Azure Container
Instance from Node.js.
### Prerequisites
In order to provision resources in Azure, we must first create the Azure
credentials objects necessary. For better or worse, Azure builds on top of
Azure Active Directory which offers an absurd amount of role-based access
controls and options. The downside of that flexibility is that it's supremely
awkward to get simple API tokens set up for what seem like otherwise mundane
tasks.
To provision resources, we will need an "Application", "Service Principal", and
"Secret". The instructions below will use the Azure CLI:
* `openssl rand -base64 24` will generate a good "client secret" to use.
* `az ad app create --display-name MyAppName --homepage http://example.com/my-app --identifier-uris http://example.com/my-app --password $CLIENT_SECRET` creates the Azure Active Directory Application, mind the "App ID" (aka client ID).
* `az ad sp create --id $CLIENT_ID` will create a Service Principal.
* And finally, I'll assign a role to that Service Principal: `az role assignment create --assignee http://example.com/my-app --role Contributor --scope /subscriptions/$SUBSCRIPTION)_ID/resourceGroups/my-apps-resource-group`.
In these steps, I've isolated the Service Principal to a specific Resource
Group (`my-apps-resource-group`) to keep it away from other resources, but also
to make it easier to monitor costs.
A number of these variables will be set in the Azure Function "Application
Settings" to enable my JavaScript function to authenticate against the Azure
APIs.
### Accessing Azure from Azure
Writing the JavaScript to actually launch a container instance was a little
tricky, as I couldn't find a single example in the [azure-arm-containerinstance
package](https://github.com/Azure/azure-sdk-for-node/tree/master/lib/services/containerinstanceManagement).
In the "Codes" section below is the entire Azure Function, but the only major
caveat is that in my example I've "hacked" the `apiVersion` which is used when
accessing the Azure REST APIs, as the current package hits an API which doesn't
support the "Never" restart policy for the container.
With the Azure SDK for Node, authenticating properly, it's feasible to do all
kinds of interesting operations in Azure, creating, updating, or deleting
resources based on specific triggers from Azure Functions.
### Future Possibilities
The code below is among the most simplistic use-cases imaginable for
combining Azure Functions and Azure Container Instances. Thinking more broadly,
one could conceivably trigger short-lived containers 'on-demand" in response to
messages coming from Event Hub, or even inbound HTTP requests from another user
or system. Imagine, for example, if you wanted to provide a quick demo of some
application to new users on your website. One Azure Function provisioning
containers for specific users, and another periodically reaping any containers
which have been running past their timeout, would be both cheap and easily
deployed.
I still wouldn't use Azure Container Instances for any "full-time" workload,
their pricing model is fundamentally flawed for those kinds of tasks. If you
have workloads which are run for only seconds, minutes, or hours at a time,
they make a *lot* more sense, and with Azure Functions, are cheaply and easily
orchestrated.
### Codes
**index.js**
```
module.exports = function (context) {
const ACI = require('azure-arm-containerinstance');
const AZ = require('ms-rest-azure');
context.log('Starting a container');
AZ.loginWithServicePrincipalSecret(
process.env.AZURE_CLIENT_ID,
process.env.AZURE_CLIENT_SECRET,
process.env.AZURE_TENANT_ID,
(err, credentials) => {
if (err) {
throw err;
}
let client = new ACI(credentials, process.env.AZURE_SUBSCRIPTION_ID);
let container = new client.models.Container();
context.log('Launching a container for client', client);
container.name = 'my-container-name';
container.environmentVariables = [
{
name: 'SOME_ENV_VAR',
value: process.env.SOME_ENV_VAR
}
];
container.image = 'my-fancy-image-name:latest';
container.ports = [{port: 80}];
container.resources = {
requests: {
cpu: 1,
memoryInGB: 1
}
};
/* HACK THE PLANET */
/* https://github.com/Azure/azure-sdk-for-node/issues/2334 */
client.apiVersion = '2017-10-01-preview';
context.log('Provisioning a container', container);
client.containerGroups.createOrUpdate(
'spyglass-containers', /* resource group */
'some-proc', /* container group name */
{
containers: [container],
osType: 'Linux',
location: 'westus',
restartPolicy: 'never'
}
).then((r) => {
context.log('Launched:', r);
context.done();
});
});
};
```
**package.json**
```
{
"name": "foobar-processing",
"version": "0.0.1",
"description": "Timer-triggered function for running an Azure Container Instance",
"main": "index.js",
"author": "R Tyler Croy",
"dependencies": {
"azure-arm-containerinstance": "^1.0.0-preview"
}
}
```
**function.json**
```
{
"disabled": false,
"bindings": [
{
"direction": "in",
"schedule": "0 */10 * * * *",
"name": "tenMinuteTimer",
"type": "timerTrigger"
}
]
}
```

View File

@ -0,0 +1,519 @@
---
layout: post
title: "Jenkins on Kubernetes with Azure storage"
tags:
- aks
- azure
- jenkins
- kubernetes
---
_This research was funded by [CloudBees](https://cloudbees.com/) as part of my
work in the CTO's Office with the vague guideline of "ask interesting
questions and then answer them." It does not represent any specific product
direction by CloudBees and was performed with
[Jenkins](https://jenkins.io), rather than CloudBees products, and Kubernetes
1.8.1 on Azure._
At [this point](/tag/azure.html) it is certainly no secret that I am fond of the
work the Microsoft Azure team have been doing over the past couple years. While
I was excited to announce [we had
partnered](https://jenkins.io/blog/2016/05/18/announcing-azure-partnership/) to
run Jenkins project infrastructure on Azure. Things didn't start to get _really_
interesting until they announced [Azure Container
Service](https://azure.microsoft.com/en-us/services/container-service/). A
mostly-turn-key Kubernetes service alone was pretty interesting, but then
"[AKS](https://azure.microsoft.com/en-us/blog/introducing-azure-container-service-aks-managed-kubernetes-and-azure-container-registry-geo-replication/)"
was announced, bringing a, much needed, _managed_ Kubernetes resource into the
Azure ecosystem. Long story short, thanks to Azure, I'm quite the fan of
Kubernetes now too.
Kubernetes is brilliant at a lot of things. It's easy to use, has some really
great abstractions for common orchestration patterns, and is superb for running
stateless applications. State**ful** applications also run fairly well on
Kubernetes, but the challenge usually has _much_ more to do with the
application, rather than Kubernetes. Jenkins is one of those challenging
applications.
Since Jenkins is my jam, this post covers the ins-and-outs of deploying a
Jenkins master on Kubernetes, specifically through the lens of Azure Container
Service (AKS). This will cover the basic gist of running a Jenkins environment
on Kubernetes, evaluating the different storage options for "Persistent
Volumes" available in Azure, outlining their limitations for stateful
applications such as Jenkins, and will conclude with some recommendations.
* [Jenkins and the File System](#filesystem)
* [Kubernetes Storage](#k8s-storage)
* [Azure Disk](#azure-disk)
* [Azure File](#azure-file)
* [Conclusions](#conclusions)
<a name="filesystem"></a>
## Jenkins and the File System
To understand how Jenkins relates to storage in Kubernetes, it's useful to
first review how Jenkins utilizes its backing file system. Unlike many
contemporary web applications, Jenkins does not make use of a relational
database or any other off-host storage layer, but rather writes a number of
files to the file system of the host running the master process.
These files are not data files, or configuration files, in the traditional
sense. The Jenkins master maintains an internal tree-like object model, wherein
generally each node (object) in that tree is serialized in an XML format to the
file system. This does not mean that every single object in memory is written
to an XML file, but a non-trivial number of "live" objects representing
Credentials, Agents, Projects, and other configurations, may be periodically
written to disk at any given time.
A concrete example would be: when an administrator navigates to
`http://JENKINS_URL/manage` and changes a setting such as "Quiet Period" and
clicks "Save", the `config.xml` file (typically) in `/var/lib/jenkins` will be
rewritten.
These files aren't typically read in any periodic fashion, they're usually
only read when objects are loaded into memory during the initialization of Jenkins.
Additionally, XML files will span a number of levels in the directory
hierarchy. Each Job or Pipeline will have a directory in
`/var/lib/jenkins/jobs/<jobname>` which will have subfolders containing files
corresponding to each Run.
In short, Jenkins generates a large number of little files across a broad, and
sometimes deep, directory hierarchy. Combined with the read/write access
patterns Jenkins has, I would consider it a "worst-case scenario" for just
about any commonly used network-based storage solution.
Perhaps some future post will more thoroughly profile the file system
performance of Jenkins, but suffice it to say: it's complicated.
<a name="k8s-storage"></a>
## Kubernetes Storage
With a bit of background on Jenkins, here's a cursory overview storage in
Kubernetes. Kubernetes itself provides a consistent, cross-platform, interface
primarily via three "objects" if you will: [Persistent
Volumes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/),
Persistent Volume Claims, and [Storage
Classes](https://kubernetes.io/docs/concepts/storage/storage-classes/). Without
diving too deep into the details, workloads such as Jenkins will typically make
a "Persistent Volume Claim", as in "hey give me something I can mount as a
persistent file system." Kubernetes then takes this and confers with the
configured Storage Classes to determine how to meet that need.
In Azure these claims are handled by one of two provisioners:
* [Azure Disk](#azure-disk): an abstraction on top of Azure's "data disks"
which are attached to a Node within the cluster. These show up on the actual
Node as if a real disk/storage device has been plugged into the machine.
* [Azure File](#azure-file): an abstraction on top of Azure Files Storage, which
is basically CIFS/SMB-as-a-Service. CIFS mounts are attached to the Node
within the cluster, but rapidly attachable/detachable like any other CIFS/SMB
mount.
Both of these can be used simultaneously to provide persistence for stateful
applications in Kubernetes running on Azure, but their performance and
capabilities are not going to be interchangeable.
<a name="azure-disk"></a>
### Azure Disk
In AKS, two Storage Classes are pre-configured by default, yet neither one is
configured to [actually **be** the default Storage
Class](https://github.com/Azure/AKS/issues/48):
* `default`: utilizes the "Standard" storage (as in, hard drive, spinning
magnetic disks) model in Azure.
* `managed-premium`: utilizes the "Premium" storage (as in, solid state
drives).
The only real distinctions between the two which I have observed are going to be
speed and cost.
#### Limitations
Regardless of whether "Standard" or "Premium" storage is used for Azure
Disk-backed Persistent Volumes in Kubernetes (AKS or ACS) the limitations are
the same.
In my testing, the most frustrating limitation is the [fixed number of data disks which can be attached to a Virtual Machine in Azure](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-general).
As of this writing, the default Virtual Machine size used when provisioning AKS
is: `Standard_D1_v2`. One vCPU and 3.5GB of memory and a data disk limit of
**four**. Fortunately the default node count for AKS is current 3, but this
means that a default AKS cluster cannot currently support more than 12
Persistent Volumes backed by Azure Disk at once.
An easy way to change that is to provision larger Virtual Machine sizes with
AKS, but this **cannot be changed** once the cluster has been provisioned. For
my research clusters I have stuck with a minimum size of `Standard_D4_v2` which
provides up to 32 data disks per Virtual Machine, e.g.:
`az aks create -g my-resource-group -n aks-test-cluster -s Standard_D4_v2`
The Azure Disk Storage Class in Kubernetes also only supports the
`ReadWriteOnce` [access mode](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes).
In effect a Persistent Volume can only be mounted read/write by a single Node
within the Kubernetes cluster. By understanding how Azure Disk volumes are
provisioned and attached to Virtual Machines in Azure, this makes total sense.
The impact of this means that the only allowable `replica` setting for any
given workload which might use this Persistent Volume is **1**.
This has one further limitation on scheduling and high-availability for
workloads running on the cluster. Detaching and attaching disks to these
Virtual Machines is a **slow** operation. In my experimenting this varied from
approximately 1 to 5 minutes.
For a "high availability" stateful workload, this means that a Pod dying or
being killed by a rolling update, may incur a non-trivial outage __if__
Kubernetes schedules that Pod for a different Node in the cluster. While there
is support [specifying node affinity](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/)
in Kubernetes, I have not figured out a means of encouraging Kubernetes to keep
a workload scheduled on whichever Node has mounted the Persistent Volume.
Though it would be possible to explicitly pin a Persistent Volume to a specific
Node, and then pin a Pod to that Node, a lot of workload flexibility would be
lost.
#### Benefits
It may be tempting to think at this point "Azure Disk is not good, so
everything should just use Azure File!" But there are benefits to Azure Disk
which should be considered. Azure Disk is, for lack of a better description, a
big dumb block store. In that simplicity are its strengths.
While Persistent Volumes backed by Azure Disk can be slow to detach or reattach
to a Node, once they're present, they're fast. Operations like disk scans,
small reads and writes, all _feel_ like trivially fast operations from the
Jenkins standpoint. In my testing the difference between a Jenkins master
running on local instance storage (the Virtual Machine's "main" disk) and
running a Jenkins master on a partition from a Data Disk, is imperceptible.
Another benefit which I didn't realize until I evaluated [Azure
File](#azure-file) backed Persistent Volumes is that, as a big dumb block
store, Azure Disks are essentially whatever file system format you want them to
be. In AKS they default to `ext4` which is perfectly happy and native to me,
meaning my Linux-based containers will make the correct assumptions about the
underlying file system's capabilities.
<a name="azure-file"></a>
### Azure File
AKS does not set up an Azure File Storage Class by default, but the Kubernetes
versions which are available (1.7.7, 1.8.1) have the support for Azure File
backed Persistent Volumes. In order to add the storage class, pass something
like the following to Kubernetes via `kubectl create -f azurefile.yaml`:
```yaml
---
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: azurefile
annotations:
labels:
kubernetes.io/cluster-service: 'true'
provisioner: kubernetes.io/azure-file
parameters:
storageAccount: 'mygeneralpurposestorageaccount'
reclaimPolicy: 'Retain'
# mountOptions are passed into mount.cifs as `-o` options
mountOptions:
```
According to [the Azure File documentation](https://kubernetes.io/docs/concepts/storage/storage-classes/#azure-file)
it's not necessary to specify the `storageAccount` key, but I had some
difficulty coaxing AKS to provision an Azure Storage Account on its own, so I
manually provisioned one within the "hidden" AKS Resource Group"
(`MC_<group>_<aks-name>_<location>`) and entered the name into
`azurefile.yaml`.
Full disclosure: I **hate** Storage Accounts in Azure. Where nearly everything
else in Azure rather enjoyable to use, and neatly tucked into Resource Groups,
and have reasonable naming restrictions, Storage Accounts are crummy and live
in an Azure _global namespace_ so if somebody else chooses the same name as what
you want, tough luck. The reason this is somewhat relevant to the current
discussion is that Storage Accounts _feel old_ when you use them. Everything
about them _feels_ as if it's from a by-gone era in Azure's development (ASM).
The feature used by the Azure File Storage Class is what I would describe as
"Samba/CIFS-as-a-Service." Kubernetes is basically utilizing the
Microsoft-technology-equivalent of NFS.
But it's not NFS, it's CIFS. And that is **important** to Linuxy container
folks.
#### Limitations
The biggest limitations with Azure File backed Persistent Volumes in Kubernetes
are really limitations of
[CIFS](https://technet.microsoft.com/en-us/library/cc939973.aspx), and frankly,
they are _infuriating_. An application like Jenkins will make, what were at one
point, reasonable assumptions about the operation system and underlying
file system. "If it looks like a Linux operating system, I am going to assume
the file system supports symbolic links" comes to mind. Jenkins will attempt to
create symbolic links when a Pipeline Run or Build completes, to update a
`lastSuccessfulBuild` or `lastFailedBuild` symbolic link, which are useful for
hyperlinks in the Jenkins web interface.
Jenkins should no doubt be more granular and thoughtful about file system
capabilities, but I'm willing to bet that a number of other applications which
you might consider deploying on Kubernetes are also making assumptions along
the lines of "it's a Linux, so it's probably a Linuxey file system" which Azure
File backed Persistent Volumes invalidate.
Volumes which are attached to the Node, are attached [with very strict
permissions](https://github.com/kubernetes/kubernetes/issues/2630#issuecomment-344091454).
On a Linux file system level, an Azure File backed volume attached at `/mnt/az`
would be attached with `0700` permissions allowing _only_ root access. There
are two options for working around this, as far as I am aware of:
1. Adding a `uid=1000` to the `mountOptions` specified for the Storage Class in
the `azurefile.yaml` referenced above. Unfortunately this would require that
every container attempting to utilize Azure File backed volumes use the same
uid.
1. Specifying a
[securityContext](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/)
for the container with: `runAsUser: 0`. This makes me feel exceptionally
uncomfortable, and I would absolutely not recommend running any untrusted
workloads on a Kubernetes cluster with this setting.
The final, and for me the most important, limitation for Azure File backed
storage is the performance. Presently there is [no Premium model offered for
Azure Files Storage](https://feedback.azure.com/forums/217298-storage/suggestions/8374074-offer-premium-storage-option-for-azure-file-servic),
which I would presume means that Azure File volumes are backed by spinning hard
drives, rather than solid state.
The performance bottleneck for Jenkins is _not_ theoretical however. With a
totally fresh Persistent Volume Claim for a Jenkins application, the
initialization of the application took upwards of **5-15 minutes**, namely:
* 2-3 _seconds_ to create the Persistent Volume and bind it to a Node in the
Kubernetes cluster.
* 3-4 minutes to "extract [Jenkins] from war file". When `jenkins.war` runs the
first time, it unpacks the `.war` file into `JENKINS_HOME` (usually
`/var/lib/jenkins`) and populates `/var/lib/jenkins/war` with a number of small
static files. Basically, unzipping a 100MB archive which contains hundreds of
files.
* 5-10 minutes from "Starting Initialization" to "Jenkins is ready." In my
observation this tends to be highly variable depending on the size of Jenkins
environment, how many plugins are loaded, and what kind of configuration XML
files must be loaded at initialization time.
The closest comparison to Azure File backed storage and the performance
challenges I have observed with it, are similar to challenges the CloudBees
Engineering team observed with [Amazon EFS](https://aws.amazon.com/efs/) when
it was first announced. The disk read/write patterns exhibited by Jenkins
caused trouble on EFS as well, but that has seen marked improvement over the
last 6 months, whereas Azure Files Storage doesn't appear to have had
significant performance improvements in a number of years.
#### Benefits
Despite performance challenges, Azure File backed Persistent Volumes are not
without their benefits. The most notable benefit, which is what originally
attracted me to the Azure File Storage Class, is the support for the
`ReadWriteMany` access mode.
For some workloads, of which Jenkins is not one of them, this would enable a
`replicas` setting greater than 1 and concurrent Persistent Volume access
between the running containers. Even for single container workloads, this is a
valuable setting as it allows for effectively zero-downtime rolling updates and
re-deployments when a new Pod is scheduled on a different underlying Node.
Additionally, Azure File volumes can be simultaneously mounted by other machines in the
resource group, or even across the internet, which can be very useful for
debugging or forensics when something goes wrong (things usually go wrong).
Compared to an Azure Disk volume which would require a [container to be successfully
running](https://kubernetes.io/docs/tasks/debug-application-cluster/get-shell-running-container/) in the Kubernetes environment before you could dig into the disk.
<a name="conclusions"></a>
## Conclusions
Running a highly available Jenkins environment is a non-trivial exercise. One
which requires a substantial understanding of both the nuances of how Jenkins
interacts with the file system but also how users expect to interact with the
system. While I was optimistic at the outset of this work that Kubernetes, or
more specifically AKS, might significantly change the equation; it has not.
To the best of my understanding, this work applies evenly to Azure Container
Service (ACS) and Azure Container Service (AKS) (naming is hard), since both
are using the same fundamental Kubernetes support for Azure via the Azure Disk
and Azure File Storage Classes. Unfortunately I don't have time to do a serious
performance analysis of Data Disks using Standard storage, Data Disks using
Premium Storage, and Azure File mounts. I would love to see work in that area
published by the Microsoft team though!
At this point in time, those seeking to provision Jenkins on ACS or AKS, I
strongly recommend using the Azure Disk Storage Class with Premium storage.
That will not help with "high availability" of Jenkins, but at least once
Jenkins is running, it will be running swiftly. I also recommend using [Jenkins
Pipeline](https://jenkins.io/doc/book/pipeline) for all Jenkins-based
workloads, not just because I fundamentally think it's a better tool than
classic Freestyle Jobs, but it has built-in **durability**. Using Jenkins in
tandem with the [Azure VM Agents](https://plugins.jenkins.io/azure-vm-agents)
plugin, workloads are kicked out to dynamically provisioned Virtual Machines,
and when the master goes down, from which recovery can take 5-ish minutes in
the worst case scenario, the outstanding Pipeline-based workloads will not be
interrupted during that window.
I still find myself excited about the potential of AKS, which is currently in
"public preview." My recommendation to Microsoft would be to spend a
significant amount of time investing in both storage and cluster performance to
strongly differentiate AKS from Kubernetes provided on other clouds.
Personally, I would love to have: faster stateful applications, auto-scaled
Nodes based on compute (or even Data Disk limits!), and cross-location
[Federation](https://kubernetes.io/docs/concepts/cluster-administration/federation/)
for AKS.
Maybe in 2018!
---
### Configuration
Below is the configuration for the Service, Namespace, Ingress, and Stateful
Set I used:
```yaml
---
apiVersion: v1
kind: "List"
items:
- apiVersion: v1
kind: Namespace
metadata:
name: "jenkins-codevalet"
- apiVersion: v1
kind: Service
metadata:
name: 'jenkins-codevalet'
namespace: 'jenkins-codevalet'
spec:
ports:
- name: 'http'
port: 80
targetPort: 8080
protocol: TCP
- name: 'jnlp'
port: 50000
targetPort: 50000
protocol: TCP
selector:
app: 'jenkins-codevalet'
- apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: 'http-ingress'
namespace: 'jenkins-codevalet'
annotations:
kubernetes.io/tls-acme: "true"
kubernetes.io/ingress.class: "nginx"
spec:
tls:
- hosts:
- codevalet.io
secretName: ingress-tls
rules:
- host: codevalet.io
http:
paths:
- path: '/u/codevalet'
backend:
serviceName: 'jenkins-codevalet'
servicePort: 80
- apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: "jenkins-codevalet"
namespace: "jenkins-codevalet"
labels:
name: "jenkins-codevalet"
spec:
serviceName: 'jenkins-codevalet'
replicas: 1
selector:
matchLabels:
app: 'jenkins-codevalet'
volumeClaimTemplates:
- metadata:
name: "jenkins-codevalet"
namespace: "jenkins-codevalet"
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
template:
metadata:
labels:
app: "jenkins-codevalet"
annotations:
spec:
securityContext:
fsGroup: 1000
# https://github.com/kubernetes/kubernetes/issues/2630#issuecomment-344091454
runAsUser: 0
containers:
- name: "jenkins-codevalet"
image: "rtyler/codevalet-master:latest"
imagePullPolicy: Always
ports:
- containerPort: 8080
name: http
- containerPort: 50000
name: jnlp
resources:
requests:
memory: 384M
limits:
memory: 1G
volumeMounts:
- name: "jenkins-codevalet"
mountPath: "/var/jenkins_home"
env:
- name: CPU_REQUEST
valueFrom:
resourceFieldRef:
resource: requests.cpu
- name: CPU_LIMIT
valueFrom:
resourceFieldRef:
resource: limits.cpu
- name: MEM_REQUEST
valueFrom:
resourceFieldRef:
resource: requests.memory
divisor: "1Mi"
- name: MEM_LIMIT
valueFrom:
resourceFieldRef:
resource: limits.memory
divisor: "1Mi"
- name: JAVA_OPTS
value: "-Dhudson.DNSMultiCast.disabled=true -Djenkins.CLI.disabled=true -Djenkins.install.runSetupWizard=false -Xmx$(MEM_REQUEST)m -Dhudson.slaves.NodeProvisioner.MARGIN=50 -Dhudson.slaves.NodeProvisioner.MARGIN0=0.85"
```

Binary file not shown.

After

Width:  |  Height:  |  Size: 187 KiB