otto/rfc/0001-otto-systems-design.adoc

= RFC-0001: Otto System Design
:toc: preamble
:toclevels: 3
ifdef::env-github[]
:tip-caption: :bulb:
:note-caption: :information_source:
:important-caption: :heavy_exclamation_mark:
:caution-caption: :fire:
:warning-caption: :warning:
endif::[]

.Metadata
[cols="1h,1"]
|===
| RFC
| 0001

| Title
| Otto System Design

| Sponsor
| link:https://github.com/rtyler/[R Tyler Croy]

| Status
| Draft :speech_balloon:

| Type
| Standards

| Created
| 2019-02-23

|===

== Abstract

Otto is a robust distributed system for scalable continuous integration and
delivery. To accomplish this Otto is multi-process oriented and distributed by
default; all system interactions must occur over process boundaries even if
executing in the same logical host. This document outlines the high level
architecture but omits specific inter-process communication formats, protocols,
and dependencies.


== Specification

The processes which compose Otto are segmented based on their role or task, and
functionally independent from one another. They must not share data on disk or
memory with one another excepting explicit data storage services.

The system is composed such that each process holds a relatively well-defined
area of responsibility. Each of the processes should be easily replaceable, or
horizontally scalable, depending on the environment. The diagram below is
incomplete, but one such example would be the "Orchestrator" service, whose
responsibilities might involve pushing workloads into Agents depending on how
the CI/CD process has been defined. For a Kubernetes-based deployment of Otto,
this should be swapped out with a Kubernetes-friendly Orchestrator which
implements the orchestration API. In an AWS or Azure-based deployment, this
might instead be provided by a "serverless" function.

[source]
----
+------------+     +-----------+        +------------+
|  Frontend  +-----> Event Bus +-------->Orchestrator|
+--------^---+     +-----------+        +----------+-+
         |                                         |
         |                                       +-v------+
         |                +---------------+------+ Agents |
         |                |               |      +---------+
         |                |               |       +--------|
         |                |               |         +-------+
         |                |               |
         |    +-----------v---+   +-------v------+
         +--->|Relational Data|   |Object Storage|
              |Storage        |   +--------------+
              +---------------+

----

Each of the above boxes represents a process as part of Otto which would have
its own API. Rather than logical components which might represent off-the-shelf
solution.

To support this, all interprocess communication must happen via HTTP and in the
JSON format.

All processes are stateless and rely on the relational and object storage for
any longer term persistence. No other system is the source of record.

[[motivation]]
== Motivation

The motivations for Otto's systems design fall into three major areas:

1. Safe extensibility, allowing different components to be added or removed
   without affecting the other parts of the system.
1. Scalability, allowing different components to scale horizontally as
   workloads demand it.
1. Cost, allowing different components to only execute as needed.

The section below contains more reasoning on how these motivations relate to
other systems and their strengths and weaknesses.

== Reasoning

When considering potential designs for Otto, the most significant influence was
the design and implementation of
link:https://jenkins.io/[Jenkins].
As the predominant open source CI/CD server, Jenkins has succeeded in gaining
adoption by being simple to use, but has struggled to adapt to larger-scale
deployments, or cloud-native environments. Design inspiration also comes from
link:https://kubernetes.io/[Kubernetes],
link:https://kafka.apache.org/[Kafka], and
commonly understood microservice architecture patterns.

Considering the motivations described <<motivations,above>>,
the reasoning expanded in the following sections.

=== Safe extensibility

Extensibility is to be deemed "_safe_" by the avoidance of side effects on
other pre-existing components of the system. A compromise, crash, or error in a
single component **must not** impact other functional areas of the system. In
Otto, this is accomplished by using process boundaries, even in cases of
single-logical-host deployments. This process-external extensibility model is
most successful and prominent in the design and implementation of Kubernetes.

The alternative is to incorporate process-internal extensibility, a model many
plugin-oriented systems such as MySQL, Redis, and to a larger extent Jenkins.
The key benefit of this process-internal extensibility model is that it appears
simpler at a superficial layer. Since the introduction of the `dlopen(3)` system
call, users and administrators have been able to augment functionality of
programs with dynamically loaded code. In **all cases**, this compromises safety
of the parent process. The program is effectively trusting the loaded code as if
it were its own, and thereby granting it access to all of its internal data
structures and ABIs. In the case of Jenkins in particular, this has resulted in
core internal APIs becoming de facto public APIs, as plugin authors discover
classes and methods which suit there purposes.

An external-only extensibility system requires that _all_ actors create explicit
public APIs, while allowing them the freedom of differing implementations

The process-internal approach typically confers implementation restrictions on
the author of the plugin or module. With native applications like MySQL or
Redis, it is possible to write a module in a language other than C or C++ so
long as the ABI is respected and the module compiles to native code. In Jenkins,
although implemented on the JVM, so long as the plugin can operate within the
confines of the Java environment it _should_ work. In practice, process-internal
extensibility systems tend to result in a plugin/module monoculture as it offers
the path of least resistance.

iOtto's process-external extensibility model allows significant flexibility as
practically every programming language and environment allows for implementing
HTTP services. Even within Otto, it may prove useful to operate services
implemented in different languages depending on the runtime environment.
Process-external extensibility allows for this flexibility for all actors in the
system, including the system itself.

=== Scalability

Scalability is a key concern for the design of Otto. As present time it is not
expected to be deployed in a truly multi-tenant environment supporting vastly
different users as a SaaS. There are important benefits to approaching the
system design as if it were to need to evolve into a SaaS model at some point
however.

By structuring Otto as separate stateless processes, each process can be
auto-scaled as workloads demand it. Additionally structuring the "edges" between the different processes to require no single central point of transit, different areas of the system will be able to grow as needed without requiring the entire system to vertically scale. In contrast, consider the implementation of
link:https://jenkins.io/doc/book/pipeline[Jenkins Pipeline] for example. Under
the hood Jenkins Pipeline is executed by a Groovy runtime running on the Jenkins
master, with portions delegated to the agents. It is effectively a very
traditional hub-and-spoke systems architecture, with the hub (master) devoting
non-zero compute resources to interpreting the Jenkins Pipeline script. Like any
hub-and-spoke architecture, this has inherent scale limitations as the hub
becomes over-utilized.

Note: The "Event Bus" in the Otto systems design runs the risk of becoming a central
hub, crucial to all other operations within the system and thereby causing some
scaling challenges.

=== Cost

Cost to operate is another key concern for the design of Otto. The separation of
distinct processes is intended to allow Otto to naturally graft onto existing
"serverless" offerings by various cloud-providers. Rather than a design which
requires a virtual machine to be running a web server to receive webhooks,
Otto's receiver can run via AWS Lambda and only "come alive" to receive the
payload, before executing whatever work it must do in response to the webhook.
As a _reactive_ system, Otto is far cheaper to operate than a fixed-footprint
infrastructure.

The isolation of system state further helps Otto bind into reactive, or
"serverless" infrastructure components. As nothing except storage is effectively
persistent, the other components can be operated only as need.


== Security

Securing this system is a sufficiently complex topic to merit it's own RFC.

== References