otto/rfc/0001-otto-systems-design.adoc

8.7 KiB
Raw Permalink Blame History

<html lang="en"> <head> </head>

RFC-0001: Otto System Design

Table 1. Metadata

RFC

0001

Title

Otto System Design

Sponsor

R Tyler Croy

Status

Draft 💬

Type

Standards

Created

2019-02-23

Abstract

Otto is a robust distributed system for scalable continuous integration and delivery. To accomplish this Otto is multi-process oriented and distributed by default; all system interactions must occur over process boundaries even if executing in the same logical host. This document outlines the high level architecture but omits specific inter-process communication formats, protocols, and dependencies.

Specification

The processes which compose Otto are segmented based on their role or task, and functionally independent from one another. They must not share data on disk or memory with one another excepting explicit data storage services.

The system is composed such that each process holds a relatively well-defined area of responsibility. Each of the processes should be easily replaceable, or horizontally scalable, depending on the environment. The diagram below is incomplete, but one such example would be the "Orchestrator" service, whose responsibilities might involve pushing workloads into Agents depending on how the CI/CD process has been defined. For a Kubernetes-based deployment of Otto, this should be swapped out with a Kubernetes-friendly Orchestrator which implements the orchestration API. In an AWS or Azure-based deployment, this might instead be provided by a "serverless" function.

+------------+     +-----------+        +------------+
|  Frontend  +-----> Event Bus +-------->Orchestrator|
+--------^---+     +-----------+        +----------+-+
         |                                         |
         |                                       +-v------+
         |                +---------------+------+ Agents |
         |                |               |      +---------+
         |                |               |       +--------|
         |                |               |         +-------+
         |                |               |
         |    +-----------v---+   +-------v------+
         +--->|Relational Data|   |Object Storage|
              |Storage        |   +--------------+
              +---------------+

Each of the above boxes represents a process as part of Otto which would have its own API. Rather than logical components which might represent off-the-shelf solution.

To support this, all interprocess communication must happen via HTTP and in the JSON format.

All processes are stateless and rely on the relational and object storage for any longer term persistence. No other system is the source of record.

Motivation

The motivations for Ottos systems design fall into three major areas:

  1. Safe extensibility, allowing different components to be added or removed without affecting the other parts of the system.

  2. Scalability, allowing different components to scale horizontally as workloads demand it.

  3. Cost, allowing different components to only execute as needed.

The section below contains more reasoning on how these motivations relate to other systems and their strengths and weaknesses.

Reasoning

When considering potential designs for Otto, the most significant influence was the design and implementation of Jenkins. As the predominant open source CI/CD server, Jenkins has succeeded in gaining adoption by being simple to use, but has struggled to adapt to larger-scale deployments, or cloud-native environments. Design inspiration also comes from Kubernetes, Kafka, and commonly understood microservice architecture patterns.

Considering the motivations described above, the reasoning expanded in the following sections.

Safe extensibility

Extensibility is to be deemed "safe" by the avoidance of side effects on other pre-existing components of the system. A compromise, crash, or error in a single component must not impact other functional areas of the system. In Otto, this is accomplished by using process boundaries, even in cases of single-logical-host deployments. This process-external extensibility model is most successful and prominent in the design and implementation of Kubernetes.

The alternative is to incorporate process-internal extensibility, a model many plugin-oriented systems such as MySQL, Redis, and to a larger extent Jenkins. The key benefit of this process-internal extensibility model is that it appears simpler at a superficial layer. Since the introduction of the dlopen(3) system call, users and administrators have been able to augment functionality of programs with dynamically loaded code. In all cases, this compromises safety of the parent process. The program is effectively trusting the loaded code as if it were its own, and thereby granting it access to all of its internal data structures and ABIs. In the case of Jenkins in particular, this has resulted in core internal APIs becoming de facto public APIs, as plugin authors discover classes and methods which suit there purposes.

An external-only extensibility system requires that all actors create explicit public APIs, while allowing them the freedom of differing implementations

The process-internal approach typically confers implementation restrictions on the author of the plugin or module. With native applications like MySQL or Redis, it is possible to write a module in a language other than C or C++ so long as the ABI is respected and the module compiles to native code. In Jenkins, although implemented on the JVM, so long as the plugin can operate within the confines of the Java environment it should work. In practice, process-internal extensibility systems tend to result in a plugin/module monoculture as it offers the path of least resistance.

iOttos process-external extensibility model allows significant flexibility as practically every programming language and environment allows for implementing HTTP services. Even within Otto, it may prove useful to operate services implemented in different languages depending on the runtime environment. Process-external extensibility allows for this flexibility for all actors in the system, including the system itself.

Scalability

Scalability is a key concern for the design of Otto. As present time it is not expected to be deployed in a truly multi-tenant environment supporting vastly different users as a SaaS. There are important benefits to approaching the system design as if it were to need to evolve into a SaaS model at some point however.

By structuring Otto as separate stateless processes, each process can be auto-scaled as workloads demand it. Additionally structuring the "edges" between the different processes to require no single central point of transit, different areas of the system will be able to grow as needed without requiring the entire system to vertically scale. In contrast, consider the implementation of Jenkins Pipeline for example. Under the hood Jenkins Pipeline is executed by a Groovy runtime running on the Jenkins master, with portions delegated to the agents. It is effectively a very traditional hub-and-spoke systems architecture, with the hub (master) devoting non-zero compute resources to interpreting the Jenkins Pipeline script. Like any hub-and-spoke architecture, this has inherent scale limitations as the hub becomes over-utilized.

Note: The "Event Bus" in the Otto systems design runs the risk of becoming a central hub, crucial to all other operations within the system and thereby causing some scaling challenges.

Cost

Cost to operate is another key concern for the design of Otto. The separation of distinct processes is intended to allow Otto to naturally graft onto existing "serverless" offerings by various cloud-providers. Rather than a design which requires a virtual machine to be running a web server to receive webhooks, Ottos receiver can run via AWS Lambda and only "come alive" to receive the payload, before executing whatever work it must do in response to the webhook. As a reactive system, Otto is far cheaper to operate than a fixed-footprint infrastructure.

The isolation of system state further helps Otto bind into reactive, or "serverless" infrastructure components. As nothing except storage is effectively persistent, the other components can be operated only as need.

Security

Securing this system is a sufficiently complex topic to merit its own RFC.

References

</html>