iep/iep-005
R. Tyler Croy 267c9664b2 It doesn't seem like we need to collect events, abandoning proposal 2017-06-16 07:35:36 -07:00
..
README.adoc It doesn't seem like we need to collect events, abandoning proposal 2017-06-16 07:35:36 -07:00

README.adoc

<html lang="en"> <head> </head>

IEP-5: Systematic collection of project events

Table 1. Metadata

IEP

5

Title

Systematic collection of project events

Author

R. Tyler Croy

Status

🔥 Abandoned

Type

Service

Created

2016-12-28

Abstract

Considering the size and velocity of the Jenkins project, it can be difficult to systematically determine the overall health through manual checks, scraping of various APIs, or through other non-automated means. Additionally, without any project-owned corpus of data around it is burdensome to answer questions such as:

  • Whats the level of development activities in core, class A/B/C plugins? How are they changing over the time?

  • Who are the seasoned contributors?

  • Who are the new contributors that we can reach out to and help?

  • Whats the typical journey of a plugin developer?

Specification

The primary source of project events is currently GitHub, which provides organization-wide webhooks. This specification focuses primarily on these events but can be readily extended when new sources of information arise.

Technologies Introduced

This specification introduces a few new technologies which are currently not part of the Jenkins project infrastructure:

The motivations for selecting each of these pieces of technology is discussed more in the Rationale section.

Component Diagram
+--------------------+                  +----------------------------+            +--------------+
| GitHub (jenkinsci) +----webhook------>|  github-event-queue (App)  +--enqueue-->|   EventHub   |
+--------------------+           +----->+----------------+-----------+            | (all events) |
+------------------------+       |                       |                        +--------------+
| GitHub (jenkins-infra) +-------+                       |
+------------------------+                               |
                                          +----append----+
                                          |
                           +--------------v--------+
                           |  DocumentDB (github)  |
                           +-----------------------+

Data Format

The following describes the JSON format to be expected by applications querying either DocumentDB or consuming from an EventHub.

JSON Event/Document Format
{
    "source"   : "github", // (1)
    "type"     : "pull_request", // (2)
    "event"    : {}, // (3)
    "received" : "2017-01-05T21:56:04.522Z" // (4)
}
  1. Original source of the event, "github" indicates the event originated as a GitHub webhook.

  2. Type of event, in the case of the github source, this will be one of the webhook event types.

  3. The actual JSON payload of the event.

  4. The ISO-8601 timestamp indicating when the event was received by the Azure Functions app

This table should be updated when new sources and types are added:

Table 2. Event Sources
Identifier Description Types

github

A GitHub webhook payload

One of the webhook event types.

Storage Capacity

The Costs for Azure DocumentDB storage is only based on what is actually consumed rather than what is provisioned. Therefore each DocumentDB provisioned for events should be at minimum 25GB.

Time-to-live

Azure DocumentDB supports a document or collection-level time-to-live (TTL), which can automatically purge documents after the TTL expires. Until we better understand the amount of events data the project wishes to store, this will remain off.

Consistency

Until a compelling reason to change the consistency level for the DocumentDB resources, the default of session consistency should be used.

Monitoring

The monitoring facilities built into Azure Functions dont integrate with any of the existing monitoring tools in use by the Jenkins project. Azure Functions can however, output diagnostic logs and web server logs into an Azure storage account. This is not scoped in this document because it is not yet clear whether these logs are necessary and worth the added cost of storing them.

zure DocumentDB does not have a supported integration with DataDog.

Until there is more support in DataDog or Azure for better monitoring, these services will not be automatically monitored.

Motivation

By provisioning relatively simple webhook receivers which not only archive events data into a live-queryable datastore (DocumentDB), but also publish those events on an Azure EventHub, the Jenkins project will have an easy-to-access data store of project events. Additionally, the events enqueued into EventHub can act as an input for future services (for example, a project health dashboard) without requiring additional infrastructure to be provisioned.

Rationale

Technologies Introduced

An underlying rationale for using all of the Azure-specific technologies referenced below is that it is cheaper, easier, and faster to use platform-level services provided by Azure rather than implementing and hosting the features of each technology itself.

Azure Functions

Implement a GitHub webhook receiver in Azure Functions is sufficiently trivial that it is one of their "Get Started" examples, and as such Azure Functions has explicit support and ehancements for receiving webhook payloads from GitHub.

Additionally, Azure Functions represent a "slice" of computation which is suitable for the purpose of receiving a JSON payload, processing it, and storing it for later. This as opposed to implementing a new web application specifically for this purpose which would need either its own virtual machine or container infrastructure in order to execute.

Azure EventHub

The use of Azure EventHub in the architecture described above is more future-proofing than a strong requirement to solve the problem at hand. The assumption being that additional services in the future will wish to consume some or all of the events received by the deployed Azure Functions app.

The most practical means of providing this service internally is through a pub/sub mechanism which EventHubs provide in Azure. EventHubs also provide the added benefit of automatically expiring old messages along with many other valuable queueing features such as consumer groups and partitions.

Azure DocumentDB

GitHub webhook event payloads are constructed as JSON, and it is expected that any subsequent events will be consumed as JSON. As such, a document-oriented database ("NoSQL") is preferred in order to avoid time-consuming schema updates.

Costs

The pricing for Azure Functions by itself is already confusing and without an existing Functions app consuming the jenkinsci events its difficult to evaluate what the runtime cost would be. That said, 1 million monthly executions are provided for free by Azure, meaning the Function app itself will cost nothing or very little.

The noticeable cost of this proposal will come from the EventHub and DocumentDB storage and transit rates.

DocumentDB

The storage rate, in East US, is $0.25 per GB / month. The throughput rate, in East US, is $0.008/hr per hundred Request Units per second.

Assuming each DocumentDB instance is provisioned with 5GB of storage, the annual storage cost will be roughly $15. Though this is likely to go up as more data is stored over time.

The throughput rates annual cost is difficult to ascertain without real-world usage, but is expected to remain under $100 barring dramatic shifts in expected usage.

EventHub

The cost per million ingress events in East US is a paltry $0.028, so not worth discussing.

The throughput unit cost (1MB ingress, 2MB egress) comes to an annual cost of $133.92.

Reference implementation

The reference implementation of the Azure Functions app can be found in the jenkins-infra/analytics-functions repository.

The Terraform for actually provisioning the Azure Functions app in Azure can be found in this pull request.

</html>