Compare commits

...

26 Commits

Author SHA1 Message Date
Dmytro Suvorov fe3d9fec6f
Merge 9d028be1d3 into d0c64237c0 2024-03-19 13:05:24 -04:00
R Tyler Croy d0c64237c0
Merge pull request #134 from scribd/bens-mlplat
Evolution Of Machine Learning Platforms
2024-02-26 13:01:23 -08:00
ben 441d41a44a refined scribs platform section 2024-02-15 17:18:12 -08:00
Ben Shaw 9a88cbaaf3
Update 2024-02-05-evolution-of-mlplatform.md
Reduce cruft
2024-02-09 17:53:45 -08:00
Ben Shaw 698a85802a
Update 2024-02-05-evolution-of-mlplatform.md
Fix references
2024-02-09 17:43:05 -08:00
Ben Shaw 65e33686d9
Update 2024-02-05-evolution-of-mlplatform.md
fix links and add details about scribds ml platform
2024-02-09 17:25:59 -08:00
Ben Shaw 88cce8e236
Update 2024-02-05-evolution-of-mlplatform.md
[WIP] refactor links and move benefits to bottom with more specific examples
2024-02-08 19:59:01 -08:00
Ben Shaw 48b6ec515f
Update 2024-02-05-evolution-of-mlplatform.md 2024-02-06 20:22:46 -08:00
ben e628fcaf7a update date till monday will release then 2024-02-02 17:58:24 -08:00
ben 9ea9a10f9d always fixing 2024-02-02 17:55:19 -08:00
ben 817346f143 fixed tags 2024-02-02 17:44:29 -08:00
Ben Shaw d7b3e02599
Update 2024-02-01-evolution-of-mlplatform.md 2024-02-02 15:47:03 -08:00
Ben Shaw 62c25c6a5e
Update authors.yml 2024-02-02 15:45:23 -08:00
Ben Shaw 21f6da9218
Update 2024-02-01-evolution-of-mlplatform.md 2024-02-02 15:40:49 -08:00
Ben Shaw 9937bf0937
Update 2024-02-01-evolution-of-mlplatform.md 2024-02-02 10:07:54 -08:00
Ben Shaw 264713b089
Update 2024-02-01-evolution-of-mlplatform.md 2024-02-02 10:07:04 -08:00
Ben Shaw fa37acd9e3
Update 2024-02-01-evolution-of-mlplatform.md 2024-02-02 10:06:29 -08:00
Ben Shaw 91a5328a67
Update 2024-02-01-evolution-of-mlplatform.md
fixing formatting
2024-02-01 11:00:24 -08:00
ben 6875996ff4 fix formatting 2024-02-01 10:57:59 -08:00
ben 6312feceb8 remove formatting paste 2024-02-01 10:57:16 -08:00
ben c52ab4558d remove title of contents 2024-02-01 10:52:30 -08:00
ben 7551e7311e added evolution of ml platform 2024-02-01 10:49:24 -08:00
Dmitry Suvorov 9d028be1d3 rephrased some sentences and fixed image name 2023-06-15 15:17:11 +03:00
Dmytro Suvorov 6b4985e147
Update _posts/2023-06-07-airflow-upgrade-to-2-5-3.md
Co-authored-by: Maksym Dovhal <maksymd@scribd.com>
2023-06-15 11:49:43 +03:00
Dmytro Suvorov 0f58fc52bc
Update _posts/2023-06-07-airflow-upgrade-to-2-5-3.md
Co-authored-by: Maksym Dovhal <maksymd@scribd.com>
2023-06-15 11:49:35 +03:00
Dmitry Suvorov 970a5ecab2 [PE-3860]: A blog about our Airflow upgrade journey
Here I've collected all main actions we have done, issues we have faced and solutions we have made during Airflow upgrade from version `2.2.0` to `2.5.3`
2023-06-14 19:31:41 +03:00
6 changed files with 280 additions and 0 deletions

View File

@ -3,6 +3,13 @@
# description, etc
---
bshaw:
name: Ben Shaw
github: benshaw
twitter: ben_a_shaw
about: |
Ben leads the ML Platform group, helping scale production Machine Learning at scribd. Other times you will find him outside playing in the mountains.
alexjb:
name: Alex Bernardin
github: alexofmanytrades

View File

@ -0,0 +1,134 @@
---
layout: post
title: "Airflow upgrade from 2.2.0 to 2.5.3 journey"
author: dimonchik-suvorov
tags:
- featured
- airflow-series
- aws
team: Data Platform
---
Let me continue QP's `airflow-series` rubric and tell you a story of one Airflow upgrade. In this post I'm going to describe what infrastructure we had before, how we improved it and what difficulties we faced along the way, so let's go!
### At the beginning there were...
Main parts of our Airflow infrastructure:
- Old but gold Airflow `2.2.0`
- Our private fork with custom changes that we didn't upstream yet
- Airflow Image based on `python3.7` Docker image stored in AWS ECR (Elastic Container Registry)
- We install Airflow in `--editable` mode each time to apply our custom changes
- Scheduler and Web Server running on AWS ECS Fargate (Elastic Container Service)
- Terraform for configuring ECS tasks and Airflow params
- Kubernetes Executor on AWS EKS `1.24` (Elastic Kubernetes Service)
- Airflow backend - MySQL `5.7` on AWS RDS (Relational Database Service)
![](/post-images/2023-06-07-airflow-upgrade-to-2-5-3/airflow-highlevel-architecture.png)
<font size="3"><center><i>High-level architecture of our Airflow infrastructure</i></center></font>
Also, we have a couple of our own Airflow plugins like custom operators, hooks and _Self-Service Backfill UI_ (you can learn more about this in our [Airflow Summit 2021 slides](https://airflowsummit.org/slides/2021/e5_6-ModernizeAirflow-Scribd.pdf)).
Jenkins pipeline for building Airflow Docker image and apply Terraform to restart ECS tasks with new configurations and Airflow image.
### Why?
We decided to upgrade because of released features and fixed bugs obviously, but the most valuable thing we desired is multiple Schedulers because our Fargate instance came close to it's max available resources, and we suffered from constant high CPU utilization.
This caused MySQL upgrade because multiple Schedulers don't work with MySQL 5.7 which doesn't support `SKIP LOCKED` or `NOWAIT` SQL clauses.
With all the rest we decided that this is a good idea (which, as you may guess, wasn't) to also bump Python version from `3.7` to `3.10`
### As promised - the journey!
Before everything we went through all [Release Notes](https://airflow.apache.org/docs/apache-airflow/stable/release_notes.html) started from 2.2.0 to 2.5.3 simply to understand what we should be waiting for. Created a list of possible issues we can face with. We thought we prepared and fearlessly made a first move.
#### Database upgrade
Beforehand of the Airflow upgrade we decided to bump our MySQL version from `5.7` to `8` in advance. This didn't caused any troubles because Airflow `2.2.0` works fine with MySQL `8`. Another reason to bump database version in advance was to reduce the complexity of the Airflow upgrade and time it will take.
#### Testing - yes / no / maybe so?
From time to time it's good to test everything before upgrading it on Production - in order to test all integrations we created separate terraform module that takes our development env like ECS tasks for Scheduler and Webserver, Database from a snapshot and creates all infrastructure that Airflow needs right beside Dev without any intersections with it, so we could test everything without fear of breaking things. But, the main beauty of this is in quick recreation - ~5 minutes to redeploy it from the scratch in case something went irreparably wrong. Also, separately, we've been testing some parts locally using Breeze and our own custom Airflow image build.
After testing integrations and parts our next step was the `performance testing` step. We didn't enable multi-scheduler feature and wanted to see how it works just with a single Scheduler and compare results. Spoiler - it works almost the same so far. A bit more intensively uses cores and memory now (because we made few additional configurations) but the issue we had before is gone. The issue was that old Scheduler was using 100% of CPUs from time to time and only restarting it helped (big long spike on the screenshot).
For performance testing we have created autogenerated DAG with more than 1000 dummy PythonOperators (this is an approximate number of tasks in our main DAG). Those operators have random `sleep` to test concurrency and things.
![](/post-images/2023-06-07-airflow-upgrade-to-2-5-3/ecs-performance-metrics.png)
<font size="3"><center><i>AWS ECS performance metrics</i></center></font>
#### Updating the code base
We started from merging `2.5-stable` into our main fork branch. After all merge conflicts were resolved (the easiest part of the upgrade I'd say) we started testing...
This wasn't exactly what you can call a smooth upgrade - switching to the new Airflow version caused a lot of errors in our plugins and services that are heavily rely on the Airflow core/providers code.
But, we did expected bigger part of the issues because we did went through Airflow Release Notes.
Issues we knew about and/or caught during the testing were:
- in our custom DAGs dump code - internal structure of the DAG class slightly changed (like `_BaseOperator__init_kwargs`, `TaskGroup`, `ParamsDict`, etc.), and it caused compilation errors
- custom DAGs validations (DummyOperator deprecated in favor of EmptyOperator)
- in custom operators `execute` function where we were suing TaskInstance `key` and it's changed to DagRun's `run_id`
- some of `timetable` functions also changed, and we had to find new ways to do what we did before merging (like `dag.timetable.infer_data_interval(your_execution_date).end` became `dag.timetable.infer_manual_data_interval(run_after=exec_dt).start`)
- `node:12.22.6` Docker image we used for building npm (`airflow/www/static/dist`) was too old for the new and shiny Airflow (switched to the `node:16.0.0`)
- `TriggerRuleDep` changed, and our custom rules that were using/overriding `_get_dep_statuses` and `_evaluate_trigger_rule` also started to fail
- [Okta integration](https://tech.scribd.com/blog/2021/integrating-airflow-and-okta.html) started to fail because of new Flask AppBuilder (simply adding `server_metadata_url` to existing configuration solved the issue)
- some Airflow configuration params were changed:
- `AIRFLOW__CORE__SQL_ALCHEMY_CONN` -> `AIRFLOW__DATABASE__SQL_ALCHEMY_CONN`
- Airflow config section `kubernetes` renamed to `kubernetes_executor` and all related environment variables configurations started with `AIRFLOW__KUBERNETES__...` changed accordingly to `AIRFLOW__KUBERNETES_EXECUTOR__...`
- `AwsLambdaHook` changed to `LambdaHook` in the `AWS` provider. `function_name` parameter was moved from the `AwsLambdaHook.init()` function to the `invoke_lambda` function (put it closer to the execution)
- BaseOperator's `task_concurrency` parameter changed to `max_active_tis_per_dag`
- `ResultProxy` and `RowProxy` classes from `sqlalchemy.engine.result` changed their names to the `Result` and `Row`
- DAG's `schedule_interval` param changed to the `schedule`
- our Unit tests also broke because we mocked TaskInstance and other classes but their internal structure changed
Wasn't that hard, agree? By the way during the concurrency and performance testing we have learned that our Kubernetes executor isn't optimal - `AIRFLOW__KUBERNETES_EXECUTOR__WORKER_PODS_CREATION_BATCH_SIZE=1` by default didn't correspond to our needs - we have bumped this param to `50`, and it becomes significantly faster in starting new tasks.
#### Apply DB migrations gently
In parallel with testing the code we've been testing Airflow to database connectivity (`airflow db check` command) and how Airflow database migrations apply to our database (`airfow db upgrade`). How to upgrade Airflow database could be found [here](https://airflow.apache.org/docs/apache-airflow/2.5.3/installation/upgrading.html). What we have learned:
- some providers changed their packages, and most significant thing was that Airflow folks introduced `common.sql` provider and moved there some functionality from the `databricks` provider (which we also use a lot). Because of [Broken installation with --editable flag](https://github.com/apache/airflow/issues/30764) we faced an issue with Airflow communication with the database. I've described few workarounds in the ticket, but the root cause was that we upgraded Python (and as consequence - pip version).
- new Docker image with Python `3.10` came with the new Debian `bullseye` and there was a change important for us - `mysql-client` lib was stopped packaging and became `default-mysql-client`
- `airfow db upgrade`generated `dangling` tables and put there all DAGs, TaskInstances and other entities that couldn't be migrated. For example if DAG was without ID or execution_date. Don't ask me how that happened, but we reviewed these tables and successfully dropped them.
- applying migrations took ~2 hours on our weak dev RDS cluster and we thought it will be faster on production... boy oh boy we were wrong...
- our plain Airflow `User` role lost access to the custom _Self-Service Backfill UI_. Solved this by setting `AIRFLOW__WEBSERVER__UPDATE_FAB_PERMS` to `False` because otherwise it drops custom permissions each time Airflow Web Server restarts
#### Ship it!
Finally, everything was tested and fixed. The time has come to upgrade Airflow on production! All DAGs paused, ECS tasks stopped. Production database snapshot created.
###### Database migration
We started applying migrations on production database. In order to do this we prepared another ECS task which runs `airfow db upgrade` command. We successfully deployed it via Terraform and start to wait... and wait... and wait a little bit more...
After ~5 hours we started to think why it takes so long... We realized that:
- our production Airflow database RDS cluster only twice as big as our development
- we have giant database with data for 2017 year and so on (backfills mostly). Considering we had a `sentinel` DAG which checks Scheduler is alive for each 5 minutes (sends a gauge metric to the Datadog), and our main DAG with ~1k tasks in it running daily, you can imagine how big `task_instance` and `dag_run` tables were...
What we decided to do:
- stop the migration
- give our production RDS cluster more resources - 32Gb RAM and 16 CPUs comparing to 8Gb RAM and 4 CPUs that we had before
- delete all data below year 2021
- rerun migration ECS task again
That helped - migration completed in ~30 minutes!
###### Airflow - your turn!
Deploy ECS tasks via Terraform with updated Airflow image didn't cause any issues.
After that we've merged all PR's with updated DAGs and related services. Checked that there are no DAG parsing errors on production.
Started all DAGs and start to observe... and it started to fail...
- [Braking changes](https://github.com/apache/airflow/pull/26452/files#diff-55326294fdc9ce88aee820373ed658972dff4067517a9c4f59819efbbf3e3b85R27) in the SlackWebHook. Basically they refactored the hook and switched to the Slack SDK instead of using plain HTTP. This caused `SlackRequestError: Invalid URL detected: ***` for all our Slack hook usages. In order to fix it we had to went through all Slack Airflow Connections, copy `webhook_token` extra param, change the connection type to Slack Incoming Webhook and put copied value to the `Webhook Token` field
- [Bug](https://github.com/apache/airflow/issues/31898) in Databricks SQL operator they are creating new connection for each query. As I understand new Airflow connection object means new Spark Session in the Databricks Warehouse. This breaks our templated queries because in most of them first query is `use some_database` and next is an actual processing query. When it creates new connection for each query in a template file it simply forgets about `use` statement and fails second query because can't find tables without database
After these issues were fixed we finally got a possibility to go to sleep because this is the end of the story, we made it!
### Conclusions
Even if you think you have tested everything you could be wrong. We did went through ann Airflow Release notes and prevented all possible issues in the core Airflow but didn't look at breaking changes in providers we use. don't forget to look what changed there also before upgrade.
Another decision we made - we decided to use constraints file during Airflow installation because without it some of Python modules could be loaded with newer versions during Airflow Docker image rebuild and caused dependencies conflicts.
Clear your database as much as possible in advance to reduce the number of rows that will be processed during the database migration
Nevertheless, upgrade is finished and now we have Airflow 2.5.3 up and running.
### Credits
- [Maksym Dovhal](https://github.com/Maks-D) for creating separate Terraform module for testing, actual testing things, fixing bugs along the way and for being deployment commander
- [Kuntal Basu](https://github.com/kuntalkumarbasu) - our infrastructure magician and guru for his help in real-time issues detection and fixing during the deployment
- [Artur Kiiko](https://github.com/arturkii) and [Lakshmi Pernapati](https://github.com/lpernapati) for participation in local/integration testing and fixing issues

View File

@ -0,0 +1,133 @@
---
layout: post
title: "The Evolution of the Machine Learning Platform"
team: Machine Learning Platform
author: bshaw
tags:
- mlops
- featured
- ml-platform-series
---
Machine Learning Platforms (ML Platforms) have the potential to be a key component in achieving production ML at scale without large technical debt, yet ML Platforms are not often understood. This document outlines the key concepts and paradigm shifts that led to the conceptualization of ML Platforms in an effort to increase an understanding of these platforms and how they can best be applied.
Technical Debt and development velocity defined
-----------------------------------------------
### Development Velocity
Machine learning development velocity refers to the speed and efficiency at which machine learning (ML) projects progress from the initial concept to deployment in a production environment. It encompasses the entire lifecycle of a machine learning project, from data collection and preprocessing to model training, evaluation, validation deployment and testing for new models or for re-training, validation and deployment of existing models.
### Technical Debt
The term "technical debt" in software engineering was coined by Ward Cunningham, Cunningham used the metaphor of financial debt to describe the trade-off between implementing a quick and dirty solution to meet immediate needs (similar to taking on financial debt for short-term gain) versus taking the time to do it properly with a more sustainable and maintainable solution (akin to avoiding financial debt but requiring more upfront investment). Just as financial debt accumulates interest over time, technical debt can accumulate and make future development more difficult and expensive.
The idea behind technical debt is to highlight the consequences of prioritizing short-term gains over long-term maintainability and the need to address and pay off this "debt" through proper refactoring and improvements. The term has since become widely adopted in the software development community to describe the accrued cost of deferred work on a software project.
### Technical Debt in Machine Learning
Originally a software engineering concept, Technical debt is also relevant to Machine Learning Systems infact the landmark google paper suggest that ML systems have the propensity to easily gain this technical debt.
> Machine learning offers a fantastically powerful toolkit for building useful complex prediction systems quickly. This paper argues it is dangerous to think of these quick wins as coming for free. Using the software engineering framework of technical debt , we find it is common to incur massive ongoing maintenance costs in real-world ML systems
>
> [Sculley et al (2021) Hidden Technical Debt in Machine Learning Systems](https://www.scribd.com/document/428241724/Hidden-Technical-Debt-in-Machine-Learning-Systems)
> As the machine learning (ML) community continues to accumulate years of experience with livesystems, a wide-spread and uncomfortable trend has emerged: developing and deploying ML sys-tems is relatively fast and cheap, but maintaining them over time is difficult and expensive
>
> [Sculley et al (2021) Hidden Technical Debt in Machine Learning Systems](https://www.scribd.com/document/428241724/Hidden-Technical-Debt-in-Machine-Learning-Systems)
Technical debt is important to consider especially when trying to move fast. Moving fast is easy, moving fast without acquiring technical debt is alot more complicated.
The Evolution Of ML Platforms
-----------------------------
### DevOps -- The paradigm shift that led the way
DevOps is a methodology in software development which advocates for teams owning the entire software development lifecycle. This paradigm shift from fragmented teams to end-to-end ownership enhances collaboration and accelerates delivery. Dev ops has become standard practice in modern software development and the adoption of DevOps has been widespread, with many organizations considering it an essential part of their software development and delivery processes. Some of the principles of DevOps are:
1. **Automation**
2. **Continuous Testing**
3. **Continuous Monitoring**
4. **Collaboration and Communication**
5. **Version Control**
6. **Feedback Loops**
### Platforms -- Reducing Cognitive Load
This shift to DevOps and teams teams owning the entire development lifecycle introduces a new challenge—additional cognitive load. Cognitive load can be defined as
> The total amount of mental effort a team uses to understand, operate and maintain their designated systems or tasks.
>
> [Skelton & Pais (2019) Team Topologies](https://teamtopologies.com/book)
The weight of the additional load introduced in DevOps of teams owning the entire software development lifecycle can hinder productivity, prompting organizations to seek solutions.
Platforms emerged as a strategic solution, delicately abstracting unnecessary details of the development lifecycle. This abstraction allows engineers to focus on critical tasks, mitigating cognitive load and fostering a more streamlined workflow.
> The purpose of a platform team is to enable stream-aligned teams to deliver work with substantial autonomy. The stream-aligned team maintains full ownership of building, running, and fixing their application in production. The platform team provides internal services to reduce the cognitive load that would be required from stream-aligned teams to develop these underlying services.
>
> [Skelton & Pais (2019) Team Topologies](https://teamtopologies.com/book)
> Infrastructure Platform teams enable organisations to scale delivery by solving common product and non-functional requirements with resilient solutions. This allows other teams to focus on building their own things and releasing value for their users
>
> [Rowse & Shepherd (2022) Building Infrastructure Platforms](https://martinfowler.com/articles/building-infrastructure-platform.html)
### ML Ops -- Reducing technical debt of machine learning
The ability of ML systems to rapidly accumulate technical debt has given rise to the concept of MLOps. MLOps is a methodology that takes inspiration from and incorporates best practices of the DevOps, tailoring them to address the distinctive challenges inherent in machine learning. MLOps applies the established principles of DevOps to machine learning, recognizing that merely a fraction of real-world ML systems comprises the actual ML code. Serving as a crucial bridge between development and the ongoing intricacies of maintaining ML systems.
MLOps is a methodology that provides a collection of concepts and workflows designed to promote efficiency, collaboration, and sustainability of the ML Lifecycle. Correctly applied MLOps can play a pivotal role controlling technical debt and ensuring the efficiency, reliability, and scalability of the machine learning lifecycle over time.
Scribd's ML Platform -- MLOps and Platforms in Action
-------------------------------------
At Scribd we have developed a machine learning platform which provides a curated developer experience for machine learning developers. This platform has been built with MLOps in mind which can be seen through its use of common DevOps principles.
1. **Automation:**
* Applying CI/CD strategies to model deployments through the use of Jenkins pipelines which deploy models from the Model Registry to AWS based endpoints.
* Automating Model training throug the use of Airflow DAGS and allowing these DAGS to trigger the deployment pipelines to deploy a model once re-training has occured.
2. **Continuous** **Testing:**
* Applying continuous testing as part of a model deployment pipeline, removing the need for manual testing.
* Increased tooling to support model validation testing.
3. **Monitoring:**
* Monitoring real time inference endpoints
* Monitoring training DAGS
* Monitoring batch jobs
4. **Collaboration and Communication:**
* Feature Store which provides feature discovery and re-use
* Model Database which provides model collaboration
6. **Version Control:**
* Applying version control to experiments, machine learning models and features
References
----------
[Bottcher (2018, March 05). What I Talk About When I Talk About Platforms. https://martinfowler.com/articles/talk-about-platforms.html](https://martinfowler.com/articles/talk-about-platforms.html)
[D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Franc¸ois Crespo, Dan Dennison (2021) Hidden Technical Debt in Machine Learning Systems](https://www.scribd.com/document/428241724/Hidden-Technical-Debt-in-Machine-Learning-Systems)
[Fowler (2022, October 20).Conway's Law. https://martinfowler.com/bliki/ConwaysLaw.html](https://martinfowler.com/bliki/ConwaysLaw.html)
[Galante, what is platform engineering. https://platformengineering.org/blog/what-is-platform-engineering](https://platformengineering.org/blog/what-is-platform-engineering)
[Humanitect, State of Platform Engineering Report](https://www.scribd.com/document/611845499/Whitepaper-State-of-Platform-Engineering-Report)
[Hodgson (2023, July 19).How platform teams get stuff done. https://martinfowler.com/articles/platform-teams-stuff-done.html](https://martinfowler.com/articles/platform-teams-stuff-done.html)
[Murray (2017, April 27. The Art of Platform Thinking. https://www.thoughtworks.com/insights/blog/platforms/art-platform-thinking)](https://www.thoughtworks.com/insights/blog/platforms/art-platform-thinking)
[Rouse (2017, March 20). Technical Debt. https://www.techopedia.com/definition/27913/technical-debt](https://www.techopedia.com/definition/27913/technical-debt)
[Rowse & Shepherd (2022).Building Infrastructure Platforms. https://martinfowler.com/articles/building-infrastructure-platform.html](https://martinfowler.com/articles/building-infrastructure-platform.html)
[Skelton & Pais (2019) Team Topologies](https://teamtopologies.com/book)

Binary file not shown.

After

Width:  |  Height:  |  Size: 196 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 291 KiB

6
tag/mlops/index.md Normal file
View File

@ -0,0 +1,6 @@
---
layout: tag_page
title: "Tag: mlops"
tag: mlops
robots: noindex
---