Compare commits

...

7 Commits

Author SHA1 Message Date
Dmytro Suvorov bd8d6d3b07
Merge 9d028be1d3 into 5c5ed195f5 2024-02-10 12:57:14 +00:00
R Tyler Croy 5c5ed195f5
Merge pull request #135 from scribd/dependabot/bundler/nokogiri-1.16.2
Bump nokogiri from 1.14.3 to 1.16.2
2024-02-07 09:46:35 -08:00
dependabot[bot] 3fb64427ff
Bump nokogiri from 1.14.3 to 1.16.2
Bumps [nokogiri](https://github.com/sparklemotion/nokogiri) from 1.14.3 to 1.16.2.
- [Release notes](https://github.com/sparklemotion/nokogiri/releases)
- [Changelog](https://github.com/sparklemotion/nokogiri/blob/main/CHANGELOG.md)
- [Commits](https://github.com/sparklemotion/nokogiri/compare/v1.14.3...v1.16.2)

---
updated-dependencies:
- dependency-name: nokogiri
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-02-06 03:31:41 +00:00
Dmitry Suvorov 9d028be1d3 rephrased some sentences and fixed image name 2023-06-15 15:17:11 +03:00
Dmytro Suvorov 6b4985e147
Update _posts/2023-06-07-airflow-upgrade-to-2-5-3.md
Co-authored-by: Maksym Dovhal <maksymd@scribd.com>
2023-06-15 11:49:43 +03:00
Dmytro Suvorov 0f58fc52bc
Update _posts/2023-06-07-airflow-upgrade-to-2-5-3.md
Co-authored-by: Maksym Dovhal <maksymd@scribd.com>
2023-06-15 11:49:35 +03:00
Dmitry Suvorov 970a5ecab2 [PE-3860]: A blog about our Airflow upgrade journey
Here I've collected all main actions we have done, issues we have faced and solutions we have made during Airflow upgrade from version `2.2.0` to `2.5.3`
2023-06-14 19:31:41 +03:00
4 changed files with 136 additions and 2 deletions

View File

@ -231,7 +231,7 @@ GEM
jekyll-seo-tag (~> 2.1)
minitest (5.17.0)
multipart-post (2.1.1)
nokogiri (1.14.3-x86_64-linux)
nokogiri (1.16.2-x86_64-linux)
racc (~> 1.4)
octokit (4.22.0)
faraday (>= 0.9)
@ -239,7 +239,7 @@ GEM
pathutil (0.16.2)
forwardable-extended (~> 2.6)
public_suffix (4.0.7)
racc (1.6.2)
racc (1.7.3)
rb-fsevent (0.11.1)
rb-inotify (0.10.1)
ffi (~> 1.0)

View File

@ -0,0 +1,134 @@
---
layout: post
title: "Airflow upgrade from 2.2.0 to 2.5.3 journey"
author: dimonchik-suvorov
tags:
- featured
- airflow-series
- aws
team: Data Platform
---
Let me continue QP's `airflow-series` rubric and tell you a story of one Airflow upgrade. In this post I'm going to describe what infrastructure we had before, how we improved it and what difficulties we faced along the way, so let's go!
### At the beginning there were...
Main parts of our Airflow infrastructure:
- Old but gold Airflow `2.2.0`
- Our private fork with custom changes that we didn't upstream yet
- Airflow Image based on `python3.7` Docker image stored in AWS ECR (Elastic Container Registry)
- We install Airflow in `--editable` mode each time to apply our custom changes
- Scheduler and Web Server running on AWS ECS Fargate (Elastic Container Service)
- Terraform for configuring ECS tasks and Airflow params
- Kubernetes Executor on AWS EKS `1.24` (Elastic Kubernetes Service)
- Airflow backend - MySQL `5.7` on AWS RDS (Relational Database Service)
![](/post-images/2023-06-07-airflow-upgrade-to-2-5-3/airflow-highlevel-architecture.png)
<font size="3"><center><i>High-level architecture of our Airflow infrastructure</i></center></font>
Also, we have a couple of our own Airflow plugins like custom operators, hooks and _Self-Service Backfill UI_ (you can learn more about this in our [Airflow Summit 2021 slides](https://airflowsummit.org/slides/2021/e5_6-ModernizeAirflow-Scribd.pdf)).
Jenkins pipeline for building Airflow Docker image and apply Terraform to restart ECS tasks with new configurations and Airflow image.
### Why?
We decided to upgrade because of released features and fixed bugs obviously, but the most valuable thing we desired is multiple Schedulers because our Fargate instance came close to it's max available resources, and we suffered from constant high CPU utilization.
This caused MySQL upgrade because multiple Schedulers don't work with MySQL 5.7 which doesn't support `SKIP LOCKED` or `NOWAIT` SQL clauses.
With all the rest we decided that this is a good idea (which, as you may guess, wasn't) to also bump Python version from `3.7` to `3.10`
### As promised - the journey!
Before everything we went through all [Release Notes](https://airflow.apache.org/docs/apache-airflow/stable/release_notes.html) started from 2.2.0 to 2.5.3 simply to understand what we should be waiting for. Created a list of possible issues we can face with. We thought we prepared and fearlessly made a first move.
#### Database upgrade
Beforehand of the Airflow upgrade we decided to bump our MySQL version from `5.7` to `8` in advance. This didn't caused any troubles because Airflow `2.2.0` works fine with MySQL `8`. Another reason to bump database version in advance was to reduce the complexity of the Airflow upgrade and time it will take.
#### Testing - yes / no / maybe so?
From time to time it's good to test everything before upgrading it on Production - in order to test all integrations we created separate terraform module that takes our development env like ECS tasks for Scheduler and Webserver, Database from a snapshot and creates all infrastructure that Airflow needs right beside Dev without any intersections with it, so we could test everything without fear of breaking things. But, the main beauty of this is in quick recreation - ~5 minutes to redeploy it from the scratch in case something went irreparably wrong. Also, separately, we've been testing some parts locally using Breeze and our own custom Airflow image build.
After testing integrations and parts our next step was the `performance testing` step. We didn't enable multi-scheduler feature and wanted to see how it works just with a single Scheduler and compare results. Spoiler - it works almost the same so far. A bit more intensively uses cores and memory now (because we made few additional configurations) but the issue we had before is gone. The issue was that old Scheduler was using 100% of CPUs from time to time and only restarting it helped (big long spike on the screenshot).
For performance testing we have created autogenerated DAG with more than 1000 dummy PythonOperators (this is an approximate number of tasks in our main DAG). Those operators have random `sleep` to test concurrency and things.
![](/post-images/2023-06-07-airflow-upgrade-to-2-5-3/ecs-performance-metrics.png)
<font size="3"><center><i>AWS ECS performance metrics</i></center></font>
#### Updating the code base
We started from merging `2.5-stable` into our main fork branch. After all merge conflicts were resolved (the easiest part of the upgrade I'd say) we started testing...
This wasn't exactly what you can call a smooth upgrade - switching to the new Airflow version caused a lot of errors in our plugins and services that are heavily rely on the Airflow core/providers code.
But, we did expected bigger part of the issues because we did went through Airflow Release Notes.
Issues we knew about and/or caught during the testing were:
- in our custom DAGs dump code - internal structure of the DAG class slightly changed (like `_BaseOperator__init_kwargs`, `TaskGroup`, `ParamsDict`, etc.), and it caused compilation errors
- custom DAGs validations (DummyOperator deprecated in favor of EmptyOperator)
- in custom operators `execute` function where we were suing TaskInstance `key` and it's changed to DagRun's `run_id`
- some of `timetable` functions also changed, and we had to find new ways to do what we did before merging (like `dag.timetable.infer_data_interval(your_execution_date).end` became `dag.timetable.infer_manual_data_interval(run_after=exec_dt).start`)
- `node:12.22.6` Docker image we used for building npm (`airflow/www/static/dist`) was too old for the new and shiny Airflow (switched to the `node:16.0.0`)
- `TriggerRuleDep` changed, and our custom rules that were using/overriding `_get_dep_statuses` and `_evaluate_trigger_rule` also started to fail
- [Okta integration](https://tech.scribd.com/blog/2021/integrating-airflow-and-okta.html) started to fail because of new Flask AppBuilder (simply adding `server_metadata_url` to existing configuration solved the issue)
- some Airflow configuration params were changed:
- `AIRFLOW__CORE__SQL_ALCHEMY_CONN` -> `AIRFLOW__DATABASE__SQL_ALCHEMY_CONN`
- Airflow config section `kubernetes` renamed to `kubernetes_executor` and all related environment variables configurations started with `AIRFLOW__KUBERNETES__...` changed accordingly to `AIRFLOW__KUBERNETES_EXECUTOR__...`
- `AwsLambdaHook` changed to `LambdaHook` in the `AWS` provider. `function_name` parameter was moved from the `AwsLambdaHook.init()` function to the `invoke_lambda` function (put it closer to the execution)
- BaseOperator's `task_concurrency` parameter changed to `max_active_tis_per_dag`
- `ResultProxy` and `RowProxy` classes from `sqlalchemy.engine.result` changed their names to the `Result` and `Row`
- DAG's `schedule_interval` param changed to the `schedule`
- our Unit tests also broke because we mocked TaskInstance and other classes but their internal structure changed
Wasn't that hard, agree? By the way during the concurrency and performance testing we have learned that our Kubernetes executor isn't optimal - `AIRFLOW__KUBERNETES_EXECUTOR__WORKER_PODS_CREATION_BATCH_SIZE=1` by default didn't correspond to our needs - we have bumped this param to `50`, and it becomes significantly faster in starting new tasks.
#### Apply DB migrations gently
In parallel with testing the code we've been testing Airflow to database connectivity (`airflow db check` command) and how Airflow database migrations apply to our database (`airfow db upgrade`). How to upgrade Airflow database could be found [here](https://airflow.apache.org/docs/apache-airflow/2.5.3/installation/upgrading.html). What we have learned:
- some providers changed their packages, and most significant thing was that Airflow folks introduced `common.sql` provider and moved there some functionality from the `databricks` provider (which we also use a lot). Because of [Broken installation with --editable flag](https://github.com/apache/airflow/issues/30764) we faced an issue with Airflow communication with the database. I've described few workarounds in the ticket, but the root cause was that we upgraded Python (and as consequence - pip version).
- new Docker image with Python `3.10` came with the new Debian `bullseye` and there was a change important for us - `mysql-client` lib was stopped packaging and became `default-mysql-client`
- `airfow db upgrade`generated `dangling` tables and put there all DAGs, TaskInstances and other entities that couldn't be migrated. For example if DAG was without ID or execution_date. Don't ask me how that happened, but we reviewed these tables and successfully dropped them.
- applying migrations took ~2 hours on our weak dev RDS cluster and we thought it will be faster on production... boy oh boy we were wrong...
- our plain Airflow `User` role lost access to the custom _Self-Service Backfill UI_. Solved this by setting `AIRFLOW__WEBSERVER__UPDATE_FAB_PERMS` to `False` because otherwise it drops custom permissions each time Airflow Web Server restarts
#### Ship it!
Finally, everything was tested and fixed. The time has come to upgrade Airflow on production! All DAGs paused, ECS tasks stopped. Production database snapshot created.
###### Database migration
We started applying migrations on production database. In order to do this we prepared another ECS task which runs `airfow db upgrade` command. We successfully deployed it via Terraform and start to wait... and wait... and wait a little bit more...
After ~5 hours we started to think why it takes so long... We realized that:
- our production Airflow database RDS cluster only twice as big as our development
- we have giant database with data for 2017 year and so on (backfills mostly). Considering we had a `sentinel` DAG which checks Scheduler is alive for each 5 minutes (sends a gauge metric to the Datadog), and our main DAG with ~1k tasks in it running daily, you can imagine how big `task_instance` and `dag_run` tables were...
What we decided to do:
- stop the migration
- give our production RDS cluster more resources - 32Gb RAM and 16 CPUs comparing to 8Gb RAM and 4 CPUs that we had before
- delete all data below year 2021
- rerun migration ECS task again
That helped - migration completed in ~30 minutes!
###### Airflow - your turn!
Deploy ECS tasks via Terraform with updated Airflow image didn't cause any issues.
After that we've merged all PR's with updated DAGs and related services. Checked that there are no DAG parsing errors on production.
Started all DAGs and start to observe... and it started to fail...
- [Braking changes](https://github.com/apache/airflow/pull/26452/files#diff-55326294fdc9ce88aee820373ed658972dff4067517a9c4f59819efbbf3e3b85R27) in the SlackWebHook. Basically they refactored the hook and switched to the Slack SDK instead of using plain HTTP. This caused `SlackRequestError: Invalid URL detected: ***` for all our Slack hook usages. In order to fix it we had to went through all Slack Airflow Connections, copy `webhook_token` extra param, change the connection type to Slack Incoming Webhook and put copied value to the `Webhook Token` field
- [Bug](https://github.com/apache/airflow/issues/31898) in Databricks SQL operator they are creating new connection for each query. As I understand new Airflow connection object means new Spark Session in the Databricks Warehouse. This breaks our templated queries because in most of them first query is `use some_database` and next is an actual processing query. When it creates new connection for each query in a template file it simply forgets about `use` statement and fails second query because can't find tables without database
After these issues were fixed we finally got a possibility to go to sleep because this is the end of the story, we made it!
### Conclusions
Even if you think you have tested everything you could be wrong. We did went through ann Airflow Release notes and prevented all possible issues in the core Airflow but didn't look at breaking changes in providers we use. don't forget to look what changed there also before upgrade.
Another decision we made - we decided to use constraints file during Airflow installation because without it some of Python modules could be loaded with newer versions during Airflow Docker image rebuild and caused dependencies conflicts.
Clear your database as much as possible in advance to reduce the number of rows that will be processed during the database migration
Nevertheless, upgrade is finished and now we have Airflow 2.5.3 up and running.
### Credits
- [Maksym Dovhal](https://github.com/Maks-D) for creating separate Terraform module for testing, actual testing things, fixing bugs along the way and for being deployment commander
- [Kuntal Basu](https://github.com/kuntalkumarbasu) - our infrastructure magician and guru for his help in real-time issues detection and fixing during the deployment
- [Artur Kiiko](https://github.com/arturkii) and [Lakshmi Pernapati](https://github.com/lpernapati) for participation in local/integration testing and fixing issues

Binary file not shown.

After

Width:  |  Height:  |  Size: 196 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 291 KiB