Apache Airflow - challenges (2024)

  • Report this article

Ranganath S Apache Airflow - challenges (1)

Ranganath S

Senior Engineering Manager | Data Engineering | Data science

Published Jul 15, 2023

+ Follow

Airflow has been the go to tool for workflow and scheduling , There are various tools in Big Data space, like are prefect, Dagster, Airflow, Azkaban , Apache Dolphin scheduler etc. How ever Airflow and Dagster has been on top of adoption.

Limitations of Airflow

  • Airflow is highly complex and Non-Intuitive :- Airflows "pipeline/workflow as code" is quite powerful, how ever it violates the fundamental principle - "separation of concerns" - it requires decent python programmer to create jobs as tasks in DAG , sequence it as workflow , add more workflows for any logistic works like compaction for production ready data, handle the failure scenario and test , debug the DAG paths. It can get extremely complex to debug in production. My point is why would i need to have my workflow as code ? It can be configuration, i see no reason for some one to code the workflow and debugging in production is counter intuitive.
  • Apache airflow ( open source version ) doesn't support multi tenancy out of box :- This is quite fundamental for any project to go to production, logical segregation of resources is very practical requirement to adoption, how ever this can be mitigated with k8s as deployment strategy but there are rough edges.
  • Adoption & Steep learning curve - As airflow requires Data engineer to know its scheduler, executor & workers, Airflow's framework is exhaustive , which means need to spend significant amount of time understanding this framework, API . secondly Airflow has tons of configuration parameters, while it can be seen as flexibility, it is time consuming..
  • Airflow doesnt come with Pipeline versioning :- Airflow doesnt store workflow metatdata changes, which means there is no way you could track workflow history, rollback pipelines as and when required - can be tricky in production
  • Debugging with Airflow is nightmare :- Its easy to code workflows and keep things moving , how ever as and when the flows get complex, debugging also becomes extremely difficult with tons of inter dependency both north and south. pipelines can easily get spegetti as it can have issue can arise any where. for example operators and templates are building blocks of workflow thats defined declaratively in DAG, it can also be interchanged, operators can execute code as well as orchestrate workflows, this can further complicate debugging. Executors can be either local or remote, Local is some time consuming but it can be straight forward but cant say same about remote executors. RCA can be daunting with complex environment such as these.

Apache Airflow - challenges (2)

Recommended next reads

What is Apache Airflow? An Overview Shahab Nasir 1 year ago
Airflow Sejal Baweja 1 year ago
Non-Cloud (On-premises) data pipelines; Beginners guide Suresh Bonam 1 month ago

  • Data lineage is still rudimentary :- Tracking and monitoring is important in any pipeline. According to Airflows documentation , it clearly calls out as "experimental - subjected to change" - which means as and when your pipe line evolves, you need to re check DAG , perhaps even rework. Airflow doesn't collect native information about the Job, owners or repo's, its herculean effort to understand what gets affected downstream due to what ever changes that might have happened in upstream.
  • Airflow may not trigger dot on time :- Airflow monitors all tasks, DAGs, and triggers once the dependencies are complete, by design Airflow DAG will run at the end of schedule_interval. Airflow operates in UTC by default.

You wrote DAG , scheduled it to run lets say every other hour, starting from 10 AM today, every thing is pretty neat till now, you check back at 11 AM and you find that your DAG has triggered, however log indicates there was one run at 10 AM, dont be surprised, if i say this is expected behavior, because,Airflow will run DAG at the end of the scheduling_interval, which means 10 AM DAG gets trigged by 11AM.

  • Tasks are bottleneck- There are multiple reasons why tasks can get bottleneck. one of the reasons is due to too many concurrent DAGs or with too many concurrent Tasks with in DAGs. This an be tuned in aifrlow.cfg by increasing the higher number of tasks in parrallelism. if you are planning to execute more DAGs, you should increase variable max_active_runs_per_dg and for concurrent tasks increase the default limit of airflow_celery_worker_concurrency.

So Airflow is a beast on its own, its flexible but its counter-intuitive for adoption and IMO in 2023, its not worth coding your DAGs or workflows when you have better alternatives like Apache Dolphin Scheduler , which offers DAG creationby simple DND, can scale easily without too much of tuning, multi tenancy is available out of box.

Help improve contributions

Mark contributions as unhelpful if you find them irrelevant or not valuable to the article. This feedback is private to you and won’t be shared publicly.

Contribution hidden for you

This feedback is never shared publicly, we’ll use it to show better contributions to everyone.

Like
Comment

6

3 Comments

⚓ Shawn Fergus ⚓

World's Tallest Data Enthusiast @Shipyard | Shipyard's Your Secure Data Operations Platform for all Things ETL, Workflow Orchestration, Observability, and Beyond. | Let's Be Friends.

6mo

  • Report this comment
  • Apache Airflow - challenges (15)

No more previous content

  • Apache Airflow - challenges (16)

No more next content

Like Reply

1Reaction

Kuldeep Pal

Data Engineer - III at Walmart | SDE III, Software Engineer (Backend) | Spark | Big Data | Python | SQL | AWS | GCP | Scala | Kafka | Datawarehouse | Streaming | Airflow 1x | Java Spring Boot

9mo

  • Report this comment

Mage is on the way 😂

Like Reply

1Reaction

Santosh Suresh

TWS Administrator

9mo

  • Report this comment

very useful information👍

Like Reply

1Reaction

See more comments

To view or add a comment, sign in

Sign in

Stay updated on your professional world

Sign in

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

Insights from the community

  • Data Management How can you optimize ETL for Google Cloud Dataflow?
  • Data Analytics How can you use Apache Beam for data processing pipelines?
  • Data Warehousing How can you use Apache Spark to debug and test ETL processing for Data Warehousing?
  • Data Engineering How can you integrate data extraction with data transformation and loading processes?
  • Business Intelligence How can you use Google Cloud Dataflow for ETL best practices in Business Intelligence?
  • Database Engineering How do you test and debug serverless database functions and triggers?
  • Database Development How can you customize ETL learning resources to your needs?
  • Business Intelligence What are the best ways to handle missing data during ETL?
  • Object-Relational Mapping (ORM) How do you test the scalability and reliability of your ORM-based applications?
  • Data Engineering What are the most important testing and debugging practices for batch processing pipelines?

Others also viewed

  • Extending your Puppet language dictionary for application orchestration Nicolas Corrarello 8y
  • DevOps for Big Data - Autoscale Apache Spark with JupyterHub Ziang Jia 1y
  • Testing Strategies for Apache Spark Applications: Ensuring Reliability at Scale InfoMover Technologies 3mo
  • ELK Stack + Spring Boot Mukesh C. 2mo
  • Software Automation: What I Do Pete Carapetyan 9y
  • Apache Airflow Aakash Kumar Dhal 2mo
  • Tapping into Docker's Power: Employing Apache Airflow for Seamless Workflow Integration Ankit Shrivastava 2mo
  • Airflow: ETL Workflow Management Platform Gopal Kumar Roy 3y
  • Apache Airflow Sourav R. 2y
  • ETL encapsulation in aws-Lambda Function with Serverless, CloudFormation, APIGateway, Docker, FastAPI to PowerBI API Rafael Vera-Marañón 1mo

Explore topics

  • Sales
  • Marketing
  • Business Administration
  • HR Management
  • Content Management
  • Engineering
  • Soft Skills
  • See All
Apache Airflow - challenges (2024)
Top Articles
Latest Posts
Article information

Author: Van Hayes

Last Updated:

Views: 5947

Rating: 4.6 / 5 (66 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Van Hayes

Birthday: 1994-06-07

Address: 2004 Kling Rapid, New Destiny, MT 64658-2367

Phone: +512425013758

Job: National Farming Director

Hobby: Reading, Polo, Genealogy, amateur radio, Scouting, Stand-up comedy, Cryptography

Introduction: My name is Van Hayes, I am a thankful, friendly, smiling, calm, powerful, fine, enthusiastic person who loves writing and wants to share my knowledge and understanding with you.