It sounds really amazing scaling airflow when we want grow and run as many as possible in parallell. Here I just present Airflow setup for scaleing out and the some spark airflow ETL. The source code for this article is in here.

There are several way to scale Airflow workers, one of the best way is to use celery queue. Another way is using Kubernetes Executor but Argo is more better solution if Kubernetes is going to be used for workflow management.

Here I am going to explain about scaling spark on kubernetes task on Airflow:

--

--

Here I am going to talk about doing CRUD using Delta Lake and Spark. I heard a lot of good stuff about Delta Lake and I just try to work with it and share my experience. Also it kinda supports ACID transactions on Spark.

Here is the technologies that is going to be used:

  • Deltalake: delta tables as file format
  • Spark: Processing and doing ETL
  • Hive metastore: Create tables and query on them
  • Presto: Running distributed query on delta tables
  • Airflow: Workflow management
  • Minio(S3): Storage and Deltalake file system
  • Superset: Creating dashboards using presto and hive on delta files

The more details and walking tour could be found here:

References:

--

--