Data Engineering Zoomcamp by DataTalks.Club: Module 2
This module covers workflow orchestration using Kestra .
Kestra is an open-source, event-driven orchestration platform that simplifies building both scheduled and event-driven workflows. By adopting Infrastructure as Code practices for data and process orchestration, Kestra enables you to build reliable workflows with just a few lines of YAML.
[!NOTE] The files I reference/used for the course are hosted in this git repository
[!NOTE] You can find all videos for this module in this YouTube Playlist .
Build Data Pipelines with Kestra
This module involved building ETL pipelines for Yellow and Green Taxi data from NYC’s Taxi and Limousine Commission (TLC) for the years 2019 & 2020 using Kestra both locally and in GCP.
Learning Points
- Extracting data from CSV files .
- Loading it into Postgres locally and on Google Cloud (GCS + BigQuery).
- Exploring scheduling and backfilling Kestra workflows both locally and on Google Cloud (GCS + BigQuery).
Notes
Following along to this module was mostly straight forward with a few gotchas. Some of these were:
- Running Kestra, Postgres and PgAdmin using Docker and accessing the Kestra & PgAdmin Dashboard locally didn’t work with the
docker-compose.yml
provided for the module.- This was partly because Kestra and PgAdmin’s default ports are both on port 8080 and needed to be updated. Thankfully, the module’s FAQs came in handy and provided a sample Docker Compose file with guidance on resolving this and the Connection Refused errors for linux users.
- Linux users will encounter Connection Refused errors when connecting to the Postgres DB from within Kestra. This is because
host.docker.internal
works differently on Linux.- Updating the pluginDefaults connection info for the different flows referencing it (except for the
03_postgres_dbt.yaml
flow) to the name of the Postgres image defined in thedocker-compose.yml
file resolved this.
- Updating the pluginDefaults connection info for the different flows referencing it (except for the
Previous
|
|
Updated
|
|
- The fix for the
03_postgres_dbt.yaml
connection error was slightly different and required the hostname of my local computer defined in the file i.e.host
variable. I got this by running thehostname -I
command in a terminal and selecting the first IP.
If you prefer a GUI, you can also find it through the overview tab of the Systems Monitor (on
Fedora
it’s in the Network & Systems Chart) or under Network Settings onUbuntu
.
Previous Block
|
|
Updated Block
|
|
P.S: There’s an updated fix for this in the dbt flow.