Data Engineering Zoomcamp by DataTalks.Club: Module 1
After hours of bumbling about, I finally understood the structure of the self-paced course: the follow-along videos are hosted on DTC’s YouTube channel, while the notes on GitHub serve as supplementary material rather than the primary source.
P.S: I’m on a Linux (Fedora 40) machine and working from a virtual environment . I also created a folder for the course where I save the files. All commands are run from that folder.
Docker Installation Instructions for Fedora are here .
[!NOTE] The files I reference/used for the course are hosted in this git repository .
Module 1: Containerization and Infrastructure as Code
- Introduction to GCP
- Docker and Docker Compose
- Running PostgreSQL with Docker
- Infrastructure setup with Terraform
- Homework
DE Zoomcamp 1.2.1 - Introduction to Docker
Description | Command |
---|---|
Run the Ubuntu container and initialize the Bash Shell. (-it means it opens a terminal with an interactive shell when running the Ubuntu container) |
docker run -it ubuntu bash |
Run the python v3.9 container | docker run -it python:3.9 |
Run the python container with a bash entrypoint so you can install pandas onto the container | docker run -it --entrypoint=bash python:3.9 |
Build the Dockerfile from the pwd. Said file uses the Python 3.9 image, installs Pandas using pip and sets Bash as the entry point of the container. | docker build -t test:pandas . (test is the image name) |
Run container built from the Dockerfile | docker run -it test:pandas |
After updating the Dockerfile’s entrypoint to [ "python" , "pipeline.py"] , you can pass the python file the date arg directly when running the container |
docker run -it test:pandas 2023-05-08 |
DE Zoomcamp 1.2.2 - Ingesting NY Taxi Data to Postgres
Description | Command |
---|---|
List the running docker containers | docker ps |
Run PostgreSQL container with the necessary environment variables (-e ), mounted volume (-v ) & port mapping (-p ) |
docker run -it -e POSTGRES_USER="root" -e POSTGRES_PASSWORD="root" -e POSTGRES_DB="ny_taxi" -v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data -p 5432:5432 postgres:13 |
Adjust the permissions of the ny_taxi_postgres_data folder. In Linux, the folder doesn’t show up in VS Code and you won’t have access to it in your File Manager. |
sudo chmod a+rwx ny_taxi_postgres_data |
Install pgcli using pip. This didn’t work out of the box for me and I had to manually install the psycopg-binary library using pip afterwards |
pip install pgcli psycopg-binary |
Connect to the database. It won’t have any data on it yet | pgcli -h localhost -p 5432 -u root -d ny_taxi |
Optional: Install Jupyter to have access to Jupyter Notebooks; an interactive python editor | pip install jupyter |
Open Jupyter | jupyter notebook (create a new Python 3 file when it opens) |
Download NYC Taxi Data | wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz |
Decompress NYC Taxi Data using Gzip | gzip -d yellow_tripdata_2021-01.csv.gz |
View first 100 lines of the CSV file in the terminal | less yellow_tripdata_2021-01.csv or head -n 100 yellow_tripdata_2021-01.csv |
Split CSV file into first 100 records | head -n 100 yellow_tripdata_2021-01.csv > yellow_tripdata_head_100.csv |
Count number of records in the file | wc -l yellow_tripdata_2021-01.csv |
Install sqlalchemy & psycopg2-binary using pip. I had to install psycopg2-binary because of the failed automatic pip installation from earlier. |
pip install sqlalchemy psycopg2-binary |
List PostgreSQL tables | In pgcli: \dt |
View specific table information | In pgcli: \d yellow_taxi_data |
DE Zoomcamp 1.2.3 - Connecting pgAdmin and Postgres
Description | Command |
---|---|
Run pgAdmin as a Docker container | docker run -it -e PGADMIN_DEFAULT_EMAIL="admin@admin.com" -e PGADMIN_DEFAULT_PASSWORD="root" -p 8080:80 dpage/pgadmin4 |
Create a network | docker network create pg-network |
Run Postgres container with just created pg-network . The container name i.e. pg-database will be used as the Host name/address in the pgAdmin connection settings. |
docker run -it -e POSTGRES_USER="root" -e POSTGRES_PASSWORD="root" -e POSTGRES_DB="ny_taxi" -v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data -p 5432:5432 --network=pg-network --name pg-database postgres:13 |
Run pgAdmin as a container with just created pg-network |
docker run -it -e PGADMIN_DEFAULT_EMAIL="admin@admin.com" -e PGADMIN_DEFAULT_PASSWORD="root" -p 8080:80 --network=pg-network --name pgadmin-2 dpage/pgadmin |
DE Zoomcamp 1.2.4 - Dockerizing the Ingestion Script
Description | Command |
---|---|
Convert Jupyter notebook to python script | jupyter nbconvert --to=script PostgreSQL_Data_Injestion.ipynb |
Run the python script after conversion and cleanup | python ingest_data.py --user=root --password=root --host=localhost --port=5432 --db=ny_taxi --table_name=yellow_taxi_trips --url=${URL} |
DE Zoomcamp 1.2.5 - Running Postgres and pgAdmin with Docker-Compose
Note: On Fedora, the command is docker compose
and was installed when I installed the Docker Engine following the instructions here
.
| A Youtube comment recommended using docker compose start/stop
as opposed to docker compose up/down
if you want your database settings persisted in PgAdmin. Else running docker compose up
will delete the local host server you created from lesson 1.2.3.
Description | Command |
---|---|
Start docker services defined in docker-compose.yaml |
docker compose start |
Stop docker services defined in docker-compose.yaml |
docker compose stop |
Create and start docker containers. Add the -d option to run in detached mode |
docker compose up |
Stop and remove docker containers & networks | docker compose down |
DE Zoomcamp 1.2.6 - SQL Refresher
Note: Avoid using a SELECT inside the IN(). The IN() construct is more useful when you pass a list of values as “IN(1, 2, 100, 250)”.
If there is any NULL value returned by the SELECT inside the IN(), the result will be empty, the IN() will “fail” and it will cause a lot of confusion and errors. It is a very difficult thing to debug.
Use EXISTS instead:
|
|
Terraform Install on Fedora
sudo dnf install -y dnf-plugins-core
sudo dnf config-manager --add-repo https://rpm.releases.hashicorp.com/fedora/hashicorp.repo
sudo dnf -y install terraform