Data Engineering

Data Engineering Process

Creating data pipelines is the backbone of any business intelligence or data analytics project. In this part of the process, the goal is to combine data from different sources and ensure it stays updated for downstream tools. This is actually the part I enjoy the most, finding clean solutions to messy problems and designing ETL processes that are simple, maintainable, and efficient.

  • Efficient Pipelines: Building ETL automations, generally using Python and Bash, that are easy to read, understand, and modify. Avoiding over-engineering as, most of the time, simple tools get the job done best.
  • Using the Right Tools: Sticking to Python and SQL whenever possible has proven to be the most effective approach. It improves readability, simplifies debugging, and makes onboarding new people much easier.
  • ELT vs ETL: I strongly believe that ELT is usually the best approach, especially for small and medium businesses (SMEs) where data volumes aren't massive. Loading raw data into a centralized data lake first, then transforming it with tools like dbt, makes development faster and keeps pipelines easier to maintain.
  • Testing and Monitoring: Centralizing logging, testing, and monitoring ensures that pipelines are reliable and easy to manage. Having structured logging and automated checks in place makes it easier to catch issues early and track performance over time.

Tech Stack

My approach to data engineering is focused around using technologies that are flexible and commonplace, avoiding vendor lock-in whenever possible. I use tools which are simple, but powerful, ensuring that pipelines are easy to understand, debug and maintain.

  • Data Orchestration: Primarily use Dagster to manage and schedule workflows. Similarly to Apache Airflow, which I've used in the past, everything is based on Python scripting.
  • ELT & Data Processing: Heavy use of Python (pandas, DuckDB) for transforming, cleaning and integrating data across multiple sources. Vast experience with dbt (data build tool) to automate transformations within the data warehouse.
  • Database & Storage: Extensive work with PostgreSQL and MySQL for structured data storage. Comfortable with cloud-based storage solutions, including S3 for object storage, DynamoDB for NoSQL, and EFS for shared filesystem storage.
  • Containerization & Deployment: Years of experience deploying my data engineering solutions in Docker containers on self-hosted VPS. Familiar with deploying on cloud platforms and managed services.
  • Monitoring & Logging: Experienced both with self-hosted monitoring solutions (Grafana, Prometheus, Loki) and cloud-based solutions (AWS CloudWatch). Comfortable setting up alerting mechanisms and health checks to proactively monitor pipeline performance and system stability.