Bytes, Pipelines & Queries #5 – Transforming Data with dbt & BigQuery

February 26, 2025 • 3 min read


Welcome back to Bytes, Pipelines & Queries, my ongoing dive into Data Engineering. If you're new here, This is a series of blog posts documenting my experiences, wins, and the occasional head-scratchers as I navigate this field through the Data Engineering Zoomcamp.

Before jumping into dbt and Analytics Engineering, which constitutes module 4 of the zoomcamp, I had a quick detour and engaged in a workshop on DLT (Data Load Tool). Over there, I leaned a great deal about Data Ingestion from APIs to Data Warehouses and Data Lakes. Let's talk a bit about that.

DLT Workshop: A Smooth Ride into Data Ingestion

Having worked a lot with API response data in the past, DLT turned out to be refreshingly simple and powerful. Instead of getting tangled in scripts and scheduling headaches, DLT streamlines the process of turning JSON data into structured tables.

What stood out?

🔹 Schema evolution – DLT doesn’t panic when new columns appear; it just adapts. If you've ever had a pipeline break because a new field showed up unexpectedly, you know how valuable this is.

🔹 Declarative pipelines – Instead of manually orchestrating ETL steps, you declare how data should flow, and DLT takes care of the rest.

🔹 Tight integration with BigQuery and other data warehouses – No unnecessary complexity. It just works.

It’s one of those tools that makes you wonder why you were overcomplicating things in the first place. You can find more information about DLT here. It handles ingestion really well, but once the data is inside BigQuery, you need something to transform it. Enter dbt!

Analytics Engineering: Shaping Raw Data into Something Useful

Analytics Engineering is about turning raw data into structured, query-friendly models. The approach shifts from ETL (Extract, Transform, Load) to ELT (Extract, Load, Transform). Instead of transforming data before loading it, we first dump raw data into BigQuery and then use dbt to handle transformations inside the database.

This makes scaling data transformations way easier, and dbt is the tool that made it all click.

Why dbt?

dbt is like version control, testing, and transformation wrapped into one smooth workflow:

  • Write SQL transformations like code – Modular, reusable, and clean.
  • Version control your SQL – Treat data models like software. Seamless integration with Git.
  • Ensure data quality – Run automated tests to catch issues before they break dashboards.
  • Generate clear documentation – No more “mystery tables” floating around in your warehouse.

Getting Hands-On with dbt & BigQuery

Setting up dbt with BigQuery was straightforward:
1️⃣ Create a free dbt Cloud account.
2️⃣ Connect it to BigQuery.
3️⃣ Initialize a Git repository for your project.

Once that was done, I started defining data models.

Building Data Models with dbt

1️⃣ Staging and Core Models

Typically in dbt, transformations are structured in layers:

  • Staging Models (stg_): Clean raw data, standardize column names, and remove duplicates.
  • Core Models (core_): Business logic transformations, defining fact and dimension tables.

A typical pipeline starts with staging models, which prepare the raw data for further transformation.

-- models/staging/stg_yellow_taxi.sql
SELECT 
    trip_id,
    passenger_count,
    fare_amount,
    trip_distance,
    pickup_datetime,
    dropoff_datetime,
    vendor_id
FROM raw_dataset.yellow_taxi_data;

The core models then apply business logic to create fact and dimension tables.

2️⃣ Fact and Dimension Tables

A key part of Analytics Engineering is structuring data properly. dbt makes it easy to define:

  • Fact tables: Store transactional events (e.g., taxi trips, sales records).
  • Dimension tables: Contain descriptive attributes (e.g., customer info, driver details).

Here’s an example fact table (fact_trips) that aggregates taxi trip data:

-- models/core/fact_trips.sql
SELECT 
    trip_id,
    passenger_count,
    fare_amount,
    trip_distance,
    pickup_datetime,
    dropoff_datetime,
    vendor_id
FROM {{ ref('stg_yellow_taxi') }}

This pulls from the staging model (stg_yellow_taxi), ensuring clean, structured data.

3️⃣ Materializations: Choosing the Right Approach

dbt provides different ways to store transformed data, called materializations. Each option balances performance and cost differently:

  • View: A virtual table that always runs fresh queries, saving storage but increasing compute costs.
  • Table: A static table that is fully rebuilt when dbt run is executed—great for stable datasets.
  • Incremental: Updates only new or changed records, making it ideal for large datasets that grow over time.
  • Ephemeral: A temporary transformation that exists only within a query, reducing unnecessary table creation.

Testing and Documenting Models

dbt makes testing data easy. Want to make sure trip_id is always unique and fare_amount is never NULL? Just define it:

models:
  - name: fact_trips
    description: "Aggregated taxi trip data."
    columns:
      - name: trip_id
        tests:
          - not_null
          - unique
      - name: fare_amount
        tests:
          - not_null

Simple, but game-changing when working with live pipelines.

Final Thoughts

This module changed the way I thought about data transformation after the first 3 modules. dbt turns SQL into something scalable, tested, and version-controlled—which feels like a natural evolution for analytics workflows.

dbt + BigQuery = a killer combo for building clean, efficient data models.

You can read more about dbt here. There is also a free dbt fundamentals course that I highly recommend if you want to explore it further.