Personal Projects

Formula1 Racing Databricks Project

Formula 1 is a motorsport of the highest level combining engineering, physics, human skill and team strategy at the limits of performance featuring the world's most innovative constructors and talented drivers. It is the pinnacle of automotive innovation, and with the rise of F1 in mainstream media, it is easy to see why the thrill of racing at 200+ mph captivates so many of us.

This project focuses on building end-to-end analytics pipeline on F1 racing data using Databricks and Azure Cloud. Data is ingested from Ergast API, and processed through Medallion Architecture (Bronze -> Silver -> Gold) to build curated models for visualization and insights in Power BI.

Project Goals

I had laid out the following goals before building this project:

Architecture Overview

F1 Data Architecture Diagram

The data for this project was sourced from Ergast API. Databricks served as the analytics platform, where ELT pipelines were built following Medallion Architecture. Unity Catalog was used to organize and govern the data within a single catalog formula1 containing three databases: Bronze, Silver, and Gold. Each database was stored in separate containers in Azure Data Lake Storage Gen2.

The Bronze layer stores raw data as External Tables referencing JSON and CSV files. The Silver and Gold layers store structured and curated Delta Tables as Managed Tables.

Azure Data Factory orchestrates the data pipeline on a weekly refresh schedule, and PowerBI connects to the Gold Layer to deliver analytical visualizations and insights.

Technologies Used

Storage Azure Data Lake Storage 2
Compute Databricks / Pyspark
Governance Unity Catalog
Orchestration Azure Data Factory
Visualization Power BI

Medallion Architecture Overview

Medallion Architecture was implemented to incrementally improve data quality and structure as it moved through the pipeline:

Bronze Stores raw data ingested from Ergast API (JSON/CSV). External Tables were built on top of the raw files with manually defined schemas.
Silver Refines data from the Bronze layer by standardizing formats, removing duplicates, renaming columns to snake_case, flattening nested structures, and adding metadata fields such as ingestion_date and data_source. This layer provides clean, analytics-ready tables.
Gold Produces business-level curated models used for reporting and dashboarding. Tables from Silver Layer were joined and aggregated to present unified tables, such as race results, driver standings, and constructor performance over seasons.

Table Inventory

Bronze

Circuits
  • circuitId
  • circuitRef
  • name
  • location
  • country
  • lat
  • lng
  • alt
  • url
Races
  • raceId
  • year
  • round
  • circuitId
  • name
  • date
  • time
  • url
Constructors
  • constructorId
  • constructorRef
  • name
  • nationality
  • url
Drivers
  • driverId
  • driverRef
  • number
  • code
  • name: forename, surname
  • dob
  • nationality
  • url
Results
  • resultId
  • raceId
  • driverId
  • constructorId
  • number
  • grid
  • position
  • positionText
  • positionOrder
  • points
  • laps
  • time
  • milliseconds
  • fastestLap
  • rank
  • fastestTime
  • fastestLapSpeed
  • statusId
Pitstops
  • raceId
  • driverId
  • stop
  • lap
  • time
  • duration
  • milliseconds
Lap Times
  • raceId
  • driverId
  • lap
  • position
  • time
  • milliseconds
Qualifying
  • qualifyId
  • raceId
  • driverId
  • constructorId
  • number
  • position
  • q1
  • q2
  • q3

Silver

Circuits
  • circuit_id
  • circuit_ref
  • name
  • location
  • country
  • latitude
  • longitude
  • altitude
  • data_source
  • file_date
  • ingestion_date
Races
  • race_id
  • race_year
  • round
  • circuit_id
  • name
  • race_timestamp
  • ingestion_date
  • data_source
  • file_date
Constructors
  • contructor_id
  • constructor_ref
  • name
  • nationality
  • data_source
  • ingestion_date
  • file_date
Drivers
  • driver_id
  • driver_ref
  • number
  • code
  • name
  • dob
  • nationality
  • data_source
  • file_date
  • ingestion_date
Results
  • result_id
  • race_id
  • driver_id
  • constructor_id
  • number
  • grid
  • position
  • position_text
  • position_order
  • points
  • laps
  • time
  • milliseconds
  • fastest_lap
  • rank
  • fastest_lap_time
  • fastest_lap_speed
  • data_source
  • file_date
  • ingestion_date
Pitstops
  • race_id
  • driver_id
  • stop
  • lap
  • time
  • duration
  • milliseconds
  • data_source
  • ingestion_date
Lap Times
  • race_id
  • driver_id
  • lap
  • position
  • time
  • milliseconds
  • data_source
  • ingestion_date
Qualifying
  • qualify_id
  • race_id
  • driver_id
  • constructor_id
  • number
  • position
  • q1
  • q2
  • q3
  • data_source
  • ingestion_date

Gold

Race Results
  • race_year
  • race_name
  • race_date
  • circuit_location
  • driver_name
  • driver_number
  • driver_nationality
  • team
  • grid
  • fastest_lap
  • race_time
  • points
  • position
  • race_id
  • file_date
  • ingestion_date
Driver Standings
  • race_year
  • driver_name
  • driver_nationality
  • total_points
  • wins
  • rank
Constructor Standings
  • race_year
  • team
  • total_points
  • wins
  • rank
Calculated Race Results
  • race_year
  • team_name
  • driver_id
  • driver_name
  • race_id
  • position
  • points
  • calculated_points
  • created_date
  • updated_date

Key Technical Highlights

Delta Lake

Delta Lake was used as the core storage layer on top of ADLS. Delta lakes enables efficient query performance while still leveraging open storage formats such as Parquet and JSON. It enables ACID transactions which allows additional functionality to update and delete individual records. The Merge functionality in Delta Lake plays a critical role in incremental load processes.

The Delta Transaction Logs provides a full audit history of table changes and supports versioning, allowing tables to be rolled back or queried as-of a specific point in time. This ensure data reproducibility and reliability across pipeline runs.

Screenshot showing Delta Lake capability to describe history of table
Delta Lake automatically maintains a transaction log for each table, recording every operation with details like version number, user, operation type, and timestamp. This enables full auditibility and traceablity of data changes.
Screenshot showing Delta Lake capabilty to query different versions of table
Delta Lakes support time travel, allowing tables to be queried for a specific version or timestamp. This enables reproducibility of analytical results and simplifies debugging or rollback.
Incremental Load

Races in Formula1 usually occur weekly during the season. The pipeline was designed to idempotently load and process only new data instead of reloading the entire full dataset to reduce computation power and time.

Full load was implemented for the following tables as records are added or updated slowly over time and they are relatively small:

  • Circuits
  • Races
  • Constructors
  • Drivers

On the other hand, Incremental Load was implemented for the following tables as thousands of new records are added on a weekly basis:

  • Results
  • Pitstops
  • Lap Times
  • Qualifying

Moreover, additional logic had to be implemented to incrementally load data for the following curated tables in the Gold Layer:

  • Race Results
  • Driver Standings
  • Constructor Standings
  • Calculated Race Results

The Merge function supported by Delta tables was used to perform the incremental load:

Diagram showing the incremental load logic
Tables are partitioned by race_id. The MERGE condition matches on both race_id and primary key, ensuring fast idempotent incremental updates using Delta Lake's transaction Merge capability.

Unity Catalog

Unity Catalog, the built-in data governance solution provided by Databricks, was used to manage and secure data assets for this project. It has great capabilities for data discovery, lineage, access control and auditing without requiring third-party products.

Features such as as Global Search and Catalog Explorer simplify discovery of tables, columns, notebooks, and dashboards. Unity Catalog also offers fine-grained Data Access Control and Data Auditing. Access Control allows fine-grained privileges to every object in Databricks, while auditing ensures that data usage complies with data governance and security.

Screenshot showing Unity Catalog Catalog Explorer features for circuits table in Bronze database
Unity Catalog offers capabilities to provide semantic meaning to data. Data Stewards can add description and comments to Tables and Columns and also enforce column masking for senstive fields. This enhances data understanding and discovery.
Screenshot showing Unity Catalog capabilty to view data lineage
Unity Catalog offers capabilities to trace the journey of data as it moves through a pipeline. This improves trust, transparency, and reliability in data. While Unity Catalog does not offer sophisticated column-level lineage, it provides valuable visibility into table-level data movement.

Unity Catalog is a relatively new feature, it already covers many foundational governance capabilities. However, it still falls short of enterprise Data Catalog tools like Collibra that provide robust data governance capabilities.

ADF Data Orchestration

Azure Data Factory was used to orchestrate the data pipelines for this project. Databricks provides built-in orchestration features such as Delta Live Tables, Workflows, and Jobs. However, ADF offers more flexible orchestration capabilities allowing multiple notebooks to run in parallel. ADF allows seamless integration with multiple systems including Databricks, which is why it was selected for this project.

A tumbling window of one week starting from the cut-over file data was selected to run the pipeline at the end Saturday. If new data files for the week are present, ADF runs the pipeline to update the dataset.

ADF pipeline for Ingestion (Bronze -> Slver)
ADF pipeline to transform (Silver -> Gold)
ADF master pipeline executing ingestion and transformation sequentially
Pipelines were created in Azure Data Factory by connecting to Databricks cluster and notebooks. The first pipeline supports the ingestion (Bronze -> Silver) processes and run multiple notebooks in parallel. The second pipeline supports the transformation (Silver -> Gold) processes. The third pipeline combines the two pipelines sequentially. (Not captured in first and second pipeline: If Exists condition to check if new data exists)

Using ADF as the orcestration layer ensured a fully automated, maintainable, and scalable data pipeline that integrated cleanly with Databricks, Unity Catalog and Azure Storage.

Analytical Insights

The core ROI of any data initiave revolves around enabling better data-driven insights and decisions. Many analytical insights can be extracted from the processed data for this project, but I focused on two different analyses which captures the long-term performance trends in Formula1.

By calculating the average points of drivers and constructors across different seasons, I identified the drivers and constructors that have performed exceptionally well over history. Average points offer a normalized view of competitive strength which allows different eras, regulations, and team structures to be compared more fairly, rather than the cumulative seasons totals.

It is interesting to observe that the most competitive seasons are often characterized by sharp rises in a driver’s or constructor’s average points. On the other hand, flatter overlapping trends showcase more dominant eras.

Dominant F1 drivers over time, showing yearly average points for the top performers. Each colored area represents a single driver, with gaps filled to reflect seasons where the driver did not compete. Since Ergast API has been deprecated, analytics is unfortunately only limited to 2021.

We can look at the above graph to identify dominant drivers across various eras in Formula1 racing including Schumacher’s peak in the early 2000s, Vettel’s surge between 2010 and 2013, and Hamilton’s continued dominance starting from when he joined Mercedes in 2013. Unfortunately, since Ergast API has been deprecated, the graph does not include data on newer drivers such as Max Verstappen.

These patterns emerge naturally through the data cleaning and modeling process, and they validate the integrity of the pipeline used to compute yearly performance.

Dominant F1 constructors over time, showing yearly average points for the top-performing teams. Zero values are used for seasons where teams were inactive to provide an accurate historical timeline.

Constructor performance trends over the years reveal long-running engineering excellence over the decades: Ferrari’s long presence across decades, Red Bull’s ascent in the 2010s, and Mercedes’ overwhelming dominance during the turbo-hybrid era. The driver and contructor historical dominance trends showcase the importance of both driver skills and engineering infrastructure in their role to shape championship outcomes.

Formula1 fans, or rather fans of any sport, love sports analytics like these as they provide historical context to their favorite athletes and teams and show the competitive landscape over the years.

Overall, these analytics demonstrate a few of the many meaningful insights that can be obtained from well-modeled data.

Resources

This project closely follows this course on Udemy by Ramesh Retnasamy. I used this course as a guideline, and I expanded the scope by building the project using Unity Catalog. I would recommend this course to anyone who is trying to get into Databricks.