Stackless Challenge #1 - Implementing SCD Type 2 in Microsoft Fabric Lakehouse with PySpark

Jul 26, 2025

Slowly Changing Dimension (SCD) Type 2 is a data warehousing technique used to track historical changes in a dimension table. Rather than overwriting old records, this technique inserts new rows for each change and marks older versions as inactive, preserving a full audit trail of changes over time. This is quite different from SCD Type 1, which updates old records and always displays the most recent data.

The Challenge - Implement SCD Type 2 in Lakehouse

We will use PySpark (an Apache Spark interface in Python) to implement an SCD Type 2 in a Microsoft Fabric Lakehouse to track changes in employee records over time.

For this project, we will utilize the scd_data.csv file provided by Francis Folaranmi for the Fabric User Group Nigeria Saturday challenges. This data is an employee master containing employee records. Over time, employees may move between departments or get promoted. We want to track every change, avoid overwriting data, and ensure there is only one active record per employee.

Learn more about the project from the official GitHub.

Step Zero

You need to navigate to the Lakehouse environment within your workspace. Next, save the data as received inside the Files folder of OneLake.

So set that up.

Step One

Load data from your file and set the null data within the load date to today’s date.

To do this, create a new notebook and paste the PySpark code below. This code tries to replace the null value in the Load date with the current date at the moment.

from pyspark.sql.functions import col, current_date, when

from pyspark.sql.types import DateType

df_raw = (

spark.read.option("header", True)

.option("inferSchema", True)

.csv("Files/scd_data.csv")

)

df_cleaned = df_raw.withColumn(

"LoadDate",

when(col("LoadDate").isNull(), current_date()).otherwise(col("LoadDate").cast(DateType()))

)

display(df_cleaned)`PySpark

However, this didn’t work as there were still null values. To resolve this, let’s first trim the whitespace and ensure the data is actually of type Date. Then drop duplicate rows and replace the null with today’s date.

from pyspark.sql.functions import col, current_date, trim

from pyspark.sql.types import DateType

df_casted = df_raw.withColumn(

"LoadDate",

trim(col("LoadDate")).cast(DateType())

)

df_deduped = df_casted.dropDuplicates()

df_final = df_deduped.withColumn(

"LoadDate",

when(col("LoadDate").isNull(), current_date()).otherwise(col("LoadDate"))

)

df_final.select("LoadDate").distinct().show()

Step Two

Add the SCD Type 2 tracking columns (StartDate, EndDate, IsActive) to all records.

df_addcolumn = df_final\

.withColumn("StartDate", col("LoadDate"))\

.withColumn("EndDate", lit(None).cast("date"))\

.withColumn("IsActive", lit(True))

Step Three

Use window functions partitioned by EmpID and ordered by LoadDate to track historical changes. Add row numbers and use lag() to compare current vs previous values.

Step Four

Use lead() function to get the next LoadDate for each employee, then calculate EndDate and set IsActive status.

The Final Step

Select the final columns and save to the lakehouse as dim_employee table. Each employee will have multiple records showing their history over time.

And this is the architectural diagram for the project implementation. You can find the rest of the code on GitHub.

Automate with a Data Pipeline

Now, you can use Data Pipelines in Fabric to orchestrate your notebook on a schedule.

To do this, head over to Data Pipelines in your Fabric workspace. Create a new pipeline, add a Notebook activity, and link it to your Notebook.

Then save and publish the pipeline.

Sharing with Friends!

This is the first stackless challenge, and it is a Fabric challenge, which I am still getting the hang of.

If you have a better solution, please share it in the comments section. Alternatively, you can send me a message via Twitter or LinkedIn, or just post about it there and tag me.

Learn More!

Check out

Implementing SCD Type 2 with Microsoft Fabric— PySpark / DataflowGen2 by Nermeen Abdel-Aziz
A Deep Dive into SCD with Microsoft Fabric: Leveraging Notebooks, Pipelines, Dataflows, and Advanced ETL Techniques by Abdelghani Bouaddi
Pyspark Tutorial: Getting Started with Pyspark by Natassha Selvaraj

Stackless

Stackless Challenge #1 - Implementing SCD Type 2 in Microsoft Fabric Lakehouse with PySpark

The Challenge - Implement SCD Type 2 in Lakehouse

Step Zero

Step One

Step Two

Step Three

Step Four

The Final Step

Automate with a Data Pipeline

Sharing with Friends!

Learn More!

Discussion about this post