Slowly Changing Dimension (SCD) Type 2 is a data warehousing technique used to track historical changes in a dimension table. Rather than overwriting old records, this technique inserts new rows for each change and marks older versions as inactive, preserving a full audit trail of changes over time. This is quite different from SCD Type 1, which updates old records and always displays the most recent data.
The Challenge - Implement SCD Type 2 in Lakehouse
We will use PySpark (an Apache Spark interface in Python) to implement an SCD Type 2 in a Microsoft Fabric Lakehouse to track changes in employee records over time.
For this project, we will utilize the scd_data.csv file provided by Francis Folaranmi for the Fabric User Group Nigeria Saturday challenges. This data is an employee master containing employee records. Over time, employees may move between departments or get promoted. We want to track every change, avoid overwriting data, and ensure there is only one active record per employee.
Learn more about the project from the official GitHub.
Step Zero
You need to navigate to the Lakehouse environment within your workspace. Next, save the data as received inside the Files folder of OneLake.
So set that up.
Step One
Load data from your file and set the null data within the load date to today’s date.
To do this, create a new notebook and paste the PySpark code below. This code tries to replace the null value in the Load date with the current date at the moment.
from pyspark.sql.functions import col, current_date, when
from pyspark.sql.types import DateType
df_raw = (
spark.read.option("header", True)
.option("inferSchema", True)
.csv("Files/scd_data.csv")
)
df_cleaned = df_raw.withColumn(
"LoadDate",
when(col("LoadDate").isNull(), current_date()).otherwise(col("LoadDate").cast(DateType()))
)
display(df_cleaned)
`PySpark
However, this didn’t work as there were still null values. To resolve this, let’s first trim the whitespace and ensure the data is actually of type Date. Then drop duplicate rows and replace the null with today’s date.
from pyspark.sql.functions import col, current_date, trim
from pyspark.sql.types import DateType
df_casted = df_raw.withColumn(
"LoadDate",
trim(col("LoadDate")).cast(DateType())
)
df_deduped = df_casted.dropDuplicates()
df_final = df_deduped.withColumn(
"LoadDate",
when(col("LoadDate").isNull(), current_date()).otherwise(col("LoadDate"))
)
df_final.select("LoadDate").distinct().show()
Step Two
Add the SCD Type 2 tracking columns (StartDate, EndDate, IsActive) to all records.
df_addcolumn = df_final\
.withColumn("StartDate", col("LoadDate"))\
.withColumn("EndDate", lit(None).cast("date"))\
.withColumn("IsActive", lit(True))
Step Three
Use window functions partitioned by EmpID and ordered by LoadDate to track historical changes. Add row numbers and use lag() to compare current vs previous values.
Step Four
Use lead() function to get the next LoadDate for each employee, then calculate EndDate and set IsActive status.
The Final Step
Select the final columns and save to the lakehouse as dim_employee table. Each employee will have multiple records showing their history over time.
And this is the architectural diagram for the project implementation. You can find the rest of the code on GitHub.
Automate with a Data Pipeline
Now, you can use Data Pipelines in Fabric to orchestrate your notebook on a schedule.
To do this, head over to Data Pipelines in your Fabric workspace. Create a new pipeline, add a Notebook activity, and link it to your Notebook.
Then save and publish the pipeline.
Sharing with Friends!
This is the first stackless challenge, and it is a Fabric challenge, which I am still getting the hang of.
If you have a better solution, please share it in the comments section. Alternatively, you can send me a message via Twitter or LinkedIn, or just post about it there and tag me.
Learn More!
Check out
Implementing SCD Type 2 with Microsoft Fabric— PySpark / DataflowGen2 by Nermeen Abdel-Aziz
A Deep Dive into SCD with Microsoft Fabric: Leveraging Notebooks, Pipelines, Dataflows, and Advanced ETL Techniques by Abdelghani Bouaddi
Pyspark Tutorial: Getting Started with Pyspark by Natassha Selvaraj