Thousands of customers rely on Amazon Redshift to build data warehouses to accelerate time to insights with fast, simple, and secure analytics at scale and analyze data from terabytes to petabytes by running complex analytical queries. Organizations create data marts, which are subsets of the data warehouse and usually oriented for gaining analytical insights specific to a business unit or team. The star schema is a popular data model for building data marts.
In this post, we show how to simplify data loading into a Type 2 slowly changing dimension in Amazon Redshift.
Star schema and slowly changing dimension overview
A star schema is the simplest type of dimensional model, in which the center of the star can have one fact table and a number of associated dimension tables. A dimension is a structure that captures reference data along with associated hierarchies, while a fact table captures different values and metrics that can be aggregated by dimensions. Dimensions provide answers to exploratory business questions by allowing end-users to slice and dice data in a variety of ways using familiar SQL commands.
Whereas operational source systems contain only the latest version of master data, the star schema enables time travel queries to reproduce dimension attribute values on past dates when the fact transaction or event actually happened. The star schema data model allows analytical users to query historical data tying metrics to corresponding dimensional attribute values over time. Time travel is possible because dimension tables contain the exact version of the associated attributes at different time ranges. Relative to the metrics data that keeps changing on a daily or even hourly basis, the dimension attributes change less frequently. Therefore, dimensions in a star schema that keeps track of changes over time are referred to as slowly changing dimensions (SCDs).
Data loading is one of the key aspects of maintaining a data warehouse. In a star schema data model, the central fact table is dependent on the surrounding dimension tables. This is captured in the form of primary key-foreign key relationships, where the dimension table primary keys are referred by foreign keys in the fact table. In the case of Amazon Redshift, uniqueness, primary key, and foreign key constraints are not enforced. However, declaring them will help the optimizer arrive at optimal query plans, provided that the data loading processes enforce their integrity. As part of data loading, the dimension tables, including SCD tables, get loaded first, followed by the fact tables.
SCD population challenge
Populating an SCD dimension table involves merging data from multiple source tables, which are usually normalized. SCD tables contain a pair of date columns (effective and expiry dates) that represent the record’s validity date range. Changes are inserted as new active records effective from the date of data loading, while simultaneously expiring the current active record on a previous day. During each data load, incoming change records are matched against existing active records, comparing each attribute value to determine whether existing records have changed or were deleted or are new records coming in.
In this post, we demonstrate how to simplify data loading into a dimension table with the following methods:
- Using Amazon Simple Storage Service (Amazon S3) to host the initial and incremental data files from source system tables
- Accessing S3 objects using Amazon Redshift Spectrum to carry out data processing to load native tables within Amazon Redshift
- Creating views with window functions to replicate the source system version of each table within Amazon Redshift
- Joining source table views to project attributes matching with dimension table schema
- Applying incremental data to the dimension table, bringing it up to date with source-side changes
In a real-world scenario, records from source system tables are ingested on a periodic basis to an Amazon S3 location before being loaded into star schema tables in Amazon Redshift.
For this demonstration, data from two source tables,
customer_address, are combined to populate the target dimension table
dim_customer, which is the customer dimension table.
The source tables
customer_address share the same primary key,
customer_id, and will be joined on the same to fetch one record per
customer_id along with attributes from both tables.
row_audit_ts contains the latest timestamp at which the particular source record was inserted or last updated. This column helps identify the change records since the last data extraction.
rec_source_status is an optional column that indicates if the corresponding source record was inserted, updated, or deleted. This is applicable in cases where the source system itself provides the changes and populates
The following figure provides the schema of the source and target tables.
Let’s look closer at the schema of the target table,
dim_customer. It contains different categories of columns:
- Keys – It contains two types of keys:
customer_skis the primary key of this table. It is also called the surrogate key and has a unique value that is monotonically increasing.
customer_idis the source primary key and provides a reference back to the source system record.
- SCD2 metadata –
rec_exp_dtindicate the state of the record. These two columns together define the validity of the record. The value in
rec_exp_dtwill be set as
‘9999-12-31’for presently active records.
- Attributes – Includes
Data loading into a SCD table involves a first-time bulk data loading, referred to as the initial data load. This is followed by continuous or regular data loading, referred to as an incremental data load, to keep the records up to date with changes in the source tables.
To demonstrate the solution, we walk through the following steps for initial data load (1–7) and incremental data load (8–12):
- Land the source data files in an Amazon S3 location, using one subfolder per source table.
- Use an AWS Glue crawler to parse the data files and register tables in the AWS Glue Data Catalog.
- Create an external schema in Amazon Redshift to point to the AWS Glue database containing these tables.
- In Amazon Redshift, create one view per source table to fetch the latest version of the record for each primary key (
- Create the
dim_customertable in Amazon Redshift, which contains attributes from all relevant source tables.
- Create a view in Amazon Redshift joining the source table views from Step 4 to project the attributes modeled in the dimension table.
- Populate the initial data from the view created in Step 6 into the
- Land the incremental data files for each source table in their respective Amazon S3 location.
- In Amazon Redshift, create a temporary table to accommodate the change-only records.
- Join the view from Step 6 and
dim_customerand identify change records comparing the combined hash value of attributes. Populate the change records into the temporary table with an
Drecords from the temporary table.
- Insert records into
dim_customer, querying all
Urecords from the temporary table.
Before you get started, make sure you meet the following prerequisites:
Land data from source tables
Create separate subfolders for each source table in an S3 bucket and place the initial data files within the respective subfolder. In the following image, the initial data files for
customer_address are made available within two different subfolders. To try out the solution, you can use customer_master_with_ts.csv and customer_address_with_ts.csv as initial data files.
It’s important to include an audit timestamp (
row_audit_ts) column that indicates when each record was inserted or last updated. As part of incremental data loading, rows with the same primary key value (
customer_id) can arrive more than once. The
row_audit_ts column helps identify the latest version of such records for a given
customer_id to be used for further processing.
Register source tables in the AWS Glue Data Catalog
We use an AWS Glue crawler to infer metadata from delimited data files like the CSV files used in this post. For instructions on getting started with an AWS Glue crawler, refer to Tutorial: Adding an AWS Glue crawler.
Create an AWS Glue crawler and point it to the Amazon S3 location that contains the source table subfolders, within which the associated data files are placed. When you’re creating the AWS Glue crawler, create a new database named
rs-dimension-blog. The following screenshots show the AWS Glue crawler configuration chosen for our data files.
Note that for the Set output and scheduling section, the advanced options are left unchanged.
Running this crawler should create the following tables within the
Create schemas in Amazon Redshift
First, create an AWS Identity and Access Management (IAM) role named
rs-dim-blog-spectrum-role. For instructions, refer to Create an IAM role for Amazon Redshift.
The IAM role has Amazon Redshift as the trusted entity, and the permissions policy includes
AWSGlueConsoleFullAccess, because we’re using the AWS Glue Data Catalog. Then associate the IAM role with the Amazon Redshift cluster or endpoint.
Instead, you can also set the IAM role as the default for your Amazon Redshift cluster or endpoint. If you do so, in the following
create external schema command, pass the
iam_role parameter as
Now, open Amazon Redshift Query Editor V2 and create an external schema passing the newly created IAM role and specifying the database as
rs-dimension-blog. The database name
rs-dimension-blog is the one created in the Data Catalog as part of configuring the crawler in the preceding section. See the following code:
Check if the tables registered in the Data Catalog in the preceding section are visible from within Amazon Redshift:
Each of these queries will return 10 rows from the respective Data Catalog tables.
Create another schema in Amazon Redshift to host the table,
Create views to fetch the latest records from each source table
Create a view for the
customer_master table, naming it
The preceding query uses row_number, which is a window function provided by Amazon Redshift. Using window functions enables you to create analytic business queries more efficiently. Window functions operate on a partition of a result set, and return a value for every row in that window. The
row_number window function determines the ordinal number of the current row within a group of rows, counting from 1, based on the ORDER BY expression in the OVER clause. By including the PARTITION BY clause as
customer_id, groups are created for each value of
customer_id and ordinal numbers are reset for each group.
Create a view for the
customer_address table, naming it
Both view definitions use the
row_number window function of Amazon Redshift, ordering the records by descending order of the
row_audit_ts column (the audit timestamp column). The condition
rnum=1 fetches the latest record for each
Create the dim_customer table in Amazon Redshift
dim_customer as an internal table in Amazon Redshift within the
rs_dim_blog schema. The dimension table includes the column
customer_sk, that acts as the surrogate key column and enables us to capture a time-sensitive version of each customer record. The validity period for each record is defined by the columns
rec_exp_dt, representing record effective date and record expiry date, respectively. See the following code:
Create a view to consolidate the latest version of source records
Create the view
vw_dim_customer_src, which consolidates the latest records from both source tables using
left outer join, keeping them ready to be populated into the Amazon Redshift dimension table. This view fetches data from the latest views defined in the section “Create views to fetch the latest records from each source table”:
At this point, this view fetches the initial data for loading into the
dim_customer table that we are about to create. In your use-case, use a similar approach to create and join the required source table views to populate your target dimension table.
Populate initial data into dim_customer
Populate the initial data into the
dim_customer table by querying the view
vw_dim_customer_src. Because this is the initial data load, running row numbers generated by the
row_number window function will suffice to populate a unique value in the
customer_sk column starting from 1:
In this query, we have specified
’2022-07-01’ as the value in
rec_eff_dt for all initial data records. For your use-case, you can modify this date value as appropriate to your situation.
The preceding steps complete the initial data loading into the
dim_customer table. In the next steps, we proceed with populating incremental data.
Land ongoing change data files in Amazon S3
After the initial load, the source systems provide data files on an ongoing basis, either containing only new and change records or a full extract containing all records for a particular table.
You can use the sample files customer_master_with_ts_incr.csv and customer_address_with_ts_incr.csv, which contain changed as well as new records. These incremental files need to be placed in the same location in Amazon S3 where the initial data files were placed. Please see section “Land data from source tables”. This will result in the corresponding Redshift Spectrum tables automatically reading the additional rows.
If you used the sample file for
customer_master, after adding the incremental files, the following query shows the initial as well as incremental records:
In case of full extracts, we can identify deletes occurring in the source system tables by comparing the previous and current versions and looking for missing records. In case of change-only extracts where the
rec_source_status column is present, its value will help us identify deleted records. In either case, land the ongoing change data files in the respective Amazon S3 locations.
For this example, we have uploaded the incremental data for the
customer_address source tables with a few
customer_id records receiving updates and a few new records being added.
Create a temporary table to capture change records
Create the temporary table
temp_dim_customer to store all changes that need to be applied to the target
Populate the temporary table with new and changed records
This is a multi-step process that can be combined into a single complex SQL. Complete the following steps:
- Fetch the latest version of all customer attributes by querying the view
Amazon Redshift offers hashing functions such as sha2, which converts a variable length string input into a fixed length character output. The output string is a text representation of the hexadecimal value of the checksum with the specified number of bits. In this case, we pass a concatenated set of customer attributes whose change we want to track, specifying the number of bits as 512. We’ll use the output of the hash function to determine if any of the attributes have undergone a change. This dataset will be called
newver (new version).
Because we landed the ongoing change data in the same location as the initial data files, the records retrieved from the preceding query (in
newver) include all records, even the unchanged ones. But because of the definition of the view
vw_dim_customer_src, we get only one record per customerid, which is its latest version based on
- In a similar manner, retrieve the latest version of all customer records from
dim_customer, which are identified by
rec_exp_dt=‘9999-12-31’. While doing so, also retrieve the
sha2value of all customer attributes available in
This dataset will be called
oldver (old or existing version).
- Identify the current maximum surrogate key value from the
This value (
maxval) will be added to the
row_number before being used as the
customer_sk value for the change records that need to be inserted.
- Perform a full outer join of the old version of records (
oldver) and the new version (
newver) of records on the
customer_idcolumn. Then compare the old and new hash values generated by the
sha2function to determine if the change record is an insert, update, or delete:
We tag the records as follows:
- If the
customer_idis non-existent in the
oldver.customer_id is null), it’s tagged as an insert (
- Otherwise, if the
customer_idis non-existent in the
newver.customer_id is null), it’s tagged as a delete (
- Otherwise, if the old
hash_valueare different, these records represent an update (
- Otherwise, it indicates that the record has not undergone any change and therefore can be ignored or marked as not-to-be-processed (
Make sure to modify the preceding logic if the source extract contains
rec_source_status to identify deleted records.
sha2 output maps a possibly infinite set of input strings to a finite set of output strings, the chances of collision of hash values for the original row values and changed row values are very unlikely. Instead of individually comparing each column value before and after, we compare the hash values generated by
sha2 to conclude if there has been a change in any of the attributes of the customer record. For your use-case, we recommend you choose a hash function that works for your data conditions after adequate testing. Instead, you can compare individual column values if none of the hash functions satisfactorily meet your expectations.
- Combining the outputs from the preceding steps, let’s create the INSERT statement that captures only change records to populate the temporary table:
Expire updated customer records
temp_dim_customer table now containing only the change records (either
‘D’), the same can be applied on the target
Let’s first fetch all records with values
‘D’ in the
iud_op column. These are records that have either been deleted or updated in the source system. Because
dim_customer is a slowly changing dimension, it needs to reflect the validity period of each customer record. In this case, we expire the presently active recorts that have been updated or deleted. We expire these records as of yesterday (by setting
rec_exp_dt=current_date-1) matching on the
Insert new and changed records
As the last step, we need to insert the newer version of updated records along with all first-time inserts. These are indicated by
‘I’, respectively, in the
iud_op column in the
Depending on the SQL client setting, you might want to run a
commit transaction; command to verify that the preceding changes are persisted successfully in Amazon Redshift.
Check the final output
You can run the following query and see that the
dim_customer table now contains both the initial data records plus the incremental data records, capturing multiple versions for those
customer_id values that got changed as part of incremental data loading. The output also indicates that each record has been populated with appropriate values in
rec_exp_dt corresponding to the record validity period.
For the sample data files provided in this article, the preceding query returns the following records. If you’re using the sample data files provided in this post, note that the values in
customer_sk may not match with what is shown in the following table.
In this post, we only show the important SQL statements; the complete SQL code is available in load_scd2_sample_dim_customer.sql.
If you no longer need the resources you created, you can delete them to prevent incurring additional charges.
In this post, you learned how to simplify data loading into Type-2 SCD tables in Amazon Redshift, covering both initial data loading and incremental data loading. The approach deals with multiple source tables populating a target dimension table, capturing the latest version of source records as of each run.
Refer to Amazon Redshift data loading best practices for further materials and additional best practices, and see Updating and inserting new data for instructions to implement updates and inserts.
About the Author
Vaidy Kalpathy is a Senior Data Lab Solution Architect at AWS, where he helps customers modernize their data platform and defines end to end data strategy including data ingestion, transformation, security, visualization. He is passionate about working backwards from business use cases, creating scalable and custom fit architectures to help customers innovate using data analytics services on AWS.