My Datastage Notes: Slowly changing dimesion

Introduction

The Slowly Changing Dimension stage was added in the 8.0 release of InfoSphere Information Server DataStage. It is designed specifically to populate and maintain records in star schema data models, specifically dimension tables. The Slowly Changing Dimension stage encapsulates all of the dimension maintenance logic — finding existing records, generating surrogate keys, checking for changes, and what action to take when changes occur. In addition, it allows associating dimension record surrogate key values with source records, which eliminates the need for additional lookups in later processing.

Basic Concepts

Facts and Dimensions

Dimensional Modeling is a logical design technique used in data warehouses intended to support end-user queries oriented around understandability. A Star schema is a method of Dimensional Modeling in which the data that is being measured, called the facts, are stored in one table, called the Fact table and the business objects/entities that are involved in the events being measured are stored in tables called Dimension tables. The facts in the fact table are linked to the business objects in the associated dimension tables using foreign keys. A Star schema has one fact table and several dimension tables.

Because fact tables record the measurements generated from business events, they tend to grow rapidly. For example, Sales fact table records information about sales transactions, where every transaction results in insertion of a new row in the fact table. Dimension tables, on the other hand, tend to grow or change less frequently, like records in a Branch dimension are inserted when a new Branch is introduced, or if information about an existing Branch is changed; The ways these changes are made to the dimension table depend on the types of changes and what information must be preserved in the data warehouse.

Changes to attribute information are handled in one of two ways:

Overwrite/Correction of Errors — the existing row in the dimension table is updated to contain the new attribute values; the old values are no longer available. This is commonly referred to as a Type1 change. These changes usually relate to the corrections of certain types of data errors, such as the spelling of a name. This approach will be chosen if the primary purpose is correction of data or if there is no interest in keeping the history of previous values and no need to run prior reports.

Tracking History/Partitioning History — the existing row in the dimension table is modified to indicate that it is no longer current (that is, it has been expired), and a new row is inserted with the current attribute values. This is commonly referred to as a Type2 change. The Type 2 method tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. For example, change in the marital status of the employee. Type 2 SCD perfectly partitions history because each detailed version of a dimensional identity is correctly connected to the span of fact table records for which that version is exactly correct.

Surrogate Keys

Surrogate Keys are values that are generated specifically for the purpose of uniquely identifying dimension table rows. The primary reasons you would use a surrogate key rather than the usual business key of the object in the dimension table are:

· When tracking history in the dimension table, there will be multiple rows in the dimension table for the same business key. Therefore, it is not possible to use the business key as the primary key.

· Typical fields that are used as business keys generally don’t change, but situations can arise where they do change. For example, US citizens can be assigned a new social security number, or account numbers may be reassigned after a merger.

Surrogate keys provide a way for the dimension table to have a reliable, unique, and never-changing primary key.

Implementation

SCD stages support both SCD Type 1 and SCD Type 2 processing. Each SCD stage processes a single dimension and performs lookups by using an equality matching technique. If the dimension is a database table, the stage reads the database to build a lookup table in memory. If a match is found, the SCD stage updates rows in the dimension table to reflect the changed data. If a match is not found, the stage creates a new row in the dimension table.

Scenario schema

In this document updation of one Dimension table will be illustrated using an SCD stage. Branch dimension table will be updated using a source file. The source file contains Student Details records. Let us have a look at the data in these tables.

Source: The source data file is extracted from file named StudentDetail.dat using sequential file stage and it contains five records that, when processed, apply changes to the dimension table. Table 1 shows the contents of the file.

Table 1: Source data

CollegeId	CollegeName	Principal	BrCode	BrName	Descr	MaleStud	FemaleStud
AAA	AAA COLLEGE OF ENGG	Jefferson	111	ECE	Electronics and Communications Subjects are dealt	340	157
BBB	BBB COLLEGE OF ENGG	Adams	222	CSE	Computer Science and Information Technology subjects are dealt	400	250
CCC	CCC COLLEGE OF ENGG	Mitchell	333	EEE	Electrical and Electronics Engineering Subjects are dealt	150	120
DDD	DDD COLLEGE OF ENGG	Madison	444	ME	Mechanical Engineering subjects are dealt	250	175
EEE	EEE COLLEGE OF ENGG	Monroe	555	CE	Civil Engineering Subjects are dealt	420	350

Branch dimension: The Branch dimension is a table in the target database. Initially this table contains records for three Branches. When the source data is processed, the table is updated to contain new Branch records, and to track the history of changed Branch information.

Table 2: Initial Branch dimension data

BranchSK	BrCode	BrName	Descr	Curr	EffDate	ExpDate
1	111	ECE	Electronics and Communications Subjects are dealt	Y	01-03-2004
2	222	CSE	Electrical and Electronics Engineering Subjects are dealt	Y	21-02-2002
10	333	EEEE	Electronics and Communications Subjects are dealt	Y	01-07-2008

Roadmap:

This example uses surrogate key generators that use state files to record the key values that have been used. This ensures that unique values are always generated. A job “SurrogateKey_File.dsx” is developed to initialize the state files. This job reads the Branch dimension table then creates and updates the respective surrogate key generator state files.Initializing the surrogate keys:

Figure 2: Job to update SK values

Building the Slowly Changing Dimensions job:

In this step we build a job that reads the StudentDetail.dat source file, updates the Branch dimension, and inserts records into a dataset which in turn is used to load Fact table.

Figure 3: Job Design

The primary flow of records is from left to right in the job design. The source records are read from StudentDetail, passed to the SCD stage to process the Branch dimension and then passed to a dataset which is later processed to load the fact table. As part of the processing in the SCD stage, the surrogate key values that are associated with the source records are obtained from the dimension table and added to the data being passed to the fact table.

The Branch dimension table acts as a reference source to the SCD stage. This table is used to initialize the lookup cache. Only records that are considered current are stored in the lookup cache. Any historical records in the dimension tables are automatically filtered out during initial processing. The SCD stage uses the data values from the primary input link to lookup into the cache and check for changes. If any changes are required to the dimension table, they are written to the secondary output link of the SCD stage, which is called the dimension update link. Target database stages are connected to the dimension update link to apply the changes to the actual dimension table in the database.

Each record on the primary input link of the SCD stage will go out on the primary output link, and may produce zero, one, or two records on the dimension update link. The number of records produced depends on what, if any, action needs to be taken on the dimension table.

No Action: Unchanged records require no action to the dimension table, so no records are written on the dimension update link.
Type 1 Change: New records and overwriting updates (Type1) require a one row change to the dimension table. The change is either an insert or an update. One record is written on the dimension update link to reflect these types of changes.
Type 2 Change: Changed records that are tracking history (Type2) require a two row change to the dimension table. The existing record must be updated to reflect that it is no longer current, and a new record must be inserted for the new set of values. Two records are written to the dimension update link to reflect these changes.

Configure the primary source stage

The source stage must be configured to read the StudentDetail.dat file. Complete the following steps to configure the StudentDetail sequential file stage:

On the Output|Properties tab, set the File property to C:SCDStudentDetail.dat.
On the Output|Format tab, add the Record delimiter string property and set it to DOS Format.
On the Output|Format tab, remove the Final delimiter property.
Load the DataStageSlowly Changing DimensionsTableDefsStudentDetail table definition onto the output link.

Figure 4: Source stage

Configure the stages to process the Branch dimension

Three stages are used to process the Branch dimension. Reading the job design:

Ø The first stage specifies how to read the data from the dimension table.

Ø The SCD stage determines what changes need to be made to the dimension table and those changes are written to the dimension update link.

Ø The dimension update link is connected to the dimension update target stage, which specifies how to update the actual database table with the data produced by the SCD stage.

Configure the Branch dimension source stage

Complete the following steps to configure the Branch dimension Oracle Enterprise stage:

Ø On the Output|Properties tab, set the Read method property to Table.

Ø On the Output|Properties tab, set the Table property to BranchDim.

Ø On the Output|Properties tab, set the DB Options (specify user and password).

Ø On the Output|Properties tab, set the Remote Server property from Connection.

Ø Load the DataStageSlowly Changing DimensionsTableDefsBranchDim table definition onto the output link.

Figure 5: Branch dimension source

Configure the Branch dimension SCD stage

The SCD stage has two input links and two output links. This results in a high number of property link-tab combinations. The Fast Path control of the SCD stage editor lets you navigate directly to the tabs that require input in order to complete the stage configuration.

Fast Path control

Open the Branch dimension SCD stage editor and use the Fast Path control to set the properties as shown:

Fast Path page 1: Setting the output link

Use the drop down list to select the output link that is leading to the next SCD stage. This is the primary output of the stage. By default, the first output link connected to the stage is used as the primary output link. The other link automatically becomes the dimension update link.

Figure 6: Branch dimension SCD stage, Fast Path page 1

Fast Path page 2: Define the lookup condition and purpose codes

SCD Stage provides 9 purpose codes to track changes in a dimension. This information is used in a number of ways in the SCD processing. The first task on this page is to define what the various columns of the dimension table are used for. The choices for purpose codes are:

o Surrogate Key — this column is the primary key of the dimension table and is populated with a surrogate key value.

o Business Key — this column is the natural key of the dimension table not necessarily unique. This is used for lookup purpose, to find the dimension table row that corresponds to a source data row.

o Type 2 — check this column for a change in value. If the value has changed, perform a history tracking change to the dimension table.

o Type 1 — check this column for a change in value. If the value has changed, perform an overwriting change to the dimension table.

o Current Indicator — this column is used as a flag to indicate whether it is the most current record for a particular business key.

o Effective Date — this column is used to specify when a record first became the most current record, that is, when it became the active record.

o Expiration Date — this column is used to specify the ending date of when a record was the active record. For currently active records, this value is typically a future date or NULL.

o SK Chain — this column is used to store the surrogate key of the previous or next record in the history for a particular business key.

o (blank) — This column is not used for anything with respect to SCD processing. Data for this field is inserted into the table when a new row is inserted, but this column will not be checked for changes against the source data.

Set purpose codes for the columns as shown below in Figure 7. Because this dimension table is tracking history, it contains columns to track whether a row is current and the date range for when it was current.

Click on the BrCode source field and drag it to the BrCode dimension column to create the lookup condition.

Figure 7: Branch dimension SCD stage, Fast Path page 2

This tab is used to define the lookup keys from the source record to the dimension record. Any source column can be associated with any one dimension column. This creates an equality lookup condition between those columns. If more than one source column is associated with a dimension column, then those equality conditions are AND’ed together. In this manner, multi-column lookup keys can be used.

Fast Path page 3: Configuring the surrogate key generator

This tab specifies how surrogate keys are generated for this stage. Surrogate key generation can use DataStage’s file based surrogate generation or database sequence object based generation. Here we use file based method.

Set the Source name property C:SCDSKGBranchDim as shown in Figure 8. This is the surrogate key state file that is created by running the ‘SurrogateKey_File’ job. Leave the defaults for the other properties unchanged.

Figure 8: Branch dimension SCD stage, Fast Path page 3

Fast Path page 4: Defining the slowly changing dimension behavior and derivations

The DimUpdate tab is used to define several critical elements of SCD processing. The Derivation column is used to specify how to map elements of a source row to elements of the dimension table. The Expire column is used to specify what values need to change if an existing record needs to be expired. Expire expressions are only enabled when there are Type2 columns specified, and are only available for Current Indicator and Expiration Date columns.

If no matching record is found when the lookup is performed, the derivation expressions are applied and a record is written on the dimension update link to indicate a new record needs to be added to the dimension table. If a matching record is found, the derivation expressions are applied to the source columns, and then the results are compared to the corresponding columns of the dimension table. Columns specified as Type2 are compared first. If there is a change, two records are written on the dimension update link. The first record is an update record, to expire the matched row. The Expire expressions are used to calculate the values for the update row. The second record is a new record that contains all of the new values for all columns. If no Type2 columns have changed, the Type1 columns are compared. If there are any changes, one record is written on the dimension update link that indicates an update to the dimension table. The derivation expressions are used to calculate the values for the update record.

Set the Derivation expressions and the Expire expressions as shown below in Figure.

Figure 9: Branch dimension SCD stage, Fast Path page 4

Note that you are specifying these properties on the dimension update link. The output columns for this link were automatically propagated with their purpose codes from the dimension input link. The SCD stage only does this when the set of columns on the dimension update link is empty. It is possible to load a set of columns directly on the dimension update link; however, they must exactly match those specified on the dimension input link.

Fast Path page 5: Selecting the columns for Output Link

The Output Map tab is used to define what columns will leave this stage on the primary output link. This tab operates much like the Mapping tab of other stages. The only difference is that you can select columns from the primary input link and columns from the reference link to output. The columns coming from the primary source have the same values they entered the stage with. The columns coming from the reference link represent the values from the dimension table that correspond to the source row. Note that because the SCD processing has been done by the stage, every record from the primary source data will have a corresponding record in the dimension.

Figure 10: Branch dimension SCD stage, Fast Path page 5

Select the columns for output as shown below in Figure 10. The output link is initially empty. Create and map the output columns by dragging and dropping from the source to the target. Because the Branch dimension has now been processed, the source columns that contain those attributes are no longer needed. Instead, the primary key associated with the source row is appended because that is the value that is required to be inserted into the fact table.

The stage is now configured to perform the dimension maintenance on the Branch dimension table.

Configure the Branch dimension target stage

This stage processes the dimension update link records produced by the Branch dimension SCD stage to update the actual dimension table in the database. Because incoming records represent both inserts and updates to the table, a Upsert write method must be used. Auto-generated update and insert statements take the purpose codes specified in the SCD stage into account to generate the correct update statement for this usage.

Complete the following steps to configure the Branch dimension update DB2 Enterprise stage:

Ø On the Input|Properties tab, set the Write Method property to Upsert.

Ø On the Input|Properties tab, set the Upsert Mode property to Auto-generated Update and Insert.

Ø On the Input|Properties tab, set the Table property to BranchDim.

Ø On the Input|Properties tab, set the set the DB Options (specify user and password). On the Input|Properties tab, set the Remote Server property from Connection.

Figure 11: Branch dimension target

The stage is now configured to write to the BranchDim dimension table.

Final steps

Click the Compile button to start the compile. The SCD stage processing makes use of the transform operator. Run the job by clicking the Run button in the DataStage Designer.

Summary of changes to database tables

The Branch dimension has two update records, and four new records. Two of the new records are new objects to the dimension table, and two existing records had Type2 changes, resulting in the two updates and two of the new records.

Change	BranchSK	BrCode	BrName	Descr	Curr		EffDate	ExpDate
No Change	1	111	ECE	Electronics and Communications Subjects are dealt		Y	1/3/2004	31/12/3500
Expired (Type2)	2	222	CSE	Electrical and Electronics Engineering Subjects are dealt		N	21/2/2002	{Today’s Date}
Expired (Type2)	10	333	EEEE	Electronics and Communications Subjects are dealt		N	1/7/2008	{Today’s Date}
New Record	3	444	ME	Mechanical Engineering subjects are dealt		Y	{Today’s Date}	31/12/3500
New Record	4	555	CE	Civil Engineering Subjects are dealt		Y	{Today’s Date}	31/12/3500
New Record (Type2)	5	222	CSE	Computer Science and Information Technology subjects are dealt		Y	{Today’s Date}	31/12/3500
New Record (Type2)	6	333	EEE	Electrical and Electronics Engineering Subjects are dealt		Y	{Today’s Date}	31/12/3500

Conclusion

Thus the Slowly Changing Dimension stage can be used:

a) To greatly reduce the time spent on creating jobs for processing star schemas.

b) To process history-tracking changes and in-place changes to dimension tables.

c) To reduce fact table processing by augmenting the source data with associated dimension table surrogate keys that eliminate the need for an additional lookup.

My Datastage Notes

Pages

Sunday, July 8, 2018

Slowly changing dimesion

Fast Path control

11 comments: