Read the incoming records through any input stage like sequential file/dataset/table.
• Do the required processing for the incoming data.
• After the above processing step, pass the data into the change capture stage.
•
The change capture should be having two input links- one is the before
dataset and the other is the after dataset. For our job, the before
dataset should be the active records present in the table. The active
records are all those records which are having EXPR_DT=’2999-12-31’.
The after dataset will be the incoming data passed into change capture
after all the necessary processing.
•
The change capture stage compared the before dataset and after dataset
and produces 4 change_codes for each of the records. The 4 change codes
are as follows:
“0” – Copy code (The code indicates the after record is a copy of the before record)
“1”-Insert code (The code indicates a new record has been inserted in the after set that did not exist in the before set.)
“2”-Delete code(The code indicates that a record in the before set has been deleted from the after set)
“3”-Edit code(the code indicates the after record is an edited version of the before record)
The
copy records are not passed in the change captured stage as since we
need only edited, insert records fro SCD II implementation.
• Use a filter stage to separate the records that needs to be expired and inserted.
•
Filter the records with change_code = “1 or 3” into the insert records
link. Filter the records with change_code=” 3” into update/expiry link.
•
The records with change_code=3 are edited records. So the original
records corresponding to these edited records are to be made in-active
(expired). We can make the records inactive by changing the
EXPR_DT<> ‘2999-12-31’.So to make the record inactive change the
EXPR_DT with a valid date. For e.g. you can use make the EXPR_DT as the
date one less than the date on which you are loading the data into the
table. We will assume that we are loading the data on 2008-08-15.So
the EXPR_DT for inactive records would become ‘2008-08-14’. The date
2008-08-15 can be made as the EFCT_DT for records to be inserted.
• To
get the original records which needs to be expired, “look-up” the
target table for all the records with change_code=3 which are filtered
out separately. Get the original record along with the EFCT_DT of the
original record. Then update the records EXPR_DT to ‘2008-08-14’ in the
table. Now the original records are made inactive (expired).
•
The new updated record (change_code=3) needs to be in table along with
the new insert records(change_code=1).This data is filtered out from
the “filter” stage and inserted into the table with EFCT_DT=”Data of
loading” i.e. “2008-08-15” and EXPR_DT=”2999-12-31” ---------------------------------
Datastage Implementations – Slowly Changing Dimensions
Basics of SCD
Slowly Changing Dimensions (SCDs) are dimensions that have data that
changes slowly, rather than changing on a time-based, regular schedule.
Type 1
The Type 1 methodology overwrites old data with new data, and therefore does not track historical data at all.
Here is an example of a database table that keeps supplier information:
-------------------------------------------------------------------
Supplier_Key | Supplier_Code | Supplier_Name | Supplier_State | |||||||||||||||||||||||
123 | ABC | Acme Supply Co | CA |
--------------------------------------------------------------------
In this example, Supplier_Code is the natural key and Supplier_Key is
a surrogate key. Technically, the surrogate key is not necessary, since
the table will be unique by the natural key (Supplier_Code). However,
the joins will perform better on an integer than on a character string.
Now imagine that this supplier moves their headquarters to Illinois. The updated table would simply overwrite this record:
----------------------------------------------------------------
Supplier_Key | Supplier_Code | Supplier_Name | Supplier_State |
123 | ABC | Acme Supply Co | IL |
---------------------------------------------------------------
Type 2
The Type 2 method tracks historical data by creating multiple records
for a given natural key in the dimensional tables with separate
surrogate keys and/or different version numbers. With Type 2, we have
unlimited history preservation as a new record is inserted each time a
change is made.
In the same example, if the supplier moves to Illinois, the table
could look like this, with incremented version numbers to indicate the
sequence of changes:
-----------------------------------------------------------------
Supplier_Key | Supplier_Code | Supplier_Name | Supplier_State | Version |
123 | ABC | Acme Supply Co | CA | 0 |
124 | ABC | Acme Supply Co | IL | 1 |
-----------------------------------------------------------------
Another popular method for tuple versioning is to add effective date columns.
-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
Supplier_Key | Supplier_Code | Supplier_Name | Supplier_State | Start_Date | End_Date |
123 | ABC | Acme Supply Co | CA | 01-Jan-2000 | 21-Dec-2004 |
124 | ABC | Acme Supply Co | IL | 22-Dec-2004 |
------------------------------------------------------------------------------------
The null End_Date in row two indicates the current tuple version. In
some cases, a standardized surrogate high date (e.g. 9999-12-31) may be
used as an end date, so that the field can be included in an index, and
so that null-value substitution is not required when querying.
How to Implement SCD using DataStage 8.1 –SCD stage?
Step 1: Create a datastage job with the below structure-
- Source file that comes from the OLTP sources
- Old dimesion refernce table link
- The SCD stage
- Target Fact Table
- Dimesion Update/Insert link
Step 2: To set up the SCD properties in the SCD stage ,open the stage and access the Fast Path
Step 3: The tab 2 of SCD stage is used specify the purpose of each of the pulled keys from the referenced dimension tables.
Step 4: Tab 3 is
used to provide the seqence generator file/table name which is used to
generate the new surrogate keys for the new or latest dimesion
records.These are keys which also get passed to the fact tables for
direct load.
Step 5: The Tab 4
is used to set the properties for configuring the data population logic
for the new and old dimension rows. The type of activies that we can
configure as a part of this tab are:
- Generation the new Surrogate key values to be passed to the dimension and fact table
- Mapping the source columns with the source column
- Setting up of the expired values for the old rows
- Defining the values to mark the current active rows out of multiple type rows
Step 6: Set the derivation logic for the fact as a part of the last tab.
Step 7: Complete the remaining set up, run the job