1. What type of SCD you have used so far?
Slowly Changing Dimensions (SCD) is dimensions that have data that slowly changes.
For example, we may have a Dimension in our database that tracks the sales records of your company's salesmen and when sales person is transferred from one regional office to another.
Dealing with these issues involves SCD management methodologies referred to as Type 0, 1, 2, 3, 4, and 6. Type 6 SCDs are also sometimes called Hybrid SCDs.
-The Type 0 method is a passive approach to managing dimension value changes, in which no action is taken. Values remain as they were at the time the dimension record was first entered.
-The Type 1 methodology overwrites old data with new data, and therefore does not track historical data at all. This is most appropriate when correcting certain types of data errors, such as the spelling of a name. (Assuming we won't ever need to know how it used to be misspelled in the past.)
-The Type 2 method tracks historical data by creating multiple records in the dimensional tables with separate keys. With Type 2, we have unlimited history preservation as a new record is inserted each time a change is made.
-The Type 3 method tracks changes using separate columns. Whereas Type 2 had unlimited history preservation, Type 3 has limited history preservation, as it's limited to the number of columns we designate for storing historical data. Where the original table structure in Type 1 and Type 2 was very similar, Type 3 will add additional columns to the tables:
- The Type 4 method is usually just referred to as using "history tables", where one table keeps the current data and an additional table is used to keep a record of some or all changes.
- The Type 6 method is one that combines the approaches of types 1, 2 and 3 (1 + 2 + 3 = 6). It is not frequently used because it has the potential to complicate end user access, but has some advantages over the other approaches especially when techniques are employed to mitigate the downstream complexity.
2. How did you implement Type2 SCD in Datastage?
The following steps are required to implement SCD type2 in Datastage.
1) Need to take a snap shot of the WareHouse final target dimensional table and store in a DataSet or Temp Table or directly from the Data warehouse target table itself.
2) Retrieve the new records from the source (Source table/ Flat file/ view/ or any other source) and lookup the snap shot with the help of lookup or Join stage and based on Primary Key
3) Allow to pass both the values for these particular SCD columns (the columns that are affected with change) with a different column name (like SalesTerritory Source and SalesTerritoryLkp.
4) In the next step, in a transformer, compare these two values for every single primary key using stage variables. If these values are different then close the previous record by choosing SalesTerritoryLkp value (coming from target snap shot) for SalesTerritory column ,updating the CURRENT_RECORD='N' and setting the END_DATE as Current Time ( when the records are being processed) in first link from the transformer. Simultaneously with the new value (Sales_TeritorySource ) in the SCD columns insert One more records into the target table with START_DATE as current time, CURRENT_RECORD=‘Y’ and setting the END_DATE as Null.
5) To maintain the uniqueness of these two records generate a surrogate key and use this as the primary key in the target for dimensional table
SCD Type 1
Type 1 Slowly Changing Dimension data warehouse architecture applies when no history is kept in the database. The new, changed data simply overwrites old entries. This approach is used quite often with data which change over the time and it is caused by correcting data quality errors (misspells, data consolidations, trimming spaces, language specific characters).
Type 1 SCD is easy to maintain and used mainly when losing the ability to track the old history is not an issue.
SCD 1 implementation in Datastage
The job described and depicted below shows how to implement SCD Type 1 in Datastage. It is one of many possible designs which can implement this dimension. The example is based on the customers load into a data warehouse
Datastage SCD1 job design
The most important facts and stages of the CUST_SCD2 job processing:
• There is a hashed file (Hash_NewCust) which handles a lookup of the new data coming from the text file.
• A T001_Lookups transformer does a lookup into a hashed file and maps new and old values to separate columns.
SCD1 Transformer mapping
http://etl-tools.info/images/scd1-transformer-map-new-data.jpg
• A T002 transformer updates old values with new ones without concerning about the overwritten data.
SCD1 Transformer update old entries
http://etl-tools.info/images/scd1-transformer-overwrite-data.jpg
• The database is updated in a target ODBC stage (with the 'update existing rows' update action)
SCD Type 2
Slowly changing dimension Type 2 is a model where the whole history is stored in the database. An additional dimension record is created and the segmenting between the old record values and the new (current) value is easy to extract and the history is clear.
The fields 'effective date' and 'current indicator' are very often used in that dimension and the fact table usually stores dimension key and version number.
SCD 2 implementation in Datastage
The job described and depicted below shows how to implement SCD Type 2 in Datastage. It is one of many possible designs which can implement this dimension.
For this example, we will use a table with customers data (it's name is D_CUSTOMER_SCD2) which has the following structure and data:
Datastage SCD2 job design
http://etl-tools.info/images/scd2-job-design.jpg
The most important facts and stages of the CUST_SCD2 job processing:
· The dimension table with customers is refreshed daily and one of the data sources is a text file. For the purpose of this example the CUST_ID=ETIMAA5 differs from the one stored in the database and it is the only record with changed data. It has the following structure and data:
SCD 2 - Customers file extract:
SCD 2 - Customers file extract
· There is a hashed file (Hash_NewCust) which handles a lookup of the new data coming from the text file.
· A T001_Lookups transformer does a lookup into a hashed file and maps new and old values to separate columns.
SCD 2 lookup transformer
http://etl-tools.info/images/scd2-transformer-lookup.jpg
· A T002_Check_Discrepacies_exist transformer compares old and new values of records and passes through only records that differ.
SCD 2 check discrepancies transformer
http://etl-tools.info/images/scd2-transformer-find-discrepancies.jpg
· A T003 transformer handles the UPDATE and INSERT actions of a record. The old record is updated with current indictator flag set to no and the new record is inserted with current indictator flag set to yes, increased record version by 1 and the current date.
SCD 2 insert-update record transformer
http://etl-tools.info/images/scd2-transformer-insert-update-record.jpg
· ODBC Update stage (O_DW_Customers_SCD2_Upd) - update action 'Update existing rows only' and the selected key columns are CUST_ID and REC_VERSION so they will appear in the constructed where part of an SQL statement.
· ODBC Insert stage (O_DW_Customers_SCD2_Ins) - insert action 'insert rows without clearing' and the key column is CUST_ID.
SCD TYPE 3
In the Type 3 Slowly Changing Dimension only the information about a previous value of a dimension is written into the database. An 'old 'or 'previous' column is created which stores the immediate previous attribute. In Type 3 SCD users are able to describe history immediately and can report both forward and backward from the change.
However, that model can't track all historical changes, such as when a dimension changes twice or more. It would require creating next columns to store historical data and could make the whole data warehouse schema very complex.
To implement SCD Type 3 in Datastage use the same processing as in the SCD-2 example, only changing the destination stages to update the old value with a new one and update the previous value field.
SCD TYPE 4
The Type 4 SCD idea is to store all historical changes in a separate historical data table for each of the dimensions.
To implement SCD Type 4 in Datastage use the same processing as in the SCD-2 example, only changing the destination stages to insert an old value into the destionation stage connected to the historical data table (D_CUSTOMER_HIST for example) and update the old value with a new one.
SCD 2 implementation in Datastage Parallel Jobs 7.5X2.
Change Capture Stage:
“It is processing stage, that it capture whether a record from table is copy or edited or insert or to delete by keeping the code column name”.
Simple example of change capture:
Text Box:
Change_capture
Properties of Change Capture:
Ø Change keys
o Key = EID (key column name)
§ Sort order = ascending order
Ø Change valves
o Values =? \\ ENAME
o Values =? \\ ADD
Ø Options
o Change mode = (explicit keys & values / explicit keys, values)
o Drop output for copy = (false/ true) “false – default ”
o Drop output for delete = (false/ true) “false – default”
o Drop output for edit = (false/ true) “false – default”
o Drop output for insert = (false/ true) “false – default”
§ Copy code = 0
§ Delete code = 2
§ Edit code = 3
§ Insert code = 1
§ Code column name =
o Log statistics = (false/ true) “false – default”
Change Apply Stage:
“It is processing stage, that it applies the changes of records of a table”.
Text Box:
Change Apply
Properties of Change Apply:
Ø Change keys
o Key = EID
§ Sort order = ascending order
Ø Options
o Change mode = explicit key & values
o Check value columns on delete = (false/ true) “true - default”
o Log statistics = false
o Code column name =
SCD II in version 7.5.x2
Design of that
ESDATE=current date ()
EEDATE= “9999-12-31”
Text Box:
Key=EID ACF= “Y”
Text Box: Text Box: -option: e k & v
Before.txt c=3
Text Box: Text Box: Text Box: c=all
after.txt
Text Box: Text Box: Text Box: Text Box: key= EID -option: e k & v
before.txt
ESDATE- current date ()
EEDATE- if c=3 then DFJD(JDFD(CD())-1)
else EEDATE = “9999-12-31” ACF- if(c=3) then “N” else “Y”
Example table of SCD data:
SID
CID
CNAME
ADD
AF
ESDATE
EEDATE
RV
UID
1
11
A
HYD
N
03-06-06
29-11-10
1
1
2
22
B
SEC
N
03-06-06
07-09-07
1
2
3
33
C
DEL
Y
03-06-06
9999-12-31
1
3
4
22
B
DEL
N
08-09-07
29-11-10
2
2
5
44
D
MCI
Y
08-09-07
9999-12-31
1
5
6
11
A
GDK
Y
30-11-10
9999-12-31
2
1
7
22
B
RAJ
Y
30-11-10
9999-12-31
3
2
8
55
E
CUL
Y
30-11-10
9999-12-31
1
8
Table: this table is describing the SCD six types and the description is shown above.
DAY 44
SCD I & SCD II (Design and Properties)
SCD – I: Type1 (Design and Properties):
Transfer job Load job
Text Box: Text Box: 10,20,30
Text Box: Text Box: OE_DIM before fact DS_FACT 10, 20, 40 10, 20, 40
Text Box:
DS_TRG_DIM OE_UPSERT
Text Box: Text Box: 10, 20, 40 After dim 10,20, 40 -update and insert
OE_SRC DS_TRG_DIM
In oracle we have to create table1 and table2,
Table1:
Ø Create table SRC(SNO number, SNAME varchar2(25));
o Insert into src values(111, ‘naveen’);
o Insert into src values(222, ‘munna’);
o Insert into src values(333, ‘kumar’);
Table2:
Ø Create table DIM(SKID number, SNO number, SNAME varchar2(25));
o No records to display;
Processes of transform job SCD1:
Step 1: Load plug-in Meta data from oracle of before and after data as shown in the above links that coming from different sources.
Step 2: “SCD1 properties”
fact
Fast path 1 of 5: select output link as:
Fast path 2 of 5: navigating the key column value between before and after tables
Fast path 3 of 5: selecting source type and source name.
Source type: source name:
NOTE: for every time of running the program we should empty the source name i.e., empty.txt, else surrogate key will continue with last stored value.
Fast path 4 of 5: select output in DIM.
For path 5 of 5: setting the output paths to FACT data set.
Step 3: In the Next job, i.e. in load job if we change or edit in the source table and when you are loading into oracle we must change the write method = upsert in that we have two options they are, -update n insert \\ if key column value is already.
-insert n update \\ if key column value is new.
Here SCD I result is for the below input
SCD – II: (Design and Properties):
Text Box: Transfer job Load job
Text Box:
Text Box: 10,20,30
before
Text Box: OE_DIM fact DS_FACT 10, 20, 20, 30, 40 10, 20, 20, 30, 40
Text Box:
DS_TRG_DIM OE_UPSERT
Text Box: Text Box: 10, 20, 40 After dim 10, 20, 20, 30, 40 -update and insert
OE_SRC DS_TRG_DIM
Step 1: in transformer stage:
Adding some columns to the to before table – to covert EEDATE and ESDATE columns into time stamp transformer stage to perform SCD II
In TX properties:
In SCD II properties:
fact
Fast path 1 of 5: select output link as:
Fast path 2 of 5: navigating the key column value between before and after tables
Fast path 3 of 5: selecting source type and source name.
Source type: source name:
NOTE: for every time of running the program we should empty the source name i.e., empty.txt, else surrogate key will continue with last stored value.
Fast path 4 of 5: select output in DIM.
For path 5 of 5: setting the output paths to FACT data set.
Step 3: In the Next job, i.e. in load job if we change or edit in the source table and when you are loading into oracle we must change the write method = upsert in that we have two options they are, -update n insert \\ if key column value is already.
-insert n update \\ if key column value is new.