Pages

Tuesday, July 10, 2018

How many ways we can implement scd in Datastage



1. What type of SCD you have used so far?

Slowly Changing Dimensions (SCD) is dimensions that have data that slowly changes.
For example, we may have a Dimension in our database that tracks the sales records of your company's salesmen and when sales person is transferred from one regional office to another.
Dealing with these issues involves SCD management methodologies referred to as Type 0, 1, 2, 3, 4, and 6. Type 6 SCDs are also sometimes called Hybrid SCDs.
-The Type 0 method is a passive approach to managing dimension value changes, in which no action is taken. Values remain as they were at the time the dimension record was first entered.
-The Type 1 methodology overwrites old data with new data, and therefore does not track historical data at all. This is most appropriate when correcting certain types of data errors, such as the spelling of a name. (Assuming we won't ever need to know how it used to be misspelled in the past.)
-The Type 2 method tracks historical data by creating multiple records in the dimensional tables with separate keys. With Type 2, we have unlimited history preservation as a new record is inserted each time a change is made.
-The Type 3 method tracks changes using separate columns. Whereas Type 2 had unlimited history preservation, Type 3 has limited history preservation, as it's limited to the number of columns we designate for storing historical data. Where the original table structure in Type 1 and Type 2 was very similar, Type 3 will add additional columns to the tables:
- The Type 4 method is usually just referred to as using "history tables", where one table keeps the current data and an additional table is used to keep a record of some or all changes.
- The Type 6 method is one that combines the approaches of types 1, 2 and 3 (1 + 2 + 3 = 6).  It is not frequently used because it has the potential to complicate end user access, but has some advantages over the other approaches especially when techniques are employed to mitigate the downstream complexity.

2. How did you implement Type2 SCD in Datastage?

The following steps are required to implement SCD type2 in Datastage.
1) Need to take a snap shot of the WareHouse final target dimensional table and store in a DataSet or Temp Table or directly from the Data warehouse target table itself.
2) Retrieve the new records from the source (Source table/ Flat file/ view/ or any other source) and lookup the snap shot with the help of lookup or Join stage and based on Primary Key
3) Allow to pass both the values for these particular SCD columns (the columns that are affected with change) with a different column name (like SalesTerritory Source and SalesTerritoryLkp.
4) In the next step, in a transformer, compare these two values for every single primary key using stage variables. If these values are different then close the previous record by choosing SalesTerritoryLkp value (coming from target snap shot) for SalesTerritory column ,updating the CURRENT_RECORD='N'  and setting the END_DATE as Current Time ( when the records are being processed) in first link  from the transformer. Simultaneously with the new value (Sales_TeritorySource ) in the SCD columns insert One more records into the target table with START_DATE as current time, CURRENT_RECORD=‘Y’ and setting the END_DATE as Null.
5) To maintain the uniqueness of these two records generate a surrogate key and use this as the primary key in the target for dimensional table


SCD Type 1
Type 1 Slowly Changing Dimension data warehouse architecture applies when no history is kept in the database. The new, changed data simply overwrites old entries. This approach is used quite often with data which change over the time and it is caused by correcting data quality errors (misspells, data consolidations, trimming spaces, language specific characters).
Type 1 SCD is easy to maintain and used mainly when losing the ability to track the old history is not an issue.
SCD 1 implementation in Datastage

The job described and depicted below shows how to implement SCD Type 1 in Datastage. It is one of many possible designs which can implement this dimension. The example is based on the customers load into a data warehouse


Datastage SCD1 job design




The most important facts and stages of the CUST_SCD2 job processing:
•           There is a hashed file (Hash_NewCust) which handles a lookup of the new data coming from the text file.
•           A T001_Lookups transformer does a lookup into a hashed file and maps new and old values to separate columns.
SCD1 Transformer mapping

http://etl-tools.info/images/scd1-transformer-map-new-data.jpg
•           A T002 transformer updates old values with new ones without concerning about the overwritten data.
SCD1 Transformer update old entries
http://etl-tools.info/images/scd1-transformer-overwrite-data.jpg
•           The database is updated in a target ODBC stage (with the 'update existing rows' update action)


SCD Type 2

Slowly changing dimension Type 2 is a model where the whole history is stored in the database. An additional dimension record is created and the segmenting between the old record values and the new (current) value is easy to extract and the history is clear.
The fields 'effective date' and 'current indicator' are very often used in that dimension and the fact table usually stores dimension key and version number.
SCD 2 implementation in Datastage

The job described and depicted below shows how to implement SCD Type 2 in Datastage. It is one of many possible designs which can implement this dimension.
For this example, we will use a table with customers data (it's name is D_CUSTOMER_SCD2) which has the following structure and data:






Datastage SCD2 job design
http://etl-tools.info/images/scd2-job-design.jpg

The most important facts and stages of the CUST_SCD2 job processing:
·  The dimension table with customers is refreshed daily and one of the data sources is a text file. For the purpose of this example the CUST_ID=ETIMAA5 differs from the one stored in the database and it is the only record with changed data. It has the following structure and data:
SCD 2 - Customers file extract:
SCD 2 - Customers file extract

·  There is a hashed file (Hash_NewCust) which handles a lookup of the new data coming from the text file.
·  A T001_Lookups transformer does a lookup into a hashed file and maps new and old values to separate columns.
SCD 2 lookup transformer
http://etl-tools.info/images/scd2-transformer-lookup.jpg
·  A T002_Check_Discrepacies_exist transformer compares old and new values of records and passes through only records that differ.
SCD 2 check discrepancies transformer
http://etl-tools.info/images/scd2-transformer-find-discrepancies.jpg
·  A T003 transformer handles the UPDATE and INSERT actions of a record. The old record is updated with current indictator flag set to no and the new record is inserted with current indictator flag set to yes, increased record version by 1 and the current date.
SCD 2 insert-update record transformer
http://etl-tools.info/images/scd2-transformer-insert-update-record.jpg
·  ODBC Update stage (O_DW_Customers_SCD2_Upd) - update action 'Update existing rows only' and the selected key columns are CUST_ID and REC_VERSION so they will appear in the constructed where part of an SQL statement.
·  ODBC Insert stage (O_DW_Customers_SCD2_Ins) - insert action 'insert rows without clearing' and the key column is CUST_ID.


SCD TYPE 3
In the Type 3 Slowly Changing Dimension only the information about a previous value of a dimension is written into the database. An 'old 'or 'previous' column is created which stores the immediate previous attribute. In Type 3 SCD users are able to describe history immediately and can report both forward and backward from the change.
However, that model can't track all historical changes, such as when a dimension changes twice or more. It would require creating next columns to store historical data and could make the whole data warehouse schema very complex.

To implement SCD Type 3 in Datastage use the same processing as in the SCD-2 example, only changing the destination stages to update the old value with a new one and update the previous value field.
SCD TYPE 4
The Type 4 SCD idea is to store all historical changes in a separate historical data table for each of the dimensions.

To implement SCD Type 4 in Datastage use the same processing as in the SCD-2 example, only changing the destination stages to insert an old value into the destionation stage connected to the historical data table (D_CUSTOMER_HIST for example) and update the old value with a new one.
SCD 2 implementation in Datastage Parallel Jobs 7.5X2.

Change Capture Stage:
            “It is processing stage, that it capture whether a record from table is copy or edited or insert or to delete by keeping the code column name”.

Simple example of change capture:








Text Box: 

                                                            Change_capture                   
                                 

                                                                     
Properties of Change Capture:
Ø  Change keys
o   Key = EID (key column name)
§  Sort order = ascending order
Ø  Change valves
o   Values =? \\ ENAME
o   Values =? \\ ADD
Ø  Options
o   Change mode = (explicit keys & values / explicit keys, values)
o   Drop output for copy = (false/ true)  “false – default ”
o   Drop output for delete = (false/ true) “false – default”
o   Drop output for edit = (false/ true) “false – default”
o   Drop output for insert = (false/ true) “false – default”
§  Copy code = 0
§  Delete code = 2
§  Edit code = 3
§  Insert code = 1
§  Code column name =
o   Log statistics = (false/ true) “false – default”

Change Apply Stage:
            “It is processing stage, that it applies the changes of records of a table”.

Text Box: 

                                                                                 














                                                            Change Apply           
                                 
Properties of Change Apply:
Ø  Change keys
o   Key = EID
§  Sort order = ascending order
Ø  Options
o   Change mode = explicit key & values
o   Check value columns on delete = (false/ true) “true - default”
o   Log statistics = false
o   Code column name = \\ change capture and this has to be SAME for apply operations


SCD II in version 7.5.x2
Design of that
                                                                                                ESDATE=current date ()
                                                                                                EEDATE= “9999-12-31”
Text Box: 

                                    Key=EID                               ACF= “Y”
Text Box:  Text Box:                          -option: e k & v
Before.txt                                                                 c=3         










 
Text Box:  Text Box:  Text Box:                                                                            c=all
                                                   
after.txt
Text Box:  Text Box:  Text Box:  Text Box:                                                                          key= EID                                                                                                     -option: e k & v                                 
before.txt                                                           

                                                                                                                ESDATE- current date ()
EEDATE- if c=3 then DFJD(JDFD(CD())-1)
                                                                                                                                    else EEDATE = “9999-12-31”                                                                                                                      ACF- if(c=3) then “N” else “Y”     



Example table of SCD data:
SID
CID
CNAME
ADD
AF
ESDATE
EEDATE
RV
UID
1
11
A
HYD
N
03-06-06
29-11-10
1
1
2
22
B
SEC
N
03-06-06
07-09-07
1
2
3
33
C
DEL
Y
03-06-06
9999-12-31
1
3
4
22
B
DEL
N
08-09-07
29-11-10
2
2
5
44
D
MCI
Y
08-09-07
9999-12-31
1
5
6
11
A
GDK
Y
30-11-10
9999-12-31
2
1
7
22
B
RAJ
Y
30-11-10
9999-12-31
3
2
8
55
E
CUL
Y
30-11-10
9999-12-31
1
8

Table: this table is describing the SCD six types and the description is shown above.
DAY 44
SCD I & SCD II (Design and Properties)


SCD – I: Type1 (Design and Properties):
Transfer job                                                                             Load job
Text Box:  Text Box:  10,20,30










 
Text Box:  Text Box:  OE_DIM            before              fact           DS_FACT          10, 20, 40                                            10, 20, 40








Text Box: 

 
                                                                                    DS_TRG_DIM                                     OE_UPSERT
Text Box:  Text Box:  10, 20, 40           After                     dim   10,20, 40                                             -update and insert

OE_SRC                                                DS_TRG_DIM

In oracle we have to create table1 and table2,
Table1:
Ø  Create table SRC(SNO number, SNAME varchar2(25));
o   Insert into src values(111, ‘naveen’);
o   Insert into src values(222, ‘munna’);
o   Insert into src values(333, ‘kumar’);
Table2:
Ø  Create table DIM(SKID number, SNO number, SNAME varchar2(25));
o   No records to display;

Processes of transform job SCD1:

Step 1: Load plug-in Meta data from oracle of before and after data as shown in the above links that coming from different sources.

Step 2: “SCD1 properties”

fact

Fast path 1 of 5:         select output link as:

Fast path 2 of 5:         navigating the key column value between before and after tables



Fast path 3 of 5:         selecting source type and source name.






 
Source type:                                        source name:

NOTE: for every time of running the program we should empty the source name i.e., empty.txt, else surrogate key will continue with last stored value.

Fast path 4 of 5:         select output in DIM.




For path 5 of 5:           setting the output paths to FACT data set.

Step 3: In the Next job, i.e. in load job if we change or edit in the source table and when you are loading into oracle we must change the write method = upsert in that we have two options they are, -update n insert  \\ if key column value is already.
   -insert n update        \\ if key column value is new.

Here SCD I result is for the below input






SCD – II:  (Design and Properties):

Text Box:  Transfer job                                                                             Load job
Text Box: 

Text Box:  10,20,30
                 before
Text Box:  OE_DIM                                    fact           DS_FACT          10, 20, 20, 30, 40                         10, 20, 20, 30, 40








Text Box: 

 
                                                                                    DS_TRG_DIM                                     OE_UPSERT
Text Box:  Text Box:  10, 20, 40           After                     dim   10, 20, 20, 30, 40                         -update and insert

OE_SRC                                                DS_TRG_DIM


Step 1: in transformer stage:
Adding some columns to the to before table – to covert EEDATE and ESDATE columns into time stamp transformer stage to perform SCD II







In TX properties:




In SCD II properties:

fact

Fast path 1 of 5:         select output link as:


Fast path 2 of 5:         navigating the key column value between before and after tables


Fast path 3 of 5:         selecting source type and source name.






 
Source type:                                        source name:

NOTE: for every time of running the program we should empty the source name i.e., empty.txt, else surrogate key will continue with last stored value.



Fast path 4 of 5:         select output in DIM.


For path 5 of 5:           setting the output paths to FACT data set.


Step 3: In the Next job, i.e. in load job if we change or edit in the source table and when you are loading into oracle we must change the write method = upsert in that we have two options they are, -update n insert  \\ if key column value is already.
   -insert n update        \\ if key column value is new.

Sunday, July 8, 2018

List of Environment Variables in DataStage




General Job Administration

APT_CHECKPOINT_DIR
APT_CLOBBER_OUTPUT
APT_CONFIG_FILE
APT_DISABLE_COMBINATION
APT_EXECUTION_MODE
APT_ORCHHOME
APT_STARTUP_SCRIPT
APT_NO_STARTUP_SCRIPT
APT_STARTUP_STATUS
APT_THIN_SCORE


Job Monitoring

APT_MONITOR_SIZE
APT_MONITOR_TIME
APT_NO_JOBMON
APT_PERFORMANCE_DATA

Buffering

APT_BUFFER_FREE_RUN
APT_BUFFER_MAXIMUM_MEMORY
APT_BUFFER_MAXIMUM_TIMEOUT
APT_BUFFER_DISK_WRITE_INCREMENT
APT_BUFFERING_POLICY
APT_DISABLE_ROOT_FORKJOIN
APT_SHARED_MEMORY_BUFFERS

Compiler

APT_COMPILER
APT_COMPILEOPT
APT_LINKER
APT_LINKOPT

Debugging

APT_DEBUG_OPERATOR
APT_DEBUG_MODULE_NAMES
APT_DEBUG_PARTITION
APT_DEBUG_SIGNALS
APT_DEBUG_STEP
APT_DEBUG_SUBPROC
APT_EXECUTION_MODE
APT_PM_DBX
APT_PM_GDB
APT_PM_SHOW_PIDS
APT_PM_XLDB
APT_PM_XTERM
APT_PXDEBUGGER_FORCE_SEQUENTIAL
APT_SHOW_LIBLOAD

Partitioning

APT_NO_PART_INSERTION
APT_NO_PARTSORT_OPTIMIZATION
APT_PARTITION_COUNT
APT_PARTITION_NUMBER

Reading and Writing Files

APT_DELIMITED_READ_SIZE
APT_FILE_IMPORT_BUFFER_SIZE
APT_FILE_EXPORT_BUFFER_SIZE
APT_IMPORT_PATTERN_USES_FILESET
APT_MAX_DELIMITED_READ_SIZE
APT_PREVIOUS_FINAL_DELIMITER_COMPATIBLE
APT_STRING_PADCHAR

Reporting

APT_DUMP_SCORE
APT_ERROR_CONFIGURATION
APT_MSG_FILELINE
APT_PM_PLAYER_MEMORY
APT_PM_PLAYER_TIMING
APT_RECORD_COUNTS
OSH_DUMP
OSH_ECHO
OSH_EXPLAIN
OSH_PRINT_SCHEMAS

Sorting

APT_NO_SORT_INSERTION
APT_SORT_INSERTION_CHECK_ONLY

Transport Blocks

APT_LATENCY_COEFFICIENT
APT_DEFAULT_TRANSPORT_BLOCK_SIZE
APT_MAX_TRANSPORT_BLOCK_SIZE/ APT_MIN_TRANSPORT_BLOCK_SIZE


Disk I/O

APT_BUFFER_DISK_WRITE_INCREMENT
APT_CONSISTENT_BUFFERIO_SIZE
APT_EXPORT_FLUSH_COUNT
APT_IO_MAP/APT_IO_NOMAP and APT_BUFFERIO_MAP/APT_BUFFERIO_NOMAP
APT_PHYSICAL_DATASET_BLOCK_SIZE

Decimal Support

APT_DECIMAL_INTERM_PRECISION
APT_DECIMAL_INTERM_SCALE
APT_DECIMAL_INTERM_ROUND_MODE

Building Custom Stages

DS_OPERATOR_BUILDOP_DIR
OSH_BUILDOP_CODE
OSH_BUILDOP_HEADER
OSH_BUILDOP_OBJECT
OSH_BUILDOP_XLC_BIN
OSH_CBUILDOP_XLC_BIN

DB2 Support

APT_DB2INSTANCE_HOME
APT_DB2READ_LOCK_TABLE
APT_DBNAME
APT_RDBMS_COMMIT_ROWS
DB2DBDFT

Look Up support

APT_LUTCREATE_NO_MMAP

Miscellaneous

APT_COPY_TRANSFORM_OPERATOR
APT_EBCDIC_VERSION
APT_DATE_CENTURY_BREAK_YEAR
APT_IMPEXP_ALLOW_ZERO_LENGTH_FIXED_NULL
APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS
APT_INSERT_COPY_BEFORE_MODIFY
APT_ISVALID_BACKCOMPAT
APT_OLD_BOUNDED_LENGTH
APT_OPERATOR_REGISTRY_PATH
APT_PM_NO_SHARED_MEMORY
APT_PM_NO_NAMED_PIPES
APT_PM_SOFT_KILL_WAIT
APT_PM_STARTUP_CONCURRENCY
APT_RECORD_COUNTS
APT_SAVE_SCORE
APT_SHOW_COMPONENT_CALLS
APT_STACK_TRACE
APT_TRANSFORM_COMPILE_OLD_NULL_HANDLING
APT_TRANSFORM_LOOP_WARNING_THRESHOLD
APT_WRITE_DS_VERSION
OSH_PRELOAD_LIBS

Network

APT_IO_MAXIMUM_OUTSTANDING
APT_IOMGR_CONNECT_ATTEMPTS
APT_PM_CONDUCTOR_HOSTNAME
APT_PM_NO_TCPIP
APT_PM_NODE_TIMEOUT
APT_PM_SHOWRSH
APT_PM_STARTUP_PORT
APT_PM_USE_RSH_LOCALLY
APT_RECVBUFSIZE
APT_USE_IPV4
NLS APT_COLLATION_SEQUENCE
APT_COLLATION_STRENGTH
APT_ENGLISH_MESSAGES
APT_IMPEXP_CHARSET
APT_INPUT_CHARSET
APT_OS_CHARSET
APT_OUTPUT_CHARSET
APT_STRING_CHARSET

Oracle Support

APT_ORACLE_LOAD_OPTIONS
APT_ORACLE_NO_OPS
APT_ORACLE_PRESERVE_BLANKS
APT_ORA_IGNORE_CONFIG_FILE_PARALLELISM
APT_ORA_WRITE_FILES
APT_ORAUPSERT_COMMIT_ROW_INTERVAL APT_ORAUPSERT_COMMIT_TIME_INTERVAL

SAS Support

APT_HASH_TO_SASHASH
APT_NO_SASOUT_INSERT
APT_NO_SAS_TRANSFORMS
APT_SAS_ACCEPT_ERROR
APT_SAS_CHARSET
APT_SAS_CHARSET_ABORT
APT_SAS_COMMAND
APT_SASINT_COMMAND
APT_SAS_DEBUG
APT_SAS_DEBUG_IO
APT_SAS_DEBUG_LEVEL
APT_SAS_DEBUG_VERBOSE
APT_SAS_NO_PSDS_USTRING
APT_SAS_S_ARGUMENT
APT_SAS_SCHEMASOURCE_DUMP
APT_SAS_SHOW_INFO
APT_SAS_TRUNCATION

Teradata Support

APT_TERA_64K_BUFFERS
APT_TERA_NO_ERR_CLEANUP
APT_TERA_NO_PERM_CHECKS
APT_TERA_NO_SQL_CONVERSION
APT_TERA_SYNC_DATABASE
APT_TERA_SYNC_USER

Slowly changing dimesion

Introduction
The Slowly Changing Dimension stage was added in the 8.0 release of InfoSphere Information Server DataStage. It is designed specifically to populate and maintain records in star schema data models, specifically dimension tables. The Slowly Changing Dimension stage encapsulates all of the dimension maintenance logic — finding existing records, generating surrogate keys, checking for changes, and what action to take when changes occur. In addition, it allows associating dimension record surrogate key values with source records, which eliminates the need for additional lookups in later processing.
Basic Concepts
Facts and Dimensions
Dimensional Modeling is a logical design technique used in data warehouses intended to support end-user queries oriented around understandability. A Star schema is a method of Dimensional Modeling in which the data that is being measured, called the facts, are stored in one table, called the Fact table and the business objects/entities that are involved in the events being measured are stored in tables called Dimension tables. The facts in the fact table are linked to the business objects in the associated dimension tables using foreign keys. A Star schema has one fact table and several dimension tables.
Because fact tables record the measurements generated from business events, they tend to grow rapidly. For example, Sales fact table records information about sales transactions, where every transaction results in insertion of a new row in the fact table. Dimension tables, on the other hand, tend to grow or change less frequently, like records in a Branch dimension are inserted when a new Branch is introduced, or if information about an existing Branch is changed; The ways these changes are made to the dimension table depend on the types of changes and what information must be preserved in the data warehouse.
Changes to attribute information are handled in one of two ways:
Overwrite/Correction of Errors — the existing row in the dimension table is updated to contain the new attribute values; the old values are no longer available. This is commonly referred to as a Type1 change. These changes usually relate to the corrections of certain types of data errors, such as the spelling of a name. This approach will be chosen if the primary purpose is correction of data or if there is no interest in keeping the history of previous values and no need to run prior reports.
Tracking History/Partitioning History — the existing row in the dimension table is modified to indicate that it is no longer current (that is, it has been expired), and a new row is inserted with the current attribute values. This is commonly referred to as a Type2 change. The Type 2 method tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. For example, change in the marital status of the employee. Type 2 SCD perfectly partitions history because each detailed version of a dimensional identity is correctly connected to the span of fact table records for which that version is exactly correct.
Surrogate Keys
Surrogate Keys are values that are generated specifically for the purpose of uniquely identifying dimension table rows. The primary reasons you would use a surrogate key rather than the usual business key of the object in the dimension table are:
· When tracking history in the dimension table, there will be multiple rows in the dimension table for the same business key. Therefore, it is not possible to use the business key as the primary key.
· Typical fields that are used as business keys generally don’t change, but situations can arise where they do change. For example, US citizens can be assigned a new social security number, or account numbers may be reassigned after a merger.
Surrogate keys provide a way for the dimension table to have a reliable, unique, and never-changing primary key.
Implementation
SCD stages support both SCD Type 1 and SCD Type 2 processing. Each SCD stage processes a single dimension and performs lookups by using an equality matching technique. If the dimension is a database table, the stage reads the database to build a lookup table in memory. If a match is found, the SCD stage updates rows in the dimension table to reflect the changed data. If a match is not found, the stage creates a new row in the dimension table.
SCD_Schema_01
Scenario schema
In this document updation of one Dimension table will be illustrated using an SCD stage. Branch dimension table will be updated using a source file. The source file contains Student Details records. Let us have a look at the data in these tables.
Source: The source data file is extracted from file named StudentDetail.dat using sequential file stage and it contains five records that, when processed, apply changes to the dimension table. Table 1 shows the contents of the file.
Table 1: Source data
CollegeIdCollegeNamePrincipalBrCodeBrNameDescrMaleStudFemaleStud
AAAAAA COLLEGE OF ENGGJefferson111ECEElectronics and Communications Subjects are dealt340157
BBBBBB COLLEGE OF ENGGAdams222CSEComputer Science and Information Technology subjects are dealt400250
CCCCCC COLLEGE OF ENGGMitchell333EEEElectrical and Electronics Engineering Subjects are dealt150120
DDDDDD COLLEGE OF ENGGMadison444MEMechanical Engineering subjects are dealt250175
EEEEEE COLLEGE OF ENGGMonroe555CECivil Engineering Subjects are dealt420350
Branch dimensionThe Branch dimension is a table in the target database. Initially this table contains records for three Branches. When the source data is processed, the table is updated to contain new Branch records, and to track the history of changed Branch information.
Table 2: Initial Branch dimension data
BranchSKBrCodeBrNameDescrCurrEffDateExpDate
1111ECEElectronics and Communications Subjects are dealtY01-03-2004
2222CSEElectrical and Electronics Engineering Subjects are dealtY21-02-2002
10333EEEEElectronics and Communications Subjects are dealtY01-07-2008
Roadmap:
SCD_RoadMap_03
This example uses surrogate key generators that use state files to record the key values that have been used. This ensures that unique values are always generated. A job “SurrogateKey_File.dsx” is developed to initialize the state files. This job reads the Branch dimension table then creates and updates the respective surrogate key generator state files.Initializing the surrogate keys:
Figure 2: Job to update SK values
SCD_BranchDim_04
Building the Slowly Changing Dimensions job:
In this step we build a job that reads the StudentDetail.dat source file, updates the Branch dimension, and inserts records into a dataset which in turn is used to load Fact table.
Figure 3: Job Design
SCD_JobDesign_05
The primary flow of records is from left to right in the job design. The source records are read from StudentDetail, passed to the SCD stage to process the Branch dimension and then passed to a dataset which is later processed to load the fact table. As part of the processing in the SCD stage, the surrogate key values that are associated with the source records are obtained from the dimension table and added to the data being passed to the fact table.
The Branch dimension table acts as a reference source to the SCD stage. This table is used to initialize the lookup cache. Only records that are considered current are stored in the lookup cache. Any historical records in the dimension tables are automatically filtered out during initial processing. The SCD stage uses the data values from the primary input link to lookup into the cache and check for changes. If any changes are required to the dimension table, they are written to the secondary output link of the SCD stage, which is called the dimension update link. Target database stages are connected to the dimension update link to apply the changes to the actual dimension table in the database.
Each record on the primary input link of the SCD stage will go out on the primary output link, and may produce zero, one, or two records on the dimension update link. The number of records produced depends on what, if any, action needs to be taken on the dimension table.
  • No Action: Unchanged records require no action to the dimension table, so no records are written on the dimension update link.
  • Type 1 Change: New records and overwriting updates (Type1) require a one row change to the dimension table. The change is either an insert or an update. One record is written on the dimension update link to reflect these types of changes.
  • Type 2 Change: Changed records that are tracking history (Type2) require a two row change to the dimension table. The existing record must be updated to reflect that it is no longer current, and a new record must be inserted for the new set of values. Two records are written to the dimension update link to reflect these changes.
Configure the primary source stage
The source stage must be configured to read the StudentDetail.dat file. Complete the following steps to configure the StudentDetail sequential file stage:
  1. On the Output|Properties tab, set the File property to C:SCDStudentDetail.dat.
  2. On the Output|Format tab, add the Record delimiter string property and set it to DOS Format.
  3. On the Output|Format tab, remove the Final delimiter property.
  4. Load the DataStageSlowly Changing DimensionsTableDefsStudentDetail table definition onto the output link.
  Figure 4: Source stage
SCD_SrcStage_06

Configure the stages to process the Branch dimension
Three stages are used to process the Branch dimension. Reading the job design:
Ø The first stage specifies how to read the data from the dimension table.
Ø The SCD stage determines what changes need to be made to the dimension table and those changes are written to the dimension update link.
Ø The dimension update link is connected to the dimension update target stage, which specifies how to update the actual database table with the data produced by the SCD stage.
Configure the Branch dimension source stage
Complete the following steps to configure the Branch dimension Oracle Enterprise stage:
Ø On the Output|Properties tab, set the Read method property to Table.
Ø On the Output|Properties tab, set the Table property to BranchDim.
Ø On the Output|Properties tab, set the DB Options (specify user and password).
Ø On the Output|Properties tab, set the Remote Server property from Connection.
Ø Load the DataStageSlowly Changing DimensionsTableDefsBranchDim table definition onto the output link.
Figure 5: Branch dimension source
SCD_BranchDIm_07
Configure the Branch dimension SCD stage
The SCD stage has two input links and two output links. This results in a high number of property link-tab combinations. The Fast Path control of the SCD stage editor lets you navigate directly to the tabs that require input in order to complete the stage configuration.

Fast Path control

Open the Branch dimension SCD stage editor and use the Fast Path control to set the properties as shown:
  • Fast Path page 1: Setting the output link
Use the drop down list to select the output link that is leading to the next SCD stage. This is the primary output of the stage. By default, the first output link connected to the stage is used as the primary output link. The other link automatically becomes the dimension update link.
Figure 6: Branch dimension SCD stage, Fast Path page 1
SCD_FastPath1_08
  • Fast Path page 2: Define the lookup condition and purpose codes
SCD Stage provides 9 purpose codes to track changes in a dimension. This information is used in a number of ways in the SCD processing. The first task on this page is to define what the various columns of the dimension table are used for. The choices for purpose codes are:
Surrogate Key — this column is the primary key of the dimension table and is populated with a surrogate key value.
Business Key — this column is the natural key of the dimension table not necessarily unique. This is used for lookup purpose, to find the dimension table row that corresponds to a source data row.
Type 2 — check this column for a change in value. If the value has changed, perform a history tracking change to the dimension table.
Type 1 — check this column for a change in value. If the value has changed, perform an overwriting change to the dimension table.
Current Indicator — this column is used as a flag to indicate whether it is the most current record for a particular business key.
Effective Date — this column is used to specify when a record first became the most current record, that is, when it became the active record.
Expiration Date — this column is used to specify the ending date of when a record was the active record. For currently active records, this value is typically a future date or NULL.
SK Chain — this column is used to store the surrogate key of the previous or next record in the history for a particular business key.
(blank) — This column is not used for anything with respect to SCD processing. Data for this field is inserted into the table when a new row is inserted, but this column will not be checked for changes against the source data.
Set purpose codes for the columns as shown below in Figure 7. Because this dimension table is tracking history, it contains columns to track whether a row is current and the date range for when it was current.
Click on the BrCode source field and drag it to the BrCode dimension column to create the lookup condition.
Figure 7: Branch dimension SCD stage, Fast Path page 2
SCD_FastPath2_09
This tab is used to define the lookup keys from the source record to the dimension record. Any source column can be associated with any one dimension column. This creates an equality lookup condition between those columns. If more than one source column is associated with a dimension column, then those equality conditions are AND’ed together. In this manner, multi-column lookup keys can be used.
  • Fast Path page 3: Configuring the surrogate key generator
This tab specifies how surrogate keys are generated for this stage. Surrogate key generation can use DataStage’s file based surrogate generation or database sequence object based generation. Here we use file based method.
Set the Source name property C:SCDSKGBranchDim as shown in Figure 8. This is the surrogate key state file that is created by running the ‘SurrogateKey_File’ job. Leave the defaults for the other properties unchanged.
Figure 8: Branch dimension SCD stage, Fast Path page 3
SCD_FastPath3_10
  • Fast Path page 4: Defining the slowly changing dimension behavior and derivations
The DimUpdate tab is used to define several critical elements of SCD processing. The Derivation column is used to specify how to map elements of a source row to elements of the dimension table. The Expire column is used to specify what values need to change if an existing record needs to be expired. Expire expressions are only enabled when there are Type2 columns specified, and are only available for Current Indicator and Expiration Date columns.
If no matching record is found when the lookup is performed, the derivation expressions are applied and a record is written on the dimension update link to indicate a new record needs to be added to the dimension table. If a matching record is found, the derivation expressions are applied to the source columns, and then the results are compared to the corresponding columns of the dimension table. Columns specified as Type2 are compared first. If there is a change, two records are written on the dimension update link. The first record is an update record, to expire the matched row. The Expire expressions are used to calculate the values for the update row. The second record is a new record that contains all of the new values for all columns. If no Type2 columns have changed, the Type1 columns are compared. If there are any changes, one record is written on the dimension update link that indicates an update to the dimension table. The derivation expressions are used to calculate the values for the update record.
Set the Derivation expressions and the Expire expressions as shown below in Figure.
Figure 9: Branch dimension SCD stage, Fast Path page 4
SCD_FastPath4_11
Note that you are specifying these properties on the dimension update link. The output columns for this link were automatically propagated with their purpose codes from the dimension input link. The SCD stage only does this when the set of columns on the dimension update link is empty. It is possible to load a set of columns directly on the dimension update link; however, they must exactly match those specified on the dimension input link.
  • Fast Path page 5: Selecting the columns for Output Link
The Output Map tab is used to define what columns will leave this stage on the primary output link. This tab operates much like the Mapping tab of other stages. The only difference is that you can select columns from the primary input link and columns from the reference link to output. The columns coming from the primary source have the same values they entered the stage with. The columns coming from the reference link represent the values from the dimension table that correspond to the source row. Note that because the SCD processing has been done by the stage, every record from the primary source data will have a corresponding record in the dimension.
Figure 10: Branch dimension SCD stage, Fast Path page 5
SCD_FastPath5_12
Select the columns for output as shown below in Figure 10. The output link is initially empty. Create and map the output columns by dragging and dropping from the source to the target. Because the Branch dimension has now been processed, the source columns that contain those attributes are no longer needed. Instead, the primary key associated with the source row is appended because that is the value that is required to be inserted into the fact table.
The stage is now configured to perform the dimension maintenance on the Branch dimension table.
Configure the Branch dimension target stage
This stage processes the dimension update link records produced by the Branch dimension SCD stage to update the actual dimension table in the database. Because incoming records represent both inserts and updates to the table, a Upsert write method must be used. Auto-generated update and insert statements take the purpose codes specified in the SCD stage into account to generate the correct update statement for this usage.
Complete the following steps to configure the Branch dimension update DB2 Enterprise stage:
Ø On the Input|Properties tab, set the Write Method property to Upsert.
Ø On the Input|Properties tab, set the Upsert Mode property to Auto-generated Update and Insert.
Ø On the Input|Properties tab, set the Table property to BranchDim.
Ø On the Input|Properties tab, set the set the DB Options (specify user and password). On the Input|Properties tab, set the Remote Server property from Connection.
Figure 11: Branch dimension target
SCD_BranchDim_13
The stage is now configured to write to the BranchDim dimension table.
Final steps
Click the Compile button to start the compile. The SCD stage processing makes use of the transform operator. Run the job by clicking the Run button in the DataStage Designer.
Summary of changes to database tables
The Branch dimension has two update records, and four new records. Two of the new records are new objects to the dimension table, and two existing records had Type2 changes, resulting in the two updates and two of the new records.
ChangeBranchSKBrCodeBrNameDescrCurrEffDateExpDate
No Change1111ECEElectronics and Communications Subjects are dealtY1/3/200431/12/3500
Expired (Type2)2222CSEElectrical and Electronics Engineering Subjects are dealtN21/2/2002{Today’s Date}
Expired (Type2)10333EEEEElectronics and Communications Subjects are dealtN1/7/2008{Today’s Date}
New Record3444MEMechanical Engineering subjects are dealtY{Today’s Date}31/12/3500
New Record4555CECivil Engineering Subjects are dealtY{Today’s Date}31/12/3500
New Record (Type2)5222CSEComputer Science and Information Technology subjects are dealtY{Today’s Date}31/12/3500
New Record (Type2)6333EEEElectrical and Electronics Engineering Subjects are dealtY{Today’s Date}31/12/3500
Conclusion
Thus the Slowly Changing Dimension stage can be used:
a) To greatly reduce the time spent on creating jobs for processing star schemas.
b) To process history-tracking changes and in-place changes to dimension tables.
c) To reduce fact table processing by augmenting the source data with associated dimension table surrogate keys that eliminate the need for an additional lookup.