My Datastage Notes

Monday, October 6, 2014

How to use the Slowly Changing Dimension stage

The Slowly Changing Dimension stage was added in the 8.0 release of InfoSphere Information Server DataStage. It is designed specifically to support the types of activities required to populate and maintain records in star schema data models, specifically dimension table data.
The Slowly Changing Dimension stage encapsulates all of the dimension maintenance logic —

Finding existing records
Generating surrogate keys
Checking for changes and
What action to take when changes occur.
In addition, you can associate dimension record surrogate key values with source records, which eliminates the need for additional lookups in later processing.

Define the lookup condition and purpose codes
The first task on this page is to define what the various columns of the dimension table are used for. This information is used in a number of ways in the SCD processing. The choices for purpose codes are:

Surrogate Key— This column is the primary key of the dimension table and is populated with a surrogate key value.
Business Key— This column is the identifier of the business objects that the dimension table is representing, but is not the primary key of the dimension table. This column is typically used as a lookup column and corresponds to a key or some other field of the source data that identifies the associated business object. The lookup is used to find the dimension table row that corresponds to a source data row.
Type 2— Check this column for a change in value. If the value has changed, perform a history tracking change to the dimension table.
Type 1— Check this column for a change in value. If the value has changed, perform an overwriting change to the dimension table.
Current Indicator— This column is used as a flag to indicate whether it is the most current record for a particular business key.
Effective Date— This column is used to specify when a record first became the most current record, that is, when it became the active record.
Expiration Date— This column is used to specify the ending date of when a record was the active record. For currently active records, this value is typically a future date or NULL.
SK Chain— This column is used to store the surrogate key of the previous or next record in the history for a particular business key.
(blank) — This column is not used for anything with respect to SCD processing. Data for this field is inserted into the table when a new row is inserted, but this column will not be checked for changes against the source data.

Set purpose codes for the columns as shown below in Figure 7. Because this dimension table is tracking history, it contains columns to track whether a row is current and the date range for when it was current.
Click on the ProdSKU source field and drag it to the SKU dimension column to create the lookup condition.

Product Dimension SCD Stage FP2

Although this tab looks similar to a mapping tab, it is actually defining the lookup keys from the source record to the dimension record. Any source column can be associated with any one dimension column. This creates an equality lookup condition between those columns. If more than one source column is associated with a dimension column, then those equality conditions are AND'ed together. In this manner, multi-column lookup keys can be used.

Sunday, September 14, 2014

Star schema vs. snowflake schema: Which is better?

What are the key differences in snowflake and star schema? Where should they be applied?
The Star schema vs Snowflake schema comparison brings forth four fundamental differences to the fore:
1. Data optimization:
Snowflake model uses normalized data, i.e. the data is organized inside the database in order to eliminate redundancy and thus helps to reduce the amount of data. The hierarchy of the business and its dimensions are preserved in the data model through referential integrity.

Figure 1 – Snow flake model
Star model on the other hand uses de-normalized data. In the star model, dimensions directly refer to fact table and business hierarchy is not implemented via referential integrity between dimensions.

Figure 2 – Star model
2. Business model:
Primary key is a single unique key (data attribute) that is selected for a particular data. In the previous ‘advertiser’ example, the Advertiser_ID will be the primary key (business key) of a dimension table. The foreign key (referential attribute) is just a field in one table that matches a primary key of another dimension table. In our example, the Advertiser_ID could be a foreign key in Account_dimension.
In the snowflake model, the business hierarchy of data model is represented in a primary key – Foreign key relationship between the various dimension tables.
In the star model all required dimension-tables have only foreign keys in the fact tables.
3. Performance:
The third differentiator in this Star schema vs Snowflake schema face off is the performance of these models. The Snowflake model has higher number of joins between dimension table and then again the fact table and hence the performance is slower. For instance, if you want to know the Advertiser details, this model will ask for a lot of information such as the Advertiser Name, ID and address for which advertiser and account table needs to be joined with each other and then joined with fact table.
The Star model on the other hand has lesser joins between dimension tables and the facts table. In this model if you need information on the advertiser you will just have to join Advertiser dimension table with fact table.

Star schema explained

Star schema provides fast response to queries and forms the ideal source for cube structures. Learn all about star schema in this article.

4. ETL
Snowflake model loads the data marts and hence the ETL job is more complex in design and cannot be parallelized as dependency model restricts it.
The Star model loads dimension table without dependency between dimensions and hence the ETL job is simpler and can achieve higher parallelism.
This brings us to the end of the Star schema vs Snowflake schema debate. But where exactly do these approaches make sense?

Where do the two methods fit in?
With the snowflake model, dimension analysis is easier. For example, ‘how many accounts or campaigns are online for a given Advertiser?’

The star schema model is useful for Metrics analysis, such as – ‘What is the revenue for a given customer?’

Monday, September 8, 2014

How to use Aggregate stage to count number of records

Count of records:

Below is the job design:

Create a dummy col with 1 as its value.

In Aggregate Stage: