Friday, October 31, 2014

Steps to be followed for implementing SCD II

Read the incoming records through any input stage like sequential file/dataset/table.

• Do the required processing for the incoming data.

• After the above processing step, pass the data into the change capture stage.

• The change capture should be having two input links- one is the before dataset and the other is the after dataset. For our job, the before dataset should be the active records present in the table. The active records are all those records which are having EXPR_DT=’2999-12-31’. The after dataset will be the incoming data passed into change capture after all the necessary processing.

• The change capture stage compared the before dataset and after dataset and produces 4 change_codes for each of the records. The 4 change codes are as follows:

“0” – Copy code (The code indicates the after record is a copy of the before record)

“1”-Insert code (The code indicates a new record has been inserted in the after set that did not exist in the before set.)

“2”-Delete code(The code indicates that a record in the before set has been deleted from the after set)

“3”-Edit code(the code indicates the after record is an edited version of the before record)

The copy records are not passed in the change captured stage as since we need only edited, insert records fro SCD II implementation.

• Use a filter stage to separate the records that needs to be expired and inserted.

• Filter the records with change_code = “1 or 3” into the insert records link. Filter the records with change_code=” 3” into update/expiry link.

• The records with change_code=3 are edited records. So the original records corresponding to these edited records are to be made in-active (expired). We can make the records inactive by changing the EXPR_DT<> ‘2999-12-31’.So to make the record inactive change the EXPR_DT with a valid date. For e.g. you can use make the EXPR_DT as the date one less than the date on which you are loading the data into the table. We will assume that we are loading the data on 2008-08-15.So the EXPR_DT for inactive records would become ‘2008-08-14’. The date 2008-08-15 can be made as the EFCT_DT for records to be inserted.

• To get the original records which needs to be expired, “look-up” the target table for all the records with change_code=3 which are filtered out separately. Get the original record along with the EFCT_DT of the original record. Then update the records EXPR_DT to ‘2008-08-14’ in the table. Now the original records are made inactive (expired).

• The new updated record (change_code=3) needs to be in table along with the new insert records(change_code=1).This data is filtered out from the “filter” stage and inserted into the table with EFCT_DT=”Data of loading” i.e. “2008-08-15” and EXPR_DT=”2999-12-31”

---------------------------------

Datastage Implementations – Slowly Changing Dimensions

Basics of SCD

Slowly Changing Dimensions (SCDs) are dimensions that have data that changes slowly, rather than changing on a time-based, regular schedule.

Type 1

The Type 1 methodology overwrites old data with new data, and therefore does not track historical data at all.

Here is an example of a database table that keeps supplier information:

-------------------------------------------------------------------

Supplier_Key

Supplier_Code

Supplier_Name

Supplier_State

123

ABC

Acme Supply Co

--------------------------------------------------------------------

In this example, Supplier_Code is the natural key and Supplier_Key is a surrogate key. Technically, the surrogate key is not necessary, since the table will be unique by the natural key (Supplier_Code). However, the joins will perform better on an integer than on a character string.

Now imagine that this supplier moves their headquarters to Illinois. The updated table would simply overwrite this record:

----------------------------------------------------------------

Supplier_Key	Supplier_Code	Supplier_Name	Supplier_State
123	ABC	Acme Supply Co	IL

---------------------------------------------------------------

Type 2

The Type 2 method tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. With Type 2, we have unlimited history preservation as a new record is inserted each time a change is made.

In the same example, if the supplier moves to Illinois, the table could look like this, with incremented version numbers to indicate the sequence of changes:

-----------------------------------------------------------------

Supplier_Key	Supplier_Code	Supplier_Name	Supplier_State	Version
123	ABC	Acme Supply Co	CA	0
124	ABC	Acme Supply Co	IL	1

-----------------------------------------------------------------

Another popular method for tuple versioning is to add effective date columns.
-----------------------------------------------------------------------------------

Supplier_Key	Supplier_Code	Supplier_Name	Supplier_State	Start_Date	End_Date
123	ABC	Acme Supply Co	CA	01-Jan-2000	21-Dec-2004
124	ABC	Acme Supply Co	IL	22-Dec-2004

------------------------------------------------------------------------------------

The null End_Date in row two indicates the current tuple version. In some cases, a standardized surrogate high date (e.g. 9999-12-31) may be used as an end date, so that the field can be included in an index, and so that null-value substitution is not required when querying.

How to Implement SCD using DataStage 8.1 –SCD stage?

Step 1: Create a datastage job with the below structure-

Source file that comes from the OLTP sources
Old dimesion refernce table link
The SCD stage
Target Fact Table
Dimesion Update/Insert link

Figure 1

Step 2: To set up the SCD properties in the SCD stage ,open the stage and access the Fast Path

Figure 2

Step 3: The tab 2 of SCD stage is used specify the purpose of each of the pulled keys from the referenced dimension tables.

Figure 3

Step 4: Tab 3 is used to provide the seqence generator file/table name which is used to generate the new surrogate keys for the new or latest dimesion records.These are keys which also get passed to the fact tables for direct load.

Figure 4

Step 5: The Tab 4 is used to set the properties for configuring the data population logic for the new and old dimension rows. The type of activies that we can configure as a part of this tab are:

Generation the new Surrogate key values to be passed to the dimension and fact table
Mapping the source columns with the source column
Setting up of the expired values for the old rows
Defining the values to mark the current active rows out of multiple type rows

Figure 5

Step 6: Set the derivation logic for the fact as a part of the last tab.

Figure 6

Step 7: Complete the remaining set up, run the job

Wednesday, October 22, 2014

Debug in Datastage 8.7

Step 1 : Right click on the link where you want to create a break point

Step 2 : Then click on Toggle Breakpoint

Step 4 : Then From the Menu bar click "Debug" then click "Go".

Then Job Run window and Debug window will appear, then provide required parameter values and click ok on Jon Run window.

It will show like below when Debug is running

Next it will show up the first row values like below.We can change number of rows you want to see by clicking "Edit Break Point" from the "Menu > Debug " or by right clicking on the "Break Point" that we created in Step 1.

Tuesday, October 21, 2014

Data Stage Sequential File Stages (Import and Export) Performance Tuning

Improving Sequential File Performance

If the source file is fixed/de-limited, the Readers Per Nodeoption can be used to read a single input file in parallel at evenly-spaced offsets. Note that in this manner, input row order is not maintained.
If the input sequential file cannot be read in parallel, performance can still be improved by separating the file I/O from the column parsing operation. To accomplish this, define a single large string column for the non-parallel Sequential File read, and then pass this to a Column Import stage to parse the file in parallel. The formatting and column properties of the Column Import stage match those of the Sequential File stage.
On heavily-loaded file servers or some RAID/SAN array configurations, the environment variables $APT_IMPORT_BUFFER_SIZEand $APT_EXPORT_BUFFER_SIZEcan be used to improve I/O performance. These settings specify the size of the read (import) and write (export) buffer size in Kbytes, with a default of 128 (128K). Increasing this may improve performance.
Finally, in some disk array configurations, setting the environment variable $APT_CONSISTENT_BUFFERIO_SIZEto a value equal to the read/write size in bytes can significantly improve performance of Sequential File operations.

Partitioning Sequential File Reads

Care must be taken to choose the appropriate partitioning method from a Sequential File read:
Don’t read from Sequential File using SAME partitioning! Unless more than one source file is specified, SAME will read the entire file into a single partition, making the entire downstream flow run sequentially (unless it is later repartitioned).
When multiple files are read by a single Sequential File stage (using multiple files, or by using a File Pattern), each file’s data is read into a separate partition. It is important to use ROUND-ROBIN partitioning (or other partitioning appropriate to downstream components) to evenly distribute the data in the flow.

Sequential File (Export) Buffering

By default, the Sequential File (export operator) stage buffers its writes to optimize performance. When a job completes successfully, the buffers are always flushed to disk. The environment variable $APT_EXPORT_FLUSH_COUNTallows the job developer to specify how frequently (in number of rows) that the Sequential File stage flushes its internal buffer on writes. Setting this value to a low number (such as 1) is useful for realtime applications, but there is a small performance penalty associated with increased I/O.

Reading from and Writing to Fixed-Length Files

Particular attention must be taken when processing fixed-length fields using the Sequential File stage:
If the incoming columns are variable-length data types (eg. Integer, Decimal, Varchar), the field width column property must be set to match the fixed-width of the input column. Double-click on the column number in the grid dialog to set this column property.

If a field is nullable, you must define the null field value and length in the Nullable section of the column property. Double-click on the column number in the grid dialog to set these properties.

When writing fixed-length files from variable-length fields (eg. Integer, Decimal, Varchar), the field width and pad string column properties must be set to match the fixed-width of the output column. Double-click on the column number in the grid dialog to set this column property.
To display each field value, use the print_field import property. All import and export properties are listed in chapter 25, Import/Export Properties of the Orchestrate 7.0 Operators Reference.

Reading Bounded-Length VARCHAR Columns

Care must be taken when reading delimited, bounded-length Varchar columns (Varchars with the length option set). By default, if the source file has fields with values longer than the maximum Varchar length, these extra characters will be silently truncated.
Starting with v7.01 the environment variable
$APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS will direct DataStage to reject records with strings longer than their declared maximum column length.

Monday, October 20, 2014

Environmental variables in Datastage

Basically Environment variable is predefined variable those we can use while creating DS job.we create/declare these variables in DS Administrator.while designing the job we set the properties for these variables.For Example Database Uname/Pwd we declare these variables in Admin and set the values Scott and Tiger values in the properties tab in the Menu(Parameters tab)

OLTP and OLAP difference

OLTP vs. OLAP

We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can assume that OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it.

olap vs oltp

- OLTP (On-line Transaction Processing) is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put on very fast query processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second. In OLTP database there is detailed and current data, and schema used to store transactional databases is the entity model (usually 3NF).

- OLAP (On-line Analytical Processing) is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas (usually star schema).

The following table summarizes the major differences between OLTP and OLAP system design.

	OLTP System Online Transaction Processing (Operational System)	OLAP System Online Analytical Processing (Data Warehouse)
Source of data	Operational data; OLTPs are the original source of the data.	Consolidation data; OLAP data comes from the various OLTP Databases
Purpose of data	To control and run fundamental business tasks	To help with planning, problem solving, and decision support
What the data	Reveals a snapshot of ongoing business processes	Multi-dimensional views of various kinds of business activities
Inserts and Updates	Short and fast inserts and updates initiated by end users	Periodic long-running batch jobs refresh the data
Queries	Relatively standardized and simple queries Returning relatively few records	Often complex queries involving aggregations
Processing Speed	Typically very fast	Depends on the amount of data involved; batch data refreshes and complex queries may take many hours; query speed can be improved by creating indexes
Space Requirements	Can be relatively small if historical data is archived	Larger due to the existence of aggregation structures and history data; requires more indexes than OLTP
Database Design	Highly normalized with many tables	Typically de-normalized with fewer tables; use of star and/or snowflake schemas
Backup and Recovery	Backup religiously; operational data is critical to run the business, data loss is likely to entail significant monetary loss and legal liability	Instead of regular backups, some environments may consider simply reloading the OLTP data as a recovery method

Sunday, October 19, 2014

Change Capture job design example

Change Capture has two source one database (OC)
and the other coming in from transformer(link Lnk_Before) and output sent to filter stage where we check for change code.
Based on change code we sent data to update,insert,copy and delete links.

Oracle Connector Properties

Delete and insert in datastage:

Insert Example:

Delete Example:

My Datastage Notes

Pages