My Datastage Notes: May 2013

Thursday, May 23, 2013

Star vs Snowflake Schemas

First Answer: My personal opinion is to use the star by default, but if the product you are using for the business community prefers a snowflake, then I would snowflake it. The major difference between snowflake and star is that a snowflake will have multiple tables for a “dimension” and a start with a single table. For example, your company structure might be

Corporate à Region à Department à Store

In a star schema, you would collapse those into a single "store" dimension. In a snowflake, you would keep them apart with the store connecting to the fact.

Second Answer: First of all, some definitions are in order. In a star schema, dimensions that reflect a hierarchy are flattened into a single table. For example, a star schema Geography Dimension would have columns like country, state/province, city, state and postal code. In the source system, this hierarchy would probably be normalized with multiple tables with one-to-many relationships.

A snowflake schema does not flatten a hierarchy dimension into a single table. It would, instead, have two or more tables with a one-to-many relationship. This is a more normalized structure. For example, one table may have state/province and country columns and a second table would have city and postal code. The table with city and postal code would have a many-to-one relationship to the table with the state/province columns.

There are some good for reasons snowflake dimension tables. One example is a company that has many types of products. Some products have a few attributes, others have many, many. The products are very different from each other. The thing to do here is to create a core Product dimension that has common attributes for all the products such as product type, manufacturer, brand, product group, etc. Create a separate sub-dimension table for each distinct group of products where each group shares common attributes. The sub-product tables must contain a foreign key of the core Product dimension table.

One of the criticisms of using snowflake dimensions is that it is difficult for some of the multidimensional front-end presentation tools to generate a query on a snowflake dimension. However, you can create a view for each combination of the core product/sub-product dimension tables and give the view a suitably description name (Frozen Food Product, Hardware Product, etc.) and then these tools will have no problem.

Tuesday, May 14, 2013

Performance Tuning in Datastage

1	Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance
2	Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects.
3	Tuned the 'Project Tunables' in Administrator for better performance
4	Used sorted data for Aggregator
5	Sorted the data as much as possible in DB and reduced the use of DS-Sort for betterperformance of jobs
6	Removed the data not used from the source as early as possible in the job
7	Worked with DB-admin to create appropriate Indexes on tables for betterperformance of DS queries
8	Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs.
9	If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel.
10	Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories.Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal.
11	Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made.
12	Tuning should occur on a job-by-job basis.
13	Use the power of DBMS.
14	Try not to use a sort stage when you can use an ORDER BY clause in the database.
15	Using a constraint to filter a record set is much slower than performing a SELECT … WHERE….
16	Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE.
17	Minimize the usage of Transformer (Instead of this use Copy modify Filter Row Generator
18	Use SQL Code while extracting the data
19	Handle the nulls
20	Minimize the warnings
21	Reduce the number of lookups in a job design
22	Try not to use more than 20stages in a job
23	Use IPC stage between two passive stages Reduces processing time
24	Drop indexes before data loading and recreate after loading data into tables
25	Check the write cache of Hash file. If the same hash file is used for Look up and as well as target disable this Option.
26	If the hash file is used only for lookup then enable Preload to memory . This will improve the performance. Also check the order of execution of the routines.
27	Don't use more than 7 lookups in the same transformer; introduce new transformers if it exceeds 7 lookups.
28	Use Preload to memory option in the hash file output.
29	Use Write to cache in the hash file input.
30	Write into the error tables only after all the transformer stages.
31	Reduce the width of the input record - remove the columns that you would not use.
32	Cache the hash files you are reading from and writing into. Make sure your cache is big enough to hold the hash files.
33	Use ANALYZE.FILE or HASH.HELP to determine the optimal settings for your hash files.
34	Ideally, if the amount of data to be processed is small, configuration files with less number of nodes should be used while if data volume is more , configuration files with larger number of nodes should be used.
35	Partitioning should be set in such a way so as to have balanced data flow i.e. nearly equal partitioning of data should occur and data skew should be minimized.
36	In DataStage Jobs where high volume of data is processed, virtual memory settings for the job should be optimized. Jobs often abort in cases where a single lookup has multiple reference links. This happens due to low temp memory space. In such jobs $APT_BUFFER_MAXIMUM_MEMORY, $APT_MONITOR_SIZE and $APT_MONITOR_TIME should be set to sufficiently large values.
37	Sequential files should be used in following conditions. When we are reading a flat file (fixed width or delimited) from UNIX environment which is FTP ed from some external system
38	When some UNIX operations has to be done on the file Don’t use sequential file for intermediate storage between jobs. It causes performance overhead, as it needs to do data conversion before writing and reading from a UNIX file
39	In order to have faster reading from the Stage the number of readers per node can be increased (default value is one).
40	Usage of Dataset results in a good performance in a set of linked jobs. They help in achieving end-to-end parallelism by writing data in partitioned form and maintaining the sort order.
41	Look up Stage is faster when the data volume is less. If the reference data volume is more, usage of Lookup Stage should be avoided as all reference data is pulled in to local memory
42	Sparse lookup type should be chosen only if primary input data volume is small.
43	Join should be used when the data volume is high. It is a good alternative to the lookup stage and should be used when handling huge volumes of data.
44	Even though data can be sorted on a link, Sort Stage is used when the data to be sorted is huge.When we sort data on link ( sort / unique option) once the data size is beyond the fixed memory limit , I/O to disk takes place, which incurs an overhead. Therefore, if the volume of data is large explicit sort stage should be used instead of sort on link.Sort Stage gives an option on increasing the buffer memory used for sorting this would mean lower I/O and better performance.
45	It is also advisable to reduce the number of transformers in a Job by combining the logic into a single transformer rather than having multiple transformers.
46	Presence of a Funnel Stage reduces the performance of a job. It would increase the time taken by job by 30% (observations). When a Funnel Stage is to be used in a large job it is better to isolate itself to one job. Write the output to Datasets and funnel them in new job.
47	Funnel Stage should be run in “continuous” mode, without hindrance.
48	A single job should not be overloaded with Stages. Each extra Stage put in a Job corresponds to lesser number of resources available for every Stage, which directly affects the Jobs Performance. If possible, big jobs having large number of Stages should be logically split into smaller units.
49	Unnecessary column propagation should not be done. As far as possible, RCP (Runtime Column Propagation) should be disabled in the jobs
50	Most often neglected option is “don’t sort if previously sorted” in sort Stage, set this option to “true”. This improves the Sort Stage performance a great deal
51	In Transformer Stage “Preserve Sort Order” can be used to maintain sort order of the data and reduce sorting in the job
52	Reduce the number of Stage variables used.
53	The Copy stage should be used instead of a Transformer for simple operations
54	The “upsert” works well if the data is sorted on the primary key column of the table which is being loaded.
55	Don’t read from a Sequential File using SAME partitioning
56	By using hashfile stage we can improve the performance. In case of hashfile stage we can define the read cache size & write cache size but the default size is 128M.B.
57	By using active-to-active link performance also we can improve the performance. Here we can improve the performance by enabling the row buffer, the default row buffer size is 128K.B.

ETL Project Life Cycle

Business Requirement Collection :-

-> The business requirement gathering start by business Analyst, onsite technical lead and client business users.

-> In this phase,a Business Analyst prepares Business Requirement Document ( BRD ) (or) Business Requirement Specifications ( BRS )

-> BR collection takes place at client location.

-> The o/p from BR Analysis are ->[ BRS :- Business Analyst will gather the Business Requirement and document in BRS, SRS :- Senior technical people (or) ETL architect will prepare the SRS which contains s/w and h/w requirements.]

The SRS will includes
a) O/S to be used ( unix )
b) RDBMS required to build database ( oracle, Teradata etc )
c) ETL tools required ( Informatica,Datastage )
d) OLAP tools required ( Cognos ,BO )

The SRS is also called as Technical Requirement Specifications ( TRS )

Designing and Planning the solutions :-
------------------------------------------------

-> The o/p from design and planning phase is
a) HLD ( High Level Design ) Document
b)LLD ( Low Level Design ) Document

HLD ( High Level Design ) Document : -

An ETL Architect and DWH Architect participate in designing a solution to build a DWH.
An HLD document is prepared based on Business Requirement.

LLD ( Low Level Design ) Document : -

Based on HLD,a senior ETL developer prepare Low Level Design Document
The LLD contains more technical details of an ETL System.
An LLD contains data flow diagram ( DFD ), details of source and targets of each mapping.
An LLD also contains information about full and incremental load.
After LLD/specs then Development Phase will start

Development Phase ( Coding ) :-
--------------------------------------------------

-> Based on LLD/, the ETL team will create mapping ( ETL Code )
-> After designing the mappings, the code ( Mappings ) will be reviewed by developers.

Code Review :-

-> Code Review will be done by developer.
-> In code review,the developer will review the code and the logic but not the data.
-> The following activities takes place in code review
-> You have to check the naming standards of transformation,mappings of data etc.
-> Source and target mapping ( Placed the correct logic or not in mapping )

Peer Review :-

-> The code will reviewed by your team member ( third party developer )

Testing:-
--------------------------------

The following various types testing carried out in testing environment.
1) Unit Testing
2) Development Integration Testing
3) System Integration Testing
4) User Acceptance Testing

Unit Testing :-

-> A unit test for the DWH is a white Box testing,It should check the ETL procedure and Mappings.
-> The following are the test cases can be executed by an ETL developer.
1) Verify data loss
2) No.of records in the source and target
3) Dataload/Insert
4) Dataload/Update
5) Incremental load
6) Data accuracy
7) verify Naming standards.
8) Verify column Mapping

-> The Unit Test will be carried by ETL developer in development phase.
-> ETL developer has to do the data validations also in this phase.

Development Integration Testing -

-> Run all the mappings in the sequence order.
-> First Run the source to stage mappings.
-> Then run the mappings related to dimensions and facts.

System Integration Testing :-

-> After development phase,we have to move our code to QA environment.
-> In this environment,we are giving read-only permission to testing people.
-> They will test all the workflows.
-> And they will test our code according to their standards.

User Acceptance Testing ( UAT ) :-

-> This test is carried out in the presence of client side technical users to verify the data migration from source to destination.

Production Environment :-
-> Migrate the code into the Go-Live environment from test environment ( QA Environment ).

Development activities:

New Enhancements will be discussed with onsite coordinator and with the internal team once the requirement is confirmed by the business/client we will start with the impact analysis on the existing system/process and will identify the stream which are getting impacted. Once analysis is done we will come up with efforts estimation that will be submitted to business/cleint then we will start with design and coding

Unit testing will be done each individual who developed the code. We will take backup of the existing job(master) and will modify the changes on the backed up job. Once unit testing is completed that will be shared and will go for multiple reviews/ comments / approvals

Once ur tesing done will replace the back up job with the original job( we will take back up of the existing master job again)

We will consolidate all the jobs from each individual and plan for sequence level coding.

UAT/Intergration testing:

We dont have admin right to QA. To support QA environment we have third party vendor(cognizant). A new change request(CR) in internal tool(remedy ticket) will be raised for each deployment in QA.

.dsx will be placed in a shared drive mentioning all the job names path and the changes modified to the particular job and jobs moved after validation of jobs are done.

We have separate team from china who will be providing the data for QA team. We will be monitoring(Director) the jobs and get notified if any aborts/issues.

Data quality Issues will be raised thru remedy for the existing test/qa process. Developers will take it up accordingly and resolve the issues.

Once testing team give sign off we will plan for production deployment. Before deploying we will go for the documentation like modifying the existing design document, providing test cases and test results, check lists( all validations in dbs,naming conventions of jobs,existing paths present or not etc).

Once above docs are attached to the production request (CR) we will undergo approvals based up on hierarchy.

After approvals deployment will be planned. it will be scheduled on weekend.

Saturday, May 11, 2013

Change Capture Stage(CCD)

CDC -- Change Capture

Sample Source Data You can take like below

Before_Data
e_id,e_name,e_Add
11,kim,bris
22,jim,mel
33,pim,syd
44,lim,canb
55,pom,pth

After_Data
e_id,e_name,e_add
11,kim,syd
22,jim,canb
33,pim,syd
44,lim,canb
55,pom,pth

Read the data in two different files. And

Go to Change Capture Stage

Select Key_id as e_id

and click on change values.

Select Values in Available Properties to add

And select e_add as value

Click Ok

Go to Output --- Drag and Drop All the required columns required.

You can drag and drop every thing here. And Click Ok

Give file name in the Target File

Compile and Run the Job

You will get the output as

e_id e_name e_add change_code

11 kim syd 3
22 jim canb 3

3 means edited data

Implementation:

1. Two input datasets are required for change data caputure
stage.
One is Old dataset
Second is New or updated dataset

2. Give in the 2 inputs to the change capture stage and the
target as a dataset.

3. Let the incoming data be sorted based on a key column(s)
for performance purpose in the change Caputure stage.

4. Upon executing the job, the data when viewed from the
dataset shows a new column added apart from the output
data. A change code column would be generated in the change
capture stage having values as 0, 1, 2, 3 which depicts the
changes on comparing the 2 input datasets such as copy(0),
Insert(1), Delete(2), Edit(3).

5. See what kind of data you need in the output target like
copy, insert, delete, edit.

6. To apply SCD Type 2 we require Start date and End date
columns.

7. The Change Capture Stage output is given to a
Transformer Stage, where 2 new columns are generated with
Effective Start Date and End Date.

8. If you need all Inserted or new data to be passed in to
a particular dataset then you need to specify an
appropriate condition in the Transformer Stage to the
outgoing link. Ex. Drop Output For insert=true

9. In the similar way other data can also be captured or a
Filter can also be used after the Transformer Stage to
filter the data into the targets based on the requirement.

Tuesday, May 7, 2013

Generating a sequence number in datastage

surrogate-key-generation in Transformer

Create a unique counter in datastage

This entry describes various ways of creating a unique counter in DataStage jobs.
A parallel job has a surrogate key stage that creates unique IDs, however it is limited in that it does not support conditional code and it may be more efficient to add a counter to an existing transformer rather than add a new stage.

In a server job there are a set of key increment routines installed in the routine SDK samples that offer a more complex counter that remembers values between job executions.
The following section outlines a transformer only technique.

Steps

In a DataStage job the easiest way to create a counter is within the Transformer stage with a Stage Variable.
svMyCounter = svMyCounter + 1
This simple counter adds 1 each time a row is processed.
The counter can be given a seed value by passing a value in as a job parameter and setting the initial value of svMyCounter to that job parameter.

In a parallel job this simple counter will create duplicate values on each node as the transformer is split into parallel instances. It can be turned into a unique counter by using special parallel macros.

Create a stage variable for the counter, eg. SVCounter.
At the Stage Properties form set the Initial Value of the Stage Variable to "@PARTITIONNUM - @NUMPARTITIONS + 1".
Set the derivation of the stage variable to "svCounter + @NUMPARTITIONS". You can embed this in an IF statement if it is a conditional counter.

Each instance will start at a different number, eg. -1, -2, -3, -4. When the counter is incremented each instance is increment by the number of partitions, eg. 4. This gives us a sequence in instance 1 of 1, 5, 9, 13... Instance 2 is 2, 6, 10, 14... etc.

Remember this method only works if your data is evenly balanced i.e. equal number of rows going through each partition. Alternative syntax is:
@INROWNUM * @NUMPARTITIONS + @PARTITIONNUM
(numpartion*rwonum)+partinnum-numpartition+1

Example:(2 partitions)
2*1+0-2+1=1
2*1+1-2+1=2
2*2+0-2+1=3
2*2+1-2+1=4

Or

Sequence numbers can be generated in Datastage using certain routines. They are
-KeyMgtGetNextVal
-KeyMgtGetNextValConn

Or

NextSurrogatekey()