Thursday, May 23, 2013

Star vs Snowflake Schemas



First Answer: My personal opinion is to use the star by default, but if the product you are using for the business community prefers a snowflake, then I would snowflake it. The major difference between snowflake and star is that a snowflake will have multiple tables for a “dimension” and a start with a single table. For example, your company structure might be

Corporate à Region à Department à Store

In a star schema, you would collapse those into a single "store" dimension. In a snowflake, you would keep them apart with the store connecting to the fact.

Second Answer:  First of all, some definitions are in order. In a star schema, dimensions that reflect a hierarchy are flattened into a single table.  For example, a star schema Geography Dimension would have columns like country, state/province, city, state and postal code. In the source system, this hierarchy would probably be normalized with multiple tables with one-to-many relationships.

A snowflake schema does not flatten a hierarchy dimension into a single table.  It would, instead, have two or more tables with a one-to-many relationship.  This is a more normalized structure. For example, one table may have state/province and country columns and a second table would have city and postal code. The table with city and postal code would have a many-to-one relationship to the table with the state/province columns.

There are some good for reasons snowflake dimension tables. One example is a company that has many types of products. Some products have a few attributes, others have many, many. The products are very different from each other.  The thing to do here is to create a core Product dimension that has common attributes for all the products such as product type, manufacturer, brand, product group, etc. Create a separate sub-dimension table for each distinct group of products where each group shares common attributes.  The sub-product tables must contain a foreign key of the core Product dimension table. 

One of the criticisms of using snowflake dimensions is that it is difficult for some of the multidimensional front-end presentation tools to generate a query on a snowflake dimension.  However, you can create a view for each combination of the core product/sub-product dimension tables and give the view a suitably description name (Frozen Food Product, Hardware Product, etc.) and then these tools will have no problem.

Tuesday, May 14, 2013

Performance Tuning in Datastage


1
Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance
2
Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects.
3
Tuned the 'Project Tunables' in Administrator for better performance
4
Used sorted data for Aggregator
5
Sorted the data as much as possible in DB and reduced the use of DS-Sort for betterperformance of jobs
6
Removed the data not used from the source as early as possible in the job
7
Worked with DB-admin to create appropriate Indexes on tables for betterperformance of DS queries
8
Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs.
9
If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel.
10
Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories.Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal.
11
Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made.
12
Tuning should occur on a job-by-job basis.
13
Use the power of DBMS.
14
Try not to use a sort stage when you can use an ORDER BY clause in the database.
15
Using a constraint to filter a record set is much slower than performing a SELECT … WHERE….
16
Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE.
17
Minimize the usage of Transformer (Instead of this use Copy modify Filter Row Generator
18
Use SQL Code while extracting the data
19
Handle the nulls
20
Minimize the warnings
21
Reduce the number of lookups in a job design
22
Try not to use more than 20stages in a job
23
Use IPC stage between two passive stages Reduces processing time
24
Drop indexes before data loading and recreate after loading data into tables

25
Check the write cache of Hash file. If the same hash file is used for Look up and as well as target disable this Option.
26
If the hash file is used only for lookup then  enable Preload to memory . This will improve the performance. Also check the order of execution of the routines.
27
Don't use more than 7 lookups in the same transformer; introduce new transformers if it exceeds 7 lookups.
28
Use Preload to memory option in the hash file output.
29
Use Write to cache in the hash file input.
30
Write into the error tables only after all the transformer stages.
31
Reduce the width of the input record - remove the columns that you would not use.
32
Cache the hash files you are reading from and writing into. Make sure your cache is big enough to hold the hash files.
33
Use ANALYZE.FILE or HASH.HELP to determine the optimal settings for your hash files.
34
Ideally, if the amount of data to be processed is small, configuration files with less number of nodes should be used while if data volume is more , configuration files with larger number of nodes should be used.
35
Partitioning should be set in such a way so as to have balanced data flow i.e. nearly equal partitioning of data should occur and data skew should be minimized.
36
In DataStage Jobs where high volume of data is processed, virtual memory settings for the job should be optimized. Jobs often abort in cases where a single lookup has multiple reference links. This happens due to low temp memory space. In such jobs $APT_BUFFER_MAXIMUM_MEMORY, $APT_MONITOR_SIZE and $APT_MONITOR_TIME should be set to sufficiently large values.
37
Sequential files should be used in following conditions. When we are reading a flat file (fixed width or delimited) from UNIX environment which is FTP ed from some external system
38
When some UNIX operations has to be done on the file Don’t use sequential file for intermediate storage between jobs. It causes performance overhead, as it needs to do data conversion before writing and reading from a UNIX file
39
In order to have faster reading from the Stage the number of readers per node can be increased (default value is one).
40
Usage of Dataset results in a good performance in a set of linked jobs. They help in achieving end-to-end parallelism by writing data in partitioned form and maintaining the sort order.
41
Look up Stage is faster when the data volume is less. If the reference data volume is more, usage of Lookup Stage should be avoided as all reference data is pulled in to local memory
42
Sparse lookup type should be chosen only if primary input data volume is small.
43
Join should be used when the data volume is high. It is a good alternative to the lookup stage and should be used when handling huge volumes of data.
44
Even though data can be sorted on a link, Sort Stage is used when the data to be sorted is huge.When we sort data on link ( sort / unique option) once the data size is beyond the fixed memory limit , I/O to disk takes place, which incurs an overhead. Therefore, if the volume of data is large explicit sort stage should be used instead of sort on link.Sort Stage gives an option on increasing the buffer memory used for sorting this would mean lower I/O and better performance.
45
It is also advisable to reduce the number of transformers in a Job by combining the logic into a single transformer rather than having multiple transformers.
46
Presence of a Funnel Stage reduces the performance of a job. It would increase the time taken by job by 30% (observations). When a Funnel Stage is to be used in a large job it is better to isolate itself to one job. Write the output to Datasets and funnel them in new job.
47
Funnel Stage should be run in “continuous” mode, without hindrance.
48
A single job should not be overloaded with Stages. Each extra Stage put in a Job corresponds to lesser number of resources available for every Stage, which directly affects the Jobs Performance. If possible, big jobs having large number of Stages should be logically split into smaller units.
49
Unnecessary column propagation should not be done. As far as possible, RCP (Runtime Column Propagation) should be disabled in the jobs
50
Most often neglected option is “don’t sort if previously sorted” in sort Stage, set this option to “true”. This improves the Sort Stage performance a great deal
51
In Transformer Stage “Preserve Sort Order” can be used to maintain sort order of the data and reduce sorting in the job
52
Reduce the number of Stage variables used.
53
The Copy stage should be used instead of a Transformer for simple operations
54
The “upsert” works well if the data is sorted on the primary key column of the table which is being loaded.
55
Don’t read from a Sequential File using SAME partitioning
56
By using hashfile stage we can improve the performance.
In case of hashfile stage we can define the read cache size
& write cache size but the default size is 128M.B.
57
By using active-to-active link performance also we can
improve the performance.
Here we can improve the performance by enabling the row
buffer, the default row buffer size is 128K.B.

ETL Project Life Cycle





Business Requirement Collection :- 
-> The business requirement gathering start by business Analyst, onsite technical lead and client business users.
-> In this phase,a Business Analyst prepares Business Requirement Document ( BRD ) (or) Business Requirement Specifications ( BRS )
-> BR collection takes place at client location.
-> The o/p from BR Analysis are ->[ BRS :- Business Analyst will gather the Business Requirement and document in BRS, SRS :- Senior technical people (or) ETL architect will prepare the SRS which contains s/w and h/w requirements.]
The SRS will includes
a) O/S to be used (  unix )
b) RDBMS required to build database ( oracle, Teradata etc )
c) ETL tools required ( Informatica,Datastage )
d) OLAP tools required ( Cognos ,BO )
The SRS is also called as Technical Requirement Specifications ( TRS )
Designing and Planning the solutions :-
------------------------------------------------
-> The o/p from design and planning phase is
a) HLD ( High Level Design ) Document
b)LLD ( Low Level Design ) Document
HLD ( High Level Design ) Document : -
An ETL Architect and DWH Architect participate in designing a solution to build a DWH.
An HLD document is prepared based on Business Requirement.
LLD ( Low Level Design ) Document : -

Based on HLD,a senior ETL developer prepare Low Level Design Document
The LLD contains more technical details of an ETL System.
An LLD contains data flow diagram ( DFD ), details of source and targets of each mapping.
An LLD also contains information about full and incremental load.
After LLD/specs  then Development Phase will start
Development Phase ( Coding ) :-
--------------------------------------------------
-> Based on LLD/, the ETL team will create mapping ( ETL Code )
-> After designing the mappings, the code ( Mappings ) will be reviewed by developers.
Code Review :-
-> Code Review will be done by developer.
-> In code review,the developer will review the code and the logic but not the data.
-> The following activities takes place in code review
-> You have to check the naming standards of transformation,mappings of data etc.
-> Source and target mapping ( Placed the correct logic or not in mapping )
Peer Review :-
-> The code will reviewed by your team member ( third party developer )
Testing:-
--------------------------------
The following various types testing carried out in testing environment.
1) Unit Testing
2) Development Integration Testing
3) System Integration Testing
4) User Acceptance Testing
Unit Testing :-
-> A unit test for the DWH is a white Box testing,It should check the ETL procedure and Mappings.
-> The following are the test cases can be executed by an ETL developer.
1) Verify data loss
2) No.of records in the source and target
3) Dataload/Insert
4) Dataload/Update
5) Incremental load
6) Data accuracy
7) verify Naming standards.
8) Verify column Mapping
-> The Unit Test will be carried by ETL developer in development phase.
-> ETL developer has to do the data validations also in this phase.
Development Integration Testing -
-> Run all the mappings in the sequence order.
-> First Run the source to stage mappings.
-> Then run the mappings related to dimensions and facts.
System Integration Testing :-
-> After development phase,we have to move our code to QA environment.
-> In this environment,we are giving read-only permission to testing people.
-> They will test all the workflows.
-> And they will test our code according to their standards.
User Acceptance Testing ( UAT ) :-
-> This test is carried out in the presence of client side technical users to verify the data migration from source to destination.
Production Environment :-
-> Migrate the code into the Go-Live environment from test environment ( QA Environment ).





Development activities:

New Enhancements will be discussed with onsite coordinator and with the internal team once the requirement is confirmed by the business/client we will start with the impact analysis on the existing system/process and will identify the stream which are getting impacted. Once analysis is done we will come up with efforts estimation that will be submitted to business/cleint then we will start with design and coding 

Unit testing will be done each individual who developed the code. We will take backup of the existing job(master) and will modify the  changes on the backed up job. Once unit testing is completed that will be shared and will go for multiple reviews/ comments / approvals

Once ur tesing done will replace the back up job with the original job( we will take back up of the existing master job again)

We will consolidate all the jobs from each individual and plan for sequence level coding. 

UAT/Intergration testing:

We dont have admin right to QA. To support QA environment we have third party vendor(cognizant). A new change request(CR) in internal tool(remedy ticket) will be raised for each deployment in QA. 
.dsx will be placed in a shared drive mentioning all the job names path and the changes modified to the particular job and jobs moved after validation of jobs are done.

We have separate  team from china who will be providing the data for QA team. We will be monitoring(Director) the jobs and get notified if any aborts/issues.

Data quality Issues will be raised thru remedy for the existing test/qa process. Developers will take it up accordingly and resolve the issues.

Once testing team give sign off we will plan for production deployment. Before deploying we will go for the documentation like modifying the existing design document, providing test cases and test results, check lists( all validations in dbs,naming conventions of jobs,existing paths present or not etc).

Once above docs are attached to the production request (CR)  we will undergo approvals based up on hierarchy. 

After approvals deployment will be planned.  it will be scheduled on weekend. 





Saturday, May 11, 2013

Change Capture Stage(CCD)


CDC -- Change Capture 


Sample Source Data You can take like below 

Before_Data 
e_id,e_name,e_Add 
11,kim,bris 
22,jim,mel 
33,pim,syd 
44,lim,canb 
55,pom,pth 

After_Data 
e_id,e_name,e_add 
11,kim,syd 
22,jim,canb 
33,pim,syd 
44,lim,canb 
55,pom,pth 

Read the data in two different files. And 

Go to Change Capture Stage 

Select Key_id as e_id 

and click on change values. 

Select Values in Available Properties to add 

And select e_add as value 

Click Ok 

Go to Output --- Drag and Drop All the required columns required. 

You can drag and drop every thing here. And Click Ok 


Give file name in the Target File 

Compile and Run the Job 

You will get the output as 



e_id e_name e_add change_code 

11 kim syd 3 
22 jim canb 3 


3 means edited data 

Implementation:

1. Two input datasets are required for change data caputure 
stage.
One is Old dataset
Second is New or updated dataset

2. Give in the 2 inputs to the change capture stage and the 
target as a dataset.

3. Let the incoming data be sorted based on a key column(s) 
for performance purpose in the change Caputure stage.

4. Upon executing the job, the data when viewed from the 
dataset shows a new column added apart from the output 
data. A change code column would be generated in the change 
capture stage having values as 0, 1, 2, 3 which depicts the 
changes on comparing the 2 input datasets such as copy(0), 
Insert(1), Delete(2), Edit(3).

5. See what kind of data you need in the output target like 
copy, insert, delete, edit.

6. To apply SCD Type 2 we require Start date and End date 
columns. 

7. The Change Capture Stage output is given to a 
Transformer Stage, where 2 new columns are generated with 
Effective Start Date and End Date.

8. If you need all Inserted or new data to be passed in to 
a particular dataset then you need to specify an 
appropriate condition in the Transformer Stage to the 
outgoing link. Ex. Drop Output For insert=true 

9. In the similar way other data can also be captured or a 
Filter can also be used after the Transformer Stage to 
filter the data into the targets based on the requirement.

Tuesday, May 7, 2013

Generating a sequence number in datastage

surrogate-key-generation in Transformer

Create a unique counter in datastage


This entry describes various ways of creating a unique counter in DataStage jobs.
A parallel job has a surrogate key stage that creates unique IDs, however it is limited in that it does not support conditional code and it may be more efficient to add a counter to an existing transformer rather than add a new stage.

In a server job there are a set of key increment routines installed in the routine SDK samples that offer a more complex counter that remembers values between job executions.
The following section outlines a transformer only technique.



Steps

In a DataStage job the easiest way to create a counter is within the Transformer stage with a Stage Variable.
svMyCounter = svMyCounter + 1
This simple counter adds 1 each time a row is processed.
The counter can be given a seed value by passing a value in as a job parameter and setting the initial value of svMyCounter to that job parameter.

In a parallel job this simple counter will create duplicate values on each node as the transformer is split into parallel instances. It can be turned into a unique counter by using special parallel macros.
  1. Create a stage variable for the counter, eg. SVCounter.
  2. At the Stage Properties form set the Initial Value of the Stage Variable to            "@PARTITIONNUM - @NUMPARTITIONS + 1".
  3. Set the derivation of the stage variable to "svCounter + @NUMPARTITIONS". You can embed this in an IF statement if it is a conditional counter.
Each instance will start at a different number, eg. -1, -2, -3, -4. When the counter is incremented each instance is increment by the number of partitions, eg. 4. This gives us a sequence in instance 1 of 1, 5, 9, 13... Instance 2 is 2, 6, 10, 14... etc.

Remember this method only works if your data is evenly balanced i.e. equal number of rows going through each partition. Alternative syntax is:
@INROWNUM * @NUMPARTITIONS + @PARTITIONNUM
(numpartion*rwonum)+partinnum-numpartition+1

Example:(2 partitions)
2*1+0-2+1=1
2*1+1-2+1=2
2*2+0-2+1=3
2*2+1-2+1=4

Or

Sequence numbers can be generated in Datastage using certain routines. They are
-KeyMgtGetNextVal
-KeyMgtGetNextValConn

Or

NextSurrogatekey()