My Datastage Notes: 2013

Friday, December 6, 2013

InfoSphere DataStage Jobstatus returned Codes from dsjob

Equ DSJS.RUNNING To	0	This is the only status that means the job is actually running
Equ DSJS.RUNOK To	1	Job finished a normal run with no warnings
Equ DSJS.RUNWARN To	2	Job finished a normal run with warnings
Equ DSJS.RUNFAILED To	3	Job finished a normal run with a fatal error
Equ DSJS.QUEUED To	4	Job queued waiting for resource allocation
Equ DSJS.VALOK To	11	Job finished a validation run with no warnings
Equ DSJS.VALWARN To	12	Job finished a validation run with warnings
Equ DSJS.VALFAILED To	13	Job failed a validation run
Equ DSJS.RESET To	21	Job finished a reset run
Equ DSJS.CRASHED To	96	Job has crashed
Equ DSJS.STOPPED To	97	Job was stopped by operator intervention (can't tell run type)
Equ DSJS.NOTRUNNABLE To	98	Job has not been compiled
Equ DSJS.NOTRUNNING To	99	Any other status

Thursday, October 17, 2013

Warning : Ignoring duplicate entry at table record no further warnings will be issued for this table

This warning is seen when there are multiple records with the same key column is present in the reference table from which lookup is done. Lookup, by default, will fetch the first record which it gets as match and will throw the warning

since it doesn’t know which value is the correct one to be returned from the reference.

To solve this problem you can either one of the reference links from “Multiple rows returned from link” dropdown, in Lookup constraints. In this case Lookup will return multiple rows for each row that is matched.

Else use some method to eradicate duplicate multiple rows with same key columns according to the business requirements.

Monday, June 24, 2013

How to replace ^M character in VI editor/sed?

^M is DOS line break character which shows up in UNIX files when uploaded from a windows file system in ascii format.

To remove this, open your file in vi editor and type

:%s/(ctrl-v)(ctrl-m)//g

and press Enter key.

Important!! – press (Ctrl-v) (Ctrl-m) combination to enter ^M character, dont use “^” and M.

If anything goes wrong exit with q!.

Also,

Your substitution command may catch more ^M then necessary. Your file may contain valid ^M in the middle of a line of code for example. Use the following command instead to remove only those at the very end of lines:

:%s/(ctrl-v)(ctrl-m)*$//g

Using sed:

sed -e "s/^M//g" old_file_name > new_file_name

Thursday, June 6, 2013

How to convert a single row into multiple rows ?

Below is a screenshot of our input data

City	State	Name1	Name2	Name3
xy	FGH	Sam	Dean	Winchester

We are going to read the above data from a sequential file and transform it to look like this

City	State	Name
xy	FGH	Sam
xy	FGH	Dean
xy	FGH	Winchester

So lets get to the job design

Step 1: Read the input data

Step 2: Logic for Looping in Transformer

In the adjacent image you can see a new box called Loop Condition. This where we are going to control the loop variables.

Below is the screenshot when we expand the Loop Condition box

The Loop While constraint is used to implement a functionality similar to “WHILE” statement in programming. So, similar to a while statement need to have a condition to identify how many times the loop is supposed to be executed.

To achieve this @ITERATION system variable was introduced. In our example we need to loop the data 3 times to get the column data onto subsequent rows.

So lets have @ITERATION <=3

Now create a new Loop variable with the name LoopName

The derivation for this loop variable should be

If @ITERATION=1 Then DSLink2.Name1 Else If @ITERATION=2 Then DSLink2.Name2 Else DSLink2.Name3

Below is a screenshot illustrating the same

Now all we have to do is map this Loop variable LoopName to our output column Name

Lets map the output to a sequential file stage and see if the output is a desired.

After running the job, we did a view data on the output stage and here is the data as desired.

Making some tweaks to the above design we can implement things like

Adding new rows to existing rows
Splitting data in a single column to multiple rows and many more such stuff..

How to perform aggregation using a Transformer

Input:Below is the sample data of three students, their marks in two subjects, the corresponding grades and the dates on which they were graded.

Output:Our requirement is to sum the marks obtained by each student in a subject and display it in the output

Step 1: Once we have read the data from the source we have to sort data on our key field. In our example the key field is the student name

Once the data is sorted we have to implement the looping function in transformer to calculate the aggregate value

Before we get into the details, we need to know a couple of functions

SaveInputRecord(): This function saves the entire record in cache and returns the number of records that are currently stored in cache
LastRowInGroup(input-column): When a input key column is passed to this function it will return 1 when the last row for that column is found and in all other cases it will return 0
To give an example, lets say our input is

Student Code

ABC 1

ABC 2

ABC 3

DEF 2

For the first two records the function will return 0 but for the last record ABC,3 it will return 1 indicating that it is the last record for the group where student name is “ABC”
GetSavedInputRecord(): This function returns the record that was stored in cache by the function SaveInputRecord()

Back to the task at hand, we need 7 stage variables to perform the aggregation operation successfully.

1. LoopNumber: Holds the value of number of records stored in cache for a student
2. LoopBreak: This is to identify the last record for a particular student
3. SumSub1: This variable will hold the final sum of marks for each student in subject 1
4. IntermediateSumSub1: This variable will hold the sum of marks until the final record is evaluated for a student (subject 1)
5. SumSub2: Similar to SumSub1 (for subject 2)
6. IntermediateSumSub2: Similar to IntermediateSumSub1 (for subject 2)
7. LoopBreakNum: Holds the value for the number of times the loop has to run
Below is the screenshot of the stage variables

We also need to define the Loop Variables so that the loop will execute for a student until his final record is identified

To explain the above use of variables -

When the first record comes to stage variables, it is saved in the cache using the function SaveInputRecord() in first stage variableLoopNumber

The second stage variable checks if it is the last record for this particular student, if it is it stores 1 else 0

The third SumSub1 is executed only if the record is the last record

The fourth IntermediateSumSum1 is executed when the input record is not the last record, thereby storing the intermediate sum of the subject for a student

Fifth and sixth are the same as 3 and 4 stage variables

Seven will have the first value as 1 and for the second record also if the same student is fetched it will change to 2 and so on

The loop variable will be executed until the final record for a student is identified and the GetSavedInputRecord() function will make sure the current record is processed before the next record is brought for processing.

What the above logic does is for each and every record it will send the sum of marks scored by each student to the output. But our requirement is to have only one record per student in the output.

So we simply add a remove duplicates stage and add the student name as a primary key

Run the job and the output will be according to our initial expectation

We have successfully implemented AGGREGATION using TRANSFORMER Stage

Thursday, May 23, 2013

Star vs Snowflake Schemas

First Answer: My personal opinion is to use the star by default, but if the product you are using for the business community prefers a snowflake, then I would snowflake it. The major difference between snowflake and star is that a snowflake will have multiple tables for a “dimension” and a start with a single table. For example, your company structure might be

Corporate à Region à Department à Store

In a star schema, you would collapse those into a single "store" dimension. In a snowflake, you would keep them apart with the store connecting to the fact.

Second Answer: First of all, some definitions are in order. In a star schema, dimensions that reflect a hierarchy are flattened into a single table. For example, a star schema Geography Dimension would have columns like country, state/province, city, state and postal code. In the source system, this hierarchy would probably be normalized with multiple tables with one-to-many relationships.

A snowflake schema does not flatten a hierarchy dimension into a single table. It would, instead, have two or more tables with a one-to-many relationship. This is a more normalized structure. For example, one table may have state/province and country columns and a second table would have city and postal code. The table with city and postal code would have a many-to-one relationship to the table with the state/province columns.

There are some good for reasons snowflake dimension tables. One example is a company that has many types of products. Some products have a few attributes, others have many, many. The products are very different from each other. The thing to do here is to create a core Product dimension that has common attributes for all the products such as product type, manufacturer, brand, product group, etc. Create a separate sub-dimension table for each distinct group of products where each group shares common attributes. The sub-product tables must contain a foreign key of the core Product dimension table.

One of the criticisms of using snowflake dimensions is that it is difficult for some of the multidimensional front-end presentation tools to generate a query on a snowflake dimension. However, you can create a view for each combination of the core product/sub-product dimension tables and give the view a suitably description name (Frozen Food Product, Hardware Product, etc.) and then these tools will have no problem.

Tuesday, May 14, 2013

Performance Tuning in Datastage

1	Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance
2	Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects.
3	Tuned the 'Project Tunables' in Administrator for better performance
4	Used sorted data for Aggregator
5	Sorted the data as much as possible in DB and reduced the use of DS-Sort for betterperformance of jobs
6	Removed the data not used from the source as early as possible in the job
7	Worked with DB-admin to create appropriate Indexes on tables for betterperformance of DS queries
8	Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs.
9	If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel.
10	Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories.Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal.
11	Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made.
12	Tuning should occur on a job-by-job basis.
13	Use the power of DBMS.
14	Try not to use a sort stage when you can use an ORDER BY clause in the database.
15	Using a constraint to filter a record set is much slower than performing a SELECT … WHERE….
16	Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE.
17	Minimize the usage of Transformer (Instead of this use Copy modify Filter Row Generator
18	Use SQL Code while extracting the data
19	Handle the nulls
20	Minimize the warnings
21	Reduce the number of lookups in a job design
22	Try not to use more than 20stages in a job
23	Use IPC stage between two passive stages Reduces processing time
24	Drop indexes before data loading and recreate after loading data into tables
25	Check the write cache of Hash file. If the same hash file is used for Look up and as well as target disable this Option.
26	If the hash file is used only for lookup then enable Preload to memory . This will improve the performance. Also check the order of execution of the routines.
27	Don't use more than 7 lookups in the same transformer; introduce new transformers if it exceeds 7 lookups.
28	Use Preload to memory option in the hash file output.
29	Use Write to cache in the hash file input.
30	Write into the error tables only after all the transformer stages.
31	Reduce the width of the input record - remove the columns that you would not use.
32	Cache the hash files you are reading from and writing into. Make sure your cache is big enough to hold the hash files.
33	Use ANALYZE.FILE or HASH.HELP to determine the optimal settings for your hash files.
34	Ideally, if the amount of data to be processed is small, configuration files with less number of nodes should be used while if data volume is more , configuration files with larger number of nodes should be used.
35	Partitioning should be set in such a way so as to have balanced data flow i.e. nearly equal partitioning of data should occur and data skew should be minimized.
36	In DataStage Jobs where high volume of data is processed, virtual memory settings for the job should be optimized. Jobs often abort in cases where a single lookup has multiple reference links. This happens due to low temp memory space. In such jobs $APT_BUFFER_MAXIMUM_MEMORY, $APT_MONITOR_SIZE and $APT_MONITOR_TIME should be set to sufficiently large values.
37	Sequential files should be used in following conditions. When we are reading a flat file (fixed width or delimited) from UNIX environment which is FTP ed from some external system
38	When some UNIX operations has to be done on the file Don’t use sequential file for intermediate storage between jobs. It causes performance overhead, as it needs to do data conversion before writing and reading from a UNIX file
39	In order to have faster reading from the Stage the number of readers per node can be increased (default value is one).
40	Usage of Dataset results in a good performance in a set of linked jobs. They help in achieving end-to-end parallelism by writing data in partitioned form and maintaining the sort order.
41	Look up Stage is faster when the data volume is less. If the reference data volume is more, usage of Lookup Stage should be avoided as all reference data is pulled in to local memory
42	Sparse lookup type should be chosen only if primary input data volume is small.
43	Join should be used when the data volume is high. It is a good alternative to the lookup stage and should be used when handling huge volumes of data.
44	Even though data can be sorted on a link, Sort Stage is used when the data to be sorted is huge.When we sort data on link ( sort / unique option) once the data size is beyond the fixed memory limit , I/O to disk takes place, which incurs an overhead. Therefore, if the volume of data is large explicit sort stage should be used instead of sort on link.Sort Stage gives an option on increasing the buffer memory used for sorting this would mean lower I/O and better performance.
45	It is also advisable to reduce the number of transformers in a Job by combining the logic into a single transformer rather than having multiple transformers.
46	Presence of a Funnel Stage reduces the performance of a job. It would increase the time taken by job by 30% (observations). When a Funnel Stage is to be used in a large job it is better to isolate itself to one job. Write the output to Datasets and funnel them in new job.
47	Funnel Stage should be run in “continuous” mode, without hindrance.
48	A single job should not be overloaded with Stages. Each extra Stage put in a Job corresponds to lesser number of resources available for every Stage, which directly affects the Jobs Performance. If possible, big jobs having large number of Stages should be logically split into smaller units.
49	Unnecessary column propagation should not be done. As far as possible, RCP (Runtime Column Propagation) should be disabled in the jobs
50	Most often neglected option is “don’t sort if previously sorted” in sort Stage, set this option to “true”. This improves the Sort Stage performance a great deal
51	In Transformer Stage “Preserve Sort Order” can be used to maintain sort order of the data and reduce sorting in the job
52	Reduce the number of Stage variables used.
53	The Copy stage should be used instead of a Transformer for simple operations
54	The “upsert” works well if the data is sorted on the primary key column of the table which is being loaded.
55	Don’t read from a Sequential File using SAME partitioning
56	By using hashfile stage we can improve the performance. In case of hashfile stage we can define the read cache size & write cache size but the default size is 128M.B.
57	By using active-to-active link performance also we can improve the performance. Here we can improve the performance by enabling the row buffer, the default row buffer size is 128K.B.

Student	Code
ABC	1
ABC	2
ABC	3
DEF	2