Monday, September 8, 2014

Column Export Stage:

Columns to be exported:


All the columns will be exported into one column(in this case Total):

ETL Job Design Standards


When using an off-the-shelf ETL tool, principles for software development do not change: we want our code to be reusable, robust, flexible, and manageable. To assist in the development, a set of best practices should be created for the implementation to follow. Failure to implement these practices usually result in problems further down the track, such as a higher cost of future development, increased time spent on administration tasks, and problems with reliability.
Although these standards are listed as taking place in ETL Physical Design, it is ideal that they be done before the prototype if possible. Once they are established once, they should be able to be re-used for future increments and only need to be reviewed. 
Listed below are some standard best practice categories that should be identified on a typical project. 

  • Naming Conventions that will be used across the ETL integration environment.
  • Release Management: The ETL version control approach that will be used; including version control within the tool itself.
  • Environments: How the ETL environment will be physically deployed in development, testing and production. This will generally be covered in the Solution Architecture.
  • Failover and Recovery: the strategy for handling load failures. This will include recommendations on whether milestone points and staging will not be required for restarts.
  • Error Handling: proposed standards for error trapping of jobs. This should be at a standards level, with detail of the design covered explicitly in a separate section of the physical design.
  • Process Reporting: status and row counts of jobs should be retrieved for accurate process reporting.
  • Notification: Identification is the manner in which information about successful and unsuccessful runs is delivered to the administrator and relevant stakeholders.
  • Parameter Management: the ability to manage job parameters across environments so that components can be delivered and run without requiring any modifications.
  • Optimization: Standards for improving performance such as parallelism or hash files. The more detailed design aspects of this approach is a separate section of the physical design.
  • Reusability: Standards around simplified design and use of shared components.
  • Metadata Management: Standards around metadata management as they apply to the ETL design.
Listed below are some of the major standards that apply to each of these categories.
Contents
  • 1 Naming Conventions
    • 1.1 Job Naming
    • 1.2 Stage Naming
    • 1.3 Link Naming
    • 1.4 Database Action Types
  • 2 Parameter Management Standards
  • 3 Performance Optimization Design Standards
  • 4 Reuse Standards for Common Jobs
  • 5 Data Sourcing Standards
  • 6 Data Loading Standards
  • 7 Exception Handling Standards
  • 8 Process Reporting and Job Statistics Standards
  • 9 Notification Standards
Naming Conventions
There are a number of types of naming conventions to be used across the ETL environment. ETL naming conventions are important for giving all projects a consistent look and feel. A naming convention makes metadata reporting more successful by making it easy to determine data lineage and to identify ETL stages within metadata reports and job diagrams.
Typically ETL is executed as a set of jobs, each job processing a single source data entity and writing it to one or more output entities. A job is made up of stages and links. A stage carries out an action on data and a link transfers the data to the next stage.
Below is a suggested set of naming standards. Vendor-specific considerations could dictate variations from this set.
Job Naming
The job name uses underscores to identify different labels to describe the job. The following job naming template shows all the types of labels that can build a job name:
JobType_SourceSystem_TargetSystem_SourceEntity_TargetEntity_Action
The number of labels used depends on the specific requirements of the project and the nature of the particular job. 
  • JobType indicates what type of job depending on what ETL tool is being used. Some example job types include Server, Parallel and Sequence. In this instance the job types can be abbreviations such as ser_, par_ and seq_.
  • SourceSystem and TargetSystem indicate which database or application or database type owns the source or target entity. This is typically a code or abbreviation that uniquely identifies the system. These are optional labels and are usually included to make job names unique across all folders and projects in an enterprise.
  • SourceEntity is a plain English description of what is being extracted. If it is a table or file name the underscores can be removed to form an unbroken entity name. If the source table has a technical encoded name the job name describes it more descriptively.
  • TargetEntity is optional and is only used if one type of data entity is outputted from the job. When the ETL job splits data and writes to different tables this label becomes misleading.
  • Action is used for jobs that write to databases and describes the action of the database write. Action label is chosen from the list of Database Action Codes below.
Fully qualified job name examples where the job name identifies the transition between systems:
  • par_sap_staging_customers
  • par_sap_staging_sales
  • par_staging_ods_customers_insupd
  • par_staging_ods_customers_ldr
Entity-only job name examples where the name identifies what entity transformation is occurring:
  • customers_customertemp
  • par_customers
  • customers_customerhistory
  • customers_insupd
  • customers_ldr
In these examples the name of the project and the name of the folder the job resides in indicates the what source and target system is being affected. For example, the folder is named SAP to Staging Loads.
Stage Naming
The stage name consists of a prefix that identifies the stage type followed by a description of the stage. 
  • The prefix is the first two letters of the stage type or the first two initials of the stage type if multiple words occur.
  • For source and target stages the stage name includes the name of the table or file being used.
  • For transformation stages the stage name includes the primary objective of the stage or an important feature of the stage.
Link Naming
Links are named after the content of the data going down that link. For links that write to a data target a suffix indicates what type of write action from the Database Action Types below.
Database Action Types
This list shows the abbreviations that describe an action against a target table or file. These abbreviations are used in job names and link names where appropriate.
  • Ins - Insert
  • Upd - Update
  • Ups - Upsert, performs either an update or an insert
  • Del - Delete
  • Aug - Augment
  • App - Append
  • Ldr - Database Load
A combination of action types can be included in a name if they are performed in different stages, e.g., Customer_InsUpd. 
 
Parameter Management Standards
This section defines standards to manage job parameters across environments. Jobs should use parameters liberally to avoid hard coding as much as possible. Some categories of parameters include: 
  • Environmental parameters, such as directory names, file names, etc.
  • Database connection parameters
  • Notification email addresses
  • Processing options, such as degree of parallelism

The purposes of Parameter Management are: 
  • Each environment’s staging files, parameter lists, etc. are isolated from other environments.
  • Components can be migrated between environments (Development, Test, Production, etc.) without requiring any modifications.
  • Environmental values, such as directory names and database connection parameters, can be changed with minimal effort.
Parameters must be stored in a manner which is easy to maintain yet easy to protect from inadvertent or malicious modification. Routines are created to read the parameters and set them for job executions.
Performance Optimisation Design Standards
Performance Optimisation defines standards for optimising performance, such as using parallelism or in-memory lookup files. These design aspects are usually vendor-specific.
Reuse Standards for Common Jobs
Reuse standards define the approach for using shared components to simplify design. Identification of common jobs during physical design is the next iteration on the logical design task of identifying common jobs. As we move to physical design, opportunities for re-use will become more apparent. At this stage, common job opportunities should be identified and the team should be made aware of their capabilities.
Data Sourcing Standards
This section defines standards related to Data Sourcing, which involves reading data from a source file or database table or collecting data from an API or messaging queue. For database sources, it can involve a join query which combines several tables to provide a flattened source. For example, when the data source is a database table the following is recommended:
  • Try to filter out rows that are not required. Database SQL filters can be very efficient and reduce the volume of data being brought onto the ETL server.
  • Where table joins and sorts are appropriate in a source query, it may be more efficient to have the database server do this processing rather than the ETL server.
  • Do not alter the metadata of the source data during this phase. Column renames and column functions in a sourcing ETL statement work but this technique can hide the derivations from metadata management and reporting and break the chain of data lineage.
When the data source is a text file consider the following: 
  • Comma separated files can be unreliable where the source data contains free text fields. Data entry operators can add commas, quotes and even carriage-return characters into these fields which disrupts the formatting of the file.
  • Complex flat files from a mainframe usually require a definition file in order to be readable by ETL tools. Files from COBOL applications are defined by a COBOL definition file. ETL tools use the definition file to determine the formatting of the file.
Data Loading Standards
Data Loading Standards define a common approach for loading data into the target environment that impact performance, data integrity and error handling. For database targets there are multiple types of write actions:
  • Insert/Append
  • Update
  • Insert or Update
  • Update or Insert
  • Delete
Usually the stage also has options to clear or truncate the table prior to delivery. The order of update/insert insert/update is important from a performance point of view. For an insert/update action the job attempts to insert the row first, if it is already present it does an update. The order depends upon whether data is present or absent. Database-specific options could include things like transaction size, truncation before insert, etc. For non-database targets the standards for file, messaging, API or other output are defined.
Exception Handling Standards
This section outlines standards for common routines and procedures for trapping errors and handling them and reporting process statistics. The objectives are to: 
  • Find all problems with a row, not just the first problem detected.
  • Avoid row leakage, where rows are dropped or rejected without notification.
  • Report all problems.
  • Report process statistics.
  • Interface with other processes which work to resolve exception issues.
It also defines what error detection mechanisms are required, and what is to be done when an exception is detected.
Process Reporting and Job Statistics Standards
Process Reporting Standards define how status and row counts of jobs are to be retrieved, formatted and stored for accurate exception and process reporting.
Notification Standards
Notification Standards define how information about successful and unsuccessful runs is delivered to the administrator and relevant stakeholders
 

Scenario: Get the max salary from data file ( Seq file )


We know that max function is use to get the max value in a column of a table, but here you have to design a datastage job which get the max value from seq file.



Input Seq File :


FIRSTNME,SALARY,WORKDEPT

EILEEN, 0089750.00,E11
EVA, 0096170.00,D21
JOHN, 0080175.00,E01
CHRISTINE, 0152750.00,A00
SALLY, 0098250.00,C01
IRVING, 0072250.00,D11
THEODORE, 0086150.00,E21
MICHAEL, 0094250.00,B01



Output File
0152750.00


Solution Design :

a) Job Design :  


   Below is the design which can achieve the output as we needed. In this design, we are reading the data from flat file, generating a dummy column in column generator stage, doing aggregate on that and sending the output to seq file.






b) Column Generator Stage :
      In column generator stage, we will generate a column DUMMY having value 'X' for all the rows which we are reading from the seq file

For generating value 'X' for all the rows, need to change extended metadata for the column DUMMY. Set generator for Cycle algorithm with value 'X'

Map the Salary and DUMMY column to output of Column Generator stage
c) Aggregator Stage :

In aggregator stage, we will aggregate on the DUMMY column and calculate the MAX of Salary.
and Map the Max Salary to Output .

Peek Stage


The Peek stage is a Development/Debug stage. It can have a single input link and any number of output links.
The Peek stage lets you print record column values either to the job log or to a separate output link as the stage copies records from its input data set to one or more output data sets.

1. Design :

We are using below design to demonstrate the functionality of PEEK stage in datastage. Job design is having 3 stages, Row Generator , data source, for generating the dummy data for the job. Here, we can use any data source whether it is DB, dataset or flat files. Transformer, just as a stage, creating negative values of source data. Peek for Peeking :-) 


2. Row Generator Stage Properties :-

We are generating 100 records via Row Generator stage. Just set the Number of Records to 100 as shown below ...
and define the metadata......

3. Transformer Stage Properties :-

Whatever the data has been generated by source, we are simply negating by multiplication of -1. Here, we dont need this stage but for shake of doing something I have used here. :-) 

4. PEEK Stage Properties :-

In English language the meaning of PEEK word is "To glance quickly". Peek stage is doing same in datastage, we are using this stage for debugging and looking the intermediate data if needed.
In our scenario, we have used below setting for our Peek Stage.


- ROWS  -- In this we can define, whether we needed ALL ROWS or some rows. Both properties are mutually exclusive. If we need ALL ( ALL Records = TRUE ), Number of Records property will not be available for vice versa.

- COLUMNS - Same for COLUMN properties, need to define whether you need all or some column. In case of "PEEK All Input Column = FALSE" , need to defined which column you needed in output.

- PARTITION  - Same as above

- OPTION - Here, you can select where you want the output , In a file or DS job log and You needed the column names along with data or not.


5. PEEK Stage Output

Below I have attached a screen shot from DS job log where 

Double click any entry and vola :) , You can see the data in job log. For our design, we are generating negative integer values which we can see in output.

 There are lot more in PEEK stage, we will cover in next posts :-). Till then, Keep Learning n Sharing.

Scenario: To get the Unique and Duplicates values from Input Data

Input :

There is a input file which contains duplicates data, Suppose :


13
22
95
37
78
87
29
33
33
13
12
87
21
32
13

In this file :
Unique values are : 22 95 37 78 29 12 21 32
Duplicate values are  : 13 33 87


Now, we need 3 kind of outputs:




Job1:

We need 2 o/p file
o/p1 --> Contains Uniq values
o/p2 --> Contains Duplicate Values ( each once ) i.e - 13 33 87
DataStage Scenario - Design 2 - job1


Solution Design :


a) Job Design :

Below is the design which can achieve the output as we needed. Here, we are reading seq file as a input, then data is passing through Aggregator and Filter stage to achieve the output.
 




b) Aggregator Stage Properties

Input data contains only one column "No" , In Aggregator stage, we have group the data on the "No" column and calculate the rows for each Key ( No ). 


When we have used the "Count Rows" aggregation type, it will generate a new column which contain the count for each Key (No). Here we have given the column name - "count" and assigned to output as below.



c) Filter Stage Properties

In Filter stage, we put 2 where condition  count=1 and count>1. and assigned different  output files to both conditions.


Assigned the data ( column No ) to output tab.




d) Output File

We got two output from the jobs

i) Contains where count=1 ( unique values in input )
ii) Contains where count>1 ( dups values in input )


 Job2

We need 2 o/p file
o/p1 --> Contains Uniq values
o/p2 --> Contains Duplicate Values ( no of times they appear ) i.e - 13 13 13 33 33 87 87
DataStage Scenario - Design2 - job2

Solution Design :


a) Job Design :
In job design, we are using Copy, Aggregator, Filter and Join stage to get the output.






b) Copy Stage Properties :
Simply map the input to both link output. first link goes to Aggregator and second link goes to Join stage.


c) Aggregator Stage Properties :
Input data contains only one column "No" , In Aggregator stage, we have group the data on the "No" column and calculate the rows for each Key ( No ).


d) Filter Stage Properties : 
In Filter stage, we put 2 where condition  count=1 and count>1. and assigned different links to both conditions.



From filter Stage, first link (count=1) map to output file ( which contains the unique records )
and second link we map with Join stage.

e) Join Stage Properties :


 In join stage, we join the both input on key column (No).


Output from Join map with second output files which contains all the dups as occur in input.


Job3

We need 2 o/p file
o/p1 --> Contains all values once each i.e - 22 95 37 78 29 12 21 32 13 33 87
o/p2 --> Contains remaining values  - 13 13 33 87
DataStage Scenario - Design2 - job3
 
 Solution Design :


a) Job Design :




b) Sort Stage Properties :
In sort stage, we sort the data on Key column (no) and generate the change key column.


c) Filter Stage Properties :
filer the data on "Change" column generated in sort stage.



In filter stage,  condition ( change =1) gives you all values (each once) from input and condition (change=0) gives the all duplicate occurrence from input.

Scenario: Get the next column value in current row


Input file :

Sq, No
1,1000
2,2200
3,3030
4,5600


Output File  :

Sq, No,No2
1,1000,2200
2,2200,3030
3,3030,5600
4,5600,NULL


Solution Design :

a) Job Design :


Below is the design which can achieve the output as we needed. Here, we are reading seq file as a input, then data is passing through a Sort and Transformer stage to achieve the output.



 











b) Sort Stage Properties
in Sort stage, we will sort the data based on column “Sq” in descending order.

 


c) Transformer Stage Properties
Here, we took 2 stage variable : StageVar, StageVar1
and their derivations are -
    StageVar1 = StageVar
    DSLink6.No = StageVar1


 














d) Output File
Before capturing the data into Sequential file. We will sort the data again in ascending order to get the output as we needed.



Dummy Data Generation using Row Generator



By default the Row Generator stage runs sequentially, generating data in a single partition. You can, however, configure it to run in parallel and meaningful data.



a) Job Design :








b) RowGenerator Stage :

- Double click on Stage
- Fill the properties tab, Fill the No of Rows  you want to generate. ( Here I filled 50 )





- Now, clock on column tab and define the column you needed on o/p file, Need to define Column Name, type, length etc.



-- Now Here we took our Magic Step :-)  We will edit the Meta Data of Column


-- When you double click on Column 1 (Name), This Window will open, As you can see, we can edit a lot of metadata of a column here.


-- We want to generate some meaningful data, so we use the Generator properties here.
Here we selected Algorithm ( of data generation ) is Cycle which repeat the data from start to end. and we have passed some values ( Names ) 


-- Same we will follow for Second Coulmn ( Salary )




- And then Click OK, we are done with Column Generator Stage.

c) Seq File Stage

- Define the Seq File stage properties here, Like, O/p file location, delimiter, column name, quotes etc in o/p file.
- and Keep it all other tab as it is.



- Now Save the job design,Compile and RUN.

Output File will look like below -

 
Here, as we can see, Name is repeating in a Cycle again n again and Salary column also follow the same. So here we have some meaningful data for our dummy job test.