My Datastage Notes: ETL Job Design Standards

When using an off-the-shelf ETL tool, principles for software development do not change: we want our code to be reusable, robust, flexible, and manageable. To assist in the development, a set of best practices should be created for the implementation to follow. Failure to implement these practices usually result in problems further down the track, such as a higher cost of future development, increased time spent on administration tasks, and problems with reliability.

Although these standards are listed as taking place in ETL Physical Design, it is ideal that they be done before the prototype if possible. Once they are established once, they should be able to be re-used for future increments and only need to be reviewed.

Listed below are some standard best practice categories that should be identified on a typical project.

Naming Conventions that will be used across the ETL integration environment.

Release Management: The ETL version control approach that will be used; including version control within the tool itself.

Environments: How the ETL environment will be physically deployed in development, testing and production. This will generally be covered in the Solution Architecture.

Failover and Recovery: the strategy for handling load failures. This will include recommendations on whether milestone points and staging will not be required for restarts.

Error Handling: proposed standards for error trapping of jobs. This should be at a standards level, with detail of the design covered explicitly in a separate section of the physical design.

Process Reporting: status and row counts of jobs should be retrieved for accurate process reporting.

Notification: Identification is the manner in which information about successful and unsuccessful runs is delivered to the administrator and relevant stakeholders.

Parameter Management: the ability to manage job parameters across environments so that components can be delivered and run without requiring any modifications.

Optimization: Standards for improving performance such as parallelism or hash files. The more detailed design aspects of this approach is a separate section of the physical design.

Reusability: Standards around simplified design and use of shared components.

Metadata Management: Standards around metadata management as they apply to the ETL design.

Listed below are some of the major standards that apply to each of these categories.

Contents

1 Naming Conventions

1.1 Job Naming
1.2 Stage Naming
1.3 Link Naming
1.4 Database Action Types

2 Parameter Management Standards
3 Performance Optimization Design Standards
4 Reuse Standards for Common Jobs
5 Data Sourcing Standards
6 Data Loading Standards
7 Exception Handling Standards
8 Process Reporting and Job Statistics Standards
9 Notification Standards

Naming Conventions

There are a number of types of naming conventions to be used across the ETL environment. ETL naming conventions are important for giving all projects a consistent look and feel. A naming convention makes metadata reporting more successful by making it easy to determine data lineage and to identify ETL stages within metadata reports and job diagrams.

Typically ETL is executed as a set of jobs, each job processing a single source data entity and writing it to one or more output entities. A job is made up of stages and links. A stage carries out an action on data and a link transfers the data to the next stage.

Below is a suggested set of naming standards. Vendor-specific considerations could dictate variations from this set.

Job Naming

The job name uses underscores to identify different labels to describe the job. The following job naming template shows all the types of labels that can build a job name:

JobType_SourceSystem_TargetSystem_SourceEntity_TargetEntity_Action

The number of labels used depends on the specific requirements of the project and the nature of the particular job.

JobType indicates what type of job depending on what ETL tool is being used. Some example job types include Server, Parallel and Sequence. In this instance the job types can be abbreviations such as ser_, par_ and seq_.

SourceSystem and TargetSystem indicate which database or application or database type owns the source or target entity. This is typically a code or abbreviation that uniquely identifies the system. These are optional labels and are usually included to make job names unique across all folders and projects in an enterprise.

SourceEntity is a plain English description of what is being extracted. If it is a table or file name the underscores can be removed to form an unbroken entity name. If the source table has a technical encoded name the job name describes it more descriptively.

TargetEntity is optional and is only used if one type of data entity is outputted from the job. When the ETL job splits data and writes to different tables this label becomes misleading.

Action is used for jobs that write to databases and describes the action of the database write. Action label is chosen from the list of Database Action Codes below.

Fully qualified job name examples where the job name identifies the transition between systems:

par_sap_staging_customers
par_sap_staging_sales
par_staging_ods_customers_insupd
par_staging_ods_customers_ldr

Entity-only job name examples where the name identifies what entity transformation is occurring:

customers_customertemp
par_customers
customers_customerhistory
customers_insupd
customers_ldr

In these examples the name of the project and the name of the folder the job resides in indicates the what source and target system is being affected. For example, the folder is named SAP to Staging Loads.

Stage Naming

The stage name consists of a prefix that identifies the stage type followed by a description of the stage.

The prefix is the first two letters of the stage type or the first two initials of the stage type if multiple words occur.
For source and target stages the stage name includes the name of the table or file being used.
For transformation stages the stage name includes the primary objective of the stage or an important feature of the stage.

Link Naming

Links are named after the content of the data going down that link. For links that write to a data target a suffix indicates what type of write action from the Database Action Types below.

Database Action Types

This list shows the abbreviations that describe an action against a target table or file. These abbreviations are used in job names and link names where appropriate.

Ins - Insert
Upd - Update
Ups - Upsert, performs either an update or an insert
Del - Delete
Aug - Augment
App - Append
Ldr - Database Load

A combination of action types can be included in a name if they are performed in different stages, e.g., Customer_InsUpd.

Parameter Management Standards

This section defines standards to manage job parameters across environments. Jobs should use parameters liberally to avoid hard coding as much as possible. Some categories of parameters include:

Environmental parameters, such as directory names, file names, etc.
Database connection parameters
Notification email addresses
Processing options, such as degree of parallelism

The purposes of Parameter Management are:

Each environment’s staging files, parameter lists, etc. are isolated from other environments.
Components can be migrated between environments (Development, Test, Production, etc.) without requiring any modifications.
Environmental values, such as directory names and database connection parameters, can be changed with minimal effort.

Parameters must be stored in a manner which is easy to maintain yet easy to protect from inadvertent or malicious modification. Routines are created to read the parameters and set them for job executions.

Performance Optimisation Design Standards

Performance Optimisation defines standards for optimising performance, such as using parallelism or in-memory lookup files. These design aspects are usually vendor-specific.

Reuse Standards for Common Jobs

Reuse standards define the approach for using shared components to simplify design. Identification of common jobs during physical design is the next iteration on the logical design task of identifying common jobs. As we move to physical design, opportunities for re-use will become more apparent. At this stage, common job opportunities should be identified and the team should be made aware of their capabilities.

Data Sourcing Standards

This section defines standards related to Data Sourcing, which involves reading data from a source file or database table or collecting data from an API or messaging queue. For database sources, it can involve a join query which combines several tables to provide a flattened source. For example, when the data source is a database table the following is recommended:

Try to filter out rows that are not required. Database SQL filters can be very efficient and reduce the volume of data being brought onto the ETL server.
Where table joins and sorts are appropriate in a source query, it may be more efficient to have the database server do this processing rather than the ETL server.
Do not alter the metadata of the source data during this phase. Column renames and column functions in a sourcing ETL statement work but this technique can hide the derivations from metadata management and reporting and break the chain of data lineage.

When the data source is a text file consider the following:

Comma separated files can be unreliable where the source data contains free text fields. Data entry operators can add commas, quotes and even carriage-return characters into these fields which disrupts the formatting of the file.
Complex flat files from a mainframe usually require a definition file in order to be readable by ETL tools. Files from COBOL applications are defined by a COBOL definition file. ETL tools use the definition file to determine the formatting of the file.

Data Loading Standards

Data Loading Standards define a common approach for loading data into the target environment that impact performance, data integrity and error handling. For database targets there are multiple types of write actions:

Insert/Append
Update
Insert or Update
Update or Insert
Delete

Usually the stage also has options to clear or truncate the table prior to delivery. The order of update/insert insert/update is important from a performance point of view. For an insert/update action the job attempts to insert the row first, if it is already present it does an update. The order depends upon whether data is present or absent. Database-specific options could include things like transaction size, truncation before insert, etc. For non-database targets the standards for file, messaging, API or other output are defined.

Exception Handling Standards

This section outlines standards for common routines and procedures for trapping errors and handling them and reporting process statistics. The objectives are to:

Find all problems with a row, not just the first problem detected.
Avoid row leakage, where rows are dropped or rejected without notification.
Report all problems.
Report process statistics.
Interface with other processes which work to resolve exception issues.

It also defines what error detection mechanisms are required, and what is to be done when an exception is detected.

Process Reporting and Job Statistics Standards

Process Reporting Standards define how status and row counts of jobs are to be retrieved, formatted and stored for accurate exception and process reporting.

Notification Standards

Notification Standards define how information about successful and unsuccessful runs is delivered to the administrator and relevant stakeholders

My Datastage Notes

Monday, September 8, 2014

ETL Job Design Standards

1 comment: