My Datastage Notes: September 2015

Monday, September 28, 2015

Information Analyzer

InfoSphere Information Analyzer (IA) provides capabilities to profile and analyze data to deliver trusted information to any organization.
Data quality specialists use InfoSphere Information Analyzer to scan samples and full volumes of data to determine their quality and structure. This analysis helps to discover the inputs to the data integration project, ranging from individual fields to high-level data entities. Information analysis enables any organization to correct problems with structure or validity before they affect the data integration project.
After data is analyzed, data quality specialists create data quality rules to assess and monitor heterogeneous data sources for trends, patterns, and exception conditions. These rules help to uncover data quality issues and help the organization to align data quality metrics throughout the project lifecycle. Business analysts can use these metrics to create quality reports that track and monitor the quality of data over time. Business analysts can then use IBM InfoSphere Data Quality Console to track and browse exceptions that are generated by InfoSphere Information Analyzer.
Understanding where data originates, which data stores it lands in, and how the data changes over time is important to develop data lineage, which is a foundation of data governance. InfoSphere Information Analyzer shares lineage information with the rest of Information Server by storing it in the metadata repository. Other Information Server components can access lineage information directly to simplify the collection and management of metadata across any organization.
Here is a sample IA report showing the number of Constant, Unique and Null values in the entire input data on chosen columns with some additional information. This is one of the several reports that IA can generate.

IA_Sample_Report

Conductor Node in Datastage

Below is the sample APT CONFIG FILE ,see in bold to mention conductor node.

{
node "node0"
{
fastname "server1"
pools "conductor"
resource disk "/datastage/Ascential/DataStage/Datasets/node0" {pools "conductor"}
resource scratchdisk "/datastage/Ascential/DataStage/Scratch/node0" {pools ""}
}

node "node1"
{
fastname "server2"
pools ""
resource disk "/datastage/Ascential/DataStage/Datasets/node1" {pools ""}
resource scratchdisk "/datastage/Ascential/DataStage/Scratch/node1" {pools ""}
}
node "node2"
{
fastname "server2"
pools ""
resource disk "/datastage/Ascential/DataStage/Datasets/node2" {pools ""}
resource scratchdisk "/datastage/Ascential/DataStage/Scratch/node2" {pools ""}
}
}

Please find the below different answers :
------
For every job that starts there will be one (1) conductor process (started on the conductor node), there will be one (1) section leader for each node in the configuration file and there will be one (1) player process (may or may not be true) for each stage in your job for each node. So if you have a job that uses a two (2) node configuration file and has 3 stages then your job will have

1 conductor
2 section leaders (2 nodes * 1 section leader per node)
6 player processes (3 stages * 2 nodes)

Your dump score may show that your job will run 9 processes on 2 nodes.

This kind of information is very helpful when determining the impact that a particular job or process will have on the underlying operating system and system resources.

-----
Conductor Node :
It is a main process to

Start up jobs
Resource assignments
Responsible to create Section leader (used to create & manage player player process which perform actual job execution).
Single coordinator for status and error messages.
manages orderly shutdown when processing completes in the event of fatal error.

-----
Jobs developed with DataStage EE and QualityStage are independent of the actual hardware and degree of parallelism used to run the job. The parallel Configuration File provides a mapping at runtime between the job and the actual runtime infrastructure and resources by defining logical processing nodes.

To facilitate scalability across the boundaries of a single server, and to maintain platform independence, the parallel framework uses a multi-process architecture.

The runtime architecture of the parallel framework uses a process-based architecture that enables scalability beyond server boundaries while avoiding platform-dependent threading calls. The actual runtime deployment for a given job design is composed of a hierarchical relationship of operating system processes, running on one or more physical servers

Conductor Node (one per job): the main process used to startup jobs, determine resource assignments, and create Section Leader processes on one or more processing nodes. Acts as a single coordinator for status and error messages, manages orderly shutdown when processing completes or in the event of a fatal error. The conductor node is run from the primary server
Section Leaders (one per logical processing node): used to create and manage player processes which perform the actual job execution. The Section Leaders also manage communication between the individual player processes and the master Conductor Node.
Players: one or more logical groups of processes used to execute the data flow logic. All players are created as groups on the same server as their managing Section Leader process.

-----

When the job is initiated the primary process (called the “conductor”) reads the job design, which is a generated Orchestrate shell (osh) script. The conductor also reads the parallel execution configuration file specified by the current setting of the APT_CONFIG_FILE environment variable.

Once the execution nodes are known (from the configuration file) the conductor causes a coordinating process called a “section leader” to be
started on each; by forking a child process if the node is on the same machine as the conductor or by remote shell execution if the node is on a
different machine from the conductor (things are a little more dynamic in a grid configuration, but essentially this is what happens). Each section
leader process is passed the score and executes it on its own node, and is visible as a process running osh. Section leaders’ stdout and stderr are
redirected to the conductor, which is solely responsible for logging entries from the job.

The score contains a number of Orchestrate operators. Each of these runs in a separate process, called a “player” (the metaphor clearly is one of an
orchestra). Player processes’ stdout and stderr are redirected to their parent section leader. Player processes also run the osh executable.

Communication between the conductor, section leaders and player processes in a parallel job is effected via TCP.

_____________________________

What are the different options a logical node can have in the configuration file?
1. fastname – The fastname is the physical node name that stages use to open connections for high volume data transfers. The attribute of this option is often the network name. Typically, you can get this name by using Unix command ‘uname -n’.
2. pools – Name of the pools to which the node is assigned to. Based on the characteristics of the processing nodes you can group nodes into set of pools.
  1. A pool can be associated with many nodes and a node can be part of many pools.
  2. A node belongs to the default pool unless you explicitly specify apools list for it, and omit the default pool name (“”) from the list.
  3. A parallel job or specific stage in the parallel job can be constrained to run on a pool (set of processing nodes).
    1. In case job as well as stage within the job are constrained to run on specific processing nodes then stage will run on the node which is common to stage as well as job.
3. resource –resource resource_type “location” [{pools “disk_pool_name”}] | resource resource_type “value” . resource_type can be canonicalhostname(Which takes quoted ethernet name of a node in cluster that is unconnected to Conductor node by the hight speed network.) or disk (To read/write persistent data to this directory.) orscratchdisk (Quoted absolute path name of a directory on a file system where intermediate data will be temporarily stored. It is local to the processing node.) or RDBMS Specific resourses (e.g. DB2, INFORMIX, ORACLE, etc.)
How datastage decides on which processing node a stage should be run?
1. If a job or stage is not constrained to run on specific nodes then parallel engine executes a parallel stage on all nodes defined in the default node pool. (Default Behavior)
2. If the node is constrained then the constrained processing nodes are choosen while executing the parallel stage. (Refer to 2.2.3 for more detail).
When configuring an MPP, you specify the physical nodes in your system on which the parallel engine will run your parallel jobs. This is called Conductor Node. For other nodes, you do not need to specify the physical node. Also, You need to copy the (.apt) configuration file only to the nodes from which you start parallel engine applications. It is possible that conductor node is not connected with the high-speed network switches. However, the other nodes are connected to each other using a very high-speed network switches. How do you configure your system so that you will be able to achieve optimized parallelism?
1. Make sure that none of the stages are specified to be run on the conductor node.
2. Use conductor node just to start the execution of parallel job.
3. Make sure that conductor node is not the part of the default pool.
Although, parallelization increases the throughput and speed of the process, why maximum parallelization is not necessarily the optimal parallelization?
1. Datastage creates one process for every stage for each processing node. Hence, if the hardware resource is not available to support the maximum parallelization, the performance of overall system goes down. For example, suppose we have a SMP system with three CPU and a Parallel job with 4 stage. We have 3 logical node (one corresponding to each physical node (say CPU)). Now DataStage will start 3*4 = 12 processes, which has to be managed by a single operating system. Significant time will be spent in switching context and scheduling the process.
Since we can have different logical processing nodes, it is possible that some node will be more suitable for some stage while other nodes will be more suitable for other stages. So, when to decide which node will be suitable for which stage?
1. If a stage is performing amemory intensive task then it should be run on a node which has more disk space available for it. E.g.sorting a data is memory intensive task and it should be run on such nodes.
2. If some stage depends on licensed version of software (e.g. SAS Stage, RDBMS related stages, etc.) then you need to associate those stages with the processing node, which is physically mapped to the machine on which the licensed software is installed. (Assumption:The machine on which licensed software is installed is connected through other machines using high speed network.)
3. If a job contains stages, which exchange large amounts of data then they should be assigned to nodes where stages communicate by either shared memory (SMP) or high-speed link (MPP) in most optimized manner.
Basically nodes are nothing but set of machines (specially in MPP systems). You start the execution of parallel jobs from the conductor node. Conductor nodes creates a shell of remote machines (depending on the processing nodes) and copies the same environment on them. However, it is possible to create a startup script which will selectively change the environment on a specific node. This script has a default name of startup.apt. However, like main configuration file, we can also have many startup configuration files. The appropriate configuration file can be picked up using the environment variable APT_STARTUP_SCRIPT. What is use of APT_NO_STARTUP_SCRIPT environment variable?
1. Using APT_NO_STARTUP_SCRIPT environment variable, you can instruct Parallel engine not to run the startup script on the remote shell.
What are the generic things one must follow while creating a configuration file so that optimal parallelization can be achieved?
1. Consider avoiding the disk/disks that your input files reside on.
2. Ensure that the different file systems mentioned as the disk and scratchdisk resources hit disjoint sets of spindles even if they’re located on a RAID (Redundant Array of Inexpensive Disks) system.
3. Know what is real and what is NFS:
  1. Real disks are directly attached, or are reachable over a SAN (storage-area network -dedicated, just for storage, low-level protocols).
  2. Never use NFS file systems for scratchdisk resources, remember scratchdisk are also used for temporary storage of file/data during processing.
  3. If you use NFS file system space for disk resources, then you need to know what you are doing. For example, your final result files may need to be written out onto the NFS disk area, but that doesn’t mean the intermediate data sets created and used temporarily in a multi-job sequence should use this NFS disk area. Better to setup a “final” disk pool, and constrain the result sequential file or data set to reside there, but let intermediate storage go to local or SAN resources, not NFS.
4. Know what data points are striped (RAID) and which are not. Where possible, avoid striping across data points that are already striped at the spindle level.

_----------

Wednesday, September 23, 2015

New features in Datastage 11.3

Introduction

With the release of Information Server/DataStage 11.3 a few weeks ago, most DataStage developers are interested in knowing exactly what new features have surfaced and how they can best be leveraged. With the release of version 8.7, IBM introduced the Operations Console and version 9.1 followed in-line with the release of the Workload Manager. I’m afraid that DataStage developers don’t have anything too exciting to look forward to in version 11.3. There are definitely some nifty new features tacked on the suite from the standpoint of data governance, metadata management, and administration, but this post will review just the new features in DataStage.
There might be some hidden new features or “features” which aren’t documented. Feel free to comment below on what you think they might be.

Hierarchical Data Stage

Remember how the XML stage was pretty recently introduced for all XML processing in DataStage? Well now it has been relabeled as the Hierarchical Data stage, I suppose to account for its ability to process all types of Hierarchical Data (JSON) as opposed to strictly being limited to XML. This stage also has some additional functionality which wasn’t previously available. If you are familiar with this stage (Hierarchical Data/XML) you will know it has various steps which are added in the Assembly Editor, for a sequence of processing events. There are now three new steps:

REST – Invokes a RESTful web service
JSON_Parser – Parse JSON content with a selected type
JSON_Composer – Compose JSON content with a selected type

Big Data File Stage

The Big Data File stage is used to read and write to files on Hadoop (HDFS). The Big Data File stage is now compatible with Hortonworks 2.1, Cloudera 4.5, and InfoSphere BigInsights 3.0.

Greenplum Connector Stage

You can now use the Greenplum Connector stage for a native connection for accessing data which is located in a Greenplum database. You can now also import Table Definitions using the Greenplum Connector framework.

InfoSphere Master Data Management Connector Stage

The Master Data Management Connector stage can be used to read and write data from the IBM master data management solution – InfoSphere MDM. This stage can be configured for Member read and Member write interactions from the MDM server.

Amazon S3 Connector Stage

Amazon S3 (Simple Storage Service) is a cheap cloud file storage system which offers availability through web services (REST, SOAP, and BitTorrent). It offers scalability, high availability, and low latency at extremely competitive prices. The Amazon S3 Connector stage be can used to read and write data residing in Amazon S3.

Unstructured Data Stage – Microsoft Excel (.xls and .xlsx)

The Unstructured Data stage was first introduced in DataStage v9.1 and was used to read Excel files through a native interface. Previously, Excel data was staged as a .csv file or accessed through ODBC. The stage can also now be used to write data to Excel files.
unstructured data stage write excel

Sort Stage Optimization

The Sort stage now tries to optimize your DataStage sort operations by converting length bounded columns to variable length before the sort and then converts it back to a length bounded column after the sort. When a record’s actual size of data is smaller than the defined upper bound, the optimization will result in reduced disk I/O.

Improved Flexibility in Record Delimiting

The Sequential File stage now gives developers more flexibility with how a source flat file has to be delimited. A new environment variable, APT_IMPORT_HANDLE_SHORT, can be set to enable the import operator the ability the read in records which do not contain all of the fields defined in the import schema. Previously, these records were rejected by the stage. The values assigned to any missing field depends on the data type and nullability.

Operations Console/Workload Management

IBM lists the Operations Console and Workload Management as new features of the 11.3 release documentation, even though these components have already been introduced in previous releases. Both components are now part of the base Information Server installation and Workload Management is now by default enabled.

CFF as source

As a source, the CFF stage can have multiple output links and a single
reject
link.
2. The source data can contain one or more of the following clauses:
- GROUP
- REDEFINES
- OCCURS
- OCCURS DEPENDING ON
CFF source stages run in parallel mode when they are used to read
multiple
files.
When a CFF stage is defined as a source, you must provide details about
the
file that the stage will read, create record definitions for the data,
define the
column metadata, specify record ID constraints, and select output
columns.
- If you are reading data from a file that contains multiple record
types, you
must create a separate record definition for each type