Pages

Sunday, April 28, 2013

Configuration file



The Datastage configuration file is a master control file (a textfile which sits on the server side) for jobs which describes the parallel system resources and architecture. The configuration file provides hardware configuration for supporting such architectures as SMP (Single machine with multiple CPU , shared memory and disk), Grid , Cluster or MPP (multiple CPU, mulitple nodes and dedicated memory per node). DataStage understands the architecture of the system through this file.
This is one of the biggest strengths of Datastage. For cases in which you have changed your processing configurations, or changed servers or platform, you will never have to worry about it affecting your jobs since  all the jobs depend on this configuration file for execution. Datastage jobs determine which node to run the process on, where to store the temporary data, where to store the dataset data, based on the entries provide in the configuration file. There is a default configuration file available whenever the server is installed.
The configuration files have extension ".apt". The main outcome from having the configuration file is to separate software and hardware configuration from job design. It allows changing hardware and software resources without changing a job design. Datastage jobs can point to different configuration files by using job parameters, which means that a job can utilize different hardware architectures without being recompiled.
The configuration file contains the different processing nodes and also specifies the disk space provided for each processing node which are logical processing nodes that are specified in the configuration file. So if you have more than one CPU this does not mean the nodes in your configuration file correspond to these CPUs. It is possible to have more than one logical node on a single physical node. However you should be wise in configuring the number of logical nodes on a single physical node. Increasing nodes, increases the degree of parallelism but it does not necessarily mean better performance because it results in more number of processes. If your underlying system should have the capability to handle these loads then you will be having a very inefficient configuration on your hands.

1.    APT_CONFIG_FILE is the file using which DataStage determines the configuration file (one can have many configuration files for a project) to be used. In fact, this is what is generally used in production. However, if this environment variable is not defined then how DataStage determines which file to use ??
1.    If the APT_CONFIG_FILE environment variable is not defined then DataStage look for default configuration file (config.apt) in following path:
1.    Current working directory.
2.    INSTALL_DIR/etc, where INSTALL_DIR ($APT_ORCHHOME) is the top level directory of DataStage installation.

2.    Define Node in configuration file
A Node is a logical processing unit. Each node in a configuration file is distinguished by a virtual name and defines a number and speed of CPUs, memory availability, page and swap space, network connectivity details, etc.

3.    What are the different options a logical node can have in the configuration file?
1.    fastname – The fastname is the physical node name that stages use to open connections for high volume data transfers. The attribute of this option is often the network name. Typically, you can get this name by using Unix command ‘uname -n’.
2.    pools – Name of the pools to which the node is assigned to. Based on the characteristics of the processing nodes you can group nodes into set of pools.
1.    A pool can be associated with many nodes and a node can be part of many pools.
2.    A node belongs to the default pool unless you explicitly specify apools list for it, and omit the default pool name (“”) from the list.
3.    A parallel job or specific stage in the parallel job can be constrained to run on a pool (set of processing nodes).
1.    In case job as well as stage within the job are constrained to run on specific processing nodes then stage will run on the node which is common to stage as well as job.
3.    resource  resource resource_type “location” [{pools “disk_pool_name”}]  | resource resource_type “value” . resource_type can becanonicalhostname (Which takes quoted ethernet name of a node in cluster that is unconnected to Conductor node by the hight speed network.) or disk (To read/write persistent data to this directory.) or scratchdisk (Quoted absolute path name of a directory on a file system where intermediate data will be temporarily stored. It is local to the processing node.) or RDBMS Specific resourses (e.g. DB2, INFORMIX, ORACLE, etc.)

4.    How datastage decides on which processing node a stage should be run?
1.    If a job or stage is not constrained to run on specific nodes then parallel engine executes a parallel stage on all nodes defined in the default node pool. (Default Behavior)
2.    If the node is constrained then the constrained processing nodes are chosen while executing the parallel stage.

In Datastage, the degree of parallelism, resources being used, etc. are all determined during the run time based entirely on the configuration provided in the APT CONFIGURATION FILE. This is one of the biggest strengths of Datastage. For cases in which you have changed your processing configurations, or changed servers or platform, you will never have to worry about it affecting your jobs since  all the jobs depend on this configuration file for execution. Datastage jobs determine which node to run the process on, where to store the temporary data , where to store the dataset data, based on the entries provide in the configuration file. There is a default configuration file available whenever the server is installed.  You can typically find it under the <>\IBM\InformationServer\Server\Configurations  folder with the name default.apt. Bear in mind that you will have to optimise these configurations for your server based on your resources.
Basically the configuration file contains the different processing nodes and also specifies the disk space provided for each processing node. Now when we talk about processing nodes you have to remember that these can are logical processing nodes that are specified in the configuration file. So if you have more than one CPU this does not mean the nodes in your configuration file correspond to these CPUs. It is possible to have more than one logical node on a single physical node. However you should be wise in configuring the number of logical nodes on a single physical node. Increasing nodes, increases the degree of parallelism but it does not necessarily mean better performance because it results in more number of processes. If your underlying system should have the capability to handle these loads then you will be having a very inefficient configuration on your hands.
Now lets try our hand in interpreting a configuration file. Lets try the below sample.
{
node “node1″
{
fastname “SVR1″
pools “”
resource disk “C:/IBM/InformationServer/Server/Datasets/Node1″ {pools “”}
resource scratchdisk “C:/IBM/InformationServer/Server/Scratch/Node1″ {pools “}
}
node “node2″
{
fastname “SVR1″
pools “”
resource disk “C:/IBM/InformationServer/Server/Datasets/Node1″ {pools “”}
resource scratchdisk “C:/IBM/InformationServer/Server/Scratch/Node1″ {pools “”}
}
node “node3″
{
fastname “SVR2″
pools “” “sort”
resource disk “C:/IBM/InformationServer/Server/Datasets/Node1″ {pools “”}
resource scratchdisk “C:/IBM/InformationServer/Server/Scratch/Node1″ {pools  ”" }
}
}
This is a 3 node configuration file. Lets go through the basic entries and what it represents.
Fastname – This refers to the node name on a fast network. From this we can imply that the nodes node1 and node2 are on the same physical node. However if we look at node3 we can see that it is on a different physical node (identified by SVR2). So basically in node1 and node2 , all the resources are shared. This means that the disk and scratch disk specified is actually shared between those two logical nodes. Node3 on the other hand has its own disk and scratch disk space.
Pools – Pools allow us to associate different processing nodes based on their functions and characteristics. So if you see an entry other  entry like “node0” or other reserved node pools like “sort”,”db2”,etc.. Then it means that this node is part of the specified pool.  A node will be by default associated to the default pool which is indicated by “”. Now if you look at node3 can see that this node is associated to the sort pool. This will ensure that that the sort stage will run only on nodes part of the sort pool.
Resource disk  - This will specify Specifies the location on your server where the processing node will write all the data set files. As you might know when Datastage creates a dataset, the file you see will not contain the actual data. The dataset file will actually point to the place where the actual data is stored. Now where the dataset data is stored is specified in this line.
Resource scratchdisk – The location of temporary files created during Datastage processes, like lookups and sorts will be specified here. If the node is part of the sort pool then the scratch disk can also be made part of the sort scratch disk pool. This will ensure that the temporary files created during sort are stored only in this location. If such a pool is not specified then Datastage determines if there are any scratch disk resources that belong to the default scratch disk pool on the nodes  that sort is specified to run on. If this is the case then this space will be used.


Below is the sample diagram for 1 node and 4 node resource allocation:


 

 

SAMPLE CONFIGURATION FILES

 

Configuration file for a simple SMP

 

A basic configuration file for a single machine, two node server (2-CPU) is shown below. The file defines 2 nodes (node1 and node2) on a single dev server (IP address might be provided as well instead of a hostname) with 3 disk resources (d1 , d2 for the data and Scratch as scratch space).

The configuration file is shown below: 



node "node1"
{             fastname "dev"
               pool ""
               resource disk "/IIS/Config/d1" { }
               resource disk "/IIS/Config/d2" { }                            
               resource scratchdisk "/IIS/Config/Scratch" { }
}

node "node2"
{
               fastname "dev"
               pool ""
               resource disk "/IIS/Config/d1" { }
               resource scratchdisk "/IIS/Config/Scratch" { }
}             
          

 

 

Configuration file for a cluster / MPP / grid


The sample configuration file for a cluster or a grid computing on 4 machines is shown below.
The configuration defines 4 nodes (node[1-4]), node pools (n[1-4]) and s[1-4), resource pools bigdata and sort and a temporary space. 



node "node1"
            {
                        fastname "dev1"
                        pool "" "n1" "s1" "sort"
                        resource disk "/IIS/Config1/d1" {}
                        resource disk "/IIS/Config1/d2" {"bigdata"}                      
                        resource scratchdisk "/IIS/Config1/Scratch" {"sort"}
            }

            node "node2"
            {
                        fastname "dev2"
                        pool "" "n2" "s2"
                        resource disk "/IIS/Config2/d1" {}
                        resource disk "/IIS/Config2/d2" {"bigdata"}                      
                        resource scratchdisk "/IIS/Config2/Scratch" {}
            }

            node "node3"
            {
                        fastname "dev3"
                        pool "" "n3" "s3"
                        resource disk "/IIS/Config3/d1" {}
                        resource scratchdisk "/IIS/Config3/Scratch" {}
            }

            node "node4"
            {
                        fastname "dev4"
                        pool "n4" "s4"
                        resource disk "/IIS/Config4/d1" {}
                        resource scratchdisk "/IIS/Config4/Scratch" {}
            }




Resource disk : Here a disk path is defined. The data files of the dataset are stored in the resource disk.

Resource scratch disk :  Here also a path to folder is defined. This path is used by the parallel job stages for buffering of the data when the parallel job runs.


Saturday, April 27, 2013

Complex Flat File Stages



The Complex Flat File stage lets you convert data extracted from complex flat files that are generated on an IBM® mainframe. A complex flat file has hierarchical structure in its arrangement of columns. It is physically flat (that is, it has no pointers or other complicated infrastructure), but logically represents parent-child relationships. You can use multiple record types to achieve this hierarchical structure.

RECOGNIZING A HIERARCHICAL STRUCTURE

For example, use records with various structures for different types of information, such as an 'E' record for employee static information, and a 'S' record for employee monthly payroll information, or for repeating groups of information (twelve months of revenue). You can also combine these record groupings, and in the case of repeating data, you can flatten nested OCCURS groups.










MANAGING REPEATING GROUPS AND INTERNAL STRUCTURES

You can easily load, manage, and use repeating groups and internal record structures such as GROUP fields and OCCURS. You can ignore GROUP data columns that are displayed as raw data and have no logical use for most applications. The metadata can be flattened into a normalized set of columns at load time, so that no arrays exist at run time.


SELECTING SUBSETS OF COLUMNS

You can select a subset of columns from a large COBOL File Description (CFD). This filtering process results in performance gains since the stage no longer parses and processes hundreds of columns if you only need a few.
Complex flat files can also include legacy data types.

OUTPUT LINKS

The Complex Flat File stage supports multiple outputs. An output link specifies the data you are extracting, which is a stream of rows to be read.
When using the Complex Flat File stage to process a large number of columns, for example, more than 300, use only one output link in your job. This dramatically improves the performance of the GUI when loading, saving, or building these columns. Having more than one output link causes a save or load sequence each time you change tabs.
The Complex Flat File stage does not support reference lookup capability or input links

FastTrack Makes Your DataStage Development Faster



IBM introduced a tool called FastTrack that is a source to target mapping tool that is plugged straight into the Information Server and runs inside a browser. 
The tool was introduced with the Information Server and is available in the 8.1 version.
As the name suggests IBM are using it to help in the analysis and design stage of a data integration project to do the source to target mapping and the definition of the transform rules.  Since it is an Information Server product it runs against the Metadata Server and can share metadata with the other products and it can run inside a browser.
I have talked about it previously in New Product: IBM FastTrack for Source To Target Mapping and FastTrack Excel out of your DataStage project but now I have had the chance to see it in action on a Data Warehouse project.  We have been using the tool for a few weeks now and we are impressed.  It’s been easier to learn than other Information Server products and it manages to fit most of what you need inside frames on a single browse screen.  Very few bugs and it has been in the hands of someone who doesn’t know a lot about DataStage and they have been able to complete mappings and generate DataStage jobs.
I hope to get some screenshots up in the weeks to come but here are some observations in how we have saved time with FastTrack:
  1. FastTrack provides faster access to metadata.  In an Excel/Word mapping environment you need to copy and paste your metadata from a tool that can show it into your mapping document.  FastTrack can see metadata imported through any of the Information Server import methods such as DataStage plugin imports, the import bridges from Meta Integration Technologies Inc (MITI): ErWin, InfoSphere Data Architect, Cognos, Business Objects, Informatica etc.  The imports via the database connectors from any other Information Server product such as the table definitions imported and profiled by Information Analyser.  You can import an entire database in a few minutes and drag and drop it onto your various mappings.
  2. FastTrack lets you map columns from XML and Cobol Copybook hierarchies to flat file relational database targets without any metadata massaging.  In Excel you would spend days cutting and chopping XML and Cobol complex flat file definitions.  With FastTrack you can access a definition imported through the DataStage Cobol or XML importers and map away.
  3. FastTrack lets you do source to target mapping straight out of your modelling tool.  You can import your model straight into the Metadata Server via a bridge and start mapping it.  No mucking around with database DDLs and no need to get access to create database schemas.  This can be handy in the early days of a project.
  4. FastTrack has some great auto mapping functions.  There is a discovery functions where you drag and drop the source or target table onto one side of the mapping and then use the discover function to find candidate matches for the other side – then choose the “Best Match” to take the first of the candidates.  If you choose multiple columns you can Discover and Best Match all the columns in your table.  It searches through for matching column names against the candidate tables.
  5. FastTrack can map auto match on the business glossary terms attached to those columns.  It is one of the few products in the Information Server that makes productive use of the Business Glossary to speed things up.  Of course you need to create your Glossary and fill it with terms and map those terms to physical metadata first!  FastTrack lets you add Glossary terms to physical columns as you map.
  6. FastTrack lets you balance the mapping between business analysts and ETL developers.  Both can use the tool – it’s an Excel style interface – but business analysts may be faster at mapping early in the project as they gather requirements and rules and ETL   This can help avoid bottlenecks on your team if anyone can do mapping and can push the results straight into DataStage.
  7. FastTrack creates DataStage jobs.  These jobs have the source and target connectors already loaded and configured and stages such as lookup, join and transformer already built.  It even lets you add Transformer derivations such as macros, job parameters and functions from a function list.
  8. FastTrack handles DataStage naming conventions.  FastTrack holds a set of rules for naming DataStage stages and links that you can configure to match your naming convention.  Normal DataStage development means dragging and dropping stages and links onto a canvas and renaming every one.  FastTrack does the naming for you.
  9. FastTrack lets you add the links for joins and lookups.  I don’t know if you’ve tried to map joins and lookups in Excel but it’s not pretty – you have room to map the extra columns but there is no easy way to show the key fields that join the two sources together.  Generally you make a note of it under the mapping.  In FastTrack you choose the join/lookup table, choose the key fields that do the join and bring in the extra columns for mapping to the output table and it generates the required stages in DataStage.
  10. FastTrack shows progress of mapping tasks.  Once you have created a mapping for all interfaces FastTrack will produce a report showing how much of each mapping has been finished saving you the time of tracking progress manually.
What FastTrack can do better.
  1. Better bulk export and import functions – preferably XML and Excel.  Excel for when we produce documentation.  XML for when we want to back it up or move it between repositories.  (Or export, run a search and replace to globally rename a transform value and import it again).
  2. Global search and replace on transformation values, similar to the search and replace in the DataStage Transformer, for globally renaming things like function names and job parameter values.
  3. More DataStage stages – it currently lets you configure settings for Lookup, Join, Connectors and Transformers.  Would like to see Surrogate Key, Change Data Capture and Slowly Changing Dimension support – though it’s debatable whether those are business analyst functions for FastTrack or developer functions for DataStage.  It would be cool to define Type 1 Type 2 and key fields for dimension table mapping.
  4. Let you run Discover and Best Match on Business Glossary terms so you can find terms that suit the column name you are mapping.
  5. Discover transformation rules as well as mappings … oh hang on, that’s in the next release!
  6. Reverse engineer DataStage Server jobs so you can generate DataStage Enterprise jobs from a Server job mapping.
  7. More flexible licensing.  You buy licenses in packs of 10 – and that’s too many for a lot of customers!

Friday, April 26, 2013

Sample Sql Queries


 Simple select command:

SELECT SUBPRODUCT_UID
  ,SUBPRODUCT_PROVIDER_UID
  ,SUBPRODUCT_TYPE_UID
  ,DESCRIPTION
  ,EXTERNAL_ID
  ,OPTION_ID
  ,NEGOTIABLE_OFFER_IND
  ,UPDATED_BY
  ,UPDATED_ON
  ,CREATED_ON
  ,CREATED_BY FROM schemaname.SUBPRODUCT

With Inner Join:

SELECT eft.AMOUNT AS AMOUNT,
  ceft.MERCHANT_ID AS MERCHANT_ID,
  ca.ACCOUNT_NUMBER AS ACCOUNT_NUMBER,
  bf.MBNA_CREDIT_CARD_NUMBER AS MBNA_CREDIT_CARD_NUMBER,
  ceft.CUSTOMER_FIRST_NAME AS CUSTOMER_FIRST_NAME,
  ceft.CUSTOMER_LAST_NAME AS CUSTOMER_LAST_NAME,
  btr.TRACE_ID AS TRACE_ID,
  ROWNUM
FROM schemaname.bt_fulfillment bf

INNER JOIN schemaname.balance_transfer_request btr
ON btr.bt_fulfillment_uid = bf.bt_fulfillment_uid

INNER JOIN schemaname.electronic_funds_transfer eft
ON eft.bt_fulfillment_uid = bf.bt_fulfillment_uid

INNER JOIN schemaname.creditor_eft ceft
ON ceft.ELECTRONIC_FUNDS_TRANSFER_UID = eft.ELECTRONIC_FUNDS_TRANSFER_UID

INNER JOIN schemaname.credit_account ca
ON ca.ELECTRONIC_FUNDS_TRANSFER_UID = ceft.ELECTRONIC_FUNDS_TRANSFER_UID

WHERE ((btr.TYPE ='CREATE_CREDIT' AND btr.STATUS ='PENDING')
OR (btr.TYPE ='RETRY_CREDIT' AND btr.STATUS ='PENDING'))
AND btr.RELEASE_DATE < CURRENT_TIMESTAMP

Date and time string functions




-DateFromDaysSince
Description:
Returns a date by adding an integer to a baseline date
Syntax :
DateFromDaysSince(%number%,[%"yyyy-mm-dd"%])
Example:
no of  days = 8
date = 2009-09-01
DateFromDaysSince(days,date) = 2009-09-09
-DateFromJulianDay
    Description:
Returns a date from the given julian date
Syntax :
DateFromJulianDay(%juliandate%)
Example:
let jdate = 2552812
DateFromJulianDay(jdate) = 2009-09-09
-DaysSinceFromDate
Description:
Returns the number of days from source date to the given date
Syntax:
DaysSinceFromDate(%date%,%”yyyy-mm-dd”%)
Example:
date1 = 2009-09-10
date2 = 2009-09-01
DaysSinceFromDate(date1,date2) = 9
-HoursFromTime
Description:
Returns the hour portion of a time
Syntax :
HoursFromTime(%time%)
Example:
Time= 09:58:15
HoursFromTime(time)= 09
-JulianDayFromDate
        Description:
Returns julian day from the given date
Syntax :
JulianDayFromDate(%date%)
Example:
let date = 2009-09-09
JulianDayFromDate(date) = 2552812
-MicroSecondsFromTime
Description:
Returns the microsecond portion from a time
Syntax :
MicroSecondsFromTime(%time%)
Example:
Time= 09:58:15
MicroSecondsFromTime(time)=0
-MinutesFromTime
Description:
Returns the minute portion from a time
Syntax :
MinutesFromTime(%time%)
Example:
Time= 09:58:15
MinutesFromTime(time)=58
 
-MonthDayFromDate
 Description:
Returns the day of the month given the date
Syntax :
MonthDayFromDate(%date%)
Example:
date = 2009-08-09
MonthDayFromDate(date) = 09
 
-MonthFromDate
Description:
Returns the month number given the date
Syntax :
MonthFromDate(%date%)
Example:
date = 2009-09-10
MonthFromDate(date) = 09
 
-NextWeekdayFromDate
Description:
Returns the date of the specified day of the week soonest after the
Syntax :
NextWeekdayFromDate(%sourcedate%,%dayname%)
Example:
date = 2009-09-04
NextWeekdayFromDate(date,”thu”) = 2009-09-10
-PreviousWeekdayFromDate
Description:
Returns the date of the specified day of the week most recent before the source date
Syntax :
PreviousWeekdayFromDate(%sourcedate%,%dayname%)
Example:
let date = 2009-09-10
PreviousWeekdayFromDate(date,”friday”) = 2009-09-04
 
-SecondsFromTime
Description:
Returns the second portion from a time
Syntax :
SecondsFromTime(%time%)
Example:
time=09:58:15
SecondsFromTime(time)=15
 
-SecondsSinceFromTimestamp
Description:
Returns the number of seconds between two timestamps
Syntax :
SecondsSinceFromTimestamp(%timestamp%,%”yyyy-mm-dd hh:nn:ss”%)
 Example:
-TimeDate
Description:
Returns the system time and date as a formatted string
Syntax :
TimeDate()
Example:
time=09:58:09
date =15 sep 2009
TimeDate()=09:58:09   15 sep 2009
-TimeFromMidnightSeconds
Description:
Returns the time given the number of seconds since midnight
 Syntax :
TimeFromMidnightSeconds(%seconds%)
-TimestampFromDateTime
Description:
Returns a timestamp form the given date and time
Syntax :
TimestampFromDateTime(%date%,%time%)
-TimestampFromSecondsSince
Description:
Returns the timestamp from the number of seconds from the base timestamp
Syntax :
TimestampFromSecondsSince(%seconds%,[%timestamp%])
-TimestampFromTimet
 Description:
Returns a timestamp from the given unix time_t value
Syntax :
TimestampFromTimet(%timetvalue%)
 
-TimetFromTimestamp
Description:
Returns a unix time_t value from the given timestamp
Syntax :
TimetFromTimestamp(%timestamp%)
    
-WeekdayFromDate
 Description:
Returns the day number of the week from the given date. Origin day optionally specifies the day regarded as the first in the week and is Sunday by           default
Syntax :
WeekdayFromDate(%date%,[%startdayname%])
Example:
date = 2001-01-01
WeekdayFromDate(date) = 6
WeekdayFromDate(date,”Mon”) = 0
let date = 2001-01-07
WeekdayFromDate(date) = 0
WeekdayFromDate(date,”Mon”) = 6
-YeardayFromDate
Description:
Returns the day number in the year from the given date
Syntax :
YeardayFromDate(%date%)
Example:
date = 2009-09-01
YeardayFromDate(date) = 1
-YearFromDate
Description:
Returns the year from the given date
Syntax :
YearFromDate(%date%)
Example:
let date = 2009-09-01
YearFromDate(date) = 2009
-YearweekFromDate
Description:
Returns the week number in the year from the given date
Syntax :
YearweekFromDate(%date%)
Example:
let date = 2009-01-06
YearweekFromDate(date) = 1
Date, Time, and Timestamp functions that specify dates, times, or timestamps in the argument use strings with specific formats:
For a date, the format is %yyyy-%mm-%dd
For a time, the format is %hh:%nn:%ss, or, if extended to include microseconds, %hh:%nn:%ss.x where x gives the number of decimal places seconds is given to.
For a timestamp the format is %yyyy-%mm-%dd %hh:%nn:%ss, or, if extended to include microseconds, %yyyy-%mm-%dd %hh:%nn:%ss.x where x gives the number of decimal places seconds is given to.