Sunday, May 11, 2014

Resource Estimation tool

Predicting hardware resources needed to run DataStage jobs in order to meet our processing time requirements can sometimes be more of an art than a science. With new sophisticated analytical information and deep understanding of the parallel framework, IBM has added Resource Estimation to DataStage (and QualityStage) 8.0.We can estimate and predict the resource utilization of parallel job runs by creating models and making projections in the Resource Estimation window. A model estimates the system resources for a job, including the amount of scratch space, disk space, and CPU time that is needed for each stage to run on each partition. A model also estimates the data set throughput in a job.

STEPS OF RESOURCE ESTIMATION:-

Step 1: Open a job from the Repository.
Step 2: Choose the option Resource Estimation which is available in the toolbar.

Figure 1 - Choose Resource Estimation from toolbar

Step 3: Create a resource model.

There are two types of resource models:

Static. The static model does not actually run the job to create the model. CPU utilization can not be estimated, but disk space can be. The record size is always fixed. The “best case” scenario is considered when the input data is propagated. The “worst case” scenario is considered when computing record size.

Dynamic. The Resource Estimation tool actually runs the job with a sample of the data. But CPU and disk space are estimated. This is a more predictable model to use for estimating.

To create a model:

Open a job in the Designer client, or select a job in the Director client.
Open the Resource Estimation window by using one of the following methods:

In the Designer, click File →Estimate Resource.
In the Director, click Job →Estimate Resource.
Click the Resource Estimationtoolbar button.
The first time that you open the Resource Estimation window for a job, static model is generated by default.

Click the Model toolbar button to display the Create Resource Model options.
Type a name in the Model Name field. The specified name must not already exist.
Select a type in the Model Type field.
If you want to specify a data sampling range for a dynamic model, use one of the following methods:

Click the Copy Previous button to copy the sampling specifications from previous models, if any exist.
Clear the Auto check box for a data source, and type values in the Fromand To fields to specify a record range.

Click Generate.

Figure 2 - Create Resource Model

Step 4: Project the resources required to execute the job based on varying data volumes for each input data source.To make a projection:

Open a job in the Designer client, or select a job in the Director client.
Open the Resource Estimation window by using one of the following methods:

In the Designer, click File →Estimate Resource.
In the Director, click Job →Estimate Resource.
Click the Resource Estimationtoolbar button.

Click the Projection toolbar button to display the Make Resource Projection options.
Type a name in the Projection Namefield. The specified name must not already exist.
Select the unit of measurement for the projection in the Input Units field.
Specify the input size upon which to base the projection by using one of the following methods:

Click the Copy Previous button to copy the specifications from previous projections, if any exist.
If the Input Units field is set to Size in Megabytes, type a value in theMegabytes (MB) field for each data source.
If the Input Units field is set to Number of Records, type a value in the Records field for each data source.

Click Generate.

Figure 3 - Make resource projection

Step 5: A projection is then executed using the model selected. The results show the total CPU needed, disk space requirements, scratch space requirements, and more.
To generate a report:

In the Resource Estimation window, select a model in the Models list.
Select a projection in the Input Projections list. If you do not select a projection, the default projection is used.
Click the Report toolbar button.
[By default, reports are saved in the following directory:
“C:\IBM\InformationServer\Clients\Classic\Estimation\server_name\project_name\job_name\html\report.html”]

Figure 4 - Resource Estimation Report

Step 6: Graphical charts are also available for analysis, which allow the user to drill into each stage and each partition.

Figure 5 - Graphical Charts Sample

Conclusion

So, here we can see through resource estimation facility of DataStage we can estimate the resources that has been used for execution of any parallel job.

Local Containers and Shared Container

A container, as its name indicates, is used to group stages and links. Containers help simplify and modularize server job designs and allow you to replacing complex areas of the diagram with a single container stage. For example, if you have a lookup that is used by multiple jobs, you can put the jobs and links that generate the lookup into a share container and use it to different jobs. In a way, you can look at it like a procedure or function in the programming term.

Containers are linked to other stages or containers in the job by input and output stages.

Two types of container:

1. Local containers. These are created within a job and are only accessible by that job. A local container is edited in a tabbed page of the job’s Diagram window. Local containers can be used in
server jobs or parallel jobs.

2. Shared containers. These are created separately and are stored in the Repository in the same way that jobs are. There are two types of shared container:

(1.) Server shared container. Used in server jobs (can also be used in parallel jobs).

(2.) Parallel shared container. Used in parallel jobs. You can also include server shared containers in parallel jobs as a way of incorporating server job functionality into a parallel stage
(for example, you could use one to make a server plug-in stage available to a parallel job).

For here, I only tested the Server Jobs so I only put notes on the server jobs now. Parallel
Jobs works differently. Will note on it as a separate topic.

1 Local Container

The main purpose of using a DataStage local container is to simplify a complex design visually to make it easier to understand in the Diagram window. If the DataStage job has lots of stages and links, it may be easier to create additional containers to describe a particular sequence of steps.

To create a local container, from an existing job design, do the following:

(1.) Press the Shift key and using the mouse to click on the stages that you want to put into the local container.

(2.) From the Menu bar, select Edit ➤ Construct Container ➤ Local.

The group is replaced by a Local Container stage in the Diagram window. A new tab appears in the Diagram window containing the contents of the new Local Container stage. You are warned if any link naming conflicts occur when the container is constructed. The new container is opened and focus shifts onto its tab.

You can rename, move, and delete a container stage in the same way as any other stage in your job design.

To view or modify a local container, just double-click container stage in the Diagram window. You can edit the stages and links in a container in the same way you do for a job.

To create a empty container to which you can add stages and links, drag and drop the Container icon in the General group on the tool palette onto the Diagram window.

A Container stage is added to the Diagram window, double-click on the stage to open it, and add stages and links to the container the same way you do for a job.

1.1 Using Input and Output Stages in a local container

Input and output stages are used to represent the stages in the main job to which the container connects.

-- If you construct a local container from an existing group of stages and links, the input and output stages are automatically added. The link etween the input or output stage and the stage in the container has the same name as the link in the main job Diagram window.

In the example above, the input link is the link connects to the container from the main job’s Oracle_OCI stage (oracle_oci_0). The output link is the link that connects to the second container from the first container.

-- If you create a new container, it will place the input and output stages in the container without any link. You must add stages to the container Diagram window between the input and output stages. Link the stages together and edit the link names to match the ones in the main Diagram window.

You can have any number of links into and out from of a local container, all of the link names inside the container must match the link names into and out of it in the job. Once a connection is made, editing meta data on either side of the container edits the meta data on the connected stage in the job.

2 Share Container

Shared containers also help you to simplify your design but, unlike local containers, they are reusable by other jobs. You can use shared containers to make common job components available throughout the project.

Shared containers comprise groups of stages and links and are stored in the Repository like DataStage jobs. When you insert a shared container into a job, DataStage places an instance of that container into the design. When you compile the job containing an instance of a shared container, the code for the container is included in the compiled job. You can use the DataStage debugger on instances of shared containers used within jobs.

You can create a shared container from scratch, or place a set of existing stages and links within a shared container.

2.1 Create a shared container from an existing job design

(1.) Press Shift and click the other stages and links you want to add to the container.

(2.) From the Menu bar, select Edit ➤ Construct Container ➤ Shared. You will be prompted for a name for the container by the Create New dialog box. The group is replaced by a Shared Container stage of the appropriate type with the specified name in the Diagram window.

Any parameters occurring in the components are copied to the shared container as container parameters. The instance created has all its parameters assigned to corresponding job parameters.

(3.) Modify or View a Shared Contained

Select File ->Open from the Menu bar and select to Shared Container that you want to open. You can also highlight the Shared Container and use right mouse click and select property.

2.2 User A Shared Container

(1.) Dragging the Shared Container icon from the Shared Container branch in the Repository window to the job’s Diagram window.

(2.) Update the Input and Output tabs.

-- Map to Container Link.
Choose the link within the shared container to which the incoming job link will be mapped. Changing the link triggers a validation process, and you will be warned if the meta data does not match and are offered the option of reconciling the meta data as described below.

-- Columns page
Columns page shows the meta data defined for the job stage link in a standard grid. You can use the Reconcile option on the Load Shared Containers button to overwrite meta data on the job stage link with the container link meta data in the same way as described for the Validate option.

SCD Type 2

Slowly Changing Dimensions (SCDs) are dimensions that have data that changes slowly, rather than changing on a time-based, regular schedule.

Type 1

The Type 1 methodology overwrites old data with new data, and therefore does not track historical data at all.

Here is an example of a database table that keeps supplier information:

Supplier_Key	Supplier_Code	Supplier_Name	Supplier_State
123	ABC	Acme Supply Co	CA

In this example, Supplier_Code is the natural key and Supplier_Key is a surrogate key. Technically, the surrogate key is not necessary, since the table will be unique by the natural key (Supplier_Code). However, the joins will perform better on an integer than on a character string.

Now imagine that this supplier moves their headquarters to Illinois. The updated table would simply overwrite this record:

Supplier_Key	Supplier_Code	Supplier_Name	Supplier_State
123	ABC	Acme Supply Co	IL

Type 2

The Type 2 method tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. With Type 2, we have unlimited history preservation as a new record is inserted each time a change is made.

In the same example, if the supplier moves to Illinois, the table could look like this, with incremented version numbers to indicate the sequence of changes:

Supplier_Key	Supplier_Code	Supplier_Name	Supplier_State	Version
123	ABC	Acme Supply Co	CA	0
124	ABC	Acme Supply Co	IL	1

Another popular method for tuple versioning is to add effective date columns.

Supplier_Key	Supplier_Code	Supplier_Name	Supplier_State	Start_Date	End_Date
123	ABC	Acme Supply Co	CA	01-Jan-2000	21-Dec-2004
124	ABC	Acme Supply Co	IL	22-Dec-2004

The null End_Date in row two indicates the current tuple version. In some cases, a standardized surrogate high date (e.g. 9999-12-31) may be used as an end date, so that the field can be included in an index, and so that null-value substitution is not required when querying.

How to Implement SCD using DataStage 8.1 –SCD stage?

Step 1: Create a datastage job with the below structure-

Source file that comes from the OLTP sources
Old dimesion refernce table link
The SCD stage
Target Fact Table
Dimesion Update/Insert link

Figure 1

Step 2: To set up the SCD properties in the SCD stage ,open the stage and access the Fast Path

Figure 2

Step 3: The tab 2 of SCD stage is used specify the purpose of each of the pulled keys from the referenced dimension tables.

Figure 3

Step 4: Tab 3 is used to provide the seqence generator file/table name which is used to generate the new surrogate keys for the new or latest dimesion records.These are keys which also get passed to the fact tables for direct load.

Figure 4

Step 5: The Tab 4 is used to set the properties for configuring the data population logic for the new and old dimension rows. The type of activies that we can configure as a part of this tab are:

Generation the new Surrogate key values to be passed to the dimension and fact table
Mapping the source columns with the source column
Setting up of the expired values for the old rows
Defining the values to mark the current active rows out of multiple type rows

Figure 5

Step 6: Set the derivation logic for the fact as a part of the last tab.

Figure 6

Step 7: Complete the remaining set up, run the job

Figure 7

Dimension and Facts

Dimensions are categories by which summarized data can be viewed. E.g. a profit summary in a fact table can be viewed by a Time dimension (profit by month, quarter, year), Region dimension
(profit by country, state, city), Product dimension (profit for product1,
product2).

A fact table is a table that contains summarized numerical and historical data
(facts) and a multipart index composed of foreign keys from the primary keys of
related dimension tables.

In data warehousing, a dimension is a collection of reference information about
a measurable event. These events are known as facts and are stored in a fact
table. Dimensions categorize and describe data warehouse facts and measures in
ways that support meaningful answers to business questions. They form the very
core of dimensional modeling.

Dimension tables are referenced by fact tables using keys. When creating a
dimension table in a data warehouse, a system-generated key is used to uniquely
identify a row in the dimension. This key is also known as a surrogate key. The
surrogate key is used as the primary key in the dimension table. The surrogate
key is placed in the fact table and a foreign key is defined between the two
tables. When the data is joined, it does so just as any other join within the
database.