My Datastage Notes

Tuesday, April 23, 2013

Commonly used Unix commands

Cut

Cut out selected fields of each line of a file.

Syntax

cut [-b] [-c] [-f] list [-n] [-d delim] [-s] [file]

-b list


The list following -b specifies byte positions (for instance, -b1-72 would pass the first 72 bytes of each line). When -b and -n are used together, list is adjusted so that no multi-byte character is split. If -b is used, the input line should contain 1023 bytes or less.

-c list


The list following -c specifies character positions (for instance, -c1-72 would pass the first 72 characters of each line).

-f list


The list following -f is a list of fields assumed to be separated in the file by a delimiter character (see -d ); for instance, -f1,7 copies the first and seventh field only. Lines with no field delimiters will be passed through intact (useful for table subheadings), unless -s is specified. If -f is used, the input line should contain 1023 characters or less.

list


A comma-separated or blank-character-separated list of integer field numbers (in increasing order), with optional - to indicate ranges (for instance, 1,4,7; 1-3,8; -5,10 (short for 1-5,10); or 3- (short for third through last field)).

-n


Do not split characters. When -b list and -n are used together, list is adjusted so that no multi-byte character is split.

-d delim


The character following -d is the field delimiter (-f option only). Default is tab. Space or other characters with special meaning to the shell must be quoted. delim can be a multi-byte character.

-s


Suppresses lines with no delimiter characters in case of -f option. Unless specified, lines with no delimiters will be passed through untouched.

file


A path name of an input file. If no file operands are specified, or if a file operand is -, the standard input will be used.

Examples

name=`who am i | cut -f1 -d' '`

Set name to current login name.

About cut

Cut out selected fields of each line of a file.

Syntax

cut [-b] [-c] [-f] list [-n] [-d delim] [-s] [file]

-b list The list following -b specifies byte positions (for instance, -b1-72 would pass the first 72

bytes of each line). When -b and -n are used together, list is adjusted so that no multi-byte character is split. If -b is used, the input line should contain 1023 bytes or less.

-c list The list following -c specifies character positions (for instance, -c1-72 would pass the first 72 characters of each line).

-f list The list following -f is a list of fields assumed to be separated in the file by a delimiter character (see -d ); for instance, -f1,7 copies the first and seventh field only. Lines with no field delimiters will be passed through intact (useful for table subheadings), unless -s is specified. If -f is used, the input line should contain 1023 characters or less.

list A comma-separated or blank-character-separated list of integer field numbers (in increasing order), with optional - to indicate ranges (for instance, 1,4,7; 1-3,8; -5,10 (short for 1-5,10); or 3- (short for third through last field)).

-n Do not split characters. When -b list and -n are used together, list is adjusted so that no multi-byte character is split.

-d delim The character following -d is the field delimiter (-f option only). Default is tab. Space or other characters with special meaning to the shell must be quoted. delim can be a multi-byte character.

-s Suppresses lines with no delimiter characters in case of -f option. Unless specified, lines with no delimiters will be passed through untouched.

file A path name of an input file. If no file operands are specified, or if a file operand is -, the standard input will be used.

Examples

name=`who am i | cut -f1 -d' '`

Set name to current login name.

WC-UNIX:

$ wc filename X Y Z filename

X – Number of lines
Y – Number of words
Z – Number of bytes
filename – name of the file

-l : Prints the number of lines in a file.
-w : prints the number of words in a file.
-c : Displays the count of bytes in a file.
-m : prints the count of characters from a file.
-L : prints only the length of the longest line in a file.

Local Containers and Shared Container

A container, as its name indicates, is used to group stages and links. Containers help simplify and modularize server job designs and allow you to replacing complex areas of the diagram with a single container stage. For example, if you have a lookup that is used by multiple jobs, you can put the jobs and links that generate the lookup into a share container and use it to different jobs. In a way, you can look at it like a procedure or function in the programming term.

Containers are linked to other stages or containers in the job by input and output stages.

Two types of container:

1. Local containers. These are created within a job and are only accessible by that job. A local container is edited in a tabbed page of the job’s Diagram window. Local containers can be used in
server jobs or parallel jobs.

2. Shared containers. These are created separately and are stored in the Repository in the same way that jobs are. There are two types of shared container:

(1.) Server shared container. Used in server jobs (can also be used in parallel jobs).

(2.) Parallel shared container. Used in parallel jobs. You can also include server shared containers in parallel jobs as a way of incorporating server job functionality into a parallel stage
(for example, you could use one to make a server plug-in stage available to a parallel job).

For here, I only tested the Server Jobs so I only put notes on the server jobs now. Parallel
Jobs works differently. Will note on it as a separate topic.

1 Local Container

The main purpose of using a DataStage local container is to simplify a complex design visually to make it easier to understand in the Diagram window. If the DataStage job has lots of stages and links, it may be easier to create additional containers to describe a particular sequence of steps.

To create a local container, from an existing job design, do the following:

(1.) Press the Shift key and using the mouse to click on the stages that you want to put into the local container.

(2.) From the Menu bar, select Edit ➤ Construct Container ➤ Local.

The group is replaced by a Local Container stage in the Diagram window. A new tab appears in the Diagram window containing the contents of the new Local Container stage. You are warned if any link naming conflicts occur when the container is constructed. The new container is opened and focus shifts onto its tab.

You can rename, move, and delete a container stage in the same way as any other stage in your job design.

To view or modify a local container, just double-click container stage in the Diagram window. You can edit the stages and links in a container in the same way you do for a job.

To create a empty container to which you can add stages and links, drag and drop the Container icon in the General group on the tool palette onto the Diagram window.

A Container stage is added to the Diagram window, double-click on the stage to open it, and add stages and links to the container the same way you do for a job.

1.1 Using Input and Output Stages in a local container

Input and output stages are used to represent the stages in the main job to which the container connects.

-- If you construct a local container from an existing group of stages and links, the input and output stages are automatically added. The link etween the input or output stage and the stage in the container has the same name as the link in the main job Diagram window.

In the example above, the input link is the link connects to the container from the main job’s Oracle_OCI stage (oracle_oci_0). The output link is the link that connects to the second container from the first container.

-- If you create a new container, it will place the input and output stages in the container without any link. You must add stages to the container Diagram window between the input and output stages. Link the stages together and edit the link names to match the ones in the main Diagram window.

You can have any number of links into and out from of a local container, all of the link names inside the container must match the link names into and out of it in the job. Once a connection is made, editing meta data on either side of the container edits the meta data on the connected stage in the job.

2 Share Container

Shared containers also help you to simplify your design but, unlike local containers, they are reusable by other jobs. You can use shared containers to make common job components available throughout the project.

Shared containers comprise groups of stages and links and are stored in the Repository like DataStage jobs. When you insert a shared container into a job, DataStage places an instance of that container into the design. When you compile the job containing an instance of a shared container, the code for the container is included in the compiled job. You can use the DataStage debugger on instances of shared containers used within jobs.

You can create a shared container from scratch, or place a set of existing stages and links within a shared container.

2.1 Create a shared container from an existing job design

(1.) Press Shift and click the other stages and links you want to add to the container.

(2.) From the Menu bar, select Edit ➤ Construct Container ➤ Shared. You will be prompted for a name for the container by the Create New dialog box. The group is replaced by a Shared Container stage of the appropriate type with the specified name in the Diagram window.

Any parameters occurring in the components are copied to the shared container as container parameters. The instance created has all its parameters assigned to corresponding job parameters.

(3.) Modify or View a Shared Contained

Select File ->Open from the Menu bar and select to Shared Container that you want to open. You can also highlight the Shared Container and use right mouse click and select property.

2.2 User A Shared Container

(1.) Dragging the Shared Container icon from the Shared Container branch in the Repository window to the job’s Diagram window.

(2.) Update the Input and Output tabs.

-- Map to Container Link.
Choose the link within the shared container to which the incoming job link will be mapped. Changing the link triggers a validation process, and you will be warned if the meta data does not match and are offered the option of reconciling the meta data as described below.

-- Columns page
Columns page shows the meta data defined for the job stage link in a standard grid. You can use the Reconcile option on the Load Shared Containers button to overwrite meta data on the job stage link with the container link meta data in the same way as described for the Validate option.

Multiple Job Compile in DataStage

Multiple job compilation is used to compile multiple jobs at a time. Follow the steps:

Step 1: Log into DS Designer and go to
‘Tools -> Multiple Job Compile’

Step 2: Then select job kind, routines or etc which you want to compile and choose Show Manual section page from the following diagram.

Step 3: Move selected jobs or routines from Project contents to selected items as shown in the following figure

Step 4: Choose Force Compile and Start Compile in the last window.

Step 5: After completion of all selected jobs, a report will be generated by Data Stage if that option is checked.

Advantages and disadvantages of using the Multiple Job compile are as follows:

a) You can compile any number of jobs at a time. Just select the jobs you want to compile as shown above and you can attend to any other works. In the mean time, the compilation process will be going on in the background.

b) One disadvantage of the multiple job compile is that this takes much more time when compared to compiling jobs individually. So I would suggest you to go for individual compilation if you need to compile only 2 or 3

c) Another disadvantage is that you need to unhold the jobs as long as they are being compiled. If you try to open the jobs in midst, the compilation for that particular job will fail and you need to compile that job again.

I suggest you to go for multiple Job compile when you need to compile jobs in bulk.

Tips & Tricks for debugging a DataStage job

The article talks about DataStage debugging techniques. This can be applied to a job which

is not producing proper output data or
to a job that is aborting or generating warnings

Use the Data Set Management utility, which is available in the Tools menu of the DataStage Designer or the DataStage Manager, to examine the schema, look at row counts, and delete a Parallel Data Set. You can also view the data itself.

Check the DataStage job log for warnings or abort messages. These may indicate an underlying logic problem or unexpected data type conversion. Check all the messages. The PX jobs almost all the times, generate a lot of warnings in addition to the problem area.

Run the job with the message handling (both job level and project level) disabled to find out if there are any warning that are unnecessarily converted to information messages or dropped from logs.

Enable the APT_DUMP_SCORE using which you would be able see how different stages are combined. Some errors/logs mentioned the error is in APT_CombinedOperatorController stages. The stages that form the part of the APT_CombinedOperatorController can be found using the dump score created after enabling this env variable.
This environment variable causes the DataStage to add one log entry which tells how stages are combined in operators and what virtual datasets are used. It also tells how the operators are partitioned and how many no. of partitions are created.

One can also enable APT_RECORD_COUNTS environment variables. Also enable OSH_PRINT_SCHEMAS to ensure that a runtime schema of a job matches the design-time schema that was expected.

Sometimes the underlying data contains the special characters (like null characters) in database or files and this can also cause the trouble in the execution. If the data is in table or dataset, then export it to a sequential file (using DS job). Then use the command “cat –tev” or “od –xc” to find out the special characters.

Once can also use “wc -lc filename”, displays the number of lines and characters in the specified ASCII text file. Sometime this is also useful.

Modular approach: If the job is very bulky with many stages in it and you are unable to locate the error, the one option is to go for modular approach. In this approach, one has to do the execution step by step. E.g. If a job has 10 stages, then create a copy of the job. Just keep say first 3 stages and run the job. Check the result and if the result is fine, then add some more stages (may be one or two) and again run the job. This has to be done till one is unable to locate the error.

Partitioned approach with data: This approach is very useful if the job is running fine for some set of data and failing for other set of data, or failing for large no. of rows. In this approach, one has to run the jobs on selected no .of rows and/or partitions using the DataStage @INROWNUM (and @PARTITIONNUM in Px). E.g. a job when run with 10K rows works fine and is failing with 1M rows. Now one can use @INROWNUM and run the job for say first 0.25 million rows. If the first 0.25 million are fine, then from 0.26 million to 0.5 million and so on.
Please note, if the job parallel job then one also has to consider the no. of partitions in the job.

Other option in such case is – run the job only one node (may be by setting using APT_EXECUTION_MODE to sequential or using the config file with one node.

Execution mode: Sometime if the partitions are confusing, then one can run the job in sequential mode. There are two ways to achieve this:

Use the environment variable APT_EXECUTION_MODE and set it to sequential mode.
Use a configuration file with only one node.

A parallel Job fails and error do not tell which row it has failed for: In this case, if this job is simple we should try to build the server job and run it. The server jobs can report the errors along with the rows which are in error. This is very useful in case when DB errors like primary/unique violation or any other DB error is reported by PX job.

Sometimes when dealing when DB and if the rows are not getting loaded as expected, adding the reject links to the DB stages can help us locating the rows with issues.

In a big job, adding some intermediate datastes/peek stages to find out the data values at certain levels can help. E.g. if there 10 stages and after that it is going to dataset. Now there may be different operations done at different stages. After 2/3 stages, add peek stages or send data to datasets using copy stages. Check the values after at these intermediate points and see if they can shed some light on the issue.

How to delete DataStage jobs at the command line

1. Login to the DataStage Administrator. Select the Project and click the Command button. Then execute the following command:

LIST DS_JOBS <job_name>

2. Go to the project directory and list all files with the job number returned from item 1.

On Unix/Linux execute,

"ls | grep <job_number>"

3. On Windows use search in Windows Explorer or from command (DOS) prompt use

"dir *<job number>".

4. This should output something like:

DS_TEMPnn, RT_BPnn, RT_BPnn.O, RT_CONFIGnn, RT_LOGnn, RT_STATUSnn, RT_SCnn
where nn is the job number.

5. Delete all the files found in step 2.

On Unix/Linux

"rm -r <FileName>nn", e.g. "rm -r DS_TEMP51".

On WindowsDelete these from Windows Explorer or from command (DOS) prompt execute:

"del <FileName>nn", e.g. "del DS_TEMP51"

6. In DataStage Administrator command window execute the commands below one by one:

DELETE VOC DS_TEMPnn
DELETE VOC RT_BPnn
DELETE VOC RT_BPnn.O
DELETE VOC RT_CONFIGnn
DELETE VOC RT_LOGnn
DELETE VOC RT_STATUSnn
DELETE VOC RT_SCnn
DELETE DS_JOBS job_name

Difference Between The Continuous Funnel And Sort Funnel

# Continuous Funnel combines the records of the input data in no guaranteed order. It takes one record from each input link in turn. If data is not available on an input link, the stage skips to the next link rather than waiting.

# Sort Funnel combines the input records in the order defined by the value(s) of one or more key columns and the order of the output records is determined by these sorting keys.

# Sequence copies all records from the first input data set to the output data set, then all the records from the second input data set, and so on.

For all methods the meta data of all input data sets must be identical.

The sort funnel method has some particular requirements about its input data. All input data sets must be sorted by the same key columns as to be used by the Funnel operation.

Typically all input data sets for a sort funnel operation are hash-partitioned before they're sorted (choosing the auto partitioning method will ensure that this is done). Hash partitioning guarantees that all records with the same key column values are located in the same partition and so are processed on the same node. If sorting and partitioning are carried out on separate stages before the Funnel stage, this partitioning must be preserved.

The sort funnel operation allows you to set one primary key and multiple secondary keys. The Funnel stage first examines the primary key in each input record. For multiple records with the same primary key value, it then examines secondary keys to determine the order of records it will output.

Dataset

Inside a InfoSphere DataStage parallel job, data is moved around in data sets. These carry meta data with them, both column definitions and information about the configuration that was in effect when the data set was created. If for example, you have a stage which limits execution to a subset of available nodes, and the data set was created by a stage using all nodes, InfoSphere DataStage can detect that the data will need repartitioning.

If required, data sets can be landed as persistent data sets, represented by a Data Set stage .This is the most efficient way of moving data between linked jobs. Persistent data sets are stored in a series of files linked by a control file (note that you should not attempt to manipulate these files using UNIX tools such as RM or MV. Always use the tools provided with InfoSphere DataStage).

there are the two groups of Datasets - persistent and virtual.

The first type, persistent Datasets are marked with *.ds extensions, while for second type, virtual datasets *.v extension is reserved. (It's important to mention, that no *.v files might be visible in the Unix file system, as long as they exist only virtually, while inhabiting RAM memory. Extesion *.v itself is characteristic strictly for OSH - the Orchestrate language of scripting).

Further differences are much more significant. Primarily, persistent Datasets are being stored in Unix files using internal Datastage EE format, while virtual Datasets are never stored on disk - they do exist within links, and in EE format, but in RAM memory. Finally, persistent Datasets are readable and rewriteable with the DataSet Stage, and virtual Datasets - might be passed through in memory.

A data set comprises a descriptor file and a number of other files that are added as the data set grows. These files are stored on multiple disks in your system. A data set is organized in terms of partitions and segments.

Each partition of a data set is stored on a single processing node. Each data segment contains all the records written by a single job. So a segment can contain files from many partitions, and a partition has files from many segments.

Firstly, as a single Dataset contains multiple records, it is obvious that all of them must undergo the same processes and modifications. In a word, all of them must go through the same successive stage.
Secondly, it should be expected that different Datasets usually have different schemas, therefore they cannot be treated commonly.

Alias names of Datasets are

1) Orchestrate File
2) Operating System file

And Dataset is multiple files. They are
a) Descriptor File
b) Data File
c) Control file
d) Header Files

In Descriptor File, we can see the Schema details and address of data.
In Data File, we can see the data in Native format.
And Control and Header files resides in Operating System.

Starting a Dataset Manager

Choose Tools ► Data Set Management, a Browse Files dialog box appears:

Navigate to the directory containing the data set you want to manage. By convention, data set files have the suffix .ds.
Select the data set you want to manage and click OK. The Data Set Viewer appears. From here you can copy or delete the chosen data set. You can also view its schema (column definitions) or the data it contains.