Thursday, April 18, 2013

Unix-Vi Editor

There are many ways to edit files in Unix and for me one of the best ways is using screen-oriented text editor vi. This editor enable you to edit lines in context with other lines in the file.

Now a days you would find an improved version of vi editor which is called VIM. Here VIM stands for ViIMproved.

The vi is generally considered the de facto standard in Unix editors because:

It's usually available on all the flavors of Unix system.
Its implementations are very similar across the board.
It requires very few resources.
It is more user friendly than any other editors like ed or ex.

You can use vi editor to edit an existing file or to create a new file from scratch. You can also use this editor to just read a text file.

Starting the vi Editor:

There are following way you can start using vi editor:

Command	Description
vi filename	Creates a new file if it already does not exist, otherwise opens existing file.
vi -R filename	Opens an existing file in read only mode.
view filename	Opens an existing file in read only mode.

Following is the example to create a new file testfile if it already does not exist in the current working directory:

$vi testfile

As a result you would see a screen something like as follows:

|
~
~
~
~
~
~
~
~
~
~
~
~
"testfile" [New File]

You will notice a tilde (~) on each line following the cursor. A tilde represents an unused line. If a line does not begin with a tilde and appears to be blank, there is a space, tab, newline, or some other nonviewable character present.

So now you have opened one file to start with. Before proceeding further let us understanding few minor but important concepts explained below.

Operation Modes:

While working with vi editor you would come across following two modes:

Command mode: This mode enables you to perform administrative tasks such as saving files, executing commands, moving the cursor, cutting (yanking) and pasting lines or words, and finding and replacing. In this mode, whatever you type is interpreted as a command.
Insert mode: This mode enables you to insert text into the file. Everything that's typed in this mode is interpreted as input and finally it is put in the file .

The vi always starts in command mode. To enter text, you must be in insert mode. To come in insert mode you simply type i. To get out of insert mode, press the Esc key, which will put you back into command mode.

Hint: If you are not sure which mode you are in, press the Esc key twice, and then you'll be in command mode. You open a file using vi editor and start type some characters and then come in command mode to understand the difference.

Getting Out of vi:

The command to quit out of vi is :q. Once in command mode, type colon, and 'q', followed by return. If your file has been modified in any way, the editor will warn you of this, and not let you quit. To ignore this message, the command to quit out of vi without saving is :q!. This lets you exit vi without saving any of the changes.

The command to save the contents of the editor is :w. You can combine the above command with the quit command, or :wq and return.

The easiest way to save your changes and exit out of vi is the ZZ command. When you are in command mode, type ZZ and it will do the equivalent of :wq.

You can specify a different file name to save to by specifying the name after the :w. For example, if you wanted to save the file you were working as another filename called filename2, you would type :w filename2 and return. Try it once.

Moving within a File:

To move around within a file without affecting your text, you must be in command mode (press Esc twice). Here are some of the commands you can use to move around one character at a time:

Command	Description
k	Moves the cursor up one line.
j	Moves the cursor down one line.
h	Moves the cursor to the left one character position.
l	Moves the cursor to the right one character position.

There are following two important points to be noted:

The vi is case-sensitive, so you need to pay special attention to capitalization when using commands.
Most commands in vi can be prefaced by the number of times you want the action to occur. For example, 2j moves cursor two lines down the cursor location.

There are many other ways to move within a file in vi. Remember that you must be in command mode (press Esc twice). Here are some more commands you can use to move around the file:

Command	Description
0 or \|	Positions cursor at beginning of line.
$	Positions cursor at end of line.
w	Positions cursor to the next word.
b	Positions cursor to previous word.
(	Positions cursor to beginning of current sentence.
)	Positions cursor to beginning of next sentence.
E	Move to the end of Blank delimited word
{	Move a paragraph back
}	Move a paragraph forward
[[	Move a section back
]]	Move a section forward
n\|	Moves to the column n in the current line
1G	Move to the first line of the file
G	Move to the last line of the file
nG	Move to nth line of the file
:n	Move to nth line of the file
fc	Move forward to c
Fc	Move back to c
H	Move to top of screen
nH	Moves to nth line from the top of the screen
M	Move to middle of screen
L	Move to botton of screen
nL	Moves to nth line from the bottom of the screen
:x	Colon followed by a number would position the cursor on line number represented by x

Control Commands:

There are following useful command which you can use along with Control Key:

Command	Description
CTRL+d	Move forward 1/2 screen
CTRL+d	Move forward 1/2 screen
CTRL+f	Move forward one full screen
CTRL+u	Move backward 1/2 screen
CTRL+b	Move backward one full screen
CTRL+e	Moves screen up one line
CTRL+y	Moves screen down one line
CTRL+u	Moves screen up 1/2 page
CTRL+d	Moves screen down 1/2 page
CTRL+b	Moves screen up one page
CTRL+f	Moves screen down one page
CTRL+I	Redraws screen

Editing Files:

To edit the file, you need to be in the insert mode. There are many ways to enter insert mode from the command mode:

Command	Description
i	Inserts text before current cursor location.
I	Inserts text at beginning of current line.
a	Inserts text after current cursor location.
A	Inserts text at end of current line.
o	Creates a new line for text entry below cursor location.
O	Creates a new line for text entry above cursor location.

Deleting Characters:

Here is the list of important commands which can be used to delete characters and lines in an opened file:

Command	Description
x	Deletes the character under the cursor location.
X	Deletes the character before the cursor location.
dw	Deletes from the current cursor location to the next word.
d^	Deletes from current cursor position to the beginning of the line.
d$	Deletes from current cursor position to the end of the line.
D	Deletes from the cursor position to the end of the current line.
dd	Deletes the line the cursor is on.

As mentioned above, most commands in vi can be prefaced by the number of times you want the action to occur. For example, 2x deletes two character under the cursor location and 2dd deletes two lines the cursor is on.

I would highly recommend to exercise all the above commands properly before proceeding further.

Change Commands:

You also have the capability to change characters, words, or lines in vi without deleting them. Here are the relevant commands:

Command	Description
cc	Removes contents of the line, leaving you in insert mode.
cw	Changes the word the cursor is on from the cursor to the lowercase w end of the word.
r	Replaces the character under the cursor. vi returns to command mode after the replacement is entered.
R	Overwrites multiple characters beginning with the character currently under the cursor. You must use Esc to stop the overwriting.
s	Replaces the current character with the character you type. Afterward, you are left in insert mode.
S	Deletes the line the cursor is on and replaces with new text. After the new text is entered, vi remains in insert mode.

Copy and Past Commands:

You can copy lines or words from one place and then you can past them at another place using following commands:

Command	Description
yy	Copies the current line.
yw	Copies the current word from the character the lowercase w cursor is on until the end of the word.
p	Puts the copied text after the cursor.
P	Puts the yanked text before the cursor.

Advanced Commands:

There are some advanced commands that simplify day-to-day editing and allow for more efficient use of vi:

Command	Description
J	Join the current line with the next one. A count joins that many lines.
<<	Shifts the current line to the left by one shift width.
>>	Shifts the current line to the right by one shift width.
~	Switch the case of the character under the cursor.
^G	Press CNTRL and G keys at the same time to show the current filename and the status.
U	Restore the current line to the state it was in before the cursor entered the line.
u	Undo the last change to the file. Typing 'u' again will re-do the change.
J	Join the current line with the next one. A count joins that many lines.
:f	Displays current position in the file in % and file name, total number of file.
:f filename	Renames current file to filename.
:w filename	Write to file filename.
:e filename	Opens another file with filename.
:cd dirname	Changes current working directory to dirname.
:e #	Use to toggle between two opened files.
:n	In case you open multiple files using vi, use :n to go to next file in the series.
:p	In case you open multiple files using vi, use :p to go to previous file in the series.
:N	In case you open multiple files using vi, use :N to go to previous file in the series.
:r file	Reads file and inserts it after current line
:nr file	Reads file and inserts it after line n.

Word and Character Searching:

The vi editor has two kinds of searches: string and character. For a string search, the / and ? commands are used. When you start these commands, the command just typed will be shown on the bottom line, where you type the particular string to look for.

These two commands differ only in the direction where the search takes place:

The / command searches forwards (downwards) in the file.
The ? command searches backwards (upwards) in the file.

The n and N commands repeat the previous search command in the same or opposite direction, respectively. Some characters have special meanings while using in search command and preceded by a backslash (\) to be included as part of the search expression.

Character	Description
^	Search at the beginning of the line. (Use at the beginning of a search expression.)
.	Matches a single character.
*	Matches zero or more of the previous character.
$	End of the line (Use at the end of the search expression.)
[	Starts a set of matching, or non-matching expressions.
<	Put in an expression escaped with the backslash to find the ending or beginning of a word.
>	See the '<' character description above.

The character search searches within one line to find a character entered after the command. The f and F commands search for a character on the current line only. f searches forwards and F searches backwards and the cursor moves to the position of the found character.

The t and T commands search for a character on the current line only, but for t, the cursor moves to the position before the character, and T searches the line backwards to the position after the character.

Set Commands:

You can change the look and feel of your vi screen using the following :set commands. To use these commands you have to come in command mode then type :set followed by any of the following options:

Command	Description
:set ic	Ignores case when searching
:set ai	Sets autoindent
:set noai	To unset autoindent.
:set nu	Displays lines with line numbers on the left side.
:set sw	Sets the width of a software tabstop. For example you would set a shift width of 4 with this command: :set sw=4
:set ws	If wrapscan is set, if the word is not found at the bottom of the file, it will try to search for it at the beginning.
:set wm	If this option has a value greater than zero, the editor will automatically "word wrap". For example, to set the wrap margin to two characters, you would type this: :set wm=2
:set ro	Changes file type to "read only"
:set term	Prints terminal type
:set bf	Discards control characters from input

Running Commands:

The vi has the capability to run commands from within the editor. To run a command, you only need to go into command mode and type :! command.

For example, if you want to check whether a file exists before you try to save your file to that filename, you can type :! ls and you will see the output of ls on the screen.

When you press any key (or the command's escape sequence), you are returned to your vi session.

Replacing Text:

The substitution command (:s/) enables you to quickly replace words or groups of words within your files. Here is the simple syntax:

:s/search/replace/g

The g stands for globally. The result of this command is that all occurrences on the cursor's line are changed.

IMPORTANT:

Here are the key points to your success with vi:

You must be in command mode to use commands. (Press Esc twice at any time to ensure that you are in command mode.)
You must be careful to use the proper case (capitalization) for all commands.
You must be in insert mode to enter text.

Wednesday, April 17, 2013

Capturing Unmatched Records from a Join in Data Stage

The Join stage does not provide reject handling for unmatched records (such as in an InnerJoin scenario). If un-matched rows must be captured or logged, an OUTER join operation must be performed. In an OUTER join scenario, all rows on an outer link (eg. Left Outer, Right Outer, or both links in the case of Full Outer) are output regardless of match on key values.

During an Outer Join, when a match does not occur, the Join stage inserts NULL values into the unmatched columns. Care must be taken to change the column properties to allow NULL values before the Join. This is most easily done by inserting a Copy stage and mapping a column from NON-NULLABLE to NULLABLE.

A Filter stage can be used to test for NULL values in unmatched columns.

In some cases, it is simpler to use a Column Generator to add an ‘indicator’ column, with a constant value, to each of the outer links and test that column for the constant after you have performed the join. This is also handy with Lookups that have multiple reference links.

Environment Variable

Data Stage Environment Variable Settings for All Jobs

Ascential recommends the following environment variable settings for all Enterprise Edition jobs. These settings can be made at the project level, or may be set on an individual basis within the properties for each job.

Environment Variable Settings For All Jobs

Environment Variable	Setting	Description
$APT_CONFIG_FILE	filepath	Specifies the full pathname to the EE configuration file.
$APT_DUMP_SCORE	1	Outputs EE score dump to the DataStage job log, providing detailed information about actual job flow including operators, processes, and datasets. Extremely useful for understanding how a job actually ran in the environment. (see section 10.1 Reading a Score Dump)
$OSH_ECHO	1	Includes a copy of the generated osh in the job’s DataStage log. Starting with v7, this option is enabled when “Generated OSH visible for Parallel jobs in ALL projects” option is enabled in DataStage Administrator.
$APT_RECORD_COUNTS	1	Outputs record counts to the DataStage job log as each operator completes processing. The count is per operator per partition.
$APT_PM_SHOW_PIDS	1	Places entries in DataStage job log showing UNIX process ID (PID) for each process started by a job. Does not report PIDs of DataStage “phantom” processes started by Server shared containers.
$APT_BUFFER_MAXIMUM_TIMEOUT	1	Maximum buffer delay in seconds
$APT_THIN_SCORE (DataStage 7.0 and earlier)	1	Only needed for DataStage v7.0 and earlier. Setting this environment variable significantly reduces memory usage for very large (>100 operator) jobs.

Additional Environment Variable Settings

Ascential recommends setting the following environment variables on an as-needed basis. These variables can be used to tune the performance of a particular job flow, to assist in debugging, and to change the default behavior of specific EE stages.

NOTE: The environment variable settings in this section are only examples. Set values that are optimal to your environment.

Sequential File Stage Environment Variables

Environment Variable	Setting	Description
$APT_EXPORT_FLUSH_COUNT	[nrows]	Specifies how frequently (in rows) that the Sequential File stage (export operator) flushes its internal buffer to disk. Setting this value to a low number (such as 1) is useful for realtime applications, but there is a small performance penalty from increased I/O.
$APT_IMPORT_BUFFER_SIZE $APT_EXPORT_BUFFER_SIZE	[Kbytes]	Defines size of I/O buffer for Sequential File reads (imports) and writes (exports) respectively. Default is 128 (128K), with a minimum of 8. Increasing these values on heavily-loaded file servers may improve performance.
$APT_CONSISTENT_BUFFERIO_SIZE	[bytes]	In some disk array configurations, setting this variable to a value equal to the read / write size in bytes can improve performance of Sequential File import/export operations.
$APT_DELIMITED_READ_SIZE	[bytes]	Specifies the number of bytes the Sequential File (import) stage reads-ahead to get the next delimiter. The default is 500 bytes, but this can be set as low as 2 bytes. This setting should be set to a lower value when reading from streaming inputs (eg. socket, FIFO) to avoid blocking.
$APT_MAX_DELIMITED_READ_SIZE	[bytes]	By default, Sequential File (import) will read ahead 500 bytes to get the next delimiter. If it is not found the importer looks ahead 4*500=2000 (1500 more) bytes, and so on (4X) up to 100,000 bytes. This variable controls the upper bound which is by default 100,000 bytes. When more than 500 bytes read-ahead is desired, use this variable instead of APT_DELIMITED_READ_SIZE.

Oracle Environment Variables

Environment Variable	Setting	Description
$ORACLE_HOME	[path]	Specifies installation directory for current Oracle instance. Normally set in a user’s environment by Oracle scripts.
$ORACLE_SID	[sid]	Specifies the Oracle service name, corresponding to a TNSNAMES entry.
$APT_ORAUPSERT_COMMIT_ROW_INTERVAL $APT_ORAUPSERT_COMMIT_TIME_INTERVAL	[num] [seconds]	These two environment variables work together to specify how often target rows are committed for target Oracle stages with Upsert method. Commits are made whenever the time interval period has passed or the row interval is reached, whichever comes first. By default, commits are made every 2 seconds or 5000 rows.
$APT_ORACLE_LOAD_OPTIONS	[SQL* Loader options]	Specifies Oracle SQL*Loader options used in a target Oracle stage with Load method. By default, this is set to OPTIONS(DIRECT=TRUE, PARALLEL=TRUE)
$APT_ORA_IGNORE_CONFIG_FILE_PARALLELISM	1	When set, a target Oracle stage with Load method will limit the number of players to the number of datafiles in the table’s tablespace.
$APT_ORA_WRITE_FILES	[filepath]	Useful in debugging Oracle SQLLoader issues. When set, the output of a Target Oracle stage with Load method is written to files instead of invoking the Oracle SQLLoader. The filepath specified by this environment variable specifies the file with the SQL*Loader commands.
$DS_ENABLE_RESERVED_CHAR_CONVERT	1	Allows DataStage to handle Oracle databases which use the special characters # and $ in column names.

Job Monitoring Environment Variables

Environment Variable	Setting	Description
$APT_MONITOR_TIME	[seconds]	In v7 and later, specifies the time interval (in seconds) for generating job monitor information at runtime. To enable size-based job monitoring, unset this environment variable, and set $APT_MONITOR_SIZEbelow.
$APT_MONITOR_SIZE	[rows]	Determines the minimum number of records the job monitor reports. The default of 5000 records is usually too small. To minimize the number of messages during large job runs, set this to a higher value (eg. 1000000).
$APT_NO_JOBMON	1	Disables job monitoring completely. In rare instances, this may improve performance. In general, this should only be set on a per-job basis when attempting to resolve performance bottlenecks.
$APT_RECORD_COUNTS	1	Prints record counts in the job log as each operator completes processing. The count is per operator per partition.

Job Monitoring Environment Variables

Environment Variable	Setting	Description
$APT_MONITOR_TIME	[seconds]	In v7 and later, specifies the time interval (in seconds) for generating job monitor information at runtime. To enable size-based job monitoring, unset this environment variable, and set $APT_MONITOR_SIZEbelow.
$APT_MONITOR_SIZE	[rows]	Determines the minimum number of records the job monitor reports. The default of 5000 records is usually too small. To minimize the number of messages during large job runs, set this to a higher value (eg. 1000000).
$APT_NO_JOBMON	1	Disables job monitoring completely. In rare instances, this may improve performance. In general, this should only be set on a per-job basis when attempting to resolve performance bottlenecks.
$APT_RECORD_COUNTS	1	Prints record counts in the job log as each operator completes processing. The count is per operator per partition.

Data Stage Sequential File Stages (Import and Export) Performance Tuning

Improving Sequential File Performance

If the source file is fixed/de-limited, the Readers Per Nodeoption can be used to read a single input file in parallel at evenly-spaced offsets. Note that in this manner, input row order is not maintained.

If the input sequential file cannot be read in parallel, performance can still be improved by separating the file I/O from the column parsing operation. To accomplish this, define a single large string column for the non-parallel Sequential File read, and then pass this to a Column Import stage to parse the file in parallel. The formatting and column properties of the Column Import stage match those of the Sequential File stage.

On heavily-loaded file servers or some RAID/SAN array configurations, the environment variables $APT_IMPORT_BUFFER_SIZEand $APT_EXPORT_BUFFER_SIZEcan be used to improve I/O performance. These settings specify the size of the read (import) and write (export) buffer size in Kbytes, with a default of 128 (128K). Increasing this may improve performance.

Finally, in some disk array configurations, setting the environment variable $APT_CONSISTENT_BUFFERIO_SIZEto a value equal to the read/write size in bytes can significantly improve performance of Sequential File operations.

Partitioning Sequential File Reads

Care must be taken to choose the appropriate partitioning method from a Sequential File read:

Don’t read from Sequential File using SAME partitioning! Unless more than one source file is specified, SAME will read the entire file into a single partition, making the entire downstream flow run sequentially (unless it is later repartitioned).

When multiple files are read by a single Sequential File stage (using multiple files, or by using a File Pattern), each file’s data is read into a separate partition. It is important to use ROUND-ROBIN partitioning (or other partitioning appropriate to downstream components) to evenly distribute the data in the flow.

Sequential File (Export) Buffering

By default, the Sequential File (export operator) stage buffers its writes to optimize performance. When a job completes successfully, the buffers are always flushed to disk. The environment variable $APT_EXPORT_FLUSH_COUNTallows the job developer to specify how frequently (in number of rows) that the Sequential File stage flushes its internal buffer on writes. Setting this value to a low number (such as 1) is useful for realtime applications, but there is a small performance penalty associated with increased I/O.

Reading from and Writing to Fixed-Length Files

Particular attention must be taken when processing fixed-length fields using the Sequential File stage:

If the incoming columns are variable-length data types (eg. Integer, Decimal, Varchar), the field width column property must be set to match the fixed-width of the input column. Double-click on the column number in the grid dialog to set this column property.

If a field is nullable, you must define the null field value and length in the Nullablesection of the column property. Double-click on the column number in the grid dialog to set these properties.

When writing fixed-length files from variable-length fields (eg. Integer, Decimal, Varchar), the field width and pad string column properties must be set to match the fixed-width of the output column. Double-click on the column number in the grid dialog to set this column property.

To display each field value, use the print_field import property. All import and export properties are listed in chapter 25, Import/Export Properties of the Orchestrate 7.0 Operators Reference.

Reading Bounded-Length VARCHAR Columns

Care must be taken when reading delimited, bounded-length Varchar columns (Varchars with the length option set). By default, if the source file has fields with values longer than the maximum Varchar length, these extra characters will be silently truncated.

Starting with v7.01 the environment variable

$APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS will direct DataStage to reject records with strings longer than their declared maximum column length.

Data Stage Transformer Usage Guidelines

Choosing Appropriate Stages

The parallel Transformer stage always generates “C” code which is then compiled to a parallel component. For this reason, it is important to minimize the number of transformers, and to use other stages (Copy, Filter, Switch, etc) when derivations are not needed.

The Copy stage should be used instead of a Transformer for simple operations including:
Job Design placeholder between stages (unless the Force option =true, EE will optimize this out at runtime)
Renaming Columns
Dropping Columns
Default Type Conversions

Note that rename, drop (if runtime column propagation is disabled), and default type conversion can also be performed by the output mapping tab of any stage

NEVERuse the “BASIC Transformer” stage in large-volume job flows. Instead, user-defined functions and routines can expand parallel Transformer capabilities.

Consider, if possible, implementing complex derivation expressions using regular patterns by Lookup tables instead of using a Transformer with nested derivations.

For example, the derivation expression:

If A=0,1,2,3 Then B=”X” If A=4,5,6,7 Then B=”C”

Could be implemented with a lookup table containing values for column A and corresponding values of column B.

Optimize the overall job flow design to combine derivations from multiple Transformers into a single Transformer stage when possible.

In v7 and later, the Filter and/or Switch stages can be used to separate rows into multiple output links based on SQL-like link constraint expressions.

In v7 and later, the Modify stage can be used for non-default type conversions, null handling, and character string trimming.

Buildops should be used instead of Transformers in the handful of scenarios where complex reusable logic is required, or where existing Transformer-based job flows do not meet performance requirements.

Transformer NULL Handling and Reject Link

When evaluating expressions for output derivations or link constraints, the Transformer will reject (through the reject link indicated by a dashed line) any row that has a NULL value used in the expression. To create a Transformer reject link in DataStage Designer, right-click on an output link and choose “Convert to Reject”.

The Transformer rejects NULL derivation results because the rules for arithmetic and string handling of NULL values are by definition undefined. For this reason, always test for null values before using a column in an expression, for example:

If ISNULL(link.col) Then… Else…

Note that if an incoming column is only used in a pass-through derivation, the Transformer will allow this row to be output. DataStage release 7 enhances this behavior by placing warnings in the log file when discards occur.

Transformer Derivation Evaluation

Output derivations are evaluated BEFORE any type conversions on the assignment. For example, the PadString function uses the length of the source type, not the target. Therefore, it is important to make sure the type conversion is done before a row reaches the Transformer.

For example, TrimLeadingTrailing(string) works only if string is a VarChar field. Thus, the incoming column must be type VarChar before it is evaluated in the Transformer.

Conditionally Aborting Jobs

The Transformer can be used to conditionally abort a job when incoming data matches a specific rule. Create a new output link that will handle rows that match the abort rule. Within the link constraints dialog box, apply the abort rule to this output link, and set the “Abort After Rows” count to the number of rows allowed before the job should be aborted .

Since the Transformer will abort the entire job flow immediately, it is possible that valid rows will not have been flushed from Sequential File (export) buffers, or committed to database tables. It is important to set the Sequential File buffer flush or database commit parameters.