Thursday, April 18, 2013

Unix-Vi Editor


There are many ways to edit files in Unix and for me one of the best ways is using screen-oriented text editor vi. This editor enable you to edit lines in context with other lines in the file.
Now a days you would find an improved version of vi editor which is called VIM. Here VIM stands for ViIMproved.
The vi is generally considered the de facto standard in Unix editors because:
  • It's usually available on all the flavors of Unix system.
  • Its implementations are very similar across the board.
  • It requires very few resources.
  • It is more user friendly than any other editors like ed or ex.
You can use vi editor to edit an existing file or to create a new file from scratch. You can also use this editor to just read a text file.

Starting the vi Editor:

There are following way you can start using vi editor:
CommandDescription
vi filenameCreates a new file if it already does not exist, otherwise opens existing file.
vi -R filenameOpens an existing file in read only mode.
view filenameOpens an existing file in read only mode.
Following is the example to create a new file testfile if it already does not exist in the current working directory:
$vi testfile
As a result you would see a screen something like as follows:
|
~
~
~
~
~
~
~
~
~
~
~
~
"testfile" [New File]    
You will notice a tilde (~) on each line following the cursor. A tilde represents an unused line. If a line does not begin with a tilde and appears to be blank, there is a space, tab, newline, or some other nonviewable character present.
So now you have opened one file to start with. Before proceeding further let us understanding few minor but important concepts explained below.

Operation Modes:

While working with vi editor you would come across following two modes:
  1. Command mode: This mode enables you to perform administrative tasks such as saving files, executing commands, moving the cursor, cutting (yanking) and pasting lines or words, and finding and replacing. In this mode, whatever you type is interpreted as a command.
  2. Insert mode: This mode enables you to insert text into the file. Everything that's typed in this mode is interpreted as input and finally it is put in the file .
The vi always starts in command mode. To enter text, you must be in insert mode. To come in insert mode you simply type i. To get out of insert mode, press the Esc key, which will put you back into command mode.
Hint: If you are not sure which mode you are in, press the Esc key twice, and then you'll be in command mode. You open a file using vi editor and start type some characters and then come in command mode to understand the difference.

Getting Out of vi:

The command to quit out of vi is :q. Once in command mode, type colon, and 'q', followed by return. If your file has been modified in any way, the editor will warn you of this, and not let you quit. To ignore this message, the command to quit out of vi without saving is :q!. This lets you exit vi without saving any of the changes.
The command to save the contents of the editor is :w. You can combine the above command with the quit command, or :wq and return.
The easiest way to save your changes and exit out of vi is the ZZ command. When you are in command mode, type ZZ and it will do the equivalent of :wq.
You can specify a different file name to save to by specifying the name after the :w. For example, if you wanted to save the file you were working as another filename called filename2, you would type :w filename2 and return. Try it once.

Moving within a File:

To move around within a file without affecting your text, you must be in command mode (press Esc twice). Here are some of the commands you can use to move around one character at a time:
CommandDescription
kMoves the cursor up one line.
jMoves the cursor down one line.
hMoves the cursor to the left one character position.
lMoves the cursor to the right one character position.
There are following two important points to be noted:
  • The vi is case-sensitive, so you need to pay special attention to capitalization when using commands.
  • Most commands in vi can be prefaced by the number of times you want the action to occur. For example, 2j moves cursor two lines down the cursor location.
There are many other ways to move within a file in vi. Remember that you must be in command mode (press Esc twice). Here are some more commands you can use to move around the file:
CommandDescription
0 or |Positions cursor at beginning of line.
$Positions cursor at end of line.
wPositions cursor to the next word.
bPositions cursor to previous word.
(Positions cursor to beginning of current sentence.
)Positions cursor to beginning of next sentence.
EMove to the end of Blank delimited word
{Move a paragraph back
}Move a paragraph forward
[[Move a section back
]]Move a section forward
n|Moves to the column n in the current line
1GMove to the first line of the file
GMove to the last line of the file
nGMove to nth line of the file
:nMove to nth line of the file
fcMove forward to c
FcMove back to c
HMove to top of screen
nHMoves to nth line from the top of the screen
MMove to middle of screen
LMove to botton of screen
nLMoves to nth line from the bottom of the screen
:xColon followed by a number would position the cursor on line number represented by x

Control Commands:

There are following useful command which you can use along with Control Key:
CommandDescription
CTRL+dMove forward 1/2 screen
CTRL+dMove forward 1/2 screen
CTRL+fMove forward one full screen
CTRL+uMove backward 1/2 screen
CTRL+bMove backward one full screen
CTRL+eMoves screen up one line
CTRL+yMoves screen down one line
CTRL+uMoves screen up 1/2 page
CTRL+dMoves screen down 1/2 page
CTRL+bMoves screen up one page
CTRL+fMoves screen down one page
CTRL+IRedraws screen

Editing Files:

To edit the file, you need to be in the insert mode. There are many ways to enter insert mode from the command mode:
CommandDescription
iInserts text before current cursor location.
IInserts text at beginning of current line.
aInserts text after current cursor location.
AInserts text at end of current line.
oCreates a new line for text entry below cursor location.
OCreates a new line for text entry above cursor location.

Deleting Characters:

Here is the list of important commands which can be used to delete characters and lines in an opened file:
CommandDescription
xDeletes the character under the cursor location.
XDeletes the character before the cursor location.
dwDeletes from the current cursor location to the next word.
d^Deletes from current cursor position to the beginning of the line.
d$Deletes from current cursor position to the end of the line.
DDeletes from the cursor position to the end of the current line.
ddDeletes the line the cursor is on.
As mentioned above, most commands in vi can be prefaced by the number of times you want the action to occur. For example, 2x deletes two character under the cursor location and 2dd deletes two lines the cursor is on.
I would highly recommend to exercise all the above commands properly before proceeding further.

Change Commands:

You also have the capability to change characters, words, or lines in vi without deleting them. Here are the relevant commands:
CommandDescription
ccRemoves contents of the line, leaving you in insert mode.
cwChanges the word the cursor is on from the cursor to the lowercase w end of the word.
rReplaces the character under the cursor. vi returns to command mode after the replacement is entered.
ROverwrites multiple characters beginning with the character currently under the cursor. You must use Esc to stop the overwriting.
sReplaces the current character with the character you type. Afterward, you are left in insert mode.
SDeletes the line the cursor is on and replaces with new text. After the new text is entered, vi remains in insert mode.

Copy and Past Commands:

You can copy lines or words from one place and then you can past them at another place using following commands:
CommandDescription
yyCopies the current line.
ywCopies the current word from the character the lowercase w cursor is on until the end of the word.
pPuts the copied text after the cursor.
PPuts the yanked text before the cursor.

Advanced Commands:

There are some advanced commands that simplify day-to-day editing and allow for more efficient use of vi:
CommandDescription
JJoin the current line with the next one. A count joins that many lines.
<<Shifts the current line to the left by one shift width.
>>Shifts the current line to the right by one shift width.
~Switch the case of the character under the cursor.
^GPress CNTRL and G keys at the same time to show the current filename and the status.
URestore the current line to the state it was in before the cursor entered the line.
uUndo the last change to the file. Typing 'u' again will re-do the change.
JJoin the current line with the next one. A count joins that many lines.
:fDisplays current position in the file in % and file name, total number of file.
:f filenameRenames current file to filename.
:w filenameWrite to file filename.
:e filenameOpens another file with filename.
:cd dirnameChanges current working directory to dirname.
:e #Use to toggle between two opened files.
:nIn case you open multiple files using vi, use :n to go to next file in the series.
:pIn case you open multiple files using vi, use :p to go to previous file in the series.
:NIn case you open multiple files using vi, use :N to go to previous file in the series.
:r fileReads file and inserts it after current line
:nr fileReads file and inserts it after line n.

Word and Character Searching:

The vi editor has two kinds of searches: string and character. For a string search, the / and ? commands are used. When you start these commands, the command just typed will be shown on the bottom line, where you type the particular string to look for.
These two commands differ only in the direction where the search takes place:
  • The / command searches forwards (downwards) in the file.
  • The ? command searches backwards (upwards) in the file.
The n and N commands repeat the previous search command in the same or opposite direction, respectively. Some characters have special meanings while using in search command and preceded by a backslash (\) to be included as part of the search expression.
CharacterDescription
^Search at the beginning of the line. (Use at the beginning of a search expression.)
.Matches a single character.
*Matches zero or more of the previous character.
$End of the line (Use at the end of the search expression.)
[Starts a set of matching, or non-matching expressions.
<Put in an expression escaped with the backslash to find the ending or beginning of a word.
>See the '<' character description above.
The character search searches within one line to find a character entered after the command. The f and F commands search for a character on the current line only. f searches forwards and F searches backwards and the cursor moves to the position of the found character.
The t and T commands search for a character on the current line only, but for t, the cursor moves to the position before the character, and T searches the line backwards to the position after the character.

Set Commands:

You can change the look and feel of your vi screen using the following :set commands. To use these commands you have to come in command mode then type :set followed by any of the following options:
CommandDescription
:set icIgnores case when searching
:set aiSets autoindent
:set noaiTo unset autoindent.
:set nuDisplays lines with line numbers on the left side.
:set swSets the width of a software tabstop. For example you would set a shift width of 4 with this command: :set sw=4
:set wsIf wrapscan is set, if the word is not found at the bottom of the file, it will try to search for it at the beginning.
:set wmIf this option has a value greater than zero, the editor will automatically "word wrap". For example, to set the wrap margin to two characters, you would type this: :set wm=2
:set roChanges file type to "read only"
:set termPrints terminal type
:set bfDiscards control characters from input

Running Commands:

The vi has the capability to run commands from within the editor. To run a command, you only need to go into command mode and type :! command.
For example, if you want to check whether a file exists before you try to save your file to that filename, you can type :! ls and you will see the output of ls on the screen.
When you press any key (or the command's escape sequence), you are returned to your vi session.

Replacing Text:

The substitution command (:s/) enables you to quickly replace words or groups of words within your files. Here is the simple syntax:
:s/search/replace/g
The g stands for globally. The result of this command is that all occurrences on the cursor's line are changed.

IMPORTANT:

Here are the key points to your success with vi:
  • You must be in command mode to use commands. (Press Esc twice at any time to ensure that you are in command mode.)
  • You must be careful to use the proper case (capitalization) for all commands.
  • You must be in insert mode to enter text.

Wednesday, April 17, 2013

Capturing Unmatched Records from a Join in Data Stage


The Join stage does not provide reject handling for unmatched records (such as in an InnerJoin scenario). If un-matched rows must be captured or logged, an OUTER join operation must be performed. In an OUTER join scenario, all rows on an outer link (eg. Left Outer, Right Outer, or both links in the case of Full Outer) are output regardless of match on key values.

During an Outer Join, when a match does not occur, the Join stage inserts NULL values into the unmatched columns. Care must be taken to change the column properties to allow NULL values before the Join. This is most easily done by inserting a Copy stage and mapping a column from NON-NULLABLE to NULLABLE.

 A Filter stage can be used to test for NULL values in unmatched columns.

 In some cases, it is simpler to use a Column Generator to add an ‘indicator’ column, with a constant value, to each of the outer links and test that column for the constant after you have performed the join. This is also handy with Lookups that have multiple reference links.

Environment Variable


Data Stage Environment Variable Settings for All Jobs

Ascential recommends the following environment variable settings for all Enterprise Edition jobs. These settings can be made at the project level, or may be set on an individual basis within the properties for each job.
 Environment Variable Settings For All Jobs

Environment VariableSettingDescription
$APT_CONFIG_FILEfilepathSpecifies the full pathname to the EE configuration file.
$APT_DUMP_SCORE1Outputs EE score dump to the DataStage job log, providing detailed information about actual job flow including operators, processes, and datasets. Extremely useful for understanding how a job actually ran in the environment. (see section 10.1 Reading a Score Dump)
$OSH_ECHO1Includes a copy of the generated osh in the job’s DataStage log.  Starting with v7, this option is enabled when “Generated OSH visible for Parallel jobs in ALL projects” option is enabled in DataStage Administrator.
$APT_RECORD_COUNTS1Outputs record counts to the DataStage job log as each operator completes processing. The count is per operator per partition.
$APT_PM_SHOW_PIDS1Places entries in DataStage job log showing UNIX process ID (PID) for each process started by a job. Does not report PIDs of DataStage “phantom” processes started by Server shared containers.
$APT_BUFFER_MAXIMUM_TIMEOUT1Maximum buffer delay in seconds
$APT_THIN_SCORE
(DataStage 7.0 and earlier)
1Only needed for DataStage v7.0 and earlier. Setting this environment variable significantly reduces memory usage for very large (>100 operator) jobs.

Additional Environment Variable Settings
Ascential recommends setting the following environment variables on an as-needed basis. These variables can be used to tune the performance of a particular job flow, to assist in debugging, and to change the default behavior of specific EE stages. 
NOTE: The environment variable settings in this section are only examples. Set values that are optimal to your environment.
 Sequential File Stage Environment Variables
Environment VariableSettingDescription
$APT_EXPORT_FLUSH_COUNT[nrows]Specifies how frequently (in rows) that the Sequential File stage (export operator) flushes its internal buffer to disk. Setting this value to a low number (such as 1) is useful for realtime applications, but there is a small performance penalty from increased I/O.
$APT_IMPORT_BUFFER_SIZE

$APT_EXPORT_BUFFER_SIZE
[Kbytes]Defines size of I/O buffer for Sequential File reads (imports) and writes (exports) respectively. Default is 128 (128K), with a minimum of 8. Increasing these values on heavily-loaded file servers may improve performance.
$APT_CONSISTENT_BUFFERIO_SIZE[bytes]In some disk array configurations, setting this variable to a value equal to the read / write size in bytes can improve performance of Sequential File import/export operations.
$APT_DELIMITED_READ_SIZE[bytes]Specifies the number of bytes the Sequential File (import) stage reads-ahead to get the next delimiter. The default is 500 bytes, but this can be set as low as 2 bytes.
This setting should be set to a lower value when reading from streaming inputs (eg. socket, FIFO) to avoid blocking.
$APT_MAX_DELIMITED_READ_SIZE[bytes]By default, Sequential File (import) will read ahead 500 bytes to get the next delimiter. If it is not found the importer looks ahead 4*500=2000 (1500 more) bytes, and so on (4X) up to 100,000 bytes.

This variable controls the upper bound which is by default 100,000 bytes.  When more than 500 bytes read-ahead is desired, use this variable instead of APT_DELIMITED_READ_SIZE.

 Oracle Environment Variables

Environment VariableSettingDescription
$ORACLE_HOME[path]Specifies installation directory for current Oracle instance. Normally set in a user’s environment by Oracle scripts.
$ORACLE_SID[sid]Specifies the Oracle service name, corresponding to a TNSNAMES entry.
$APT_ORAUPSERT_COMMIT_ROW_INTERVAL
$APT_ORAUPSERT_COMMIT_TIME_INTERVAL
[num]
[seconds]
These two environment variables work together to specify how often target rows are committed for target Oracle stages with Upsert method.

Commits are made whenever the time interval period has passed or the row interval is reached, whichever comes first. By default, commits are made every 2 seconds or 5000 rows.
$APT_ORACLE_LOAD_OPTIONS[SQL*
Loader options]
Specifies Oracle SQL*Loader options used in a target Oracle stage with Load method. By default, this is set to OPTIONS(DIRECT=TRUE, PARALLEL=TRUE)
$APT_ORA_IGNORE_CONFIG_FILE_PARALLELISM1When set, a target Oracle stage with Load method will limit the number of players to the number of datafiles in the table’s tablespace.
$APT_ORA_WRITE_FILES[filepath]Useful in debugging Oracle SQL*Loader issues. When set, the output of a Target Oracle stage with Load method is written to files instead of invoking the Oracle SQL*Loader. The filepath specified by this environment variable specifies the file with the SQL*Loader commands.
$DS_ENABLE_RESERVED_CHAR_CONVERT1Allows DataStage to handle Oracle databases which use the special characters # and $ in column names.

 Job Monitoring Environment Variables

Environment VariableSettingDescription
$APT_MONITOR_TIME[seconds]In v7 and later, specifies the time interval (in seconds) for generating job monitor information at runtime. To enable size-based job monitoring, unset this environment variable, and set $APT_MONITOR_SIZEbelow.
$APT_MONITOR_SIZE[rows]Determines the minimum number of records the job monitor reports. The default of 5000 records is usually too small. To minimize the number of messages during large job runs, set this to a higher value (eg. 1000000).
$APT_NO_JOBMON1Disables job monitoring completely. In rare instances, this may improve performance. In general, this should only be set on a per-job basis when attempting to resolve performance bottlenecks.
$APT_RECORD_COUNTS1Prints record counts in the job log as each operator completes processing. The count is per operator per partition.


Job Monitoring Environment Variables

Environment VariableSettingDescription
$APT_MONITOR_TIME[seconds]In v7 and later, specifies the time interval (in seconds) for generating job monitor information at runtime. To enable size-based job monitoring, unset this environment variable, and set $APT_MONITOR_SIZEbelow.
$APT_MONITOR_SIZE[rows]Determines the minimum number of records the job monitor reports. The default of 5000 records is usually too small. To minimize the number of messages during large job runs, set this to a higher value (eg. 1000000).
$APT_NO_JOBMON1Disables job monitoring completely. In rare instances, this may improve performance. In general, this should only be set on a per-job basis when attempting to resolve performance bottlenecks.
$APT_RECORD_COUNTS1Prints record counts in the job log as each operator completes processing. The count is per operator per partition.

Data Stage Sequential File Stages (Import and Export) Performance Tuning


Improving Sequential File Performance

If the source file is fixed/de-limited, the Readers Per Nodeoption can be used to read a single input file in parallel at evenly-spaced offsets. Note that in this manner, input row order is not maintained.
 If the input sequential file cannot be read in parallel, performance can still be improved by separating the file I/O from the column parsing operation. To accomplish this, define a single large string column for the non-parallel Sequential File read, and then pass this to a Column Import stage to parse the file in parallel. The formatting and column properties of the Column Import stage match those of the Sequential File stage. 
On heavily-loaded file servers or some RAID/SAN array configurations, the environment variables $APT_IMPORT_BUFFER_SIZEand $APT_EXPORT_BUFFER_SIZEcan be used to improve I/O performance. These settings specify the size of the read (import) and write (export) buffer size in Kbytes, with a default of 128 (128K). Increasing this may improve performance.
Finally, in some disk array configurations, setting the environment variable $APT_CONSISTENT_BUFFERIO_SIZEto a value equal to the read/write size in bytes can significantly improve performance of Sequential File operations.

Partitioning Sequential File Reads

Care must be taken to choose the appropriate partitioning method from a Sequential File read:
Don’t read from Sequential File using SAME partitioning! Unless more than one source file is specified, SAME will read the entire file into a single partition, making the entire downstream flow run sequentially (unless it is later repartitioned).
When multiple files are read by a single Sequential File stage (using multiple files, or by using a File Pattern), each file’s data is read into a separate partition. It is important to use ROUND-ROBIN partitioning (or other partitioning appropriate to downstream components) to evenly distribute the data in the flow.

Sequential File (Export) Buffering

By default, the Sequential File (export operator) stage buffers its writes to optimize performance. When a job completes successfully, the buffers are always flushed to disk. The environment variable $APT_EXPORT_FLUSH_COUNTallows the job developer to specify how frequently (in number of rows) that the Sequential File stage flushes its internal buffer on writes. Setting this value to a low number (such as 1) is useful for realtime applications, but there is a small performance penalty associated with increased I/O.

 Reading from and Writing to Fixed-Length Files

Particular attention must be taken when processing fixed-length fields using the Sequential File stage:
If the incoming columns are variable-length data types (eg. Integer, Decimal, Varchar), the field width column property must be set to match the fixed-width of the input column. Double-click on the column number in the grid dialog to set this column property.
 
If a field is nullable, you must define the null field value and length in the Nullablesection of the column property. Double-click on the column number in the grid dialog to set these properties.
 
When writing fixed-length files from variable-length fields (eg. Integer, Decimal, Varchar), the field width and pad string column properties must be set to match the fixed-width of the output column. Double-click on the column number in the grid dialog to set this column property.
 To display each field value, use the print_field import property. All import and export properties are listed in chapter 25, Import/Export Properties of the Orchestrate 7.0 Operators Reference.

 Reading Bounded-Length VARCHAR Columns

Care must be taken when reading delimited, bounded-length Varchar columns (Varchars with the length option set). By default, if the source file has fields with values longer than the maximum Varchar length, these extra characters will be silently truncated.
Starting with v7.01 the environment variable
$APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS will direct DataStage to reject records with strings longer than their declared maximum column length.

Data Stage Transformer Usage Guidelines

Choosing Appropriate Stages

The parallel Transformer stage always generates “C” code which is then compiled to a parallel component. For this reason, it is important to minimize the number of transformers, and to use other stages (Copy, Filter, Switch, etc) when derivations are not needed. 
  • The Copy stage should be used instead of a Transformer for simple operations including:
  • Job Design placeholder between stages (unless the Force option =true, EE will optimize this out at runtime)
  • Renaming Columns
  • Dropping Columns
  • Default Type Conversions

Note that rename, drop (if runtime column propagation is disabled), and default type conversion can also be performed by the output mapping tab of any stage
NEVERuse the “BASIC Transformer” stage in large-volume job flows. Instead, user-defined functions and routines can expand parallel Transformer capabilities. 
Consider, if possible, implementing complex derivation expressions using regular patterns by Lookup tables instead of using a Transformer with nested derivations.
For example, the derivation expression:

If A=0,1,2,3 Then B=”X” If A=4,5,6,7 Then B=”C”
Could be implemented with a lookup table containing values for column A and corresponding values of column B.
Optimize the overall job flow design to combine derivations from multiple Transformers into a single Transformer stage when possible.
In v7 and later, the Filter and/or Switch stages can be used to separate rows into multiple output links based on SQL-like link constraint expressions. 
In v7 and later, the Modify stage can be used for non-default type conversions, null handling, and character string trimming.
 Buildops should be used instead of Transformers in the handful of scenarios where complex reusable logic is required, or where existing Transformer-based job flows do not meet performance requirements.

Transformer NULL Handling and Reject Link

When evaluating expressions for output derivations or link constraints, the Transformer will reject (through the reject link indicated by a dashed line) any row that has a NULL value used in the expression. To create a Transformer reject link in DataStage Designer, right-click on an output link and choose “Convert to Reject”.
The Transformer rejects NULL derivation results because the rules for arithmetic and string handling of NULL values are by definition undefined. For this reason, always test for null values before using a column in an expression, for example:
If ISNULL(link.col) Then… Else…
Note that if an incoming column is only used in a pass-through derivation, the Transformer will allow this row to be output. DataStage release 7 enhances this behavior by placing warnings in the log file when discards occur.

Transformer Derivation Evaluation

Output derivations are evaluated BEFORE any type conversions on the assignment. For example, the PadString function uses the length of the source type, not the target. Therefore, it is important to make sure the type conversion is done before a row reaches the Transformer.

For example, TrimLeadingTrailing(string) works only if string is a VarChar field. Thus, the incoming column must be type VarChar before it is evaluated in the Transformer.

Conditionally Aborting Jobs

The Transformer can be used to conditionally abort a job when incoming data matches a specific rule. Create a new output link that will handle rows that match the abort rule. Within the link constraints dialog box, apply the abort rule to this output link, and set the “Abort After Rows” count to the number of rows allowed before the job should be aborted .
Since the Transformer will abort the entire job flow immediately, it is possible that valid rows will not have been flushed from Sequential File (export) buffers, or committed to database tables. It is important to set the Sequential File buffer flush  or database commit parameters.