Tuesday, September 2, 2014

Sequential File Best Performance Settings/Tips


The Sequential File stage executes in parallel mode if reading multiple files but executes sequentially if it is only reading one file. By default a complete file will be read by a single node (although each node might read more than one file).

While handling huge volumes of data, the Sequential File stage can itself become one of the major bottlenecks as reading and writing from this stage is slow. Certainly do not use sequential files for intermediate storage between jobs. It causes performance overhead, as it needs to do data conversion before writing and reading from a file. Rather Dataset stages should be used for intermediate storage between different jobs.
Datasets are key to good performance in a set of linked jobs. They help in achieving end-to-end parallelism by writing data in partitioned form and maintaining the sort order. No repartitioning or import/export conversions are needed.
In order to have faster reading from the Sequential File stage the number of readers per node can be increased (default value is one). This means, for example, that a single file can be partitioned as it is read (even though the stage is constrained to running sequentially on the conductor mode).

For fixed-width files
, however, you can configure the stage to behave differently. You can specify that single files can be read by multiple nodes. This can improve performance on cluster systems. See details below: " Read from multiple nodes".

You can specify that a number of readers run on a single node. This means, for example, that a single file can be partitioned as it is read (even though the stage is constrained to running sequentially on the conductor node). See details below: " Number Of readers per node".

(These two options are mutually exclusive.)

The stage executes in parallel if writing to multiple files, but executes sequentially if writing to a single file. Each node writes to a single file, but a node can write more than one file.

Number Of readers per node

This is an optional property and only applies to files containing fixed-length records, it is mutually exclusive with the Read from multiple nodes property. Specifies the number of instances of the file read operator on a processing node. The default is one operator per node per input data file. If numReaders is greater than one, each instance of the file read operator reads a contiguous range of records from the input file. The starting record location in the file for each operator, or seek location, is determined by the data file size, the record length, and the number of instances of the operator, as specified by numReaders.



Each node reads a single file, but the file can be divided according to the number of readers per node, and written to separate partitions. This method can result in better I/O performance on an SMP (Symmetric Multi Processing) system.
The resulting data set contains one partition per instance of the file read operator, as determined by numReaders. This provides a way of partitioning the data contained in a single file. Each node reads a single file, but the file can be divided according to the number of readers per node, and written to separate partitions. This method can result in better I/O performance on an SMP system.





Read from multiple nodes:



This is an optional property and only applies to files containing fixed-length records, it is mutually exclusive with the Number of Readers Per Node property. Set this to Yes to allow individual files to be read by several nodes. This can improve performance on a cluster system.
InfoSphere DataStage knows the number of nodes available, and using the fixed length record size, and the actual size of the file to be read, allocates the reader on each node a separate region within the file to process. The regions will be of roughly equal size.
It can also be specified that single files can be read by multiple nodes. This is also an optional property and only applies to files containing fixed-length records. Set this option to “Yes” to allow individual files to be read by several nodes. This can improve performance on cluster systems.

IBM DataStage knows the number of nodes available, and using the fixed length record size, and the actual size of the file to be read, allocates to the reader on each node a separate region within the file to process. The regions will be of roughly equal size.


The options “Read From Multiple Nodes” and “Number of Readers Per Node” are mutually exclusive.

Sequential File Other Properties:

First Line is Column Names

Specifies that the first line of the file contains column names. This property is false by default.

Missing file mode

Specifies the action to take if one of your File properties has specified a file that does not exist. Choose from Error to stop the job, OK to skip the file, or Depends, which means the default is Error, unless the file has a node name prefix of *: in which case it is OK. The default is Depends.

Keep file partitions

Set this to True to partition the imported data set according to the organization of the input file(s). So, for example, if you are reading three files you will have three partitions. Defaults to False.

Reject mode

Allows you to specify behavior if a read record does not match the expected schema. Choose from Continue to continue operation and discard any rejected rows, Fail to cease reading if any rows are rejected, or Save to send rejected rows down a reject link. Defaults to Continue.

Report progress

Choose Yes or No to enable or disable reporting. By default the stage displays a progress report at each 10% interval when it can ascertain file size. Reporting occurs only if the file is greater than 100 KB, records are fixed length, and there is no filter on the file. For the Best Performance always use this property "NO".

Filters

This is an optional property. You can use this to specify that the data is passed through a filter program after being read from the files. Specify the filter command, and any required arguments, in the Property Value box.

File name column

This is an optional property. It adds an extra column of type VarChar to the output of the stage, containing the pathname of the file the record is read from. You should also add this column manually to the Columns definitions to ensure that the column is not dropped if you are not using runtime column propagation, or it is turned off at some point.

Read first rows

Specify a number n so that the stage only reads the first n rows from the file.

Row number column

This is an optional property. It adds an extra column of type unsigned BigInt to the output of the stage, containing the row number. You must also add the column to the columns tab, unless runtime column propagation is enabled.

File

This property defines the flat file that data will be read from. You can type in a pathname, or browse for a file. You can specify multiple files by repeating the File property. Do this by selecting the Properties item at the top of the tree, and clicking on File in the Available properties to add window. Do this for each extra file you want to specify.

File pattern

Specifies a group of files to import. Specify file containing a list of files or a job parameter representing the file. The file could also contain be any valid shell expression, in Bourne shell syntax, that generates a list of file names.

Read method:

This property specifies whether you are reading from a specific file or files or using a file pattern to select files (for example, *.txt).

1 comment:

Unknown said...

This is so informative and precise. Thank you