The Sequential File stage executes in parallel mode if reading multiple files but executes sequentially if it is only reading one file. By default a complete file will be read by a single node (although each node might read more than one file).
While handling huge volumes of data, the Sequential File stage can itself become one of the major bottlenecks as reading and writing from this stage is slow. Certainly do not use sequential files for intermediate storage between jobs. It causes performance overhead, as it needs to do data conversion before writing and reading from a file. Rather Dataset stages should be used for intermediate storage between different jobs.
Datasets
are key to good performance in a set of linked jobs. They help in
achieving end-to-end parallelism by writing data in partitioned form and
maintaining the sort order. No repartitioning or import/export
conversions are needed.
In
order to have faster reading from the Sequential File stage the number
of readers per node can be increased (default value is one). This means,
for example, that a single file can be partitioned as it is read (even
though the stage is constrained to running sequentially on the conductor
mode).
For fixed-width files, however, you can configure the stage to behave differently. You can specify that single files can be read by multiple nodes. This can improve performance on cluster systems. See details below: " Read from multiple nodes".
You
can specify that a number of readers run on a single node. This means,
for example, that a single file can be partitioned as it is read (even
though the stage is constrained to running sequentially on the conductor
node). See details below: " Number Of readers per node".
(These two options are mutually exclusive.)
The
stage executes in parallel if writing to multiple files, but executes
sequentially if writing to a single file. Each node writes to a single
file, but a node can write more than one file.
Number Of readers per node
This
is an optional property and only applies to files containing
fixed-length records, it is mutually exclusive with the Read from
multiple nodes property. Specifies the number of instances of the file
read operator on a processing node. The default is one operator per node
per input data file. If numReaders is greater than one, each instance
of the file read operator reads a contiguous range of records from the
input file. The starting record location in the file for each operator,
or seek location, is determined by the data file size, the record
length, and the number of instances of the operator, as specified by
numReaders.
Each
node reads a single file, but the file can be divided according to the
number of readers per node, and written to separate partitions. This
method can result in better I/O performance on an SMP (Symmetric Multi
Processing) system.
The resulting data set contains one partition per instance of the file
read operator, as determined by numReaders. This provides a way of
partitioning the data contained in a single file. Each node reads a
single file, but the file can be divided according to the number of
readers per node, and written to separate partitions. This method can
result in better I/O performance on an SMP system.
Read from multiple nodes:
This
is an optional property and only applies to files containing
fixed-length records, it is mutually exclusive with the Number of
Readers Per Node property. Set this to Yes to allow individual files to
be read by several nodes. This can improve performance on a cluster
system.
InfoSphere DataStage knows the number of nodes available, and using the
fixed length record size, and the actual size of the file to be read,
allocates the reader on each node a separate region within the file to
process. The regions will be of roughly equal size.
It
can also be specified that single files can be read by multiple nodes.
This is also an optional property and only applies to files containing
fixed-length records. Set this option to “Yes” to allow individual files
to be read by several nodes. This can improve performance on cluster
systems.
IBM
DataStage knows the number of nodes available, and using the fixed
length record size, and the actual size of the file to be read,
allocates to the reader on each node a separate region within the file
to process. The regions will be of roughly equal size.
The options “Read From Multiple Nodes” and “Number of Readers Per Node” are mutually exclusive.
Sequential File Other Properties:
First Line is Column Names
Specifies that the first line of the file contains column names. This property is false by default.Missing file mode
Specifies the action to take if one of your File properties has specified a file that does not exist. Choose from Error to stop the job, OK to skip the file, or Depends, which means the default is Error, unless the file has a node name prefix of *: in which case it is OK. The default is Depends.
Keep file partitions
Set this to True to
partition the imported data set according to the organization of the
input file(s). So, for example, if you are reading three files you will
have three partitions. Defaults to False.
Reject mode
Allows you to specify behavior if a read record does not match the expected schema. Choose from Continue to continue operation and discard any rejected rows, Fail to cease reading if any rows are rejected, or Save to send rejected rows down a reject link. Defaults to Continue.
Report progress
Choose Yes or No to
enable or disable reporting. By default the stage displays a progress
report at each 10% interval when it can ascertain file size. Reporting
occurs only if the file is greater than 100 KB, records are fixed
length, and there is no filter on the file. For the Best Performance always use this property "NO".
Filters
This
is an optional property. You can use this to specify that the data is
passed through a filter program after being read from the files. Specify
the filter command, and any required arguments, in the Property Value
box.
File name column
This
is an optional property. It adds an extra column of type VarChar to the
output of the stage, containing the pathname of the file the record is
read from. You should also add this column manually to the Columns
definitions to ensure that the column is not dropped if you are not
using runtime column propagation, or it is turned off at some point.
Read first rows
Specify a number n so that the stage only reads the first n rows from the file.
Row number column
This
is an optional property. It adds an extra column of type unsigned
BigInt to the output of the stage, containing the row number. You must
also add the column to the columns tab, unless runtime column
propagation is enabled.
File
This
property defines the flat file that data will be read from. You can
type in a pathname, or browse for a file. You can specify multiple files
by repeating the File property. Do this by selecting the Properties item at the top of the tree, and clicking on File in the Available properties to add window. Do this for each extra file you want to specify.
File pattern
Specifies
a group of files to import. Specify file containing a list of files or a
job parameter representing the file. The file could also contain be any
valid shell expression, in Bourne shell syntax, that generates a list
of file names.
Read method:
This
property specifies whether you are reading from a specific file or
files or using a file pattern to select files (for example, *.txt).
1 comment:
This is so informative and precise. Thank you
Post a Comment