Merge_Stage


Merge Stage:
The Merge stage is a processing stage. It can have any number of input links, a single output link, and the same number of reject links as there are update input links.(according to DS documentation)
Merge stage combines a mster dataset with one or more update datasets based on the key columns.the output record contains all the columns from master record plus any additional columns from each update record that are required.
A master record and update record will be merged only if both have same key column values.
The data sets input to the Merge stage must be key partitioned and sorted. This ensures that rows with the same key column values are located in the same partition and will be processed by the same node. It also minimizes memory requirements because fewer rows need to be in memory at any one time.
As part of preprocessing your data for the Merge stage, you should also remove duplicate records from the master data set. If you have more than one update data set, you must remove duplicate records from the update data sets as well.
Unlike Join stages and Lookup stages, the Merge stage allows you to specify several reject links. You can route update link rows that fail to match a master row down a reject link that is specific for that link. You must have the same number of reject links as you have update links. The Link Ordering tab on the Stage page lets you specify which update links send rejected rows to which reject links. You can also specify whether to drop unmatched master rows, or output them on the output data link.
Example :
Master dataset:
CUSTOMER_IDCUSTOMER_NAME
1UMA
2POOJITHA
Update dataset1
CUSTOMER_IDCITYZIP_CODESEX
1CYPRESS90630M
2CYPRESS90630F
Output:
CUSTOMER_IDCUSTOMER_NAMECITYZIP_CODESEX
1UMACYPRESS90630M
2POOJITHACYPRESS90630F
Merge stage configuration steps:
Options:
Unmatched Masters Mode:Keep means that unmatched rows (those without any updates) from the master link are output; Drop means that unmatched rows are dropped instead.
Warn On Reject Updates:True to generate a warning when bad records from any update links are rejected.
Warn On Unmatched Masters:True to generate a warning when there are unmatched rows from the master link.
Partitioning:Hash on both master input and update input as shown below:
Compile and run the job :
Scenario 2:
Remove a record from the updateds1 and check the output:
Check for the datastage warning in the job log as we have selected Warn on unmatched masters = TRUE
stg_merge,0: Master record (0) has no updates.
stg_merge,1: Update record (1) of data set 1 is dropped; no masters are left.
Scenarios 3:Drop unmatched master record and capture reject records from updateds1
Scenario 4:Insert a duplicate record with same customer id in the master dataset and check for the results.
Look at the output and it is clear that merge stage automatically dropped the duplicate record from master dataset.
Scenario 4:Added new updatedataset2 which contains following data.
Update Dataset2
CUSTOMER_IDCITIZENSHIP
1INDIAN
2AMERICAN
Still we have duplicate row in the master dataset.if you compile the job with above design you will get compilation error like below.
If you look ate the above figure you can see 2 rows in the output becuase we have a matching row for the customer_id = 2 in the updateds2 .
Scenario 5:add a duplicate row for customer_id=1 in updateds1 dataset.
Now we have duplicate record both in master dataset and updateds1.Run the job and check the results and warnings in the job log.
No change the results and merge stage automatically dropped the duplicate row.
Scenario 6:modify a duplicate row for customer_id=1 in updateds1 dataset with zipcode as 90630 instead of 90620.
Run the job and check output results.
I ran the same job multiple times and found the merge stage is taking first record coming as input from the updateds1 and dropping the next records with same customer id.
This post covered most of the merge scenarios.

No comments: