Lookup Stage:
The Lookup stage is most appropriate when the reference data for all lookup stages in a job is small enough to fit into available physical memory. Each lookup reference requires a contiguous block of shared memory. If the Data Sets are larger than available memory resources, the JOIN or MERGE stage should be used.
Lookup stages do not require data on the input link or reference links to be sorted. Be aware, though, that large in-memory lookup tables will degrade performance because of their paging requirements. Each record of the output data set contains columns from a source record plus columns from all the corresponding lookup records where corresponding source and lookup records have the same value for the lookup key columns. The lookup key columns do not have to have the same names in the primary and the reference links.
The optional reject link carries source records that do not have a corresponding entry in the input lookup tables.
You can also perform a range lookup, which compares the value of a source column to a range of values between two lookup table columns. If the source column value falls within the required range, a row is passed to the output link. Alternatively, you can compare the value of a lookup column to a range of values between two source columns. Range lookups must be based on column values, not constant values. Multiple ranges are supported.
There are some special partitioning considerations for Lookup stages. You need to ensure that the data being looked up in the lookup table is in the same partition as the input data referencing it. One way of doing this is to partition the lookup tables using the Entire method.
Lookup stage Configuration:Equal lookup
You can specify what action need to perform if lookup fails.
Scenario1: Continue
Choose entire partition on the reference link
Scenario2:Fail
Job aborted with the following error:
stg_Lkp,0: Failed a key lookup for record 2 Key Values: CUSTOMER_ID: 3
Scenari03:Drop
Scenario4:Reject
If we select reject as lookup failure condition then we need to add reject link otherwise we get compilation error.
Range Lookup:
Business scenario:we have input data with customer id and customer name and transaction date.We have customer dimension table with customer address information.Customer can have multiple records with different start and active dates and we want to select the record where incoming transcation date falls between start and end date of the customer from dim table.
Ex Input Data:
CUSTOMER_ID | CUSTOMER_NAME | TRANSACTION_DT |
1 | UMA | 2011-03-01 |
1 | UMA | 2010-05-01 |
Ex Di Data:
CUSTOMER_ID | CITY | ZIP_CODE | START_DT | END_DT |
1 | BUENA PARK | 90620 | 2010-01-01 | 2010-12-31 |
1 | CYPRESS | 90630 | 2011-01-01 | 2011-04-30 |
Expected Output:
CUSTOMER_ID | CUSTOMER_NAME | TRANSACTION_DT | CITY | ZIP_CODE |
1 | UMA | 2011-03-01 | CYPRESS | 90630 |
1 | UMA | 2010-05-01 | BUENA PARK | 90620 |
Configure the lookup stage as shown below.Double click on Lnk_input.TRANSACTION_DATE column.(specifying condition on the input link)
You need to specify return multiple rows from the reference link otherwise you will get following warning in the job log.Even though we have two distinct rows base on customer_id,start_dt and end_dt columns but datastage is considering duplicate rows based on customer_id key only.
stg_Lkp,0: Ignoring duplicate entry; no further warnings will be issued for this table
Compile and Run the job:
Scenario 2:Specify range on reference link:
This concludes lookup stage configuration for different scenarios.
Difference between normal and sparse lookup?
Normal look-up:all the reference table data is stored in the buffer for cross- check with the primary table data.
Sparse lookup:each record of the primary table is cross checked with the reference table datethe types of look-ups will araise only if the reference table is in database.so depending on the size of the reference table we will set the type of lookup to implement.
RANGE LOOKUP WITH EXAMPLE IN DATASTAGE
Range Look Up is used to check the range of the records from another table records.
For example If we have the employees list, getting salaries from $1500 to $ 3000.
If we like to check the range of the employees with respect to salaries.
We can do it by using Range Lookup.
For Example if we have the following sample data.
xyzcomp ( Table Name )
e_id,e_name,e_sal
100,james,2000
200,sammy,1600
300,williams,1900
400,robin,1700
500,ponting,2200
600,flower,1800
700,mary,2100
Entire partition has all data across the nodes So while matching(in lookup) the records all data should be present across all nodes.
For lookup sorting is not required.so when we are not using entire partition then reference data splits into all nodes. Then each primary record need check with all nodes for matched reference record.Then we face performance issue.If we use entire in lookup then one primary record needs to look into 1 node is enough.if match found then that record goes to target otherwise it move to reject,drop etc(based on requirement)no need check in another node.In this case if we are running job in 4 nodes then at a time 4 records should process.
Note:Please remember we go for lookup only we have small reference data.If we go for big data it is performance issue(I/O work will increase here) and also some times job will abort.
For example If we have the employees list, getting salaries from $1500 to $ 3000.
If we like to check the range of the employees with respect to salaries.
We can do it by using Range Lookup.
For Example if we have the following sample data.
xyzcomp ( Table Name )
e_id,e_name,e_sal
100,james,2000
200,sammy,1600
300,williams,1900
400,robin,1700
500,ponting,2200
600,flower,1800
700,mary,2100
lsal is nothing but low salary
hsal is nothing but High salary
Now Read and load the data in Sequential files
And Open Lookup file--- Select e_sal in the first table data
And Open Key expression and
Here Select e_sal >=lsal And
e_sal <=hsal
Click Ok
Than Drag and Drop the Required columns into the output and click Ok
Give File name to the Target File.
Then Compile and Run the Job . That's it you will get the required Output.
Why Entire partition is used in LOOKUP stage ?
Entire partition has all data across the nodes So while matching(in lookup) the records all data should be present across all nodes.
For lookup sorting is not required.so when we are not using entire partition then reference data splits into all nodes. Then each primary record need check with all nodes for matched reference record.Then we face performance issue.If we use entire in lookup then one primary record needs to look into 1 node is enough.if match found then that record goes to target otherwise it move to reject,drop etc(based on requirement)no need check in another node.In this case if we are running job in 4 nodes then at a time 4 records should process.
Note:Please remember we go for lookup only we have small reference data.If we go for big data it is performance issue(I/O work will increase here) and also some times job will abort.
Difference between normal and sparse lookup?
Normal look-up:all the reference table data is stored in the buffer for cross- check with the primary table data.
Sparse lookup:each record of the primary table is cross checked with the reference table datethe types of look-ups will araise only if the reference table is in database.so depending on the size of the reference table we will set the type of lookup to implement.
During lookup, what if we have duplicates in reference table/file?
The lookup stage in Datastage 8 is an enhanced version of what was present in earlier Datastage releases. This article is going to take a deep dive into the new lookup stage and the various options it offers. Even though the lookup stage can’t be used in cases where huge amounts of data are involved(since it requires data to be present in the memory for operations), it still warrants its own place in job designs. This is because the lookup stage offers a bit more than the other conventional lookup stages like join and merge.
Lets look at the example shown below.
Source
Emp ID EmpName Dept
1001 AABB IT
1002 BBCC IT
1003 BBDD BS
Reference
Emp ID Salary Dept Quarter
1001 2000 IT Q1
1001 3000 IT Q2
1001 4000 IT Q3
Now if you use the lookup stage the with Emp iD as the key then the output would be as below
EMp ID Salary Dept EmpName Quarter
1001 2000 IT AABB Q1
But if you have a closer look at the data we can see that the reference table actually has three records for that ID. However your lookup stage actually only retrieved the one record. Now if you need to retrieve all 3 records for that ID then you will have to
- Go to the constraints page of the lookup stage
- Go to tab ‘Multiple rows returned from link’
- Select the refernce link
EMp ID Salary Dept EmpName Quarter
1001 2000 IT AABB Q1
1001 3000 IT AABB Q2
1001 4000 IT AABB Q3
A point to be noted is that only one reference link in the lookup stage can return multiple rows. This can’t be done for more than one reference link and can only be done for in-memory lookups
There are a host of other options also available on the constraints page shown below
In addition to the lookup, the stage also gives us the option of checking if the data satisfies a particular condition like Salary > 2000,etc.. All such additional conditions that you want to check can be done in this area. How the job behaves during a lookup is determined by the ‘Condition Not Met’ or ‘Lookup Failure’ option. The four options available for this tab are Continue, Drop, Reject and Fail. Condition Not Met option will be applicable if you provide a condition check. If you do not provide such a check then the values in the ‘Condition Not Met’ option will not make a difference.
The ‘Continue’ option will allow the job to continue without failing and the retrieved refernce value will be populated as NULL. If the value is specified as ‘Drop’, then the records will be dropped from the data set if the lookup/condition has failed. If the option is specified as ‘Reject’, then all records that failed lookup will go to the reject link. You should remember to provide a reject link to the lookup stage if this option is set. Else your job will fail. If you specify the value as ‘Fail’, then the job will move to the aborted state whenever a lookup fails against the reference dataset.
The lookup stage gives us 3 different lookup options. The first is ‘Equality’ which is the normal look. The data is looked up for an exact match (Case sensitive). The second option is the Casesless match. It does exactly what the name indicates. The third and final option is the ‘Range’. This allows you to define a range lookup on the stream link or a reference link of a Lookup stage. On the stream link, the lookup compares the value of a source column to a range of values between two lookup columns. On the reference link, the lookup compares the value of a lookup column to a range of values between two source columns.
Datastage is an best etl tool and over her
ReplyDeleteLook Up Stage do the better performance than other stages to get the results.
Excellent tutorial on LOOKUP stage.Thank You
ReplyDeleteBest article on the Lookup Stage. Travel from botanical garden to noida sector 52 metro
ReplyDeletebest education platform is sarkari result show
ReplyDelete