Pages

Thursday, June 6, 2013

How to perform aggregation using a Transformer

 Input:Below is the sample data of three students, their marks in two subjects, the corresponding grades and the dates on which they were graded.

Output:Our requirement is to sum the marks obtained by each student in a subject and display it in the output
Step 1: Once we have read the data from the source we have to sort data on our key field. In our example the key field is the student name
Once the data is sorted we have to implement the looping function in transformer to calculate the aggregate value
Before we get into the details, we need to know a couple of functions
    • SaveInputRecord(): This function saves the entire record in cache and returns the number of records that are currently stored in cache
    • LastRowInGroup(input-column): When a input key column is passed to this function it will return 1 when the last row for that column is found and in all other cases it will return 0
      To give an example, lets say our input is
      StudentCode
      ABC1
      ABC2
      ABC3
      DEF2
      For the first two records the function will return 0 but for the last record ABC,3 it will return 1 indicating that it is the last record for the group where student name is “ABC”
    • GetSavedInputRecord(): This function returns the record that was stored in cache by the function SaveInputRecord()

Back to the task at hand, we need 7 stage variables to perform the aggregation operation successfully.
1. LoopNumber: Holds the value of number of records stored in cache for a student
2. LoopBreak: This is to identify the last record for a particular student
3. SumSub1: This variable will hold the final sum of marks for each student in subject 1
4. IntermediateSumSub1: This variable will hold the sum of marks until the final record is evaluated for a student (subject 1)
5. SumSub2: Similar to SumSub1 (for subject 2)
6. IntermediateSumSub2: Similar to IntermediateSumSub1 (for subject 2)
7. LoopBreakNum: Holds the value for the number of times the loop has to run
Below is the screenshot of the stage variables
We also need to define the Loop Variables so that the loop will execute for a student until his final record is identified
To explain the above use of variables -
When the first record comes to stage variables, it is saved in the cache using the function SaveInputRecord() in first stage variableLoopNumber
The second stage variable checks if it is the last record for this particular student, if it is it stores 1 else 0
The third SumSub1 is executed only if the record is the last record
The fourth IntermediateSumSum1 is executed when the input record is not the last record, thereby storing the intermediate sum of the subject for a student
Fifth and sixth are the same as 3 and 4 stage variables
Seven will have the first value as 1 and for the second record also if the same student is fetched it will change to 2 and so on
The loop variable will be executed until the final record for a student is identified and the GetSavedInputRecord() function will make sure the current record is processed before the next record is brought for processing.
What the above logic does is for each and every record it will send the sum of marks scored by each student to the output. But our requirement is to have only one record per student in the output.
So we simply add a remove duplicates stage and add the student name as a primary key
Run the job and the output will be according to our initial expectation
We have successfully implemented AGGREGATION using TRANSFORMER Stage

4 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Hope You everyone like this post and good information keep posting.Datastage is an ETL tool and its a part of IBM Information and solutions .In this we are having so many edtions,want to learn Datastage Online Training

    ReplyDelete
  3. your blog is very helpful and best than any other sites on datastage.Giving more detailed information will help me in resolving issues in my project.One thing have to notice you some screen
    shots are not visible plz check it

    ReplyDelete