Input:Below is the sample data of three students, their marks in two subjects, the corresponding grades and the dates on which they were graded.
Output:Our requirement is to sum the marks obtained by each student in a subject and display it in the output
Step 1: Once we have read the data from the source we have to sort data on our key field. In our example the key field is the student name
Once the data is sorted we have to implement the looping function in transformer to calculate the aggregate value
Before we get into the details, we need to know a couple of functions
Back to the task at hand, we need 7 stage variables to perform the aggregation operation successfully.
1. LoopNumber: Holds the value of number of records stored in cache for a student
2. LoopBreak: This is to identify the last record for a particular student
3. SumSub1: This variable will hold the final sum of marks for each student in subject 1
4. IntermediateSumSub1: This variable will hold the sum of marks until the final record is evaluated for a student (subject 1)
5. SumSub2: Similar to SumSub1 (for subject 2)
6. IntermediateSumSub2: Similar to IntermediateSumSub1 (for subject 2)
7. LoopBreakNum: Holds the value for the number of times the loop has to run
Below is the screenshot of the stage variables
We also need to define the Loop Variables so that the loop will execute for a student until his final record is identified
To explain the above use of variables -
When the first record comes to stage variables, it is saved in the cache using the function SaveInputRecord() in first stage variableLoopNumber
The second stage variable checks if it is the last record for this particular student, if it is it stores 1 else 0
The third SumSub1 is executed only if the record is the last record
The fourth IntermediateSumSum1 is executed when the input record is not the last record, thereby storing the intermediate sum of the subject for a student
Fifth and sixth are the same as 3 and 4 stage variables
Seven will have the first value as 1 and for the second record also if the same student is fetched it will change to 2 and so on
The loop variable will be executed until the final record for a student is identified and the GetSavedInputRecord() function will make sure the current record is processed before the next record is brought for processing.
What the above logic does is for each and every record it will send the sum of marks scored by each student to the output. But our requirement is to have only one record per student in the output.
So we simply add a remove duplicates stage and add the student name as a primary key
Run the job and the output will be according to our initial expectation
We have successfully implemented AGGREGATION using TRANSFORMER Stage