Monday, June 24, 2013

How to replace ^M character in VI editor/sed?

^M is DOS line break character which shows up in UNIX files when uploaded from a windows file system in ascii format.
To remove this, open your file in vi editor and type
:%s/(ctrl-v)(ctrl-m)//g
and press Enter key.
Important!! – press (Ctrl-v) (Ctrl-m) combination to enter ^M character, dont use “^” and M.
If anything goes wrong exit with q!.

Also,
Your substitution command may catch more ^M then necessary. Your file may contain valid ^M in the middle of a line of code for example. Use the following command instead to remove only those at the very end of lines:
:%s/(ctrl-v)(ctrl-m)*$//g


Using sed:

sed -e "s/^M//g" old_file_name > new_file_name

Thursday, June 6, 2013

How to convert a single row into multiple rows ?

Below is a screenshot of our input data

CityStateName1Name2Name3
xyFGHSamDeanWinchester
We are going to read the above data from a sequential file and transform it to look like this
CityStateName
xyFGHSam
xyFGHDean
xyFGHWinchester
So lets get to the job design
Step 1: Read the input data
Step 2: Logic for Looping in Transformer
In the adjacent image you can see a new box called Loop Condition. This where we are going to control the loop variables.
Below is the screenshot when we expand the Loop Condition box
The Loop While constraint is used to implement a functionality similar to “WHILE” statement in programming. So, similar to a while statement need to have a condition to identify how many times the loop is supposed to be executed.
To achieve this @ITERATION system variable was introduced. In our example we need to loop the data 3 times to get the column data onto subsequent rows.
So lets have @ITERATION <=3
Now create a new Loop variable with the name LoopName 
The derivation for this loop variable should be
If @ITERATION=1 Then DSLink2.Name1 Else If @ITERATION=2 Then DSLink2.Name2 Else DSLink2.Name3
Below is a screenshot illustrating the same

Now all we have to do is map this Loop variable LoopName to our output column Name

Lets map the output to a sequential file stage and see if the output is a desired.

After running the job, we did a view data on the output stage and here is the data as desired.

Making some tweaks to the above design we can implement things like
  • Adding new rows to existing rows
  • Splitting data in a single column to multiple rows and many more such stuff..

How to perform aggregation using a Transformer

 Input:Below is the sample data of three students, their marks in two subjects, the corresponding grades and the dates on which they were graded.

Output:Our requirement is to sum the marks obtained by each student in a subject and display it in the output
Step 1: Once we have read the data from the source we have to sort data on our key field. In our example the key field is the student name
Once the data is sorted we have to implement the looping function in transformer to calculate the aggregate value
Before we get into the details, we need to know a couple of functions
    • SaveInputRecord(): This function saves the entire record in cache and returns the number of records that are currently stored in cache
    • LastRowInGroup(input-column): When a input key column is passed to this function it will return 1 when the last row for that column is found and in all other cases it will return 0
      To give an example, lets say our input is
      StudentCode
      ABC1
      ABC2
      ABC3
      DEF2
      For the first two records the function will return 0 but for the last record ABC,3 it will return 1 indicating that it is the last record for the group where student name is “ABC”
    • GetSavedInputRecord(): This function returns the record that was stored in cache by the function SaveInputRecord()

Back to the task at hand, we need 7 stage variables to perform the aggregation operation successfully.
1. LoopNumber: Holds the value of number of records stored in cache for a student
2. LoopBreak: This is to identify the last record for a particular student
3. SumSub1: This variable will hold the final sum of marks for each student in subject 1
4. IntermediateSumSub1: This variable will hold the sum of marks until the final record is evaluated for a student (subject 1)
5. SumSub2: Similar to SumSub1 (for subject 2)
6. IntermediateSumSub2: Similar to IntermediateSumSub1 (for subject 2)
7. LoopBreakNum: Holds the value for the number of times the loop has to run
Below is the screenshot of the stage variables
We also need to define the Loop Variables so that the loop will execute for a student until his final record is identified
To explain the above use of variables -
When the first record comes to stage variables, it is saved in the cache using the function SaveInputRecord() in first stage variableLoopNumber
The second stage variable checks if it is the last record for this particular student, if it is it stores 1 else 0
The third SumSub1 is executed only if the record is the last record
The fourth IntermediateSumSum1 is executed when the input record is not the last record, thereby storing the intermediate sum of the subject for a student
Fifth and sixth are the same as 3 and 4 stage variables
Seven will have the first value as 1 and for the second record also if the same student is fetched it will change to 2 and so on
The loop variable will be executed until the final record for a student is identified and the GetSavedInputRecord() function will make sure the current record is processed before the next record is brought for processing.
What the above logic does is for each and every record it will send the sum of marks scored by each student to the output. But our requirement is to have only one record per student in the output.
So we simply add a remove duplicates stage and add the student name as a primary key
Run the job and the output will be according to our initial expectation
We have successfully implemented AGGREGATION using TRANSFORMER Stage