Use the remove duplicates stage and use link partitioning and sorting to make sure the correct row is retained Use the aggregator and select minimum values I was curious if there was any performance difference between the two options so i created this test job.
The job gerenates 1,000,000 rows and multiplies that by four in the transformer stage with a while loop construction. This creates four duplicate rows with minor differences. These rows are then dedupliced using the remove duplicates and aggegator stage. The result is written to /dev/null to minimize the impact of I/O.
The streams are not 100% equal. The aggregator stage takes the minimum value of both the non-key fields and the remove duplicates stage selects the row with the lowest sequence number. In this case the outcome is the same but it may be different in your job.
It also adds the loop variable to the output and a timestamp with the loop variable added.
The trick of getting the first row in from the source is to sort the input row with partitioning. The data is sorted and partitioned on business key and sorted on the seq. Note that the seq is only sorted and not used for partitioning.
I've ran this job multiple times and the outcome differs each time i ran it. Both streams are equally fast. Sometimes the remove duplicates stage is faster and sometimes the aggregator is faster. You can choose the one that suits your needs.