Performance difference between Lookup and Join

You can read a lot of articles on internet that mention that there is a difference in performance between a Lookup and a Join stage. That's why i wanted to determine what stage we should use in our case.

I created two functional identical flows. The first one uses a lookup stage and the second one uses a Join stage to 'lookup' data in a table. The lookup source is 'narrow' with only two fields (a key field and a NVarchar field). The main stream contains a key field and 10 NVarchar fields.

The lookup data contains only even numbers to make sure we test the reject functions of both streams. 50% of the main data won't find a value in the lookup stream. All these records are written to the reject stream.

All data is written to /dev/null to limit the influence of I/O.

This is what i've learnt from this job:

The Join stage is much (in this case 1.5 times) faster if the main stream contains a lot of records There's no real difference in performance if the main stream contains little (less than 10.000) records There are some technical/functional differences in usage of the Lookup and Join stage:

The names of the key fields in the Lookup can be different in the two links.
You can use a conditional lookup in the Lookup stage.
You can use range lookups (two ways) in the Lookup stage.

So if you suspect a lot of records in the main stream and you can live with the technical/functional differences between the two stages, use the joiner stage for lookups.

Below are some of the key elements from this job documented.

Select data to work with

Both input streams should have representative data to make sure the results can be used to predict the performance of 'real world' jobs.

I wanted to have the lookup stream half the records of the main stream but wanted to mage sure the 'gaps' are evenly distributed. Thats why i've set a condition on the transformer that limits the output to rows with an even number (Mod(NUMBER,2) = 0).

The lookup stream only gets one NVarchar field with the length of 15.

The main stream gets 10 NVarchar fields with a length of 50.

Lookup stage

The number of the main stream is connected to the number of the lookup stream. The condition properties are set to reject if a row is not found in the lookup stream.

Join properties

The key is set to number. This only works if the key fields have the same name in the main stream and the lookup stream. The join has a left join type.

I'm using a left outer join so i had to make sure that the right links are set to left and right.

Make sure that the output columns from the lookup link are nullable because the lookup can fail. In that case we want the result to be null.

Reject if lookup field is null

Reject if lookup field is null In the transformer i check if the lookup result is not null. The reject link is set to 'otherwise'. Make sure that the processing order of the links is correct or all rows will go to the reject link.