I'm new to SSIS, and trying to feel my way around but I've got some questions about performance.
I've used DTS for a number of years, and I read in Donald Farmers blog that SSIS was about 7 times faster than DTS (mileage may vary). Well, I may be doing something very wrong, because I'm getting about 7 times slower!
I have around 40 dimension tables, most of them have less than 100 rows. Only two have 60,000 rows. Then I have a single fact table with 2.7 million rows.
I pull my data from Excel for the dimension info, and Oracle for fact info. In DTS the dimension load takes around 45 seconds total, all 40 dimensions into SQL Server 2000. Then I load the fact table and use a function to do a lookup of the surrogate key for the fact table, although I do this step after the fact table has been loaded and then do an lookup/update for each foreign surrogate key.
I tested only 3 key assignments in DTS on the 2.7mil row table. The initial load of around 8 columns is around 2.5 minutes. Then the updates on the 3 key fields takes around 2 minutes each. So total, this is around 8.5 to 10 minutes.
Now in SSIS, I load all 40 dimension tables (into SQL2005), and the validation step alone takes around 2 minutes. Then the load is incredibly slow- around 15 minutes!
Then the fact table load, I use the lookup transformation for 3 surrogate key lookups. This takes a total of 18 minutes for the 2.7mil row table.
Altogether, it takes 10 minutes for DTS and around 33 minutes for SSIS. I ran each one 3 times, with consistent results. Our server is a Dual 3.6Ghz hyperthreaded Pentium IV Xeon with 3G of RAM.
I MUST be doing something wrong in SSIS. Please tell me I am!!
-Kory

SSIS Performance compared to DTS
bmcneill0
Have you looked at ash's blog about handling lookup misses
http://blogs.msdn.com/ashvinis/archive/2005/08/04/447859.aspx
By range lookup do you mean your flow has a say a value of 34 and the lookup has a min and max value so this will match the record with min of 20 and max of 40.
If so how are you doing this
Are you using the OLE DB Command
Mec343
The data flow will not remove unused components because its RunInOptimizedMode property is set to false.
When I look at the dataflow itself, the RunInOptimizedMode is set to True.
Does this mean anything
PeteL - MSFT
To do this, you need to load your dimension into an array or other collection and then perform your lookup that way. That will be much more performant than performing a SQL call for each row.
IanG
Thanks.
You are correct in your analysis of partial cache. If you use partial cache then, in fact, there is no advance caching. Partial cache performs singleton queries for each row, and values for which a match is found are cached for future use.
A good example of the advantage of partial cache would be for loading sales line items for a supermarket there may be millions of potential product keys. You do not want to cache every sku, but you do know that if an item is found, it is likely to appear in a later basket also, so it makes sense to cache the found value.
I guess there are a few ways to do bucketing - it rather depends on whether your buckets are static or dynamic.
If none of these are suitable, then those RDBMS thingies come in quite handy. We include quite a good one, free with every copy of SSIS. ;-)
hth
Donald
Gordon7502
http://www.sqljunkies.com/WebLog/simons/archive/2005/10/04/17007.aspx
DeanHuff
I agree that there are other approaches, but common sense tells me that the way I am doing it should be faster than it is... other ETL tools do this without any issue, so I would assume MS tools should be able to handle it.
Can I trace the queries being sent to the database to see what is happening How does the trace handle the cache
Kory
benjarras
I would suggest that for certain high volumn processes out of the box tasks/components will not be the best. SSIS is an amazing product partly form the ease by which it can be easily extended.
I've just wrote a multi threaded rss reader in 20 lines of code, works on the power of .net within an extensible architecture. Most people will be able to use the product and components out of the box others will need to write code. Not neccessarily complex code.
You must realise that this is a v1 product and compared with many MS V1 products this is leagues above the rest, largely because the framework is there. If you need to extend it is easy and this is what the time has been spent on, extensibility and performance.
mathieu.szablowski
By range lookup, I mean I have a loan amount in my source table, say $234,232.43. My dimension table has ranges like $0 to $20,000, etc. In the dimension table I have a Min and Max column for the range.
In the lookup, I use the Advance properties to re-write the default query to use two parameters and modify the where clause to say "WHERE [MIN] < AND [MAX] >= "
When I modify the default query, it only allows a partial cache to be used, which I set at 10M.
In order to modify the lookup query, I first need to join two fields, so I join two arbitrary fields (in this example, I join Loan Amount in the source to MIN in the lookup table). This is possibly a reason for the poor performance, but without the initial join, the lookup is invalid.
Any other ideas
soellnas
Can you post your package
udotan
SSIS does not support range lookups of this type in this version, but check out this solution from Runying Mao via Thomas Pagel's blog ...
http://sqljunkies.com/WebLog/tpagel/archive/2005/08/31/16585.aspx
Why is this not part of our lookup component
One of the design philosophies of SSIS is to provide a range of quite atomic "tookit" components that can be built into design patterns - rather than providing more complex components that implement entire patterns internally. Think of it as RISC v CISC. :-)
The lookup component uses hashes to make the lookup operation itself more efficient. But by their nature hashes are not suitable for finding values within a range - only for finding exact matches. For this reason I expect that a range lookup would actually be a different component.
Now you'll be wondering if we're going to do one in next version. :-) I can't say for sure - it's a good scenario that others have requested too. Perhaps - or perhaps we create more atomic components that enable the pattern.
hth
Donald Farmer, Group Program Manager, Microsoft SQL Server Business IntelligenceGregory_N
This would nee some code that turns a result set into an array or something that a rnage lookup can be performed against. The key being doing an in memory lookup on your ssis machine rather than a sql query against a remote machine.
Sorry havent got any code at then moment to answer your problem
jeff_akbm
Thanks Donald. The reason Runying Maos solution doesn't fit my problem is his query is returning more values and he has to filter for just one, but in my situation I match against an "upper" and "lower" boundary, so I can't do an equi-join.
From a dimensional modeling point of view, I imagine this happens in ETL quite a bit, where users want to bucket balances, interest rates, etc. into discrete buckets. The only way I know how to assign a surrogate key to these ranges is to compare the value against a min and max range. Did folks in Project REAL run across this type of key assignment problem
Now this leads me to find the most efficient, performant way of getting SSIS to do this lookup, obviously without using the Lookup Transformation.
Looking at the profiler trace while running my package, it doesn't appear to be caching the lookup dataset, even though I've chosen partial cache. Is this because if the value, say $23,339.98 is surrogate key 4 which is range description $20,000 - $40,000, it would cache the value (23,339.98) and key (4) and if ever it ran across that exact value again, it would retrieve (4) from the cache instead of running another query If that's the case, then the partial cache would do no good because in 2.7million records, the chances the discrete values would match are low.
I suppose another strategy is to change my surrogate keys to be smart, which I believe Ralph Kimball is against. In this case, the surrogate key could be expressed as a formula or expression rather than a lookup. However, this would only be practical for dimensions with consistent increments, say 5% increments, or $10,000 increments. Random increments would make smart key expressions too difficult, and knowledge of the dimension table would have to be inherent in the expression.
Is there another way I can write my lookup query to change the join type, using hints If I abandon the lookup transform altogether for this, what type of trasform should I be using An Execute SQL task that calls a custom stored procedure I can't see how this would be any more efficient than what the Lookup transform is doing.
Kory
nmanville
1) Made sure I was utilizing the cache. For exact-match lookups, I used Full cache, and the lookup tables were very small (< 30 rows). For range lookups, all that is available is partial cache, which I set at 10M, the lookup table for this was < 30 rows as well.
2) I optimized the lookup query for the range match to only include the columns necessary to do the lookup.
I ran two scenarios- one with just exact match lookups for two dimensions. This loaded 2.7 million rows in 4.75 minutes.
The second scenario, I included a range lookup as well as the other two exact lookup matches. This loaded the 2.7 million rows in 2.5 hours!
I'm at a loss on what to do next. I've done this same thing using Data Integrator with 20 key lookups, 8 of which are range lookups, and it loaded 2.7 million rows in under 20 minutes.
I'm currently using the September CTP.
Anyone from Microsoft who can help me on this one
Thanks
Kory
Michael Adamson