SSIS Performance compared to DTS

I'm new to SSIS, and trying to feel my way around but I've got some questions about performance.

I've used DTS for a number of years, and I read in Donald Farmers blog that SSIS was about 7 times faster than DTS (mileage may vary).  Well, I may be doing something very wrong, because I'm getting about 7 times slower!

I have around 40 dimension tables, most of them have less than 100 rows.  Only two have 60,000 rows.  Then I have a single fact table with 2.7 million rows.

I pull my data from Excel for the dimension info, and Oracle for fact info.  In DTS the dimension load takes around 45 seconds total, all 40 dimensions into SQL Server 2000.  Then I load the fact table and use a function to do a lookup of the surrogate key for the fact table, although I do this step after the fact table has been loaded and then do an lookup/update for each foreign surrogate key.

I tested only 3 key assignments in DTS on the 2.7mil row table.  The initial load of around 8 columns is around 2.5 minutes.  Then the updates on the 3 key fields takes around 2 minutes each.  So total, this is around 8.5 to 10 minutes.

Now in SSIS, I load all 40 dimension tables (into SQL2005), and the validation step alone takes around 2 minutes.  Then the load is incredibly slow- around 15 minutes!

Then the fact table load, I use the lookup transformation for 3 surrogate key lookups.  This takes a total of 18 minutes for the 2.7mil row table.

Altogether, it takes 10 minutes for DTS and around 33 minutes for SSIS.  I ran each one 3 times, with consistent results.  Our server is a Dual 3.6Ghz hyperthreaded Pentium IV Xeon with 3G of RAM.

I MUST be doing something wrong in SSIS.  Please tell me I am!!

-Kory


Answer this question

SSIS Performance compared to DTS

  • bmcneill0

    Have you looked at ash's blog about handling lookup misses
     http://blogs.msdn.com/ashvinis/archive/2005/08/04/447859.aspx


    By range lookup do you mean your flow has a say a value of 34 and the lookup has a min and max value so this will match the record with min of 20 and max of 40.
    If so how are you doing this

    Are you using the OLE DB Command



  • Mec343

    I've set logging to a Profiler file, just for the data flow that loads the fact table.  Something I found interesting is this entry:

    The data flow will not remove unused components because its RunInOptimizedMode property is set to false.

    When I look at the dataflow itself, the RunInOptimizedMode is set to True.

    Does this mean anything

  • PeteL - MSFT

    Ah I think I see the problem. You will be better off (I think) with a custom lookup using a script component.

    To do this, you need to load your dimension into an array or other collection and then perform your lookup that way. That will be much more performant than performing a SQL call for each row.



  • IanG

    Thanks.

    You are correct in your analysis of partial cache. If you use partial cache then, in fact, there is no advance caching. Partial cache performs singleton queries for each row, and values for which a match is found are cached for future use.

    A good example of the advantage of partial cache would be for loading sales line items for a supermarket there may be millions of potential product keys. You do not want to cache every sku, but you do know that if an item is found, it is likely to appear in a later basket also, so it makes sense to cache the found value.

    I guess there are a few ways to do bucketing - it rather depends on whether your buckets are static or dynamic.

    • One way would be to do the join in the source query for the data flow.
    • Another would be to use a script component which can call out to the SQL Server using ADO.Net.
    • For static, predefined, buckets you could use a derived column component which calculated the bucket key using expressions. The upper and lower ranges could be driven by variables populated from a SQL query.
    • Finally, my own favourite, which is similar to your won suggestion. Work out a formula that enables you to assign a single bucket key to each bucket. For example ABS(x/20000) will give each 20,000 increment a unique bucket key. You can easily calculate this bucket key in the data flow using a derived column. Then look up from the calculated bucket key in the flow to the bucket key in the reference table and return the surrogate key. Note - the bucket key is not the surrogate key. Please don't make your surrogate keys meaningful - it will bring you endless pain.

    If none of these are suitable, then those RDBMS thingies come in quite handy. We include quite a good one, free with every copy of SSIS. ;-)

    hth

    Donald



  • Gordon7502

  • DeanHuff

    I agree that there are other approaches, but common sense tells me that the way I am doing it should be faster than it is... other ETL tools do this without any issue, so I would assume MS tools should be able to handle it.

    Can I trace the queries being sent to the database to see what is happening   How does the trace handle the cache

    Kory


  • benjarras

    Running SQL Server Profiler will show the queries being sent to the DB.

    I would suggest that for certain high volumn processes out of the box tasks/components will not be the best. SSIS is an amazing product partly form the ease by which it can be easily extended.

    I've just wrote a multi threaded rss reader in 20 lines of code, works on the power of .net within an extensible architecture. Most people will be able to use the product and components out of the box others will need to write code. Not neccessarily complex code.

    You must realise that this is a v1 product and compared with many MS V1 products this is leagues above the rest, largely because the framework is there. If you need to extend it is easy and this is what the time has been spent on, extensibility and performance.



  • mathieu.szablowski

    I just read ash's blog- currently I am ignoring misses, and do not have follow-up step to insert a default value.

    By range lookup, I mean I have a loan amount in my source table, say $234,232.43.  My dimension table has ranges like $0 to $20,000, etc.  In the dimension table I have a Min and Max column for the range. 

    In the lookup, I use the Advance properties to re-write the default query to use two parameters and modify the where clause to say "WHERE [MIN] < AND [MAX] >= "

    When I modify the default query, it only allows a partial cache to be used, which I set at 10M.

    In order to modify the lookup query, I first need to join two fields, so I join two arbitrary fields (in this example, I join Loan Amount in the source to MIN in the lookup table).  This is possibly a reason for the poor performance, but without the initial join, the lookup is invalid.

    Any other ideas

  • soellnas

    Do you have the lookups configured to cache data. With all that memory it should be able to cache all the data. if not it will be doing an sp call for each row thats 2.7 million * x dimensions. Thats a lot of sql calls.

    Can you post your package

  • udotan

    SSIS does not support range lookups of this type in this version, but check out this solution from Runying Mao via Thomas Pagel's blog ...

    http://sqljunkies.com/WebLog/tpagel/archive/2005/08/31/16585.aspx 

    Why is this not part of our lookup component

    One of the design philosophies of SSIS is to provide a range of quite atomic "tookit" components that can be built into design patterns - rather than providing more complex components that implement entire patterns internally. Think of it as RISC v CISC. :-)

    The lookup component uses hashes to make the lookup operation itself more efficient. But by their nature hashes are not suitable for finding values within a range - only for finding exact matches. For this reason I expect that a range lookup would actually be a different component.

    Now you'll be wondering if we're going to do one in next version. :-) I can't say for sure - it's a good scenario that others have requested too. Perhaps - or perhaps we create more atomic components that enable the pattern.

    hth

    Donald Farmer, Group Program Manager,
    Microsoft SQL Server Business Intelligence

     

     

     



  • Gregory_N

    Not being an expert but I believe that as your lookup ranegs are small. Loading htme into an array and doing binary lookups would probably provide the performance you require.

    This would nee some code that turns a result set into an array or something that a rnage lookup can be performed against. The key being doing an in memory lookup on your ssis machine rather than a sql query against a remote machine.

    Sorry havent got any code at then moment to answer your problem

  • jeff_akbm

    Thanks Donald.  The reason Runying Maos solution doesn't fit my problem is his query is returning more values and he has to filter for just one, but in my situation I match against an "upper" and "lower" boundary, so I can't do an equi-join.

    From a dimensional modeling point of view, I imagine this happens in ETL quite a bit, where users want to bucket balances, interest rates, etc. into discrete buckets.  The only way I know how to assign a surrogate key to these ranges is to compare the value against a min and max range.  Did folks in Project REAL run across this type of key assignment problem

    Now this leads me to find the most efficient, performant way of getting SSIS to do this lookup, obviously without using the Lookup Transformation.

    Looking at the profiler trace while running my package, it doesn't appear to be caching the lookup dataset, even though I've chosen partial cache.  Is this because if the value, say $23,339.98 is surrogate key 4 which is range description $20,000 - $40,000, it would cache the value (23,339.98) and key (4) and if ever it ran across that exact value again, it would retrieve (4) from the cache instead of running another query   If that's the case, then the partial cache would do no good because in 2.7million records, the chances the discrete values would match are low.

    I suppose another strategy is to change my surrogate keys to be smart, which I believe Ralph Kimball is against.  In this case, the surrogate key could be expressed as a formula or expression rather than a lookup.  However, this would only be practical for dimensions with consistent increments, say 5% increments, or $10,000 increments.  Random increments would make smart key expressions too difficult, and knowledge of the dimension table would have to be inherent in the expression.

    Is there another way I can write my lookup query to change the join type, using hints   If I abandon the lookup transform altogether for this, what type of trasform should I be using An Execute SQL task that calls a custom stored procedure   I can't see how this would be any more efficient than what the Lookup transform is doing.

    Kory


  • nmanville

    Thanks for the feedback.  I've tried both of your suggestions:

    1) Made sure I was utilizing the cache.  For exact-match lookups, I used Full cache, and the lookup tables were very small (< 30 rows).  For range lookups, all that is available is partial cache, which I set at 10M, the lookup table for this was < 30 rows as well.

    2) I optimized the lookup query for the range match to only include the columns necessary to do the lookup.


    I ran two scenarios- one with just exact match lookups for two dimensions.  This loaded 2.7 million rows in 4.75 minutes.

    The second scenario, I included a range lookup as well as the other two exact lookup matches.  This loaded the 2.7 million rows in 2.5 hours!

    I'm at a loss on what to do next.  I've done this same thing using Data Integrator with 20 key lookups, 8 of which are range lookups, and it loaded 2.7 million rows in under 20 minutes.

    I'm currently using the September CTP.

    Anyone from Microsoft who can help me on this one

    Thanks

    Kory

  • Michael Adamson

    The only other thing I can think of is that you're getting all columns from the lookup tables instead of just what you need... check the second bullet on this: http://www.sqljunkies.com/WebLog/ashvinis/archive/2004/10/26/4802.aspx


  • SSIS Performance compared to DTS