Performance, handling 1.2 million rows takes 7-8 hours?

Hi!

I'm working on a project restructuring some data using SQLIS. The problem is one package that is handling about 1.2 million rows takes about 7-8 hours to complete.

To my question, is there any of the dataflow components that are know to be slow that you should avoid regarding performance issues

The following list is a brief specification over the component that are contained in the package:

-------------------------------------------------------------------------------------

3 OLEDB Data sources.
2 OLEDB Data sources.
3 Derivied column objects.

1 Union which unions all rows from the sources above.

2 look up components that look up rows for the all rows after the union component.

1 script component calulating a new surrugate key for all rows (sk=sk+1)

1 Conditional split splitting up all rows into 2 separate flows

For each of these separate flows there is a 2 look up components.

-------------------------------------------------------------------------------------

:)

I don't know if the above was much of help, but I hope it will give you a brief overview, there are no complicated script components involved.

The server running the package does nothing else but executing the package on a dual core Pentium 4 at 3.2 GHz and has 2GB of RAM.

-------------------------------------------------------------------------------------

When the package has started it seems to be IDLE for 7 hours, taking almot 0% of the processortime... :S

Anyone who has had a similar problem




Answer this question

Performance, handling 1.2 million rows takes 7-8 hours?

  • Ifgash

    I found that I could still use the OLE DB destination with fast load, max batch size of 3000 and max commit of 3000, then the second ole DB destination for a table level insert and error catching.

    I used a script component to transform the error codes into something more meaningful and then dumped the last ditch errors (after 2nd ole db adabter to a local variable object dataset).
    (thanks Jamie for your blog post a few years ago.)



    C
    ode Snippet

    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
    'Use the incoming error number as a parameter to GetErrorDescription
    Row.MeaningfulDescription = ComponentMetaData.GetErrorDescription(Row.DestTableErrorCode)
    End Sub






    I am just moving a data warehouse from staging to live, so there were no transformation and lookups to do. But it moved 1.4 million lines through in about 5 minutes. 900,000 through fast load, around 500,000 through the table level ole DB dest, and 50 ended up in my error DataSet in a package level variable so I could use the script component to figure out why there were errors.

    Somehting I learned on the Lookup tables: Make sure the lookup keys are indexed so the SQL query is faster. Using SQL trace you will see a call for each lookup until they are all catched in memory (use full catch to handle null values).

    I have to figure out the whole SQL transaction logs bit. I set it to simple so far, with 10% growth, and limited to 4096 (4 gigs).

    Go the package Default Buffer size at 104857600, which I think is 100mb (100 x 1024 x 1024 bytes/megabyte). Still in dev, but My machine bogs down after all the memory consumption.

  • Vb.net user

    I have no general solution but have now come optimized the model which now takes about 3 minutes to execute, with a larger dataset containing about 6M rows.

    -----------------------------------------------------------------------------------------------------

    1)

    Took all lookups from large tables and used "SQL JOINS" instead.

    2)

    Replaced "OLEDB-destination" in favor for "SQL-destination", this made a significant speed up. This is one point where I could see that a big difference, when I was using the "generic" OLEDB driver the model stored 10000 in chunks which took alot of time. After changing to SQL-destinations the rows seemed to be stored much faster in a more seamless fashion. Maybe seem lika a obvious thing, but it didn't know there were such a difference, I liked the OLEDB-Destinations because of the error handling capabilities.

    3)

    Changed my restriction to the db & transaction log - files, to: no restriction in size and file growth by 10% (previously 1MB). This may have speeded up the work a bit for SQL Server, cause it would not have to allocate new space all the time.

    4)

    The above changes made the model execute really fast when it was split up. After merging the separate models, running the "hole" model still "hang" and used almost no CPU. The solution in this case was to use more engine threads from default 5, I increased the number of threads to 10.

    -----------------------------------------------------------------------------------------------------

    Sorry for the specific solution, but I hope it may help someone that have been stuck in some similar scenario, to try out some of the above "tricks" to solve hers/his problem.

    Best luck!



  • iemad

    I'll echo what Jamie said - the Lookups are the only thing that could take time. Check:

    * If the Lookup CacheType = Full or Partial, are you out of memory and swapping to disk (doesn't seem likely)

    * If CacheType = Partial or None, you're doing a round trip to the db for every single record (None) or for every single record missiing the Lookup cace (Partial), this seems a likely reason for your issue.

    Good luck/Kristian



  • Jason Smith

    I would suggest that the idle time is spent populating the LOOKUP caches. Are you using full caching

    If those 7 hours are spent filling caches then you will see information in the Output window informing you of this.

    It seems you have ALOT of LOOKUP components here. My first advice would be to split the LOOKUPs into seperate data-flows - using raw files to pass data between them.

    I'm interested to know hwo you get on so reply here and let us know!

    -Jamie



  • JT673

    Depends on indexes on the destination as well. I indexed my destination table (added a few indexes on top of the existing 5 dimmension surrogate keys... Primary Key) and the transfer slowed down alot.


  • amphi

    Hello guys,

    Thanks for the replies I'm using full cache on all lookup components.

    Do I have to set the following options in the Advanced-tab:

    "Enable memory restrictions" and
    "Enable caching"

    to true, in order to get the cache activated



  • Global Lens

    yes but I've split the task now and only have 3... to debug alittle to see if it went faster, but it seems that I get the same result, the CPU goes idle...

    why I want to have all components in the same task is that I'm constructing a parent\child structure and then it was the most straight forward solution to have all components in the same task, coz I need the surrugate key that is generated for parent to populate all children rows...

    I'm merging alot of different tables into a more general model, and have to do alot of lookups to find all information... Maybe it's more efficient to do JOINS in the datasource instead od tables/lookups...



  • C# Lover

    No. They are additional tuning params. I think you can leave them for the time being.

    As I understand it you have something like 5 LOOKUPs in the same data-flow. With 1.2 million rows of data this is very likely to cause problems

    -Jamie



  • Ramesh Narayanan

    If you CAN do joins in the source components then that is where you should do them. Use the power of the DBMS before you have to employ SSIS.

    You still haven't found out which component is causing the problems. Break your data-flow into as many speerate parts as possible and link them with data-flows. Assuming this works OK start to put everything back into the data-flow - and stop when the problems re-occur. That will show you where the problems occur.

    -Jamie



  • Performance, handling 1.2 million rows takes 7-8 hours?