Hi guys,
I have been able to put together a simple SSIS package and familiarize with how it works. Now as I am trying to put together this ETL job I am stuck and am looking for some recos:
- I need the source system to be able to do a join on my Datawarehouse table and push thru only those rows which I am interested in. Can this be done using one of the Data flow task Presently we do it by using staging area, creating temp tables, doing joins, but if I attempt the same here, what is the advantage I would derive out of SSIS Any pointers will be helpful
- Does anyone know as to how I can tap into the SSIS meta data and maybe kind of provide an UI on top of it to make small changes (I know I can do that using variables, but having the option of tapping into meta data would be cool)
Thanks for bearing with me so far .....
Thanks,
spj11

ETL using SSIS
savage1965
spj,
Firstly, the LOOKUP component can cause problems if there are alot of records in the lookup cache. But the number of records in the pipeline (note the important difference) (i.e. 1641991 in your case) should be largely irrelevant. How many rows are you looking up against (i.e. how many rows in your lookup cache)
As always, the only way to know whether a LOOKUP or a MERGE JOIN will work better is to test and measure.
Now onto your real problem - the timeout in the DataReader source.
Why are you using a DataReader source Have you tried an OLE DB Source
Your query takes 12:21 just in QA. What is this query doing Perhaps your first avenue of investigation should be whether this can be speeded up.
A couple of starting points there.
-Jamie
Anonymice
You are right, the bottleneck here is the hardware, but thats what I am concerned right. A lot of optimization need to done before package deployment. You might argue this is the case with SQL queries too, but with SQL queries, we kind of know how to optimize it by going for proper indexing, increasing page size and all other database optimization
FYI: The number of rows are 1641991
Now you are recommending using raw file transfer and then joining them ....... Ooooohhhh that doesn't sound doing the right thing, but I might as well try that .....
Thanks again.
Alejandro Mezcua
I'm aware that there are problems with temp tables but using SSIS means it shouldn't be a problem. The power of SSIS is that you can do all this manipulation in memory so temp tables aren't needed.
That isn't a particularly useful error message is it What kind of source component/connection manager are you using It seems to be a connecivity issue. Can you run Profiler on the source and see if you can see the the conneciton attempt
-Jamie
P.S. If you are trying to find out which rows are new rows (as indicated in the linked to article) you should try method 2 first - use a LOOKUP.
m_3ryan
I have 4 tables/views that I am doing the joins on and based on ur reco, I am attempting merge join. But is there no component which can straightforward do 4 way join rather than I have OLE DB source than sort then join and then sort and then join again like here:
OLEDB1 OLEDB2 OLEDB3 OLEDB4
Sort Sort Sort Sort
JOIN 1 JOIN2
Sort Sort
JOIN
Hopefully I am making sense.
Thanks,
spj11
ashok_gupta121
where it talks about doing a merge join on src and destination (which effectively does the join). But the problem is my source server is timing out.
Error mesg:
Error: System.Data.SqlClient.SqlException: Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding. at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection) at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection) at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj) at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj) at System.Data.SqlClient.SqlDataReader.ConsumeMetaData() at System.Data.SqlClient.SqlDataReader.get_MetaData() at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString) at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async) at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method, DbAsyncResult result) at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method) at System.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior, String method) at System.Data.SqlClient.SqlCommand.ExecuteDbDataReader(CommandBehavior behavior) at System.Data.Common.DbCommand.System.Data.IDbCommand.ExecuteReader(CommandBehavior behavior) at Microsoft.SqlServer.Dts.Pipeline.DataReaderSourceAdapter.PreExecute() at Microsoft.SqlServer.Dts.Pipeline.ManagedComponentHost.HostPreExecute(IDTSManagedComponentWrapper90 wrapper)
Is there a way around without using temp tables
Xinirnix
Ok, so this is where I stand:
Followed your guidance and converted those joins to "Merge Join" in the Data Flow and it seemed like fast initally but as it started getting about a million rows from those four views, my memory was clogged and it was taking an awful long time.
So I redesigned the package and converted those four sources to the original join but this time I used OLE DB Adapter (and this has the connection timeout) and it sure did kick off well again. But since the join yields again almost a million rows, my memory was clogged but this time I allowed it to run and it took an awful 5.5 hours.
These packages were run on a machine with W2k3, with 1 GB memory and 2 GB virtual memory enabled. Our prod servers would have bigger memories but my concern after this excercise is this could potentially be a bottleneck right. Are there ways I can improve this Should I resort to temp tables for all the cleansing
Thanks,
spj
Martin Vobr
I've just looked and I don't think there is a setting to change the timeout unfortunately.
So the actual query that you are executing is joining a load of tables/views (its irrelevant what) together right Why not just have a seperate datareader/ole db source for each view and do the join in the SSIS pipeline
-Jamie
michivo
I use the terms "pipeline" and "data-flow" interchangably. I probably shouldn't Sorry about that.
So when I say do the join in the pipeline, yes, I do mean to use a MERGE JOIN.
-Jamie
Chris Wr
It seems as though you are hitting the wall with the hardware you have at the moment.
If you want to double check then you could use the SSIS performance counters - documented in BOL as far as I am aware.
How many rows are we talking abou What's the width of the row
I assume everything is going on in one data-flow currently is that correct Here's a tip for you - break it into seperate data-flows and use raw files to transfer data between them.
-Jamie
weehyong
I will try that. When you say SSIS pipeline, did u mean use a merge join
Ron_S_MWA
I wanted to add something else to the scenario 1 I was talking about above:
Even if I wanted to do the join by creating temp tables with data from the DW, how do I add it to the Data Flow task. It doesn't see the temp table during design right. Did anyone else face similar situation How did you guys resolve it
Mark64
Yes, the lookup component will potentially go up to a million records or more too .....
As per your second question, the reason the query is taking 12:21 is bcoz of the joins on views. Its the only interface we have to be able to gather the data. There r plans to do away with that interface (thank God!!!) but presently I am stuck with that.
So what would u do in this scenario .... Still continue and explore lookup versus merge join .... Have you ever had a scenario where u had to work with temp tables
I will try with the OLE DB Source, but do you know if I can ask the connection to keep continuing until it gets the results. Where do I set this property, if there is one
Thanks,
spj
Lakshmana Kumar K
You are correct, that is exactly what you would have to do. Although its worth saying that if the output from the source components is already sorted then you don't need the SORT components.
I don't know whether this would be faster or slower than doing the joins in a single query but hopefully it would eliminate the perceived timeout problem.
-Jamie
Boonster
Thanks Jamie for the attention. You have helped a lot of people in here and I am hoping you can help me too.
It is a DataReader source component. I think its timing out becoz of the actual time it takes to execute the query on the source system. The source system provides us a certain set of views and we do a join on them to get the requisite data. Then on that subset data, we need to do a join on one of our DW table (lets call DWTable) to determine if we need inserts or updates (this is today being done using temp tables where we load the DWTable from our DW)
Now I have followed some of ur posts here and I got the impression that lookup is not such a good idea when you have lots of records to lookup. This would be the case in my scenario; the records for lookup will keep growing over time and it could well be a million or so ....
I have tried to run the query without the join on our DWTable (which is what the DataReader Source component above is doing) and it took 12:21 mins with a total of 1641991 records (this is from Query Analyzer).
Any help or pointers will be greatly appreciated
Thanks,
spj