Use Replication to Identify Updated Data?

I am new to SSIS (though have a decent amount of experience with SQL Server 2000), and am trying to design a data warehouse and eventually a comprehensive reporting solution.

Here is our setup:
- We have ~150 studios all operating software which we designed (ie. we have control over the system that will be the Data Warehouse source).
- All studios use merge replication back to Head Office (we maintain a duplicate of each of their databases).

From my limited understanding of Data Warehousing methodology, I believe that I should prepare a 'staging area' which contains all needed data from these databases. My question is how to do this in an efficient manner (without resorting to software-maintained timestamps in all the source tables - this would require significant modification to our software).

Since replication is already identifying which rows are new or updated, I'm wondering if there is some way to use this information to limit the information processed during the Data Warehouse updating process. I realize that I would still need to determine whether the row was updated or new, but it would cut down immensely on the number of rows processed.

I feel like I could be missing out on a simple way to do my data extraction from the source DBs. Does anyone have any advice

Thanks in advance for your help.


Answer this question

Use Replication to Identify Updated Data?

  • jaskey

    Brent,

    Thanks for the advice - definitely some good points I hadn't thought of... the replication scenario would be a nightware in itself.

    In talking with my colleague, we thought of maybe just putting some sort of calculated hash in each source and destination table, and then comparing them to determine if there was a change (ie. select * where source.hash <> destination.hash). Not sure if this would work, but I have a nagging suspicion that this is what the CheckSum component discussed in these forums is being used for. If so, I've definitely been a bit slow on the uptake.

    Thanks,

    Kristian


  • rick1234

    Kristian,

    I don't know much about replication. Is there anything in the replicated data to indicate when it entered the replication database If there is then there is a possibility of using that information to do what you wish. Otherwise I fear you may have to go down the nasty route of comparing all of the source with all of the destination to determine new, changed or unchanged rows.

    Not much help to you I know. Sorry.

    -Jamie



  • nka

    This is probably not a good idea for a couple of reasons. First, you have to worry about what to do when you reinitialize the subscriptions. In this case you have the option of overwriting all the data on the subscribers which you will have a hard time tracking as it doesn't go through the normal replication system tables. Second, the system tables that hold the replication commands are flushed quite frequently so you would have to have a plan to deal with the possibility of missed updates. Also, if replication ever went down for some reason then so would your datawarehouse. BOL has a fairly detailed explanation of how merge replication works.

    A better solution would probably be to write your own triggers to track changes in a table you specify.


  • CGuyArun

    Brent,

    You bring up excellent points. Our application is still very much in the enhancement (read: scope breaking) phase, so the frequent schema changes would probably make the hash value solution impractical.

    I will look further into the triggers - perhaps they could call a user defined function that would populate a 'db changes' table, passing along values indicating the table, key, change type, datetime, etc.

    Thanks again,

    Kristian





  • Satriani

    - Kristian - wrote:
    Jamie,


    My biggest issue in getting started here is just shock and disbelief that SQL Server doesn't have some sort of basic changed data capture capability- I guess I assumed it would be a core part of any data warehousing solution.


    Kristian

    SQL Server is a platform for data storage and manipulation. You have to implement a solution on top of that platform. CDC is firmly out of scope for SQL Server in my opinion and is definately not a core part of a DW solution.

    Mind you, they keep bolting new things on all the time so who knows - it could come in the future. Try suggesting it at the feedback center: http://lab.msdn.microsoft.com/productfeedback/default.aspx

    -Jamie



  • DanSmith2004

    Jamie,

    Thank you for your response. Thanks also for your blog - there's lots of helpful SSIS info there. I believe that replication sets up a series of triggers on the DB tables to track changes; now I just need to find out how to access that info....

    My biggest issue in getting started here is just shock and disbelief that SQL Server doesn't have some sort of basic changed data capture capability- I guess I assumed it would be a core part of any data warehousing solution.

    My manager is starting to get skeptical about the whole project. He figures running through all the data each night (which they consider the minimum update frequency) isn't feasible. However, it looks like that's the way it's done.

    If anyone has further information about gleaning information from a replication process, I'd be very grateful!

    Thanks again,

    Kristian

  • Ori&amp;#39;

    Depending on the hash algorithm there is a possibility that there will be duplicate hash values in your table. It is usually a good idea to compare primary keys as well as a hash value for this reason. The checksum component you mention is quite useful but in your case it might be better to use a computed column (with the hashbytes function) in your source so you don't have to read the entire table to generate hashes in SSIS. Also, once you create a hash across a number of columns, if you ever want to add another column to the datawarehouse it's quite painful.
  • Use Replication to Identify Updated Data?