SSIS: Inserting unknown members from fact table

Looking up surrogate keys in a dimension table and adding these to your data flow is easy when there is a match in your dimension table for every key in your fact table. However, I am puzzled by how to manage the data flow when no match can be found for a specific key in the fact table when doing the lookup AND I then want to insert this unknown key as an unknown/inferred member in the dimension table. The problem is further complicated by the fact that when I have inserted the unknown member in the dimension table and it has been assigned a surrogate key there, I want to add this surrogate key to my fact table - just as if there had been a match in the lookup in the first place.

I realize that I must likely will have to start by somehow managing the error output from the lookup component and continue from there. But how

Any help is greatly appreciated.


Answer this question

SSIS: Inserting unknown members from fact table

  • pelandry

    I'm looking for a solution to import data from a flat file into an normalized data modell. To explain it a little simpler think about to following:

    The Data Souce is a CSV-File with FirstName, LastName and Category. Sample data could be

    Dirk; Bauer; sailing
    Peter; Bauer; fishing
    Marc; Bauer; reading

    In my data modell I have defined the 2 tables "Person" and "Category":

    Table "Person"
    ----------------
    [PersonID] [int]
    IDENTITY(1,1) NOT NULL
    [CategoryID] [int] NOT NULL
    [FirstName] [nvarchar](50)
    [LastName] [nvarchar](50)

    Table "Category"
    ----------------
    [CategoryID] [int]
    IDENTITY(1,1) NOT NULL
    [CategoryName] [nvarchar](50)

    Now I like to read my first row from the source and lookup a value for the CategoryID "sailing". As my data tables are empty right now, the lookup is not able to read a value for "sailing". Now I like to insert a new row in the table "Category" for the value "sailing" and receive the new "CategoryID" to insert my values in the table "Person" INCLUDING the new "CategoryID".

    I think this is a normal way of reading data from a source and performing some lookups. In my "real world" scenario I have to lookup about 20 foreign keys before I'm able to insert the row read from the flat file source.

    I really can't belief that this is a "special" case and I also can't belief that there is no easy and simple way to solve this with SSIS. Ok, the solution from Thomas is working but it is a very complex solution for this small problem. So, any help would be appreciated...

    Thanks,
    Dirk


  • Lars-Inge Tonnessen

    This is the way I do it.

    1.  When I create my dimensions I have a Sql Task that adds an "unknown" entry to my dimension - usually with a surrogate key value of 1. I then load the rest of the records using a SCD Transform

    2.  I create a variable in the package I use to lookup surrogate keys - such as "UnknownDate" and assign it a value of 1 (or whatever the key value is - I use this so I can just change the variable if my unknown key changes in the dim)

    3.  I do a lookup for the business key & return the surrogate key.  If it is a success, I go to a Union All, if it fails, I point it to a Row Count.

    4.  I put a variable in the row count, something like rcNoQaDate - so I can log the "no match" records

    5.  I then map the row count to a Derived Column component and create the surrogate key field in the data flow and assign the user variable "UnknownDate" 

    6.  I map the derived column to my Union All, which matches up my surrogate key I created in the Derived Column to the one found in the lookup transform.

    Hope this makes sense.  In some of my packages I also log the business key for the rows that don't match so I can do research on them for debugging.  I just add a multicast after the derived column and go to an OLEDB Destination.

    -Evan Black

  • Raffi BASmajian

    Hello again,

    Actually, what I want to do is described by Kimball in option 3 in the mentioned paper. I want to insert the unknown fact record in the dimension table with "dummy" attributes. You see, I cannot be certain that I will ever get more information on the unknown business key in subsequent dimension loads. Therefore, simply exposing it in the dimension table with the dummy attributes is enough for the business user.

    I would still like to have an example of how to solve my problem in SSIS. After all, the way I understand inferred member support in the SCD, it actually assumes that unknown business keys from the fact table are being inserted into the dimension table and marked with a boolean value (inferred/not inferred). This way, the inferred members can be recognized in subsequent dimension loads and updated. For this to work, some method for inserting unknown business keys from the fact table and assigned a surrogate key to them in the data flow has to exist... !



  • rufusz1

    The way I see it, the SCD wizard is for loading dimensions only. My problem arises when I am loading fact tables and I want to assign surrogate keys for each of the business keys in the fact table. I assign the surrogate keys by doing lookup transformations. However, when I encounter an "unknown" business key in the fact table (i.e. there is no match in the dimension table), I want to:

    1. Insert the unknown business key in the dimension table (and thereby automatically assign a surrogate key to it because the surrogate key field in my dimension table is an identity column).
    2. Add the newly generated surrogate key to my data flow

    How would the SCD wizard help me with this

  • miloush

    Michael,

    have a look on my blog... http://sqljunkies.com/WebLog/tpagel/archive/2005/08/19/16495.aspx

    Perhaps this is the solution you're looking for...


  • Robort

    My approach (I'll write a blog about this when I find some time) is just a little bit different... I ALLWAYS give the "unknown" member the ID 0. And I define in every fact table a standard value for each foreing key of 0. So it's easy whenever you don't find any matching dimension record or simply "forget" to set it, you still have a "working" modell...

  • basav

    Thomas,

    I do agree with you, and you are right: Poor data integrity in the source system is the only reason (combined with early-arriving facts) why I need this approach. However, I have a case where I have to do a POC based on extracts from only transaction tables in the source system. The transaction tables share dimensions and as such I need to create "common" dimension tables instead of just building the dimensions in AS right on top of the transaction tables.

    Besides, I still assume that when Microsoft designed the support for inferred members in the SCD wizard, they must have had this scenario in mind. Smile

    ... and another thing: Simply placing all unknown business keys in one "unknown" bucket in the dimension table will make it difficult/impossible for the business users to correct the poor integrity in the source systems, since they would not know which products/customer/etc. they had to correct without a key to identify them.

  • Martin Koppmann

    Michael,

    I'm still wondering about why you need this stuff so much... I'm in BI business for some years and did quite a number of projects... This scenario wasn't new for me but it's still some kind of "special". The usual case is that you have your dimension and match it to your facts. It's not very common (at least for me) to build up dimensions from the facts... At my customers dimensions come from other systems (i.e. a ERP system). When dimension records are missing then this is indicating a problem on the source system's side, i.e. customers are deleted who had orders. Then it's no good idea to automatically "recreate" the customer but to push this to an "error" or "unknown" bucket to be inspected by a business administrator. These problems should be fixed on the source system side, not in the warehouse. But that's my way to handle it, your scenario might be different... So building up ETL processes like I designed in my blog for each key for each fact table, that would be much too much work, I aggree. However I can't offer you a simpler solution...


  • Moritz Pfennig

    Have you tried the Slowly Changing Dimension Transformation It has native support for SCD's including inferred members, so should be able to help. Very fast to use, compared to manually building Data Flows to do the same thing. You can always tweak the resulting Data Flow as well.

  • Mike Allen Mc Donogh

    Thomas... Thank you very much for your contribution. It really looks like you found a working solution, which - in itself - is great. However, I can only begin to imagine the amount of work involved in repeating your example for every key in the fact table - not to mention every key in EVERY fact table. Tongue Tied

    Anyway - we have a way to do it. Great!

  • Mark 5762

    Dirk,

    You can do the import in two steps (dataflows), then it's not so hard to do... The downside would be that you have to read the data twice...

    The first dataflow imports the categories. You have your source, do a aggregate on the category and save it in the category table.

    The second dataflow reads the persons, assigns the categoryID by a lookup on the table you filled in the fist step and saves it in the Person table.

    For me it's a quite uncommon scenario to have "denormalized" fact tables, or fact tables having dimension data, too. It's not easy to fill dimension tables and the fact table from one source in one dataflow when you want to assign surrogate keys in it, too...




  • Eli Schleifer

    Hmmm... Interesting. However - once again I must stress my need to insert the unknown business keys from the fact records as new/inferred members in the dimension table and THEN use the assigned surrogate key in the fact table. This cannot be done with the described approach, since all unknown business keys in the fact table will be mapped to a single "Unknown" member in the dimension table. More often than not (in my experience) this just does not cut it! The "unknown" business keys in the fact table often has a meaning for the business user, which is why it HAS to be exposed as a separate entity.

    Any other ideas Thanks! :-)

  • Maksim

    OK... There does not seem to be a solution to my problem - at least not an easy one! Marco Russo has actually been blogging a bit about something similar. He has some good points. His conclusion, however, is that doing what I want to do in SSIS is quite cumbersome. See his blog:

    http://sqljunkies.com/WebLog/sqlbi/archive/2005/05/30/15684.aspx

    I really like SSIS, but since is a tool that very much is pointed towards BI development, I wonder why there has not been more focus on solving this particular problem!

    Are my needs that atypical Smile

  • bob12354

    Hi

    The approach suggested by Evan Black is the one I generally follow.

    I insert a "Dummy" record, with a surrogate key of 1, in all dimensions. Whenever a fact lookup fails, I associate the fact with this "dummy" member. If you have an append/modify fact, then the fact records linked to the dummy member(s) will hopefully get cured the next time you run the refresh. This is due to the fact that subsequent dimension refreshes will get those new members and the same fact lookup will extract an existing dimension member at that time.

    You might want to read the article by Ralph Kimball on early arriving facts. It explains three possible approaches in all.

    http://www.rkimball.com/html/designtipsPDF/KimballDT57EarlyArriving.pdf

    Regards

    Milinda_V

  • SSIS: Inserting unknown members from fact table