Looking up surrogate keys in a dimension table and adding these to your data flow is easy when there is a match in your dimension table for every key in your fact table. However, I am puzzled by how to manage the data flow when no match can be found for a specific key in the fact table when doing the lookup AND I then want to insert this unknown key as an unknown/inferred member in the dimension table. The problem is further complicated by the fact that when I have inserted the unknown member in the dimension table and it has been assigned a surrogate key there, I want to add this surrogate key to my fact table - just as if there had been a match in the lookup in the first place.
I realize that I must likely will have to start by somehow managing the error output from the lookup component and continue from there. But how
Any help is greatly appreciated.

SSIS: Inserting unknown members from fact table
pelandry
The Data Souce is a CSV-File with FirstName, LastName and Category. Sample data could be
Dirk; Bauer; sailing
Peter; Bauer; fishing
Marc; Bauer; reading
In my data modell I have defined the 2 tables "Person" and "Category":
Table "Person"
----------------
[PersonID] [int] IDENTITY(1,1) NOT NULL
[CategoryID] [int] NOT NULL
[FirstName] [nvarchar](50)
[LastName] [nvarchar](50)
Table "Category"
----------------
[CategoryID] [int] IDENTITY(1,1) NOT NULL
[CategoryName] [nvarchar](50)
Now I like to read my first row from the source and lookup a value for the CategoryID "sailing". As my data tables are empty right now, the lookup is not able to read a value for "sailing". Now I like to insert a new row in the table "Category" for the value "sailing" and receive the new "CategoryID" to insert my values in the table "Person" INCLUDING the new "CategoryID".
I think this is a normal way of reading data from a source and performing some lookups. In my "real world" scenario I have to lookup about 20 foreign keys before I'm able to insert the row read from the flat file source.
I really can't belief that this is a "special" case and I also can't belief that there is no easy and simple way to solve this with SSIS. Ok, the solution from Thomas is working but it is a very complex solution for this small problem. So, any help would be appreciated...
Thanks,
Dirk
Lars-Inge Tonnessen
1. When I create my dimensions I have a Sql Task that adds an "unknown" entry to my dimension - usually with a surrogate key value of 1. I then load the rest of the records using a SCD Transform
2. I create a variable in the package I use to lookup surrogate keys - such as "UnknownDate" and assign it a value of 1 (or whatever the key value is - I use this so I can just change the variable if my unknown key changes in the dim)
3. I do a lookup for the business key & return the surrogate key. If it is a success, I go to a Union All, if it fails, I point it to a Row Count.
4. I put a variable in the row count, something like rcNoQaDate - so I can log the "no match" records
5. I then map the row count to a Derived Column component and create the surrogate key field in the data flow and assign the user variable "UnknownDate"
6. I map the derived column to my Union All, which matches up my surrogate key I created in the Derived Column to the one found in the lookup transform.
Hope this makes sense. In some of my packages I also log the business key for the rows that don't match so I can do research on them for debugging. I just add a multicast after the derived column and go to an OLEDB Destination.
-Evan Black
Raffi BASmajian
Hello again,
Actually, what I want to do is described by Kimball in option 3 in the mentioned paper. I want to insert the unknown fact record in the dimension table with "dummy" attributes. You see, I cannot be certain that I will ever get more information on the unknown business key in subsequent dimension loads. Therefore, simply exposing it in the dimension table with the dummy attributes is enough for the business user.
I would still like to have an example of how to solve my problem in SSIS. After all, the way I understand inferred member support in the SCD, it actually assumes that unknown business keys from the fact table are being inserted into the dimension table and marked with a boolean value (inferred/not inferred). This way, the inferred members can be recognized in subsequent dimension loads and updated. For this to work, some method for inserting unknown business keys from the fact table and assigned a surrogate key to them in the data flow has to exist... !
rufusz1
1. Insert the unknown business key in the dimension table (and thereby automatically assign a surrogate key to it because the surrogate key field in my dimension table is an identity column).
2. Add the newly generated surrogate key to my data flow
How would the SCD wizard help me with this
miloush
have a look on my blog... http://sqljunkies.com/WebLog/tpagel/archive/2005/08/19/16495.aspx
Perhaps this is the solution you're looking for...
Robort
basav
I do agree with you, and you are right: Poor data integrity in the source system is the only reason (combined with early-arriving facts) why I need this approach. However, I have a case where I have to do a POC based on extracts from only transaction tables in the source system. The transaction tables share dimensions and as such I need to create "common" dimension tables instead of just building the dimensions in AS right on top of the transaction tables.
Besides, I still assume that when Microsoft designed the support for inferred members in the SCD wizard, they must have had this scenario in mind.
... and another thing: Simply placing all unknown business keys in one "unknown" bucket in the dimension table will make it difficult/impossible for the business users to correct the poor integrity in the source systems, since they would not know which products/customer/etc. they had to correct without a key to identify them.
Martin Koppmann
I'm still wondering about why you need this stuff so much... I'm in BI business for some years and did quite a number of projects... This scenario wasn't new for me but it's still some kind of "special". The usual case is that you have your dimension and match it to your facts. It's not very common (at least for me) to build up dimensions from the facts... At my customers dimensions come from other systems (i.e. a ERP system). When dimension records are missing then this is indicating a problem on the source system's side, i.e. customers are deleted who had orders. Then it's no good idea to automatically "recreate" the customer but to push this to an "error" or "unknown" bucket to be inspected by a business administrator. These problems should be fixed on the source system side, not in the warehouse. But that's my way to handle it, your scenario might be different... So building up ETL processes like I designed in my blog for each key for each fact table, that would be much too much work, I aggree. However I can't offer you a simpler solution...
Moritz Pfennig
Mike Allen Mc Donogh
Anyway - we have a way to do it. Great!
Mark 5762
You can do the import in two steps (dataflows), then it's not so hard to do... The downside would be that you have to read the data twice...
The first dataflow imports the categories. You have your source, do a aggregate on the category and save it in the category table.
The second dataflow reads the persons, assigns the categoryID by a lookup on the table you filled in the fist step and saves it in the Person table.
For me it's a quite uncommon scenario to have "denormalized" fact tables, or fact tables having dimension data, too. It's not easy to fill dimension tables and the fact table from one source in one dataflow when you want to assign surrogate keys in it, too...
Eli Schleifer
Any other ideas Thanks! :-)
Maksim
http://sqljunkies.com/WebLog/sqlbi/archive/2005/05/30/15684.aspx
I really like SSIS, but since is a tool that very much is pointed towards BI development, I wonder why there has not been more focus on solving this particular problem!
Are my needs that atypical
bob12354
The approach suggested by Evan Black is the one I generally follow.
I insert a "Dummy" record, with a surrogate key of 1, in all dimensions. Whenever a fact lookup fails, I associate the fact with this "dummy" member. If you have an append/modify fact, then the fact records linked to the dummy member(s) will hopefully get cured the next time you run the refresh. This is due to the fact that subsequent dimension refreshes will get those new members and the same fact lookup will extract an existing dimension member at that time.
You might want to read the article by Ralph Kimball on early arriving facts. It explains three possible approaches in all.
http://www.rkimball.com/html/designtipsPDF/KimballDT57EarlyArriving.pdf
Regards
Milinda_V