GUID

Hello,

I'm working on a smart client app that has an offline sql express store and needs to work with several types of central databases (support for multiple products - ms sql, DB2 etc)

While trying to put together some offline functionality that needs the user to create records on the offline sql express data store, we've run into the need of being able to uniquely identify records so replicating the data in the offline store back into the primary database should not be a problem.

The data created offline spans many tables and involves several tables with relation ships - FK etc...Clearly not a simple case of store and forward.

We dont want to get into the mess of performing key replacement during a synch job with the server. Thats way too much trouble.

GUID seems like a good choice, but as always we have several stake holders having different opinions. And with databases other than MS SQL we will have to store them as strings.

To cut to the chase - can we not hash a GUID to get an integer while retaining atleast the same likelyhood of producing unique ids [no drop]

Thanks,

Aviansh



Answer this question

GUID

  • MobileDev

    Hello Uma Jayachandran,

    I think I understand what you are saying in terms of the ramifications GUIDs have on indexes. Assuming, we are not going the GUID route can you please point me to techniques that will help generate unique integer ids in a disconnected mode

    Here is some home work that I've done on this front - Our app has a unique user id, an integer that is only generated on the server and so at all times will be unique . Can we some how use this to generate other integer (disconnected mode) ids

    I dont want to set up a composite key involving the user id, this will mean quite a bit of reengineering of the existing bits of the app and would like to avoid this as far as possible.

    I also understad the id ranges suggestion that you have made - I cant use an explict set up kind of mechanism as the number of users are fairly large. In a worst case scenario (which I will need to factor in) we may have as many as 2000+ users. Setting up ranges for them may be too much trouble. Also managing size of each range.

    is there some other math algo that I can use

    Here is something I ran into and am wondering if I can do something like this - http://www.c-sharpcorner.com/UploadFile/sgersh01/GenerateintegerIDs12032005000238AM/GenerateintegerIDs.aspx ArticleID=40b9fc09-d5a0-4084-ba31-86a0de3997b8

    Another interesting post on the web suggests generating an MD5 checksum.

    What I'm hoping to uncover is - Can we not use GUIDs to some how generate unique integers. I sure hope that is some how possibe to do. Or atleast use the user id to generate unique integer ids.

    Thanks,

    Avinash


  • RON 7659

    I was under the impression that I want to cluster data that goes together. In my example, I want all the Smith's together and all the Jones together. Or, for that matter all the S% 's  or J% 's. That is, if thats the way I grab them. When their key is used in a child table, that would be clustered along with the constituent child lookups.

    For example, in specfied order the following are entered into the database:

    Presidents:
    [(4, Madison, James), (3, Jefferson, Thomas), (1, Washington, George), (6, Adams, John Quincy), (5, Monroe, James),  (2, Adams, John)]

    EventType:
    [(intraterm), (postterm), (preterm) 

    And then I want to track the following entries, entered in this order:

    • John Quincy Adams Served as Armistad Advocate post term
    • James Madison Wrote the Federalist Papers preterm
    • George Washington Crossed the Delaware River preterm
    • Thomas Jefferson initiated the Louisiana Purchase intraterm
    • John Quincy Adams Authored the Monroe Doctrine preterm
    • John Quincy Adams Served In House of Representatives post term
    • James Monroe Authored the Monroe Doctrine intraterm
    • John Adams signed the Alien and Sedition Act intraterm
    • James Madison witnessed the burning of the White House intraterm
    • Thomas Jefferson wrote the Declaration of Independence preterm

    Using default key clustering gives this physical layout ( "*" represents generated Identity key index, (field) represents inclusion in non-clustered index):
    Presidents
    Id, (LName)index1,   (FName)index1
    1    Washington    George
    2    Adams         John
    3    Jefferson     Thomas
    4    Madison       James
    5    Monroe        James
    6    Adams         John Quincy

    EventType
    Id*  (Description)index1 
    1    intraterm
    2    postterm
    3    preterm

    Event
    Id* (PresidentId)index1, (Description)index2,                        (EventTypeId)index1
    1   6                 Served as Armistad Advocate                2
    2   4                 Wrote the Federalist Papers                3
    3   1                 Crossed the Delaware River                 3
    4   3                 Initiated the Louisiana Purchase           1
    5   6                 Authored the Monroe Doctrine               3
    6   6                 Served In House of Representatives         2
    7   5                 Authored the Monroe Doctrine               1
    8   2                 Signed the Alien and Sedition Act          1
    9   5                 Witnessed the burning of the White House   1
    10  3                 Wrote the Declaration of Independence      3

    But, my users will select by President Name, or Event, Or a combination of both and want it presented in order of occurence. Shouldnt the best performance be one that takes this into account and wouldnt that make less I/O demands

    Clustering with this in mind would  yield the following layout ( [field] represents inclusion in clustered index):

    Presidents
    Id,  [LName]index1,  [FName]index1
    2    Adams         John
    6    Adams         John Quincy
    3    Jefferson     Thomas
    4    Madison       James
    5    Monroe        James
    1    Washington    George

    EventType
    Id*, (Description)index1 
    1    intraterm
    2    postterm
    3    preterm

    Event
    Id*   [PresidentId]index1, (Description)index2,                         [EventTypeId]index1
    3     1                  Crossed the Delaware River                 3
    8     2                  Signed the Alien and Sedition Act          1
    10    3                  Wrote the Declaration of Independence      3
    4     3                  Initiated the Louisiana Purchase           1
    2     4                  Wrote the Federalist Papers                3
    9     4                  Witnessed the burning of the White House   1
    7     5                  Authored the Monroe Doctrine               1
    5     6                  Authored the Monroe Doctrine               3
    1     6                  Served as Armistad Advocate                2
    6     6                  Served In House of Representatives         2
      


  • p.partovi

    It is not clear what the disconnected mode entails for your application. Here are some of the questions:
    1. Who owns the data Is there a central server that will maintain the data
    2. What happens to logical conflicts For example, if 2 users enter the same product and you generate two different GUIDs for example how will you resolve the conflict since they are one and the same. You do not want to end up in a situation where your data has lots of duplicates and you have to clean it periodically in the central copy
    3. Does the user own the data Maybe you can partition by user.
    4. Since the server generates the unique id, you can generate local ids it seems like and then resolve the conflicts when you insert/update the data on the server
    In any case, it seems like there are technologies that has built-in support for dealing with this sort of topology (merge replication and peer-to-peer replication in SQL Server 2005 for example). You can generate ids in your local disconnected store and since there is no multi-user conflict I don't see the issue here in using some integer sequence.
    I suggest that you take a look at some of the technologies I mentioned, look at some distributed solutions using partitioning and then decide.


  • mickers

    Please see my comment above about index key(s) significance. Dealing with performance issues later can become a very tough proposition. It is very easy to forget or not evaluate data growth for example in your application requirements. But then suddenly you find that you don't scale to certain data sets or for some customer implementations. You don't have to go to 100s of GBs to see performance problems with GUIDs. The ceiling is much lower depending on which configuration you are running. There are also other issues that I didn't talk about. For example, business intelligence is becoming common place these days and you cannot use GUIDs as measures in most OLAP implementations. And you should consider the performance limits upfront because certain data models will be too costly to make changes or rewrite or scale poorly.
    I don't mean to suggest that GUIDs doesn't belong in a data model. I just want to comment that you should consider it carefully and be wary of it's implications. Merge replication for example uses guid in it's tables and it is a good usage case. But it is ironic that if you only use sequential guids in SQL Server 2005 you can get considerable performance gains. So there are things you could not do with merge replication in SQL Server 2000 that is now possible because the engine now provides a method to generate sequential guids (albeit with other disadvantages).
    For this particular problem, since the user wants to support other databases easily use of GUID can be problematic. For example, DB2 doesn't provide native GUID support. You will have to create your own user-defined type, implement domain constraints/validations etc on the DB2 end of tables. Oracle also doesn't have built-in data type for GUID. You will have to use a separate function "SYS_GUID" to generate GUIDs (this is the only thing provided) but you can't use it directly in DML statements. You need additional layers of code using packages etc to encapsulate lot of the required functionality and use it directly in tables for examples.


  • SAMI_12

    Clustering on any key column(s) that are large has performance implications. Note that the clustering key is part of every non-clustered index you create on the table. So it doesn't matter if you are clustering on (LName, Fname) or (Guid). Although in this case it is probably worse to cluster on the name columns since that will be like average of 60 bytes in length or so and you will also waste additional 4 bytes for the uniquefier. If you are using such keys for data warehouse scenario then this becomes even more critical. The amount of IO that you can reduce by having smaller index keys is important for everything - SELECT performance, DML, index maintenance, load, defrag, storage etc. What you would do is to cluster on a candidate key generated on the database much more efficiently like identity (int or bigint or decimal) and use nonclustered unique constraints.
    There are certainly some convenience factors associated with using GUID but you should consider other alternatives also. I have seen implementations where GUID was used without considering performance upfront and changing the data model later was much harder to do but when done the performance improved 10 to 100 times or more. For this particular problem, using GUID doesn't help in logical data conflict resolutions. You are just going to push the problem to a different area. And you have to merge conflicts at some point anyway but why not use some technology that can do it efficiently and provides the necessary tools. Even without replication, you can partition your data or use different ranges of int/bigint to avoid conflicts.


  • bexter

    guid is a great choice.

    no a hash won't work. isnt a hash a string

    your hash wont be much smaller that stringed guid.

    I am sure that generating a guid is quicker than running a hashing algorithm.

    what arguments are you getting against a guid



  • JamesWalshe

    Using guid has lot of performance implications. It is one of the worst data type that you can use for a primary key especially when it is clustered or as a clustered index. It will bloat the rest of the indexes. It is hard to manipulate. It can cause fragmentation (SQL Server 2005 has mechanism to generate sequential guids) and increase maintainenance costs on the tables.
    I think you should consider a replication solution rather than manipulating your data model. You can create a central publisher and SQL Express subscribers for the data. The central publisher can push/replicate the data to the other data sources. This provides a scaleable architecture and SQL Server has support for replicating to Oracle/DB2.


  • scsaba

    Hi...

    I think GUIDS are a good thing depending on the situation where you want to use them. Most DBAs dont really like them, but i think the reason for this is its hard to remember a Guid or a pain to type one. A GUID itself is only a few byte larger then an integere and if its used in a table it should usually not cause any problems. The one thing that was already mentioned is the picking it as a clustered Index is usually a bad choiche, since there is really nothing in a guid that can be clustered...

    The main problem here is the requirement for "other DBs" which would mean that your software needs to be able to convert the guid into a datatype that can be understood by the other Systems. The GUID hold 16 Bytes of Information... Now the "easy" way to encode this would be a string of the hex representation, or you could just split the bytes into a few integers and force the DB to create multiple fields which holds this. Or create a small textfield with a fixed length of 24 Characters and store the () encoded String in it This will allow you to save a "few bytes (8)" from the usual hexadecimal text representation and will avoid 0x00 Bytes in the textstring.

    But the question is how many rows do you need to store in your Application Will it be more hassle to provide another aproach GUID are ideal if you have Data that needs to be inserted in a DB "later" since you already can get a valid key and dont need to change an index field later. I would go with GUIDS... Since logical partitioning of indexes like someone else suggested can be a greater pain... Especially once you handed out all possible ranges to clients of your system, and you then need to add another client and you have 0 keys for him "left"



  • anjumahesh

    GUIDs seem like a good choice.

    Unless performance is SERIOUSLY strained on any of the servers/platforms in question, I don't think there should be much debate about it. Yes, some people think that GUIDs are the fruit of the devil, but performance tuning before the fact is the root of all evil - as the saying goes.

    I'd be a bit leery about hashing the GUIDs, unless you were pretty confident of the hashing algorithm. Likewise, storing GUIDS as text takes up a pretty decent amound of space (see:
    http://blog.angrypets.com/2003/12/warning_guids_a.html for more info)
    ... but that too isn't soooo much space that I'd be worried about it's performance impact. The only real place you'll see significant GUID 'rot' is when you're using them in joins - and that really shouldn't decrease performance too much - it just serves as a wider index than an int (i.e. 16bytes instead of 4)...

    In other words, I'd go with the GUID if it were me - and then deal with any performance issues IF and when they arose (though I doubt you'll encounter anything that couldn't be offset by some decent tuning elsewhere - unless we're talking about tables measurable in hundreds of GBs...)

    --Mike

  • Ben Fidge

    Umachandar Jayachandran - MS wrote:
    Using guid has lot of performance implications. It is one of the worst data type that you can use for a primary key especially when it is clustered or as a clustered index. It will bloat the rest of the indexes. It is hard to manipulate. It can cause fragmentation (SQL Server 2005 has mechanism to generate sequential guids) and increase maintainenance costs on the tables.
    Typically, you would cluster on indexes that are used as your primary logical selections.
    for example, if I had a relation People(ID, FName, LName)
    I would make my ID a non clustered primary key and cluster a non unique index of (LName,FName)
    as my typical selection criteria would be by Last Name and a possible First Name


  • mdenkmaier

    How about having the PK GUID hashed into an int field, and these 2 fields form the clustered index. Joins between tables using the GUID and Hash would be dramatically improved as SQL would use the hash first and only revert to joining on GUID if there were identical hashes of different GUIDs.


  • IanMc

    Many thanks for your response. I've tried to put together my thoughts on some of your questions.

    I was under the impression that SQL express will not be able to perform merge replication with non MS Databases fx: DB2 on AS 400 [Ref: http://forums.microsoft.com/MSDN/ShowPost.aspx PostID=314274&SiteID=1]

    >>it is not clear what the disconnected mode entails for your >>application. Here are some of the questions:

    >>1. Who owns the data Is there a central server that will >>maintain the data

    Avinash: Yes. A central db server (ms sql 2000/20005/DB2 Win/DB2 AS400) exists and holds info from all online users and also all ocassionally connected users. (consolidated data)


    >>2. What happens to logical conflicts For example, if 2 users >>enter the same product and you generate two different >>GUIDs for example how will you resolve the conflict since they >>are one and the same. You do not want to end up in a >>situation where your data has lots of duplicates and you have >> clean it periodically in the central copy

    Avinash: we've adopted the RDA philosophy. Its is assumed that there will be no conflicts. This is acheived by partitioning the data (by geography, product categorization etc). So, by design there should not be conflicts b/w records that originate from two users. If however there are updates then the rule that we have agreed upon is "the last update wins". What if we have duplicate products uniqueness of a product from a database perspective comes from the unique primary key. However, we were thinking that during the synch operation - some additional business checks may be performed and the users may be notified via a "System Messages" feature that we have, that there there are two products with the same name and that the user needs to resolve the conflict.

    >>3. Does the user own the data Maybe you can partition by >>user.

    Avinash: Yes. We are partitioning the data. If the system is setup correctly no two users must be working on the same set of data .

    >>4. Since the server generates the unique id, you can >>generate local ids it seems like and then resolve the conflicts >>when you insert/update the data on the server

    Avinash: I was thinking that the tables/data base objects on the server and the client would be identical. i.e. we only maintain one version of the database {scripts etc} and use the same on both the server and the client (offline database) . We are currently using identity columns on the table and that works just fine in the central server and always connected users. In case of the offline users - if we leave the identity to be present surely there will be conflicts in the ids being generated online and offline. And like I mentioned I would like to avoid the situation of having one setof ids on the client and then replacing them during or post a synch operation. Since there are several tables with FKs this would be quite messy.

    This is exactly where I was hoping to be able to generate unique ids and therefore get away from any possible issues. i.e. both the server and client would use some kind of a integer id generation algo in a sp/dynamic sql etc. Essentialy exactly what a GUID would do for us - but an integer instead. And I was hoping that the user's id which is originally generated on the server and is always unique - could some how be used to generate the unique integer ids. A user is either an ocassionally connected user or an always connected user and never both together. The way I'm thinking about it - an ocassionally connected user alwyas works with his local data store and a synch task is expected to synch the data b/w his local store and the central store when a netwrk is available. Whereas an always connected user is always connected to the central store.

    Thoughts


    >>In any case, it seems like there are technologies that has >>built-in support for dealing with this sort of topology (merge >>replication and peer-to-peer replication in SQL Server 2005 >>for example). You can generate ids in your local >>disconnected store and since there is no multi-user conflict I >>don't see the issue here in using some integer sequence.

    >>I suggest that you take a look at some of the technologies I >>mentioned, look at some distributed solutions using >>partitioning and then decide.

    Avinash: I would prefer to fall back on a well tested, out of the box solution. But the limited research I have done points to not being able to use any just now.


  • GUID