Avoid duplicate data

Hi,

I want to insert data in a database. How is it possible to avoid inserting duplicate data If I want to insert a row which also is in my database, I get an Exception and my program stops running. How can I tell my programm to go on without stoppng even if there was an Exception
Or has anybody a better and faster solution !

Thank you!


Answer this question

Avoid duplicate data

  • antonio97b

    I didn't realize you had so many exceptions ... you are correct, it would be far better NOT to catch each exception, since the volume of exceptions (and unique key constraint violations) will slow down the INSERT operations considerably.

    How is your data stored before going into the database   Are you loading from a flat file or is the data in memory in some sort of array   Either way, here are two options:

    1.  Pre-sort (appropriate for all in memory collections and smaller files, e.g. < 50MB)
    - Load your data into an in-memory data structure (e.g. an Array)
    - Sort on the primary key columns using the .Sort method
    - Loop through each record
      - If current row's primary key = previous row's primary key skip
      - Otherwise, INSERT

    2.  SELECT before INSERT
    - SELECT primary key columns FROM table to INSERT (if you only select the primary key columns, this should be VERY fast since it's not even hitting the table, just the index)
    - IF SELECT returns 0 rows, then INSERT

    Hope this helps, as before, let me know if you need sample code (some of yours would help me provide direction as well),
    Josh Lindenmuth

  • JayKay

    I've tried your idea... but I don't know how to ignore the exceptions...
    If you have an example - please let me know :)

  • sandPR

    1.  For each table create a stored to do the insert/update. This will provide a gain in that the sql will be compiled. 

    given Table1(key int, data1 varchar(50), data2 datetime) then:

          Create proc Process1(@key int, @data1 varchar(50), @data2 DateTime, @DoUpdate bit) 
          as 
             if @DoUpdate = 1  
                Update Table1 set data1 = @data1, data2 = @data2 
                      where key=@key      
             else
                Insert into Table1(key, data1, data2) 
                   values(@key, @data1, @data2)

    2. Before you process a table, drop all non-integrity based indexes.
       Gain comes from not having to recalculate the index on data change.

    3. For each table create a SQLCommand, set the CommandText to the stored proc and set the commandtype to storedproc.  For each line of input data, set the parameters first setting @DoUpdate = true, executenonquery, if the return is 0, set parameter @DoUpdate = false and execute again.
          Gain comes from not having to recreate a SQLCommand for every line

    4. recreate you indexes that were dropped in step 2.

    Depending on your input data, I bet you see a 75% increase in performance, I.E. 5000 rows in 15 seconds

    let me know how it goes!

  • MARL

     Josh Lindenmuth wrote:

    2.  SELECT before INSERT
    - SELECT primary key columns FROM table to INSERT (if you only select the primary key columns, this should be VERY fast since it's not even hitting the table, just the index)
    - IF SELECT returns 0 rows, then INSERT


    Hi Josh,
    this (<- see quoted area) is what I did before ignoring the exceptions. I do a Select and if my result is 0, I insert. This takes a long time and so I searched for other methods.

    To give you a short overview, here are some informations:
    - First I read some (= 10 to 20) .txt files where my data is store.
    - Then I want to insert the data in a database.
    - Because the data in the textfiles CAN have duplicates, I need to know if I have inserted them before because of my primary key logic in the database. I have no autoincrement key (no int or something else), my key belongs to three columns (type date,time and varchar).
    - Every .txt file contains nearly 500 rows of data...so I have (10 to 20 * 500 rows) in my database.

    ... ok, I think this is all of information... I will try your idea with an ArrayList and sort it and will see how long it takes.

    If you have other good ideas, I would be happy about to know them. Smile
    Thanks for your help, Josh Big Smile

    ...and sorry for my English...Tongue Tied

  • jwdenny

    With only 10,000 rows of data, an array list may be your best method ... you can sort an array list with 100,000 rows in a couple seconds typically, so 10,000 rows should be almost instantaneous.

    I didn't realize your "slow" process was with SELECTs instead of catching exceptions.  Here's how you'd write a function that would ignore exceptions when trying to INSERT.  You'd call this function for each row in your database.

    public void InsertNoFail(sql as String)
    {
       try
       {   
          //INSERT logic
       }
       catch(Exception e)
       {
          //do nothing (essentially ignores the exception)
       }
    }

    The above could be made faster by using parameters instead of sending 1000's of INSERT statements to the database.  Also, you'd probably want to replace catch(Exception) with the specific error your database is throwing when there's a duplicate key.

    Hope this helps,
    Josh

  • Vaibhav_Patel

    Your best bet for ignoring those exceptions and continueing is to create a function that wraps your insert in a try...catch block.  You'll need to catch the specific duplicate key exception that is thrown, and just ignore the exception.  The function would be called for each row you'd be inserting.

    Let me know if you need an example,
    Josh Lindenmuth



  • Alex Jimenez

    One information:
    I was now able to insert data and ignore the exceptions, but it it so slow... when I insert 5000 rows it takes nearly one minute... how can I do this faster ! I think it takes many ressources by ignoring the mass of exceptions I get... is there a faster way to make many inserts !

    Thanks for replies!

  • Avoid duplicate data