I have encountered several scenarios where it has been necessary to clean up duplicate rows in a table. When there is a primary key or some sort of unique identifier this is an easy task. However, when they are truly duplicates and there is no way to distinguish between the rows it becomes a little trickier.
Because of this. I have often seen people take the approach of:
- Select distinct values from Table A into temporary table B
- Delete from Table A where row exists in table B
- Insert values from table B back into table A
While this works, I prefer fewer moving parts. Below are my preferred methods for cleaning up duplicates in both SQL Server and Oracle.
In SQL Server, my weapon of choice in this situation is the ROW_NUMBER window function. Let’s start by creating a table with duplicate rows.
CREATE TABLE employees
INSERT INTO employees
--see that we have multiple duplicates with no way (primary key) to distinguish one from another
SELECT * FROM employees
The ROW_NUMBER function will allow us to create that unique identifier we were looking for to distinguish between the duplicates. A simple select statement using the ROW_NUMBER function:
--This will assign a row number to each duplicate
SELECT ROW_NUMBER() OVER (PARTITION BY fname, lname, title ORDER BY fname) rownum
Looking at the results we can see that deleting everything with a rownum greater than 1 would leave us one unique row each. We will do that next using a derived table and this same window function.
--This will delete anything with a row number higher than 1 leaving only one copy of the row behind
DELETE dups FROM
(SELECT ROW_NUMBER() OVER (PARTITION BY fname, lname, title ORDER BY fname) rownum
FROM employees) dups
WHERE dups.rownum > 1
--table now only contains unique rows
SELECT * FROM employees
And with that we have a duplicate free table.
In Oracle, we could simply use ROWID Pseudocolumn the to clean up the same data. If we were to select all columns from the employees table and the ROWID Pseudocolumn, we would see that we actually do have a unique identifier to work with.
SELECT employees.*, ROWID
We would then pick out the min ROWID for each unique row in the table grouping by all columns and delete everything else.
DELETE FROM employees
WHERE rowid NOT IN
GROUP BY fname, lname, title)
And once again we have a nice clean table.
I thought I’d start this blog off by sharing something I have encountered more than once in the past few months. Most people believe that if they run their select queries in read uncommitted/nolock that there is no way they can block other processes. This belief is mostly true, but it actually still possible to unintentionally block other processes when running a select with nolock.
So how can a read uncommitted/nolock select query block other processes? To answer that, I will first explain two types of locks in SQL Server. Schema Stability (Sch-S) and Schema Modification (Sch-M) locks are designed to maintain the integrity of our objects while they are being accessed by DML or DDL queries. When running any select query in read uncommitted/nolock, a Schema Stability lock is acquired to prevent any other process from altering the underlying object while the query is executing. These locks are compatible with any other lock except a Schema Modification lock and will never block a DML query. Schema Modification locks on the other hand are acquired any time DDL is run against a table and are incompatible with every other lock type. SQL Server doesn’t want anyone modifying the table while it is being selected from (thus the Sch-S lock), nor does it want anyone accessing the data while the table structure is being modified (where the Sch-M lock comes into play). As a Schema Modification lock cannot be acquired while a Schema Stability lock exists, the Schema Modification lock would have to wait for existing Schema Stability locks on that table to clear.
Say for example you need to kick off a report that will run overnight. You know there shouldn’t be users in the system, but you run it in the read uncommitted isolation level just to be safe. Let’s then say that you have a maintenance job that is supposed to rebuild indexes on the same night. If a long query/report is running at the time your index rebuild job kicks off, you can encounter a scenario where that long running query blocks other processes. When rebuilding an index (even online), a Schema Modification lock is required. Since a Sch-M lock is incompatible with any other lock it must wait for that Sch-S lock to be released. If, at the same time, another process is attempting to update/insert/delete to the table(s) in question, they will be unable to complete as they will now be queued up behind the Sch-M lock.
Let’s step through an example. First, let’s create a table big enough to cause trouble and add a nonclustered index on that table.
CREATE TABLE schemamodlock
INSERT INTO schemamodlock
SELECT TOP 3000000 CAST(NEWID() AS CHAR(37))
FROM sys.columns A
CROSS JOIN sys.columns B
CREATE NONCLUSTERED INDEX IX_1
ON schemamodlock (col1)
Why a GUID CAST as a CHAR you ask? I have no good answer, but since we are being destructive here I decided we should let them join the party. The next part is tricky as you have to time it just right. As I stated previously, even online rebuilds need a Sch-M lock. For online rebuilds, this is very short lived and happens at the end of the rebuild process. Because the Sch-M lock is our problem lock, we want to start our long running query against the table just as the rebuild is about to finish. To prepare, I first rebuilt the index online and noted the time (let’s say 15 seconds).
Now we’re ready to do some blocking. We will first kick off our online rebuild (SPID 66 for me).
ALTER INDEX IX_1 ON schemamodlock REBUILD WITH (ONLINE = ON)
Since my previous rebuild took 15 seconds, about 10 seconds into the rebuild we will start the next step. In another query tab (SPID 70 for me) we will start our “reporting” query that will require a Sch-S lock.
FROM schemamodlock (NOLOCK)
Now let’s say our application needs to update a row in the schemamodlock table. The application (SPID 68) issues an update statement.
SET col1 = 'A'
WHERE col1 = 'B'
If you timed it up right, the update query is now blocked and sitting behind the Sch-S and Sch-M locks we created. If we run sp_who2 we can see that.
And that is how a NOLOCK/Read Uncommitted select can ruin your day. Of course, this can be described as a “perfect storm” situation and most likely the rebuilds wouldn’t be blocked excessively long. Regardless, always be sure to keep your maintenance plans in mind when doing reporting, even when using nolock.