<Updated 2013-05-11>
Recently Derek, a fellow Premier Field Engineer (PFE), had a customer request on how to remove duplicates from a SharePoint 2010 list. I have written some PowerShell scripts that remove list items, but never to check for duplicates. My main concern for this was performance as looping through a list searching for duplicates could be an expensive operation.
Problem
As it turns out the customer had a list with more than 85,000 items and many of the list items were duplicates. The original solution the custom attempted took 5 hours to run and threw out of memory exception most times it was run. My assumptions on the resource intensive list queries and looping were correct.
Solution
I first wrote my own version of a looping structure to iterate over the items and remove duplicates while trying to be as efficient with memory and CPU usage as possible. After about 10 minutes of scripting I wasn’t making good progress so I decided to switch up my approach.
A quick Bing search pulled up the following article on checking for duplicates items in a SharePoint list. The key piece of info from that article was not to loop through the items one at a time, but instead to convert the list into a DataTable using the method SPListItemCollection.GetDataTable() and then group the items on the column to compare for duplicates. In this case the Title column was used for grouping. With a little trickery I was then able to find the IDs of the duplicate items to then be used to delete the individual duplicate items.
Two added added bonuses are included in the below script. The first is that the script writes out the progress of the deletion process using Write-Progress. The second is that the commented portion at the bottom will generate list items with random single letter titles to test out the duplicate deletions.
<Update 2013-05-11>
Thank you to reader Santosh for pointing out an error in the script I published. I had mistakenly published a version that included an invalid reference. On line 10 in the foreach block (%) the call to $list.DeleteByItemID() will fail because that method doesn’t exist on that object. I’ve updated the downloadable script and below sample to the original I used for $list.GetItemById($_.ID).Delete(). Santosh also got it to work by calling $list.Items.DeleteItemById(). I prefer not to call the Items member on an SPList object because of the performance impact which you can read about in my previous post.
</Update 2013-05-11>
001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 | Add-PSSnapin microsoft.sharepoint.powershell $web = Get-SPWeb -Identity "<URL of Site>" $list = $web.Lists["DuplicatesList"] $AllDuplicates = $list.Items.GetDataTable() | Group-Object title | where {$_.count -gt 1} $count = 1 $max = $AllDuplicates.Count foreach($duplicate in $AllDuplicates) { $duplicate.group | Select-Object -Skip 1 | % {$list.GetItemById($_.ID).Delete()} Write-Progress -PercentComplete ($count / $max * 100) -Activity "$count duplicates removed" -Status "In Progress" $count++ } <# # generate 100,000 items in a list using a random single character as the title foreach($x in 1..100000) { $item = $list.AddItem() $item["Title"] = [char](97 + (Get-Random -Maximum 25)) $item.Update() } #> |
Be sure to modify the site URL and list name to work in your environment. In my example the list is named DuplicatesList and must already exist.
Conclusion
With my current solution I was able to remove 75,000 duplicates in a list with a single column (Title) in roughly 45 minutes. Derek made some tweaks and was able to remove 45,000 duplicates in about 3.5 hours on a list with dozens of columns. It appears that a list with additional columns takes longer to process. If I had additional time I would investigate the SPList.GetDataTable() method which allows passing in a query to only retrieve specific columns (i.e. the column we are using to determine duplicates). I believe this might speed up the process but don’t know for sure.
As it stands I was very happy with the solution I came up even if it did take a long time to run. The important part was that it didn’t throw out of memory exceptions while running and did have increased performance vs. the customer’s original solution. Thanks to Derek for collaborating on this item with me. It really is great to work with other PFEs who are passionate about helping customers solve real problems in their environments. I hope this script and / or post helps someone else in their environment.
-Frog Out
Links
Check for duplicate items in Sharepoint lists with Powershell
SPListItemCollection.GetDataTable() method
http://msdn.microsoft.com/en-us/library/microsoft.sharepoint.splistitemcollection.getdatatable.aspx
SPList.GetDataTable() method
http://msdn.microsoft.com/en-us/library/microsoft.sharepoint.splist.getdatatable(v=office.14).aspx