There’s been a series of discussions on the FUNDSVCS and PRSPCT-L lists about dealing with duplicates in a database, and preventing new dupes from being created. Melissa Graves from The Village for Families and Children suggested the following routine steps:
A) Search for a matching record before creating any new record. Some databases will pop up a flag if you try to save a new record which matches to another existing record (based on whatever criteria your org considers a match)
b) Run periodic (monthly?) duplicate checks of your entire system. Most databases have a dupe check process (again, based on the criteria your org considers a match).
As Melissa mentioned, some databases include tools to identify duplicates. The better ones use a variety of criteria and allow an organization to set the sensitivity of each. They might look at the first X characters of the last name plus the first X characters of the first name plus the suffix plus the first X characters of the street name plus the first X of the house number. They’re not perfect (think of how many duplicate mass mailings you get from big charities). But they will at least identify the easy, exact matches for you. Most of these utilities will also combine the dupes for you at the push of a button. Some will let you choose which data to combine, while others combine everything (which can results in duplicate addresses, etc.).
You could also pay a mail house or service bureau to run a duplicate check or merge/purge on your list and send you the results.
There are also lots of articles and tools online for deduping mailing lists in Excel. Run a search for
duplicate excel
Kevin MacDonell from St. Francis Xavier University suggested a Microsoft tutorial called Locate duplicates by using conditional formatting
Christine Schwing from Hanover College recommend using a pivot table. She adds ID number to Row Fields, and column data, summarizes the data by Count rather than sum, and then sorts smallest to largest on the Total.
Christine also recommend using Vlookup if you have two separate lists and you want to find the people who show up on both lists. To do this, move both lists into the same workbook, but separate tabs. The formula is =vlookup(A2, array_name, 2, false) where A2 is the field on the current sheet you’re looking for, array_name is the named range from the sheet you’re looking in, 2 (or any other number) represents the number of column in the array you want to return, and false ensures you’re bring back an exact match rather than "close enough for government work." She also uses vlookup when she has one list that has most of her information, but one piece of information on another sheet that she wants to bring in to one comprehensive sheet.
Another contributor, who asked to remain anonymous, had the following recommendation:
My favorite way to de-duplicate a list is to use a COUNTIF formula. You just add a column into your spreadsheet and give it a title (I use "Dup?"). Then you enter a COUNTIF formula in the next cell down. Use =IF((COUNTIF($B$2:$B2,B2)>1),2,"") if you only want copies identified and =IF((COUNTIF(B:B,B2)>1),2,"") if you want all instances of the duplicate identified (so you can decide which ones to delete). Replace all the "B"s with whatever column you are looking for duplicates in. Then just copy the formula down the row. This will put a "2" in your "Dup?" column for any duplicate record. Then you can just go to Data:Sort, sort by Dup? column, and all the duplicates will be together. You can delete them all (if you identified only the copies) or just decide which ones to delete. Another reason I like it is because you can replace the column ("B") with any other column if you decide you want to search a different column.
Here’s an example:
Dup? | ID# | Last Nam | First Name | Address | |
2468 | Smith | Dave | 1234 My Way | ||
1234 | Jones | Jenny | 44 E J St | ||
5555 | Garcia | Marcy | 789 2nd Ave | ||
9876 | Davis | John | 321 John Rd | ||
7777 | Wong | Frank | 1111 First St | ||
1357 | Lewis | Lee | 44 E J St | ||
2 | 5555 | Garcia | Marcy | 789 2nd Ave | |
2 | 7777 | Wong | Frank | 1111 First St |
Right now, the formula has identified only copies of the same ID#. I can sort by the Dup? to get them all together and then just delete them. If I want to double check by address, I can use the formula that identifies all duplicates and replace column B with column E:
With any of these approaches, you will need to review the results carefully. You will get false positives and false negatives.
If you want to be thorough, your process will also include a laborious review of lists sorted by last name, zip code, state, area code + prefix, area code, email, etc.
Leave a Reply