Merge Records Help

The options in this section help you merge records after merging two files.

Find and Merge Duplicates

Merging many records by hand can be a time consuming task, but it is very difficult for a computer to take over and automatically merge records for you. Unlike humans, computers are not that good at recognizing patterns in vague data and therefore at reliably knowing which records should be merged. But, a computer can still provide much assistance. This section describes a script called "Find and Merge Duplicates" that makes it much easier to merge many records after merging files. This script (which is in Python and therefore requires MacOS 10.5 or newer), searches your file for potential matches lets you decide whether or not to merge them.

Because you provide input for each merge, this process is not designed for merging two identical files with a few minor differences (see on-line tutorial for ways to solve that problem). It is great, however, if you have a large file and then merge in a new file that might have a few hundred (or even more) duplicate records. This script should help you accurately clean up a file after such a merge in a reasonable time frame.

Running the "Find and Merge Duplicates" Script

To search for and optionally merge all duplicate records in a file, you run the "Find and Merge Duplicates" script and proceed as follows:

  1. Each time you run the script, you look for merges in one type of record. It is best to proceed in the recommended order or if you return to merging after a break, continue where you left off. To complete merging of a single file, you will need to run this script several times — once for each class of records you need to merge.
  2. Next pick the starting record. Normally you should select the first record, but if you have already merged some, you can select the record where you left off before.
  3. The script will then search through all records of the selected type. The time will depend on the type of records. It is slowest for individuals (because more data is checked) and notes (if the file has lots of notes). Whenever it finds a match, a window will appear with a merge quality score and with descriptions of the content of each record. Based on the score and the descriptions, you can make one of the following choices:
    1. Merge - merge the two records. If the names differ, you will be asked which name you prefer for the merged record. All merged records will be listed in the final report.
    2. Don't Merge - do not merge these records, which is safest to pick if you cannot decide from the details if they should be merged. The record pairs that are not merged will be listed in the final report.
    3. Cancel - exit the current merging session and display a report of the merging process to the current point.
  4. Final Report - when merging is done for one class of records, you will get a report that lists all records that were merged and all pairs of records that were not merged.
    • For merged records, you should open each record to see if the data needs more clean up. Merging will eliminate duplicate data, but some small mismatches (such as spelling of birth place) might result in extra data you do not need. You can edit to keep only the preferred data.
    • For pairs that were not merged, you can open them up now and see if you can determine whether or not they should be merged. If they should be merged, you can use the manual "Merge Two (type) Records..." command in the "Tree" menu for that single pair.
  5. The above process may not find all duplicates. It may be necessary to find the remaining duplicates by hand.

Merge Quality Score

Each pair of records for a potential merge is assigned a score from 0 to 100. The score is based on both the amount of matching data and the accuracy of the matches. For example, two individuals with the same name, birth and death date, father and mother will have a higher score then two individuals who have only the same name and birth date. The score will also reflect accuracy. Two individuals with the exact same name or birth date will have higher score then two individual with just similar names or with slightly different birth dates. Date comparisons further reflect the precision of the dates. For example, two exact dates (e.g., 4 JUL 1776) will match better then two dates that specify only a year (e.g., 1776).

For individuals, the testing first looks at names, birth and death dates and places, and parents (their names and birth and death dates and places). Names break the name into parts (first, middle surname) and considers exact spelling or Soundex code matches. Places look at all levels of hierarchy to see how many match. If there are any conflicts in these first checks, the pair is rejected for merging. If there are no conflicts, the script looks further at one spouse. Since individuals may have several marriages, a mismatch in spouses is not considered a conflict, but matches in spouse or marriage date and place will increase the score. Testing with real data shows that actual matches have an average score aboves 50. Scores above 50 have a high chance of being records that should be merged. Scores below 50 either mean the two records should not be merged or they simply do not have enough entered data to receive a high score. If the reason is the later, the pair could still be records that should be merged.

For families, the record must have identical husband and wife. As a result, all individuals must be merged before attempting to merge families. If the husband and wife match, testing next looks at marriage date and place and correlations between children. Conflicts in marriage information will reject the match. Otherwise the score will increase with accuracy of the match and the amount of overlap in linked children.

All other types of records do less sophisticated checking that is similar to GEDitCOM II's requirements to merge two records (although it goes a little further when comparing text by allowing near matches rather then requiring exact matches). The scores will be less accurate, but you should be able determine if they can merge from the descriptions listed in merge-options window.

Merging Script Speed

The merging script is fairly fast for most record types, but can slow down for merging individuals or notes, depending on the size of the file. For individuals, it goes through each record looking for matches. The name that is being checked at any time is printed to the scripting panel. Individuals are checked in groups by first letter of their surnames, which helps to speed up the script. The script caches data to maximize speed, but you will notice a pause whenever the script reaches a new surname letter (it pauses while caching data for the next surname letter). Once it starts again, it will speed up as it goes through that letter. if desired, you can cancel the merging process at any time and start again later in the same surname group. For example, if you stopped merging with "John Smith," you can start again later with the first name beginning in "S."

For notes, it compares each note to every other note. Usually there are fewer notes then individuals, but if there are many notes, this merging task can get slow. It will, however, speed up as it goes.

Recommended Merging Order

When you are merging many pairs of records in a single file manually or by using the "Find and Merge Duplicates" script, it is best to process them in a specific order. You want to merge records commonly cited by other records before merging the main records. The recommended order is:

  1. Repository Records
  2. Source Records
  3. Notes Records
  4. Multimedia Records
  5. Research Log Records
  6. Submitter Records
  7. Individual Records
  8. Family Records
  9. Place Records

Merge Notes Records

This option goes through all note records (starting at one you select when the script starts) and merges all pairs of note records that have exactly the same text (case insensitve comparison with leading and trailing spaces ignored). This script is much faster then merging notes using the "Find and Merge Duplicates" option, but this one only merges identical notes while the other will look for similar notes to merge as well.

Merging identical notes is useful if you want to take advantage of the feature of GEDitCOM II that lets multiple records link to the same notes. When using this feature, you will need to be aware when you are editing a common note vs. a specific note on one record. In the "Default Format," the record(s) linked to the note are listed below the notes editing area. If only one record is listed, the note is a specific note; if more than one appears, the note is a common note. When editing a common note, you should verify that any changes you make apply to all records linked to that note. For example, a common note might be "A veteren of the Civil War." This note could apply to many people. If you later change that note to "He was a sergeant in the Civil War," it no longer applies to all the same people. This type of change should be done in a new note rather than changing a common note.