Monthly Archives: April 2012

A Python GEDCOM Parser

Excited by my discovery of Mayflower ancestry (or perhaps by the apparent confirmation that my genealogy records weren’t totally made up), I decided to contact other individuals on 23andme who were predicted to share DNA fragments with me and seek out other cases of family overlap.

The task was rapidly daunting! On 23andme there aren’t really any tools, the method of choice appears to be listing all surnames in one’s ancestry. The GEDCOM format genealogy file my father has documented is huge, I have currently have 384 ancestors in the document (and 163 surnames). Other genealogy buffs have similarly deep information — I slowly realized manually searching for overlaps between our lists was not at all practical.

My first “quick and dirty” attempt was to grep the file for last name matches. Little did I realize there are actually 1,547 individuals in my file! People who are not my direct ancestor (cousins and their spouses and children) are listed as well. On one hand this was really cool, more data is better… but on the other hand it meant a lot more thought was required.

To cut a long story short, I ended up finding an old GPL-licensed Python GEDCOM parser (linked here as “GEDCOM Parser”). I extensively improved it (in my humble opinion) and have uploaded the code to github as “python-gedcom“. The end result was a module I can use to pull out direct ancestors, search on last name matches, and return the path between me and a given ancestor.

Applying this to a new 23andme match (who also had 160 surnames!) I found 27 potential surname matches among my ancestors — all in the New England area. (This might simply reflect that my New England ancestors are the most extensively documented region in my tree.) I sent my distant relative this list of names, along with dates & locations of birth & death (where available).

From that, he found one definite overlap. Here’s the path from me to that ancestor, nine generations distant:

Gen 0  Madeleine Emily Price
Gen 1 . Paul Arms Price
Gen 2 .. Doris Madeline Arms
Gen 3 ... Howard William Arms
Gen 4 .... Jane Aitken
Gen 5 ..... Eliza Wales
Gen 6 ...... John Wales
Gen 7 ....... Lucy Strong
Gen 8 ........ Martha Stoughton
Gen 9 ......... John Stoughton

To be fair, I think it’s possible (even probable) that given our shared New England ancestry we have other points of overlap that we didn’t discover. I’m really pleased, though, at how tractable this task became once a program was applied: programming is a useful skill to have!