Pages may contain affiliate links.
See my disclosure page for more details.

Auto-clustering, Genetic Networks, and Kissing Cousins

Have you heard of auto-clustering for your autosomal DNA results? It's something you should try but is it going to help you or be another tool that drives you crazy? Turns out, it might depend who your ancestors were.

If you've read many posts on this blog, you know my ancestors are all southern. They aren't just southern, they're all from North Georgia. This is a nightmare when it comes to using DNA (although those of you with some of the most extreme endogamous populations, you have my sympathy, I know it could be worse).

Should you try auto-clustering for your DNA matches? | The Occasional Genealogist

Auto-clustering, Genetic Networks, and Kissing Cousins



Should you try auto-clustering for your DNA matches? | The Occasional Genealogist

Not All Genetic Clustering is the Same

I've been working on a technique for possibly two years or more to get faster results using DNA and today I want to talk about it.

I've actually spent almost a year trying to prepare this technique to share with the genealogy community but there have been some bumps.

Some due to my schedule (getting sick, taking on extra work, vacation, etc.) but some because the technique is easy to explain but the analysis, not so much. To me, it's obvious, but I've talked to some other (very smart) people about it and it's not obvious to them.

I did my version of clustering long enough ago I've actually been able to verify many of the clusters are correct or helpful (that doesn't mean all the matches in a cluster belong to an ancestor, it means the cluster works for what clusters are supposed to do). So, I know my analysis is good, I just haven't figured out how to explain it so others can apply it.

I've decided to start towards sharing some of my work because my technique is not entirely unique. In genetic genealogy, multiple people will develop their own techniques for dealing with an issue but they'll be similar. Visual phasing is an example of this.

In this case, the technique I developed for myself is similar to Dana Leed's "Leeds Method." The big difference is Dana designed her technique for the tests she worked on which were not southern (or endogamous) and she included helping adoptees.

I designed my method specifically for my southern family and in particular to do more with my great-aunt's test. Her grandparents were second cousins. Although that's the only "close" cousin marriage in my known direct line, those families inter-married a lot so many of her matches had kissing cousins, too.

NOTE: I'm using the term "kissing cousins" because there's some debate if a population like southern should be called "endogomous" or if the term "pedigree collapse" should be used. And there's just the issue if the person that should read this post would even know those terms!

I personally don't like "pedigree collapse." Southerners do have pedigree collapse, as it is defined, but an endogamous population does, too. However, the average southerner's pedigree isn't actually collapsed as someone from an undoubtedly endogamous's population would be. I've always thought of southerners as having "pedigree knot" but that's not a term!

Problems Using DNA for Southern Families

So why do you care about having kissing cousins?

It's a mess.

I think it's appropriate to use that popular phrase, it's a hot mess.

A hot, southern, sometimes inbred, mess.

It's a mess when doing traditional research and DNA has the same issues. With kissing cousins, you didn't just mix up two men of the same name, you mixed up two relatives of the same name. That means their DNA is mixed up, too.

Unlike unconditionally endogamous populations, a population like southern may appear to be "normal" when working with autosomal DNA.

I'm so used to dealing with it I didn't realize it stumps lots of people that are otherwise successful using autosomal DNA results. I knew it wasn't easy to work with, but when it's all you've got, you press on.

I've talked to a number of people who haven't realized they were going to have any issues with a southern branch until they started trying to achieve the same results they did with their other branches.

Once I figured out my technique for myself, I knew other people were having the same struggles and could use the same kind of help.

I've decided to go ahead and start sharing now because auto-clustering using the Leeds Method is all the rage. I know what's happening, programmers can easily create an automated option to perform the Leeds Method on test results.

I performed the Leeds Method manually on several of my southern tests, and on my only non-southern test (my father-in-law's test).

I'll make this short for those of you with known "normal" (non-endogamous) populations in your DNA. The Leeds Method works great and that's really all you need as far as a "clustering" step (keep in mind, clustering, manually, automatically, or any other way, is just one step in using DNA results).

Clustering is just one step in using DNA results.

If you are southern or focusing on a southern branch, you'll likely need to take things farther.

The Leeds Method vs. the 4 Buckets Technique


The automated options give varying degrees of success for southerners because using the Leeds Method without any adaptation doesn't overcome the issues southerners face using DNA for genealogy.

In essence, you will not get well-defined clusters using just the Leeds Method. When I learned about Dana's method, I learned how to do it and tried it (this was after I had already been using my method for months and had actually stopped clustering and was in the "working" phase that comes next).

What she did and what I did starts out the same, except she has some nicely defined rules which I think you should start with (see her post about starting the Leeds Method, here). You should really read through how to do the Leeds Method before jumping into auto-clustering so you understand what's happening (it's not that complex).

There were a few differences between my technique and the Leeds Method from the start and I think this highlights the difference between color-clustering for southerners vs. non-endogamous populations ("normal" DNA results).

Dana didn't include a lot of extra information in her initial spreadsheet design (and for cases where I think you only need the Leeds Method, I agree it isn't absolutely necessary and makes it easy to "see" the results).

I set out from the start to include shared cMs and the estimated relationships as given by the testing company (I used this specifically for AncestryDNA where I didn't have segment data). I also included the closest identified relationship and whether the person's estimate was closer or farther from this plus I included notes, such as when people where identified as a 2C1r and 3C (second cousin once removed and third cousin), and there was a lot of that in my family.

All of this was to help sort out if the known shared relationship was likely the genetic shared relationship or not (plus I like statistical data when working with a lot of material).

The layout of our spreadsheets' was different and I've found I can start with her layout and then morph it into my layout if the extra analysis is needed. The big difference, I used a more complex color system based on known and suspected relationships.

Why?

That's what highlighted if multiple matches belonged in two clusters (So far I haven't needed to apply more than two colors to a match, thankfully! This is of course because the color comes from the most recent ancestor in a lineage, even if another color is assigned to an ancestor of that person. Otherwise, most people would need multiple colors).

Looking at my spreadsheet wasn't as simple as looking at one for the Leeds Method but the purpose was for more complex relationships so that should be expected.

When I started, what I did was list the names of shared matches (this was why our layouts were different). I colored the primary match, not the shared matches, IF I knew we had a shared relationship. Then I'd copy the name of the primary match to each cell where that name appeared in the shared match matrix.

Later I also developed a way to color code matches who's shared ancestor was estimated from the bucket they were in. This strengthens the groups without giving the impression they are more certain than they are.

This was easier than it sounds because I was developing the technique as I went. It also worked.

Before you run out and try to duplicate this with such skimpy details, I now use a pivot table for AncestryDNA matches. This essentially performed the Leeds Method but I include the shared match names as the name for EVERY cluster (not just the four or eight biggest ones, and that's why it needs a pivot table---which does this in moments. Note when I say "every cluster," you do limit by shared cMs or estimated relationship, though, no reason to do this on too distant cousins).

If I then apply my initial color coding scheme, this becomes my technique which I call the 4 Buckets Technique (you're dumping matches in buckets that might be for an ancestor, or not, it's very casual and you need to remember that, even when people call them clusters!).

This looks absolutely nothing like the visual "auto-cluster" results you get because it works by using the filter/sort/hide features in Excel to see the buckets you create. The reason being, you will have some people in multiple clusters and you need to see that.

You want to keep adjusting the colors in your buckets. Although it's great to identify the shared genetic match, if you're related in more than one way, finding new traditional results is just as helpful even if you don't share DNA from that ancestor.

Auto-clustering DNA Results for Southern Families


I know the beta version of auto-clustering at Genetic Affairs won't help much because it's programming can't deal with matches who appear in more than one cluster. That's sort of the definition of southern genealogy.

I say the beta version because right now, the programmer has stated this is an issue (not a glitch, it's programmed that way). This could be changed later but realize, you need help with your southern branches because a match does appear in several clusters. Any tool that tosses out the matches in more than one cluster won't help.

An automated tool can only cluster matches that appear in multiple clusters by either making an arbitrary decision or including the match in each cluster it appears in. This means an automated option will always either leave you without information (if matches are tossed out) or can badly lead you astray.

I prefer the latter because clustering is always a gamble. It is a technique to help jumpstart work or provide clues or suggest what to do next.

It is not an answer in itself.

Clustering is always a gamble...It is not an answer in itself.

You need to realize this when using clustering and genetic networks of any type. You can do more analysis or layer on more techniques to refine your results and work towards a proven solution.

With southern ancestry, the matches that appear in multiple clusters are often the most helpful. You may discover unknown inter-marriages that reveal why you're having a problem working with some matches.

Because it is so common to have matches that belong to multiple clusters, you don't want to throw any matches out because those may be the only matches with trees (or the only ones who respond or the only ones at GEDmatch, etc.).

You don't want to throw any matches out, you want to figure out what to do with them (this can include ignoring them but you want to make that choice, not let a computer program decide it).

If you try an auto-clustering tool on a heavily southern test and the results don't seem helpful, this could be the reason. If you know a tool tosses out matches who appear in more than one cluster, you should expect the results to be limited. They can still be helpful, though.

So far, all the auto-cluster tools I've tried have been accurate (as much as you can tell when you can't verify how some matches are related). That means you can use the clusters you do get as a starting place.

The Perils of Genetic Networks


If you've been stymied researching your southern ancestors (and this applies to any group with lots of cousin marriages, I've seen a similar pattern in a Welsh family of a client), genetic networks will help. However, you need to be extremely careful using automated tools, this includes both "auto-clustering" tools and anything like AncestryDNA's "DNA Circles."

DNA Circles


The DNA Circles at Ancestry are particularly problematic because they are created using members' public trees. For many people, this will mean they have fewer or smaller circles because a DNA match might have the wrong (or even just a variation of) information in their tree and the algorithm won't put them in the circle because the trees don't match.

If you're southern, you can end up in the wrong Circle because a tree is wrong (and all of this also has to do with trees being copied flagrantly so they "spread") but you still have a DNA match because southerners are often related to each other genetically in multiple ways.

This is exactly how incorrect trees could get passed around in traditional research. In southern research (and any population with the same kind of insular-ness), even doing "good" research can make it appear you have the right family. Names are often the same and in a small community, even FAN clubs (friends, associates, and neighbors) can be very similar.

It takes doing great research to uncover the truth when the surface details seem to match.

With DNA Circles, you've got the potentially wrong trees (which might even look good when you look closely but don't dig deeply). Then you add the DNA which you think matches. Seems like a lead!

It might be, or it might be that you're related to someone with a wrong tree in three or four ways (or even more).

At AncestryDNA, you have no idea if you actually share the same DNA segment or if you're just a shared match. The problem: you might match person 1 and 2 but the DNA for match 1 came from ancestor A and for match 2 it came from ancestor B---but 1 and 2 share ancestor C.

You'd still all be shared matches even though it's not through one ancestor!

All you need to turn that into a DNA circle is all have trees showing one shared ancestor, regardless if the trees are correct (and you do need more than three people but if each of you had multiple relatives tested, it's not unlikely to happen).

What's my point? Genetic networks are a FANTASTIC tool for southern genealogists. As long as it doesn't lead you horribly astray.

Learn How "Auto" Works

Understand how any automatically created cluster or genetic network is created.

If it uses an algorithm to match public trees, this adds an additional chance for error which is magnified in populations where people are likely to have multiple genetic connections (it doesn't matter if the connections are small, i.e. small segments, only if the tool recognizes them as a genetic match).

If genetic networks/clusters are so great but potentially so troublesome, what's the solution?

What Comes After an Automatic Genetic Network?

The solution is doing your own analysis. With the DNA Circles, this involves redoing everything as there aren't any built-in analysis tools. However, you can use the auto-clustering tools as a double check (that's fast and either free or cheap) and then do your own analysis.

If we're talking about an auto-clustering tool, you just do your own analysis.

If you're not dealing with kissing cousins I highly recommend you try the tool at Genetic Affairs or the new Collins's Leeds Method 3D tool (for AncestryDNA, only, right now) in the DNA Gedcom Client.

The Genetic Affairs auto-cluster tool is free to try (you get 200 credits for free, it's 25 credits per auto-cluster and it might spend a few sending you your match list when you start setting it up so you might not get 8 auto-cluster reports for free).

DNA Gedcom Client is a paid feature of DNA Gedcom and you can get a month's access for $5. I like this better so I can play with the settings as much as I want.

However, it is still in Beta and a touch glitchy (sometimes it runs for me, sometimes it doesn't---it probably has to do with how many matches the test has which is a problem for most southerners---it was lightning fast on my father-in-law's results which has about a 10th the matches of my great-aunt's).

If you are dealing with kissing cousins, I still recommend giving these a try BUT, you probably want to think about the cost before you start.

If you have lots and lots of matches, which is common for southerners, your results can be overwhelming or server-crashing without adjustments to the default settings. This is why I like DNA Gedcom Client, I can try out different settings and I don't worry about stopping a run that seems to have caused the system to hangup (it's not going to cost me anymore).

If you only have one southern branch (or one branch of whatever population is causing you problems), this isn't such a big deal. It's in the cases where you need clustering the most that you have problems.

What Does Auto-clustering Look Like for Southern Families?

For example, my great-aunt who's grandparents were second cousins, that side of the family is from one county. The other side is from the adjoining county.

In a case like her's, clustering helps me get past the lack of help from geography. It helps me tell the most likely branch the shared ancestor comes from when a match is related to us three or more ways, or doesn't have a tree, or won't respond. I did all of that by clustering manually (using my own method which starts the same and then diverges to deal with the inter-marriages).

When I use auto-cluster "off the shelf" I either cause a hang-up in the system (how this happens differs by which tool) OR I get an insane number of clusters. I actually can't really read the results displayed visually. That's why there are settings to adjust. However, I've caused lots of issues playing with the settings, too.

Below are two images so you can see what I get when trying to work with my great-aunt's results. I've included brief notes in the caption for each.
My great-aunt's results from Genetic Affairs, without adjusting settings---18 clusters
Great-aunt's results using the cM cutoffs I use when I manually do my technique.---65 clusters

NOTE: I don't find clusters with single digit numbers of members very helpful although I work with them if it's all I've got. This might be very different for others but when you easily find two ways to be related to many matches, I don't like trying to define a cluster with so few members, although sometimes it's a very clear cluster.

Since I keep getting over a dozen clusters, I'd rather be able to easily combine most of the smaller clusters---so I stick with my technique where I start with large clusters and subdivide as I can. If you've found settings to essentially do this, please share in the comments.

I actually cannot recommend any settings at this time, I've had too many hang-ups and I can get actual results from my own method so I don't NEED to use auto-clustering. I do it out of curiosity and to see if it's faster. I can actually get my own clusters in about half the time in Excel for my AncestryDNA results but most genealogists are not that comfortable with Excel and could not replicate that (so I 100% see the need for the auto-cluster tools).

Auto-clustering: An Extreme Contrast

As a comparison and a note, when I ran the tools on my father-in-law's results, I only got 2 clusters using Genetic Affairs's tool. I got more with the DNA Gedcom Client tool.
Father-in-laws clusters at Genetic Affairs with default settings

Father-in-laws clusters with DNA Gedcom Client beta tool, defaul settings

Initially when I manually did the Leed's method on his results, I got multiple clusters but it boiled down to three usable clusters. This are the two maternal grandparents (very well researched by others) and his paternal side (both Italian).

Unfortunately, I lost those results because I forgot to save as an Excel file and left it as a CSV and that means dozens of cells were combined into one---garbage. So beware with your spreadsheets!

After I automated my process in Excel, I ran his test again to compare the auto-cluster tools to my "auto-bucket tool." Realize, at this point, I had used the first run to recognize how he was related to a number of matches (i.e. based on the clusters, I knew which ancestor I was looking for in a match's tree and found them by doing my own research---this was in multiple cases---TAKE AWAY: you need clusters).

The 4 Buckets Technique: Not as Pretty as Auto-clustering but Powerful

The second go round I had those notes (I added them to AncestryDNA knowing they would download in the DNA Gedcom Client so they weren't lost in my CSV mess).

That means I could skip the 1st cousin clusters that led me astray the first time. I didn't have to spend time on the newly identified matches, again. So, I could focus on the Italian branches. And I finally identified one of them as belonging to his grandmother (super exciting, I've had so many traditional research leads and this was actually one of them but the DNA has never helped with his Italians before).
4 Buckets Technique---it isn't a picture, it's a spreadsheet

Still, none connected to his grandfather but with these Italian immigrants (almost all of them arrived between 1900 and 1920 in the associated families), there are only three, four, or five people in several clusters.

I want to say a little more about the above image than will fit in a caption. This is the same spreadsheet, sorted by three different columns I've identified as potential shared Italian ancestors. This is an example where you have to work with small cluster because that's all you're going to get!

The light copper arrow points to a bucket that is likely from the paternal grandmother. I could not find a verified connection (it might be back in Italy and there are no microfilmed records from that town).

As a note, the coloring is actually a pattern instead of solid since I can't confirm any relationship (in this case, the surname is the same and it was a family I know came from the same village in Italy and settled in the same town in Connecticut. I had already researched that family as potential relatives but couldn't verify a relationship but that doesn't mean there isn't one, I just didn't look very hard).

The navy arrows are likely matches to the paternal grandfather since there is no overlap to the light copper bucket but it could also be navy is paternal grandmother's mother or father and the light copper is the other.

The two columns the navy arrows point to are most likely a bucket with two sub-buckets. The top matches are all in a bucket together, then the bottom of the first navy column is a sub-bucket, and then the bottom of the second navy column is likely another sub-bucket.

This is the other reason I think this is the paternal grandfather rather than another generation back on the paternal grandmother's line. The closest shared matches are not too distantly related and the sub-buckets would be people that share an ancestor another generation (or more) back.

These are obviously working hypotheses and the reason I like having all my details in my spreadsheet (those columns are hidden and if this wasn't an illustration in a blog post, I would have hidden all those columns from the maternal side---the purple and mint---while I focused on the paternal side).

I'm thrilled when I can tell who should be related to my father-in-law's Italians (instead of his English or Irish or Polish ancestors---because almost every match has several of those listed in their ethnicity---very different from my southerners but just as useless!).

When Auto-clustering is Enough

My point of this story is---the auto-cluster tool worked perfectly (as did the manual Leeds Method) for the half of his family that was distinct (not endogamous) and had sufficient shared matches (these were populations where lots of people tested and where lots of people had trees---the clusters will still be strong without the trees, you just may not know who the clusters likely belong to).

The clusters with Genetic Affairs had fewer members than my manual Leeds Method but this might be overcome by adjusting the settings. Just realize you may want to find those other potential cluster members at some point. The tool from the DNA Gedcom Client gave me more clusters than I wanted to start with, once again, adjusting settings might take care of that.

The automated tools struggled when there were a lack of matches. But you'll struggle, too, so don't blame the machine. DNA needs matches to be useful.

Recognizing a profile for your ancestry can help make the tools more useful and help you avoid frustration.

If you have a problem with very inter-married branches, auto-clustering can give you an initial boost. You may already be past this point if you've used your DNA results in any way.

Beyond that, you will need to look closer and possibly alter the results format.

I can't use the visual charts to do the next level of analysis I do but I can't tell if that's just because of how my brain works or if the information actually needs to be rearranged.

(Note both have spreadsheets. The spreadsheet in the DNA Gedcom Client saves in your directory for the client and it actually has two sheets in it. One is the same as the HTML chart you'll see when you run the tool but the cMs are lumped into column A meaning you can't easily sort and the link to trees isn't included. You still have the HTML file saved there you can use, though. The other is "data" and is essentially a list of who is in which cluster and does have separate columns but only includes those who are in a cluster, even if your chart includes everyone. By the way, love the automatic naming of the files. This might be a first.)

These auto-clustering tools are too new and I'm super busy right now (it is Christmas and I have small children and elves running loose around my house). I like my technique and know how to use it and really want to explain it so others can benefit from the extra level of analysis.

That means I'm not figuring out all the ins and outs of the auto-clustering tools right now.

If you have suggestions, feel free to leave them in the comments!

Go, Cluster, Now

If you have taken an autosomal DNA test, you need to learn more about genetic networks and clustering. You might as well start with auto-clustering.

NOTE: There's no reason to manually do clustering, as in typing everything in. This is how I started and how I taught the technique in my initial course but technology changes fast and that's just a lot of cutting and pasting you don't need to do, now.

You might want to manually cluster but start with an automatically generated list of matches---I use the DNA Gedcom Client to do this for AncestryDNA, you can download the match list from FTDNA and MyHeritage---sorry I haven't had time to experiment with 23andMe or other companies.

All you need to start is your match list and then you can access your shared matches online for manual clustering.

However you dive into genetic networks, I have three guidelines.
  • Understand what you're getting if it's automated.
  • Consider the specifics of the test taker's known background when reviewing results.
  • Don't expect a miracle. This is just one step in the process needed to PROVE relationships with DNA. It is a very powerful way, like adding a motor to your rowboat, but you still have to steer. Otherwise you'll go around in circles.



Are you feeling a bit overwhelmed by all your DNA matches? Don't even know where to start? If you're not even ready to think about auto-clustering, yet, I have a simple step you can do today to get started towards success using DNA. You can get my new "Road to DNA Success Starter Pack" here.


Auto-clustering, Genetic Networks, and Kissing Cousins
Auto-clustering, Genetic Networks, and Kissing Cousins

4 comments

  1. Greetings! Thank you for sharing parts of your method but I was wondering if you had actually worked it up in a step by step tutorial? I am a moderate excel user who has tested most of the auto cluster tools, but as my maternal grandfather was Amish and my paternal grandmother was VERY southern, I encountered many of the same pitfalls and knew I needed a modified analysis tool. I'd love to give your method a try, but I am having difficulty following how you set your spreadsheet up. Any suggestions or direction would be wonderful! Thank you again for giving us endogamous peoples hope! =)

    ReplyDelete
    Replies
    1. Hi Sherri,
      I don't currently have the instructions written up succinctly. I begged a number of southern friends to lend me their DNA results because the last time I tried sharing this, I discovered the analysis was as clear as mud. Explaining the Excel steps isn't too hard IF you are comfortable with Excel. Email me and I'll share the instructions I've written up for myself. If you can figure out what to do, go for it. I'd also love to get your feedback on any issues you have. I am working on creating multiple "products" to share this varying from free to paid options (depending on how much help you want or need). I suspect there are lots of people who need this help. Some can handle it with limited instructions, some need lots of help, and some would rather just pay someone to deal with the Excel/tech parts for them. At some point the Excel only part should be able to be turned into auto-clustering but it'll still require explanation of how to deal with the interrelated branches. It's obvious to me but I started at the other end of this problem, knowing what I was looking at and creating a solution rather than having a solution and figuring out what I was looking at! I look forward to hearing from you.

      Delete
  2. Thanks for this post. I have similar issues with Southern DNA and Ancestry. I also face a similar issue analyzing my wife's Newfoundland ancestry. I would be very interested in trying this in Excel. I am fairly familiar with Excel, and pivots, sorting, filter, etc. I don't know if you are willing or able to send them to the email attached to this comment? I was looking for your email but couldn't find it...probably not looking in the right place. I have started using DNAGedcom for cluster analysis, and am thinking what you are discussing in this post could be most helpful.
    Steve

    ReplyDelete
  3. Hi Steve,
    I can't email you. I got your Google+ page (and yikes, that'll disappear soon!) but the email option was disabled. You can email me at jennifer at the occasional genealogist dotcom.
    I've also found Canadian ancestry is similar. I knew it was similar when I've researched some of my unusual southern surnames and found them in Canada (I guess there might have been three brothers, one went north, one went south, and one wrote outlandish family histories) but working with DNA proved it's genetically similar. At least it doesn't require developing yet another technique!
    Email me and I'll try and provide the Excel instructions.

    ReplyDelete

Resource Library Links

If a link you click to sign-up for the Resource Library does not work, try this link instead.