6:35 PM

Auto-clustering, Genetic Networks, and Kissing Cousins

Have you heard of auto-clustering for your autosomal DNA results? It's something you should try but is it going to help you or be another tool that drives you crazy? Turns out, it might depend who your ancestors were.

If you've read many posts on this blog, you know my ancestors are all southern. They aren't just southern, they're all from North Georgia. This is a nightmare when it comes to using DNA (although those of you with some of the most extreme endogamous populations, you have my sympathy, I know it could be worse).

Should you try auto-clustering for your DNA matches? | The Occasional Genealogist

Auto-clustering, Genetic Networks, and Kissing Cousins

Not All Genetic Clustering is the Same

I've been working on a technique for possibly two years or more to get faster results using DNA and today I want to talk about it.

I've decided to start towards sharing some of my work because my technique is not entirely unique. In genetic genealogy, multiple people will develop their own techniques for dealing with an issue but they'll be similar. Visual phasing is an example of this.

In this case, the technique I developed for myself is similar to Dana Leed's "Leeds Method." The big difference is Dana designed her technique for the tests she worked on which were not southern (or endogamous) and she included helping adoptees.

I designed my method specifically for my southern family and in particular to do more with my great-aunt's test. Her grandparents were second cousins. Although that's the only "close" cousin marriage in my known direct line, those families inter-married a lot so many of her matches had kissing cousins, too. My method can be used for adoptees but was designed for a known tree for the test taker. This distinction makes a difference in my method vs. all the auto-clustering tools.

NOTE: I'm using the term "kissing cousins" because there's some debate if a population like southern should be called "endogomous" or if the term "pedigree collapse" should be used. And there's just the issue if the person that should read this post would even know those terms!

I personally don't like "pedigree collapse." Southerners do have pedigree collapse, as it is defined, but an endogamous population does, too. However, the average southerner's pedigree isn't actually collapsed as someone from an undoubtedly endogamous's population would be. I've always thought of southerners as having "pedigree knot" but that's not a term!

Problems Using DNA for Southern Families

So why do you care about having kissing cousins?

It's a mess.

I think it's appropriate to use that popular phrase, it's a hot mess.

A hot, southern, sometimes inbred, mess.

It's a mess when doing traditional research and DNA has the same issues. With kissing cousins, you didn't just mix up two men of the same name, you mixed up two relatives of the same name. That means their DNA is mixed up, too.

Unlike unconditionally endogamous populations, a population like southern may appear to be "normal" when working with autosomal DNA.

I'm so used to dealing with it I didn't realize it stumps lots of people that are otherwise successful using autosomal DNA results. I knew it wasn't easy to work with, but when it's all you've got, you press on.

I've talked to a number of people who haven't realized they were going to have any issues with a southern branch until they started trying to achieve the same results they did with their other branches.

Once I figured out my technique for myself, I knew other people were having the same struggles and could use the same kind of help.

I've decided to go ahead and start sharing now because auto-clustering using the Leeds Method is all the rage. I know what's happening, programmers can easily create an automated option to perform the Leeds Method on test results.

I performed the Leeds Method manually on several of my southern tests, and on my only non-southern test (my father-in-law's test).

I'll make this short for those of you with known "normal" (non-endogamous) populations in your DNA. The Leeds Method works great and that's really all you need as far as a "clustering" step (keep in mind, clustering, manually, automatically, or any other way, is just one step in using DNA results).

Clustering is just one step in using DNA results.

If you are southern or focusing on a southern branch, you'll likely need to take things farther.

The Leeds Method vs. the 4 Buckets Technique

The automated options give varying degrees of success for southerners because using the Leeds Method without any adaptation doesn't overcome the issues southerners face using DNA for genealogy.

In essence, you will not get well-defined clusters using just the Leeds Method. When I learned about Dana's method, I learned how to do it and tried it (this was after I had already been using my method for months and had actually stopped clustering and was in the "working" phase that comes next so I knew if some of the clusters were accurate).

What she did and what I did starts out the same, except she has some nicely defined rules which I think you should start with (see her post about starting the Leeds Method, here). You should really read through how to do the Leeds Method before jumping into auto-clustering so you understand what's happening (it's not that complex).

There were a few differences between my technique and the Leeds Method from the start and I think this highlights the difference between color-clustering for southerners vs. non-endogamous populations ("normal" DNA results).

The layout of our spreadsheets' was different but the big difference, I used a more complex color system based on known and suspected relationships.

Why?

That's what highlighted if multiple matches belonged in two clusters (So far I haven't needed to apply more than two colors to a match, thankfully! This is of course because the color comes from the most recent ancestor in a lineage, even if another color is assigned to an ancestor of that person. Otherwise, most people would need multiple colors).

Looking at my spreadsheet wasn't as simple as looking at one for the Leeds Method but the purpose was for more complex relationships so that should be expected.

When I started, what I did was list the names of shared matches (this was why our layouts were different). I colored the primary match, not the shared matches, IF I knew we had a shared relationship. Then I'd copy the name of the primary match to each cell where that name appeared in the shared match matrix.

This is a significant difference. It means the clusters (or "buckets") are based on known relationships instead of just shared matches.

Later I also developed a way to color code matches who's shared ancestor was estimated from the bucket they were in. This strengthens the groups without giving the impression they are more certain than they are.

This was easier than it sounds because I was developing the technique as I went. It also worked.

Before you run out and try to duplicate this with such skimpy details, I now use a pivot table for AncestryDNA matches. This essentially performs the Leeds Method but it goes farther.

By applying my color-coding scheme, this becomes my technique which I call the 4 Buckets Technique (you're dumping matches in buckets that might be for an ancestor, or not, it's very casual and you need to remember that, even when people call them clusters!).

This looks absolutely nothing like the visual "auto-cluster" results you get but it does show if a match might belong in several buckets, is unclear, or doesn't have any helpful shared matches (i.e. none of the shared matches have a color so you have no information to start with---yet!).

UPDATE
After several more months of work with auto-cluster tools and the 4 Buckets Technique for clients, I've found the BIG difference that is necessary for southern DNA, regardless if you have kissing cousins in your direct lines.

You will have some people in multiple clusters and you need to see that.

Auto-clustering and the Leeds Method assigns colors based on the first shared match. When you have nice distinct branches there's no problem with this.

When your maternal matches show up as shared matches to your paternal ancestors, it's a problem.

This happens a lot with southerners. I've actually found it happens very little in my family even with the kissing cousins. I have lots of overlap within my paternal branches, though so it's still a major problem. I've seen it much worse in other southerners' results and they didn't have a recent cousin marriage.

You may see this and be confused or worse yet, you may not see it and wonder why you're stumped or go down a wrong path chasing false clues!

Seeing matches appear in multiple (unexplainable) buckets tips you off there is an issue and you need to be careful.

The 4 Buckets Technique also assigns colors based on known relationships, first, if possible. This helps you get results faster once the Technique is done.

Lastly, I've stuck with the FOUR Buckets Technique. I do sub-divide buckets and find other buckets but the large number of clusters I get from auto-clustering makes me want to throw things! You should not be working on your entire family tree at once. It's too much. Dividing into four buckets or clusters and then focusing on dividing one of those leads to results faster than working over all your matches over and over again.

So, will auto-clustering help you build your family tree with DNA? It can.

For some people it's a really fast way to get clues instead of staring at a list of matches.

For other people, the number of clusters will just be a new type of confusion. It's also possible the clusters could mislead you.

If you have lots of intermarriage between the branches of your tree, you might have issues.---That doesn't mean you necessarily have cousins in YOUR tree that married. The bigger problem is when a sibling or cousin of one of your ancestors married a sibling or cousin of an ancestor from a different branch and their descendant is one of your matches. That has tangled the branches.

In a population like southerners, and many other populations, this happened many, many times. You may share DNA with such a match from one of those lines and not the other or from both.

You may know about one of those shared relationships with the match or not (keep in mind, the "tangle" might involve different generations).

On top of this, it is very likely (in such populations) that you get false shared matches. That means you share ancestor A with the match, ancestor B with the shared match, and the match and shared match share ancestor C, who you aren't even related to.

It's because of ancestor C they appear as shared matches in your results.

If both are placed in the same cluster, you will have a problem if you try to find the ancestor the three of your share (because you don't share one).

This is less likely to happen to people with nice distinct branches and if it does, it tends to happen with more distant matches.

This is why a tool like auto-clustering would be helpful for southerners but also why it can make things worse. Sloppy clusters will amplify the confusion caused by false shared matches.

The concept behind auto-clustering or "genetic networks" is essential for sorting out your DNA matches. For people with lots of tangled branches in their and their matches' trees, I recommend starting your clusters or networks with known relationships. This can highlight false leads but it'll also make progress faster as your clusters become tighter.

I personally prefer to start with a few large clusters and then divide them but most auto-clustering tools won't let you do this (but play with the settings and use tools like GEDmatch's MKA option which can allow you to cluster a sub-set of matches you already recognize as a super-cluster).

If you have taken an autosomal DNA test, you need to learn more about genetic networks and clustering. You might as well start with auto-clustering.

NOTE: There's no reason to manually do clustering, as in typing everything in. This is how I started and how I taught the technique in my initial course but technology changes fast and that's just a lot of cutting and pasting you don't need to do, now.

You might want to manually cluster but start with an automatically-generated list of matches---I use the DNA Gedcom Client to do this.

All you need to start is your match list and then you can access your shared matches online for manual clustering.

However you dive into genetic networks, I have three guidelines.

Understand what you're getting if it's automated.
Consider the specifics of the test taker's known background when reviewing results.
Don't expect a miracle. This is just one step in the process needed to PROVE relationships with DNA. It is a very powerful way, like adding a motor to your rowboat, but you still have to steer. Otherwise you'll go around in circles.

Are you feeling a bit overwhelmed by all your DNA matches? Don't even know where to start? If you're not even ready to think about auto-clustering, yet, I have a simple step you can do today to get started towards success using DNA. Get the "Road to DNA Success" free course here.