thefuzzysasquatch: A DNA primer for dummies like me!

Over the years I have looked closely at the peopling of the Americas on this blog. In particular, I have focused on examining pre-Clovis archaeological sites to see how the evidence stacks up. For instance, I have looked at the following sites:

Monte Verde in Chile (see here)

Arroyo del Vizcaíno, Uruguay (see here)

Meadowcroft Rockshelter, Pennsylvania USA (see here)

Buttermilk Creek, Texas USA (see here)

Nugget Gulch, Yukon Canada (see here)

Blue Fish Caves, Canada (see here)

Santa Elina Rock Shelter, Mato Grosso, Brazil (see here)

Cerutti Mastodon site, California USA (see here)

One thing I have not done however, is look at the genetic evidence for the peopling of the Americas in any detail.

To do so I needed to understand the science, underlying the academic DNA papers more thoroughly. I have therefore, had to go back to school!

To analyse what a particular genetics paper means in the wider context of the peopling of the Americas, I needed to know what data palaeogeneticists collected and what it meant at a basic level. I have now completed a basic study of the topic. As I went along I took notes for my own reference. I thought these may be of interest to others looking to read this type of academic paper. I have therefore put this post up here in the hopes, that it may be, of some help, to others in a similar situation to myself.

The basics

Our bodies are made of cells. These are small building blocks that join, together, to make organs and organ-systems or float freely in our circulatory system, for example blood cells.

The function of each cell is determined by a set of instructions contained in the nucleus of each cell. This set of instructions is written in a chemical molecule called DNA.

DNA, or deoxyribonucleic acid, is the hereditary material in humans and almost all other organisms. Nearly every cell in a person’s body has the same DNA. Most DNA, is located in the cell nucleus (where it is called nuclear DNA), but a small amount of DNA can also be found in the mitochondria (where it is called mitochondrial DNA or mtDNA). Mitochondria are structures within cells that convert the energy from food into a form that cells can use.

The information in DNA is stored as a code made up of four chemical bases: adenine (A), guanine (G), cytosine (C), and thymine (T). Human DNA consists of about 3 billion bases, and more than 99 percent of those bases are the same in all people. The order, or sequence, of these bases determines the information available for building and maintaining an organism, similar to the way in which letters of the alphabet appear in a certain order to form words and sentences.

DNA bases pair up with each other, A with T and C with G, to form units called base pairs. Each base is also attached to a sugar molecule and a phosphate molecule. Together, a base, sugar, and phosphate are called a nucleotide. Nucleotides are arranged in two long strands that form a spiral called a double helix. The structure of the double helix is somewhat like a ladder, with the base pairs forming the ladder’s rungs and the sugar and phosphate molecules forming the vertical sidepieces of the ladder.

Stretch of DNA showing base pairs.

In humans, the DNA is packed into 23 pairs of homologous molecules called Chromosomes, for a total of 46 Chromosomes. Each of the homologous chromosomes in a pair is inherited from a different parent. So, we get half of our genetic material – DNA - from each parent.

One set of human chromosomes (picture credit: US Library of Medicine)

How do humans differ from one another?

A gene is commonly defined as a DNA sequence, on a particular chromosome, that has a function, meaning a class of similar DNA sequences all involved in the same, particular, molecular, function.

Alleles can defined as “alternative forms” of a gene that can occur at the same locus, or place, in the genome. Many Alleles can be caused by Single Nucleotide Polymorphisms, or other changes such as deletions, transversion, or insertions.

A single-nucleotide polymorphism, abbreviated to SNP is a variation in a single nucleotide that occurs at a specific position in the genome.

For example, at a specific base position in the human genome, the C nucleotide may appear in most individuals, but in a minority of individuals, the position is occupied by an A. This means that there is a SNP at this specific position, and the two possible nucleotide variations – C or A – are said to be alleles for this position.

By any definition a gene must involve more than one nucleotide base pair. Single nucleotide polymorphisms (SNPs) thus do not occur at loci, but rather in and around loci.

SNP markers do not, therefore occur “at” loci. SNP markers do have certain, alleles at set locations: that is “sites” within the region of a locus.

If in a population only one allele occurs at a site or locus, we say that it is monomorphic, or monoallelic, in that population. If two alleles occur, as is common for SNPs, we use the term diallelic (also known as biallelic). If many alleles occur, the polymorphism is called polyallelic or multiallelic. When there are just two alleles at a locus, the one with the smaller population frequency is called the minor allele. In genetics, the term allele “frequency”- which is strictly speaking a count - is used to mean relative frequency, i.e. the proportion of all such alleles at that locus among the members of a population; thus the term minor allele frequency is often used for diallelic markers.

A polymorphic locus was originally defined as a locus at which the least common allele occurs with a “frequency” of at least 1% but a more appropriate definition would be a locus at which the most common allele occurs with a “frequency” of at most 99%. Different alleles arise at a locus as a result of mutation, or sudden change in the genetic material. Mutation is a relatively rare event, caused for example by an error in replication or the action of a mutagen. Thus all alleles are by origin mutant alleles, and a genetic polymorphism was conceived of as a locus at which the frequency of the least common allele has a frequency too large to be maintained in the population solely by recurrent mutation. However, what is important at a locus is the degree of polymorphism, and a locus in which there are 1,000 equifrequent alleles would be considered much more polymorphic than a locus at which there are two alleles with frequencies 0.01 and 0.99. Many authors now use the term mutation for any rare allele, and the term polymorphism for any common allele.

A haplotype is the multilocus analogue of an allele at a single locus. It consists of one allele from each of multiple loci that are transmitted together from a parent to an offspring. So haplotypes are made up of multiple alleles (one from each locus). It is usual nowadays to restrict the word haplotype to the case where all the loci involved are on the same chromosome pair, so that all the alleles involved are on the same chromosome.

If the alleles at one locus are not distributed in the population independently of the alleles at another locus, the two loci exhibit allelic association. If this association is a result of a mixture of subpopulations (such as ethnicities or religious groups) within each of which there is random mating, the association is often denoted as “spurious”. In such a case there is true association, but the cause is not of primary genetic interest. If the association is not due to this kind of population structure, it is either due to linkage disequilibrium (LD) or gametic phase disequilibrium (GPD); in the former case the loci are linked, i.e. they co-segregate in families, in the latter case they are not linked, i.e. they segregate independently in families.

Identity

The concept of allelic identity is an important one. Alleles are identical by descent (IBD) if they are copies of the same ancestral allele, and must be differentiated from alleles that are physically identical but not (at least within the previous dozen or so generations) ancestrally identical. Such alleles, when not IBD, are identical in state (IIS) or more commonly, nowadays identical by state (IBS). Here these alleles, are ancestrally, but not physically, different.

Mitochondrial DNA

Although most DNA is packaged in chromosomes within the nucleus, mitochondria also have a small amount of their own DNA. This genetic material is known as mitochondrial DNA or mtDNA. Mitochondrial DNA is inherited, unchanged directly from the female parent.

Mitochondria, are cellular organelles within eukaryotic cells that convert chemical energy from food into a form that cells can use, adenosine triphosphate (ATP). These organelles are found in most cell types, including bone.

As, a large number of, mitochondria and hence, mitochondrial DNA is found in most cells, there is a relatively large amount in most samples from living or deceased individuals, available for study, once extracted.

Diagrammatic representation of the position of mtDNA in cells from Wikipedia commons (2019)

Since human mtDNA evolves faster than nuclear genetic markers, it has become a mainstay of phylogenetics and evolutionary biology. The fact that mitochondrial DNA is maternally inherited enables genealogical researchers to trace maternal lineage far back in time.

By looking at the SNPs in mtDNA, haplogroups and haplotypes can be determined. The order and number of the SNP changes allows geneticists to construct a phylogenetic tree showing the relatedness in both time and space of the various haplogroups and haplotypes.

It has therefore, permitted an examination of the relatedness of populations, and so has become important in anthropology and biogeography.

y-DNA

Only males have a Y-chromosome, thus making their 23^rd chromosome pair XY, whereas women have two X chromosomes in their 23^rd pair. The Y-chromosome is almost 60 million base pairs long and there is only one per cell. A man's patrilineal ancestry, or male-line ancestry, can be traced using the DNA on his Y chromosome (Y-DNA), because the Y-chromosome is transmitted father to son nearly unchanged.

Single nucleotide polymorphisms (SNPs)

As was the case in mtDNA, single-nucleotide polymorphisms (SNPs) also occur in y-DNA. These single changes to a nucleotide in a DNA sequence will, when taken together confirm haplogroup and haplotype.

Typical, commercial, y-DNA SNP tests test about 20,000 to 35,000 SNPs, while academic researchers use far more e.g. Fu et al. (2016) used between 200,000 and ca. 800,000 SNPs for the delineation of haplotypes and relationships between ancient individuals.

Again, as in mtDNA, haplogroups and haplotypes can be determined, and the order and number of the SNP changes allows geneticists to construct a phylogenetic tree showing the relatedness in both time and space of y-DNA haplogroups and haplotypes. Different branches of this tree are different haplogroups. Most haplogroups can be further subdivided multiple times into sub-clades and finally haplotypes.

Once more, this type of DNA also, permits an examination of the relatedness of populations, and so has become important in anthropology and biogeography.

For example, commercial DNA analysis has brought up some interesting results as noted by Bettinger (2016): “All human men descend in the paternal line from a single man dubbed Y-chromosomal Adam, who lived probably between 200,000 and 400,000 years ago. ..Most significant of these new discoveries was in 2013 when the haplogroup A00 was discovered, which required theories about Y-chromosomal Adam to be significantly revised.”

If we compare the ease with which y-DNA and mtDNA can be collected some important facts emerge. For the y-DNA, there is only one copy per cell, in the nucleus. If you recall mtDNA resides in the mitochondria of cells. On average, there are 2000 mitochondria per cell. Therefore, it relatively, easy to find undamaged mtDNA, in even ancient samples.

Conversely DNA analysis carried out to examine the y-DNA looks at the diagnostic regions of the 60 million base pairs of the y-chromosome to determine haplogroup and type. To do so, depends on extracting enough DNA from these regions within those 60 million base pairs for analysis. For highly degraded remains, it's highly unlikely that enough of the right Y survives for analysis. Thus ancient remains have proved much more difficult to study from the y-DNA phylogenetic point of view.

Autosomal DNA

Inside the nucleus of every cell, each of us have 23 chromosomes. One is your sex chromosome, determining your gender. The other 22 are your autosomal chromosomes. These contain the DNA that codes for proteins, which are needed for growth, and for the replacement of old worn-out cells. For an organism to grow and function properly, cells must constantly divide to produce new cells to replace these, old, worn-out cells. During cell division, it is essential that DNA remains intact and evenly distributed among cells. Chromosomes are a key part of the process that ensures DNA is accurately copied and distributed in the vast, majority, of cell divisions. Still, mistakes do occur on rare occasions.

These mistakes or mutations are what cause single nucleotide polymorphisms (SNPs), already discussed in the sections above on mtDNA and y-DNA.

Autosomal DNA can be used to find unknown relatives through commercial DNA testing, or to link modern or fossil individuals to ancient populations.

But hang on a minute, as we each receive 50% of our DNA from each parent, about 25% of our DNA from each of our 4 grandparents and approximately 12.5% of our DNA from each of our great grandparents, surely this serial dilution, affects how far back you can trace ancestry doesn’t it? Well you can test this idea: you have about 3 Billion base pairs on your 22 chromosomes, so by generation 33 you will have, on average, just one segment of DNA from any particular ancestor. By generation 45, that drops to 0.00017 segments.

If you think about it, an ancestor who lived 20,000 years ago is roughly 800 generations removed from yourself (if each generation is counted as 25 years). Therefore, through this process of halving, the amount you receive from a particular, ancestor, will have gone down to about 3 x 10^-790 – an exceptionally small number!

Then surely tracing our autosomal DNA to a particular, ancestor, way back in time is therefore impossible, isn’t it?

Well yes and no! There is the process of genetic bottlenecking to consider. When a population, for whatever reason, is reduced to a small size, and then isolated, after a few generations through interbreeding, all members of that population have an extremely, large proportion of the same autosomal DNA. Imagine now, that this population is saved from the brink of extinction and grows again, perhaps due to improved environmental conditions, now that autosomal DNA becomes fixed within that population.

Project that population forward in time. The population still has the same autosomal DNA, or significant stretches of it – some new mutations may have occurred, especially over thousands of years.

The situation remains unchanged until this isolated population is contacted by another and admixture of genes occurs. A well-known example being Native Americans in the pre-Columbian contact period, or isolated Siberian tribes up until the 19^th century.

Then along comes autosomal DNA testing. Now we can check how many SNPs we share with many populations from around the world, even extinct ones, whose DNA has been recovered from skeletal remains.

Basically, the number of Alleles (in particular SNPs) or contiguous stretches of DNA measured in centi-Morgans, we share with a population can tell us to which populations we are related to. Even ancient ones.
I must stress however, that this, is a simplistic explanation of how stretches of intact autosomal DNA can survive many generations. It is also, not the only, mechanism to transmit longer than expected stretches of DNA.

Once again, this technique is now used to map the origins of many haplogroups through their SNPs to specific ancient populations. Therefore, many of these groups now have simple acronyms to show general geographic areas or indicate lifeways. A partial list is included below:

ANE - Ancient North Eurasian

ASE - Ancient/Ancestral South Eurasian

ASI - Ancient/Ancestral South Indian

Austronesian – meaning populations speaking a family of languages spoken in an area extending from Madagascar in the west to the Pacific islands in the east.

Basal Eurasian - a hypothetical lineage, which probably existed amount among ancient Near East individuals, who were recent migrants out of Africa

EAS – East Asian

CHG - Caucasus Hunter Gatherers

EHG - Eastern Hunter-Gatherer

ENF - Early Neolithic Farmer, a late Neolithic group from the near east

Khoisan - Southern Africa

Melanesian - a subregion of Oceania/Australasia extending from the western end of the Pacific Ocea, and eastward to Fiji.

SEA - South East Asian

SSA - Sub-Saharan African

AP - - Ancient Palaeosiberian or just Palaeosiberian

WHG - Western Hunter-Gatherer

Then there is the autosomal DNA from ancient individuals, whose SNP sets may show up in later populations and thus help map ancient migrations. Again, a partial list:

Motola 12 ca. 6,000BP (Sweden)

LaB: LaBrana ca. 7,000BP (Spain)

Los: Loschbour ca. 8,000BP (Luxembourg)

Anzick1: ca. 12,600BP (Montana USA)

AG3: Afontova Gora ca. 17,000BP (Siberia)

MA1: the Mal'ta boy ca. 24,000BP (Siberia)

Salkhit: ca. 34,500BP (Mongolia, China)

GoyetQ116-1: ca. 35,000BP (Belgium)

Kostenki 14 ca. 37,000BP (southwest Russia)

Oase 1 ca. 39,000BP (Romania)

Tianyuan ca. 40,000BP (Beijing China)

Ust'-Ishim ca. 45,000BP (Siberia)

Diagram from Yang and Fu (2018) showing the distribution of some ancient samples and groups over time.

The DNA of other, human species have also been sequenced. In Neanderthal, Denisovans and Sima de los Huesos hominins have had some sequences or even full genomes recovered from their remains. Amazingly, SNPs from some of, these ancient hominins have also been found in modern populations!

Now I feel somewhat more equipped to read genetics papers and comment on how this evidence has been used to indicate the timing and route(s) of the peopling of the Americas, I will attempt something soon. Watch this space.

References

Bettinger BT, Wayne DP (2016). Genetic Genealogy in Practice. Arlington, VA: National Genealogical Society.

Fu, Q., Posth, C., Hajdinjak, M., Petr, M., Mallick, S., Fernandes, D., Furtwängler, A., Haak, W., Meyer, M., Mittnik, A. and Nickel, B., 2016. The genetic history of ice age Europe. Nature, 534(7606), p.200.

Wikipedia commons (2019) at: https://en.wikipedia.org/wiki/Mitochondrial_DNA

Accessed 05.03.19

Yang, M.A. and Fu, Q., 2018. Insights into modern human prehistory using ancient genomes. Trends in Genetics, 34(3), pp.184-196.

Bibliography:

International Society of Genetic Genealogy at: https://isogg.org/wiki/Ancient_DNA

Genealogical DNA test at: https://en.wikipedia.org/wiki/Genealogical_DNA_test#Y_chromosome_(Y-DNA)_testing

US National Library of Medicine at: https://ghr.nlm.nih.gov/primer/basics/howmanychromosomes

Are SNPs and alleles the same thing? From Stack Exchange at:

https://biology.stackexchange.com/questions/57442/are-snps-and-alleles-the-same-thing

Autosomal DNA test from UCL at: https://www.ucl.ac.uk/mace-lab/debunking/understanding-accordion/autosomal-test

Autosomal DNA, Ancient Ancestors, Ethnicity and the Dandelion, by Roberta Estes at: https://dna-explained.com/2013/08/05/autosomal-dna-ancient-ancestors-ethnicity-and-the-dandelion/