History, Genetics and Statistics – Delhi Science Forum

The study of the past has never been an easy exercise. We earlier had two sources of data – textual and archaeological. All this had to be fitted into patterns for a coherent and consistent view of the past – this was the core enterprise of history. Of course, this still leaves open the question of how to look at these patterns – the framework of what constitutes history. Is it an account of various rulers and dates or an account of the people, how they lives and what they did? Was it looking at what caused changes in society?

All this brought to the fore that history was not just simply a value-free exercise of exploring our past but also how you view society today and importantly, what kind of society you want to build.

In this contested territory of history, we now have new tools that have made their entry. Increasingly, genetics is being harnessed to analyse human populations; ancient and not so ancient migration patterns are sought to be teased out. This brings in the second set of tools required to analyse human population and its genetic data – statistical tools. For the historians who have spent a life time understanding history based on archaeological, textual, historical linguistic evidence, an influx of these tools, some times presenting a picture at variance with the archaeological and other evidence, leads them to discard the tools altogether. Some of the most distinguished historians in India, and I am sure elsewhere, have expressed their discomfort with people with little knowledge of history postulating various theories based on obscure and incomprehensible genetic and statistical tools.

It is important here to understand what the tools can do and what they cannot do. It is also important for those who do not understand these tools that the tools by themselves cannot conclusively provide a narrative of the past. Any such set of tools can give multiple possibilities. Which one represents the past needs additional corroborative evidence and it is only with such evidence that we can come to some tentative conclusions.

As we are aware, we all carry within us a genetic code. There are four chemical compounds called bases – Adenine, Cytosine, Guanine, and Thymine – generally referred to by the alphabets A, C, G and T. Each of these bases can line up with another in pairs (A can only pair with T and C only with G) to form a string of base pairs or gene sequences, constituting the genetic code. There are over 3 billion such letters in the human genetic code. Though humans are 99.9% identical to each other, but in a genetic code of 3 billion letters, even a tenth of a percent of a difference translates into three million changes in ‘spellings’. It is to these differences in the genetic code that we look at in mapping the human population.

If we look at the differences between the genetic sequences of a group of people or of people, we can see what the differences are, and if we know the rate of change in the DNA sequences per generation and the number of years per generation, we can then trace back when they had common ancestors.

The DNA variations that we inherit are of three types. One is through the DNA sequences that are inherited as two copies, one from each parent. These are called autosomal DNA sequences. A second DNA inheritance is through the DNA sequences in the Y chromosome, which are inherited from father to son and represent a record of purely paternal inheritance. The third type of DNA inheritance that can be traced is in the DNA sequences of the mitochondria, which carry their own independent DNA sequences and are inherited only from the mother. Therefore, with such genetic studies we can also analyse the differences between the paternal and the maternal population.

The population studies have addressed the following questions:

1) Did agricultural spread through cultural transmission of the hunter gatherers taking to agriculture from the agriculturists or through demographic expansion of the agriculturalists?

2) What are the migrations that took place in the past?

These have been addressed not only for India, but also for other regions. Cavalli-Sforza and his colleagues have done pioneering studies as have many others on these questions. While the broad picture is quite clear – that human population have come out of Africa, that agriculture started in a crescent from Anatolia to Tigris Euphrates basin, the Fertile Crescent. Nevertheless a number of questions remain unanswered. The issues have been further confounded by a recent study that indicates that the rate of change we have assumed per generation may have been faster than actual and therefore a revision of the dates decided by such genetic methods is in order. Again it indicates the need to use such studies in conjunction with other evidence and not in isolation.

One such question is when did the human population come out of Africa? It is now clear that while the modern human population came out of Africa about 80,000-100,000 years ago, they did mix genetically with Neanderthals (who themselves had originated from Africa 500,000 years back), even though this was a small genetic flow.

One of the major questions have been how did agriculture spread? Was it demographic expansion – (demic expansion) – agriculturists extending agriculture and expanding their numbers or did the hunter gatherers take up agriculture as they came in contact with the agriculturists?

I am not going to suggest here that the debate is settled in favour of demic diffusion models, even though there is increasing evidence to support it. I will focus instead on what the tools are and why such tools may not be able to distinguish between a scenario on which people migrated and the genes migrated.

One of the methods used in such studies is to identify the sets of gene codings that are different – what are called single nucleotide polymorphisms (SNP’s). If we plot these SNP’s on a geographical map and look at variations across populations and space, we will see that there are variations which are larger in one direction than in others. Finding such axis of variations is called Principal Component Analysis. The largest direction of such variation can then be thought of as a migration path – a set of people migrating along this direction. On a map of Eurasia, this axis would then be a possible pathway for the migration of neolithic farmers.

Cavalli-Sforza and his colleagues have postulated that this is what happened and the genetic variation along the major axis of such variations is a record of this demic diffusion. The problem here is that even if we assume that there is a small amount of gene flows between local populations, how different would the population genetic map be from people migrating? In other words, is it possible that a certain amount of local gene flows combined with cultural transmission of agriculture would still provide very similar population genetic maps to we would get for migration of agriculturalists?

It is here that the statistical tools must be used with a great degree of caution. There are two sets of tools that are used – one is take the current set of data that we have and then try and find a statistical model that would best fit the data. The other is simulate with different sets of ancient populations, provide some kind of mixing and then work out which of these combinations and mixing approximate what we see on the ground. With computers, obviously much of these is done by algorithms and modelling tools and the researcher may not have a good feel for what is happening in these number crunching exercises.
Those familiar with such tools know that we can get a good model out of our data but such models are not unique. Varying certain parameters, using a different algorithm etc, may get us a different model. In such a scenario, it is imperative that supporting evidence must be used in order to come to any definitive conclusions. Such models may therefore be artefacts of our calculation methods and not real.

However, one definite conclusion can be arrived at when we look at the genetic data. Farming spread from the Fertile Crescent to Europe and South Asia and then further to East Asia and South east Asia. The genes associated with such neolithic farmers show a major expansion – clearly showing the higher density of population that such agriculture could support. It is also shown that the maternal population tends to have a higher proportion of Mesolithic genetic composition then in the paternal population.
When we come to South Asia, we again find that there is a significant difference between north India and South India, particularly amongst the men. Here again, the evidence is consistent with the Ancient North Indian (ANI) founders having entered South Asia from West Asia/Iran side around 10,000-12,000 years ago. As this is also the period in which agriculture enters India (Mehrgarh near Bolan pass being one such centre) it would be consistent with an expansion of ANI population along with agriculture spreading in North India and therefore the spread of ANI genes. Again, we will leave out how much of its is demic expansion and how much of its is through cultural transmission combined with local gene flows.

Such statistical models are not restricted to human population studies, they are also being used in historical linguistics. In linguistics, changes in vocabulary can be used in a similar way as genetic drift to map out possible dates when language families have split. A recent study (Mapping the Origins and Expansion of the Indo-European Language Family, Remco Bouckert and others, Science, 24 Agugat, 2012) has done such an analysis of languages to try and map out when language groups split and whether we can work out a migration of language map from such an exercise. This exercise shows that the probable origin of Indo-European group of languages to be Anatolia, with the India Iranian group breaking off from the larger Indo-European family around 4,000-6,000 years back.
If we take all this evidence together, along with the larger archaeological evidence, it is clear that the Indo-European language family in the from of Vedic Sanskrit did not enter India with agriculture. The major expansion of agriculture that took place including Mohenjo-daro-Harappa does coincide with the an ANI population but it pre dates the split of Avestan- Vedic Sanskrit from the main Indo-European family.

Romila Thapar had postulated that the speakers of Vedic Sanskrit were not large in number and the language spread is due to elite domination. A set of such speakers came, had the use of iron and used horses and chariots and were able to establish themselves at the top of the existing hierarchy in India. The spread of language is not the same as the spread of genes.
Again, there is genetic evidence that caste groups in India have had low gene flows across caste boundaries for a long time (Reconstructing Indian Population History, David Reich and others, Nature, September 2009) – it would indicate that castes have existed in India even at the time of the Harappan civilisation. Therefore, the picture that we have had from the largely archaeological evidence is not challenged by new genetic and historical linguistic evidence; even if the tools used appear completely alien to the historians.

Finally, those who are living in the past and would like to postulate that there has been no invasion of Vedic Sanskrit speakers from outside, they have to contend with the even more daunting task of then proving that Indo-Eurpoean speakers originated in India and find genetic and historical linguistic evidence for that. Apart from archaeological evidence. There is no such evidence and nit picking on the detailed picture being built by archaeology, historical linguistics, written texts, genetics and statistical tools will not get them their Aryan homeland in India.