BREADR (Biological Relatedness from Ancient DNA in R) is a Bayesian method for identifying the degrees of relatedness between pairs of individuals, especially in cases of extremely low-coverage sequence data (as low as 0.04X). BREADR leverages the so-called pairwise mismatch rate (PMR), a measure of how much the genotypes of individuals do not match, to do this. BREADR assumes an underlying binomial model for the proportion of mismatched genotype calls, and calculates posterior probabilities for degrees of relatedness up to the second-degree (third-degree or higher are defined as “Unrelated”). BREADR works with the Eigenstrat data format as an initial input, allowing researchers to control for factors such as contamination before BREADR is used.
We have implemented BREADR in the R statistical programming language, and include methods for assigning degrees of relatedness (up to the second-degree), testing for whether separate groups of individuals can be merged into one analysis, for visualising both pairwise and overall relatedness, for sensitivity analyses for changes in the prior distributions, and for testing if degrees of relatedness up to the tenth-degree are consistent with the observed PMR. This basic example shows you how to analyse a real (anonymised) ancient DNA data set.
For a comprehensive description of the underlying model used by BREADR, please see the manuscript. However, we give a brief overview of the concept here. Consider two individuals \(i\) and \(j\), with \(N_{ij}\) independent sites in the genome for which genotype calls have been made. We count the number of genotypes which do not match, and denote this \(X_{ij}\). We assume that \(X_{ij}\) will follow a binomial distribution \(X_{ij} \sim B(N_{ij},p_{ij})\), that \(p_{ij}\) is defined by how closely related the two individuals are.
It is known that if the expected PMR of two unrelated individuals is \(\bar{p}\), then the expected PMR for individuals how are related in the \(k^{\text{th}}\) degree will be
\[ p_{k} = \bar{p}\left( 1 - \dfrac{1}{2^{k+1}} \right). \]
Since the probability mass function for a binomial distribution is well known, we can calculate the probability of observing the number of mismatched genotypes for \(k=0\) (same/twins), \(k=1\) (first-degree), \(k=2\) (second-degree), and then a Poisson-weighted sum of the remaining values of \(k\) to capture the third-degree or less related probability.
When we have the probability of observing the number of mismatches we see, given a value of \(k\), denoted \(L(X_{ij} | K=k,N_{ij})\), and assuming a (user-defined) prior probability of relatedness, we then calculate a posterior probability of the degree of relatedness being \(k\), given the observed PMR.
An analysis would normally start by defining paths to all three Eigenstrat files, i.e.,
ind_path <- "path/to/eigenstrat/indfile"
snp_path <- "path/to/eigenstrat/snpfile"
geno_path <- "path/to/eigenstrat/genofile"and we would then preprocess this data using the processEigenstrat() function,
Since this step is the most computationally expensive, we include an option to automatically save the output as a TSV upon completion of the preprocessing, i.e, to avoid repeating the preprocessing
counts_example <- processEigenstrat(
indfile = ind_path,
snpfile = snp_path,
genofile = geno_path,
outfile = "path_to_save_tsv"
)Also included are options to change the minimum distance between overlapping sites (filter_length) from the default \(1\times 10^5,\) an option to only include individuals (pop_pattern) from certain populations (as defined in the ind file) and an option to remove C->T and G->A SNPs to avoid the effects of deamination.
A new feature is the ability to process only pairs within the same populations (as per the populations defined in column three of the Eigenstrat ind file). This can be useful when populations should not be considered together as, for example, they may never have been able to be closely related due to vast distances in time or geography, and/or might have a different baseline PMR. When running the processEigenstrat() function, by setting the parameter byPop=TRUE, only individuals in the same population will have a PMR calculated. In the example data, individuals Ind1-Ind4 are in “Pop1” and individuals Ind5-Ind6 are in “Pop2”. This means that while each genotype of each individual will be read into BREADR, only the four individuals in Pop1 are compared to each other (\(4\times 3/2 = 6\) pairs) and only the two individuals in Pop2 are compared to each other (\(2\times 1/2 =1\) pairs), totaling 7 PMR calculations. This is less than the \(6\times 5/2 = 15\) pairs that would be calculated by default, more than halving the number of calculations. This saving in computational effort will be greater for large sample sizes. We also include a related parameter minMerge which defines the smallest possible “group” (default equals 2). This is to control the smallest group sizes as estimating a baseline PMR can be difficult for small groups (such as in Pop2 in the example data), and any groups with a sample size smaller than this parameter are merged into one group called “Merged” (hence you should avoid this as a population name in your Eigenstrat ind file when using this parameter).
Note that since the Eigenstrat files are too large to include, we provide a pre-processed data set called counts_example which is the product of running processEigenstrat() on real data (although the raw Eigenstrat files can be found on the Github page).
We now load the counts_example tibble.
#> pair nsnps mismatch pmr
#> 1 Ind1 - Ind2 1518 310 0.2042161
#> 2 Ind1 - Ind3 9435 2093 0.2218336
#> 3 Ind1 - Ind4 8283 1854 0.2238319
#> 4 Ind1 - Ind5 1336 314 0.2350299
#> 5 Ind1 - Ind6 2242 481 0.2145406
#> 6 Ind2 - Ind3 9119 1988 0.2180064
#> 7 Ind2 - Ind4 7984 1699 0.2128006
#> 8 Ind2 - Ind5 1179 270 0.2290076
#> 9 Ind2 - Ind6 1965 423 0.2152672
#> 10 Ind3 - Ind4 20952 2253 0.1075315
#> 11 Ind3 - Ind5 7994 1703 0.2130348
#> 12 Ind3 - Ind6 10994 2398 0.2181190
#> 13 Ind4 - Ind5 6924 1451 0.2095609
#> 14 Ind4 - Ind6 9745 2141 0.2197024
#> 15 Ind5 - Ind6 1739 383 0.2202415
We now also include a diagnostic plot to consider whether populations can be merged. We may wish to merge populations (we use population instead of group to match the Eigenstrat nomenclature for the third column of the IND file) if we have a small number of individuals, and wish to increase the pairwise sample size.
The plotDOUGH() function takes as input your processEigenstrat() output (the raw counts) and produces side-by-side box plots/violin plots, comparing the PMR values per population, with associated non-parametric tests for whether the median PMR (which we use as a baseline level of relatedness by default in BREADR) differs between populations. When employing this function the user can set parameters for the font size on the x-axis (fntsize, default equals 7), the minimum number of overlapping SNPs for a pair to be included in the tests and plot (nsnps_cutoff, default is 500) and removeBetween. removeBetween forces the test to ignore the between-population PMR values, which may be useful.
In cases where there are more than two populations, BREADR first performs a Kruskal-Wallis test to see if any population medians significantly differ (p-value in top-left), and then performs pairwise posthoc Dunn’s tests to identify which population medians are significantly different, with p-value adjustments performed using the Holm-Bonferroni method. Populations with significantly differently medians are then identified in the plot as bars above the population (with p-values). If there are only two population (and removeBetween is set to TRUE) then a Mann-Whitney U-test is performed and no bars will be shown even if the groups are significantly different, so the p-value in the title must be used to assess significance. If two populations are not significantly different, then they can, in theory, be merged.
We run plotDOUGH() on our example in the following way:
indfile1 <- fs::path_package(package = "BREADR", "extdata/example.ind.txt")
plotDOUGH(
counts_example,
indfile = indfile1,
nsnps_cutoff = 500
)In this example (Figure 1), the Kruskal-Wallis test returns a p-value of 0.57 (>0.05), indicating that there is no evidence that any population median is different, and hence our two populations can be merged. However, these non-parametric tests can lack statistical power for very small sample sizes, as might be the case here. However, as an example of when there is a difference, we modified the example data such that Pop1 and Pop2 cannot be merged. (Note that the “Between” groups PMR values are not significantly different from Pop1 or Pop2, but this is only because they are the average of the two groups, and in some sense should be ignored here).
By default, BREADR sets the prior distribution for the four possible relatedness assignments as equally likely, or a uniform prior of \(P(K=k) = \frac{1}{4}\). This is likely not reasonable in a random sample from a true population. However, an uninformative prior makes the least assumptions, and on a practical level, the PMR associated with same/twin or first-degree can often be overwhelmingly obvious for a sufficient number of overlapping sites. Further, some studies will contain random samples from populations, and hence twins will be very rare. Others samples may be from collective burials (for example), in which case elements from the same individuals may be extremely likely. Hence, we leave it to the user to decide on a sensible prior distribution, but concede that this can be difficult.
To address this we include a function called priorSensitivity() to allow researchers to easily explore how sensitive the results of callRelatedness() are to the choice of the prior distribution. The function forms a grid of different prior probabilities to test, and then for each degree of relatedness, at each prior value, calculates the proportion of times each degree of relatedness would have been the “assigned” relationship (when all other priors are varied). Consider the example for Ind1 and Ind2 (row one) in the example data (Figure 6).
BREADR::priorSensitivity(
in_BREADR = relatedness_example,
row = "Ind1 - Ind2",
degrees = c(3, 4),
grid_space = 0.05,
maxPrior = 0.5
)
#> Starting sensitivity analysis (633 separate analyses)....
#> " | | | 0% | | | 1% | |= | 1% | |= | 2% | |== | 2% | |== | 3% | |=== | 4% | |=== | 5% | |==== | 5% | |==== | 6% | |===== | 6% | |===== | 7% | |===== | 8% | |====== | 8% | |====== | 9% | |======= | 9% | |======= | 10% | |======= | 11% | |======== | 11% | |======== | 12% | |========= | 12% | |========= | 13% | |========== | 14% | |========== | 15% | |=========== | 15% | |=========== | 16% | |============ | 16% | |============ | 17% | |============ | 18% | |============= | 18% | |============= | 19% | |============== | 19% | |============== | 20% | |============== | 21% | |=============== | 21% | |=============== | 22% | |================ | 22% | |================ | 23% | |================= | 24% | |================= | 25% | |================== | 25% | |================== | 26% | |=================== | 27% | |=================== | 28% | |==================== | 28% | |==================== | 29% | |===================== | 29% | |===================== | 30% | |===================== | 31% | |====================== | 31% | |====================== | 32% | |======================= | 32% | |======================= | 33% | |======================= | 34% | |======================== | 34% | |======================== | 35% | |========================= | 35% | |========================= | 36% | |========================== | 37% | |========================== | 38% | |=========================== | 38% | |=========================== | 39% | |============================ | 39% | |============================ | 40% | |============================ | 41% | |============================= | 41% | |============================= | 42% | |============================== | 42% | |============================== | 43% | |============================== | 44% | |=============================== | 44% | |=============================== | 45% | |================================ | 45% | |================================ | 46% | |================================= | 47% | |================================= | 48% | |================================== | 48% | |================================== | 49% | |=================================== | 49% | |=================================== | 50% | |=================================== | 51% | |==================================== | 51% | |==================================== | 52% | |===================================== | 52% | |===================================== | 53% | |====================================== | 54% | |====================================== | 55% | |======================================= | 55% | |======================================= | 56% | |======================================== | 56% | |======================================== | 57% | |======================================== | 58% | |========================================= | 58% | |========================================= | 59% | |========================================== | 59% | |========================================== | 60% | |========================================== | 61% | |=========================================== | 61% | |=========================================== | 62% | |============================================ | 62% | |============================================ | 63% | |============================================= | 64% | |============================================= | 65% | |============================================== | 65% | |============================================== | 66% | |=============================================== | 66% | |=============================================== | 67% | |=============================================== | 68% | |================================================ | 68% | |================================================ | 69% | |================================================= | 69% | |================================================= | 70% | |================================================= | 71% | |================================================== | 71% | |================================================== | 72% | |=================================================== | 72% | |=================================================== | 73% | |==================================================== | 74% | |==================================================== | 75% | |===================================================== | 75% | |===================================================== | 76% | |====================================================== | 77% | |====================================================== | 78% | |======================================================= | 78% | |======================================================= | 79% | |======================================================== | 79% | |======================================================== | 80% | |======================================================== | 81% | |========================================================= | 81% | |========================================================= | 82% | |========================================================== | 82% | |========================================================== | 83% | |========================================================== | 84% | |=========================================================== | 84% | |=========================================================== | 85% | |============================================================ | 85% | |============================================================ | 86% | |============================================================= | 87% | |============================================================= | 88% | |============================================================== | 88% | |============================================================== | 89% | |=============================================================== | 89% | |=============================================================== | 90% | |=============================================================== | 91% | |================================================================ | 91% | |================================================================ | 92% | |================================================================= | 92% | |================================================================= | 93% | |================================================================= | 94% | |================================================================== | 94% | |================================================================== | 95% | |=================================================================== | 95% | |=================================================================== | 96% | |==================================================================== | 97% | |==================================================================== | 98% | |===================================================================== | 98% | |===================================================================== | 99% | |======================================================================| 99% | |======================================================================| 100%
#> Collating results....
#> Done.This output can be difficult to interpret at first, and so we explain it here in detail. From Figure 5 we can see two panels: 2nd-Degree and Unrelated, both with a y-axis from 0% to 100%, and an x-axis that goes from 0 to the (default) maximum tested prior probability of 0.5. For the prior probability value along the x-axis, we see a bar. The way to interpret the first bar (at \(x=0.05\)) is: when the prior probability of Unrelated is fixed at 0.05, and across all other values that the other three prior values can be (so that all four prior probabilities sum to one), we see that the relationship between Ind1 and Ind2 is returned as “Unrelated” approximately 4% of time, “2nd-Degree” approximately 96% of the time, and never as “Same/Twins” or “1st-Degree”, as we never see the colours associated with them. We can see that the next bar (\(x=0.1\)) sees “2nd-degree” assigned most often, but less often than before. And this trend continues. A sensible interpretation of this is that as the prior probability of “Unrelated” increases, we see that the relationship between Ind1 and Ind2 is more likely to be called “Unrelated”, which makes sense. We also see that we would need a prior probability of unrelated of at least 0.25 to call an unrelated relationship.
Since Same/Twins and 1st-degree is never called, the top panel for 2nd-Degree effectively shows the inverse of the bottom panel. Unfortunately this sensitivity analysis shows us that the outcome of BREADR is strongly affected by the prior distribution here, which makes sense since it appears as if the relationship is potentially third-degree.
To give an example of when a degree of relatedness is not affected by the choice of prior probabilities, we produce a sensitivity analysis for Ind3 and Ind4 (Figure 7), who are unambiguously Same/Twins with posterior probability one (to machine precision). We also decrease the grid spacing (grid_space=0.025) to have a finer-scale analysis (at the cost of computational time). Here we can see that no matter how we set the prior distribution for Same/Twins, even as low as 0.025, we still always assign Same/Twins. Hence, this assignment is free of any affect from the prior distribution.
BREADR::priorSensitivity(
in_BREADR = relatedness_example,
row = "Ind3 - Ind4",
degrees = c(1),
grid_space = 0.025,
maxPrior = 0.5
)
#> Starting sensitivity analysis (5263 separate analyses)....
#> " | | | 0% | | | 1% | |= | 1% | |= | 2% | |== | 2% | |== | 3% | |== | 4% | |=== | 4% | |=== | 5% | |==== | 5% | |==== | 6% | |===== | 6% | |===== | 7% | |===== | 8% | |====== | 8% | |====== | 9% | |======= | 9% | |======= | 10% | |======= | 11% | |======== | 11% | |======== | 12% | |========= | 12% | |========= | 13% | |========= | 14% | |========== | 14% | |========== | 15% | |=========== | 15% | |=========== | 16% | |============ | 16% | |============ | 17% | |============ | 18% | |============= | 18% | |============= | 19% | |============== | 19% | |============== | 20% | |============== | 21% | |=============== | 21% | |=============== | 22% | |================ | 22% | |================ | 23% | |================ | 24% | |================= | 24% | |================= | 25% | |================== | 25% | |================== | 26% | |=================== | 26% | |=================== | 27% | |=================== | 28% | |==================== | 28% | |==================== | 29% | |===================== | 29% | |===================== | 30% | |===================== | 31% | |====================== | 31% | |====================== | 32% | |======================= | 32% | |======================= | 33% | |======================= | 34% | |======================== | 34% | |======================== | 35% | |========================= | 35% | |========================= | 36% | |========================== | 36% | |========================== | 37% | |========================== | 38% | |=========================== | 38% | |=========================== | 39% | |============================ | 39% | |============================ | 40% | |============================ | 41% | |============================= | 41% | |============================= | 42% | |============================== | 42% | |============================== | 43% | |============================== | 44% | |=============================== | 44% | |=============================== | 45% | |================================ | 45% | |================================ | 46% | |================================= | 46% | |================================= | 47% | |================================= | 48% | |================================== | 48% | |================================== | 49% | |=================================== | 49% | |=================================== | 50% | |=================================== | 51% | |==================================== | 51% | |==================================== | 52% | |===================================== | 52% | |===================================== | 53% | |===================================== | 54% | |====================================== | 54% | |====================================== | 55% | |======================================= | 55% | |======================================= | 56% | |======================================== | 56% | |======================================== | 57% | |======================================== | 58% | |========================================= | 58% | |========================================= | 59% | |========================================== | 59% | |========================================== | 60% | |========================================== | 61% | |=========================================== | 61% | |=========================================== | 62% | |============================================ | 62% | |============================================ | 63% | |============================================ | 64% | |============================================= | 64% | |============================================= | 65% | |============================================== | 65% | |============================================== | 66% | |=============================================== | 66% | |=============================================== | 67% | |=============================================== | 68% | |================================================ | 68% | |================================================ | 69% | |================================================= | 69% | |================================================= | 70% | |================================================= | 71% | |================================================== | 71% | |================================================== | 72% | |=================================================== | 72% | |=================================================== | 73% | |=================================================== | 74% | |==================================================== | 74% | |==================================================== | 75% | |===================================================== | 75% | |===================================================== | 76% | |====================================================== | 76% | |====================================================== | 77% | |====================================================== | 78% | |======================================================= | 78% | |======================================================= | 79% | |======================================================== | 79% | |======================================================== | 80% | |======================================================== | 81% | |========================================================= | 81% | |========================================================= | 82% | |========================================================== | 82% | |========================================================== | 83% | |========================================================== | 84% | |=========================================================== | 84% | |=========================================================== | 85% | |============================================================ | 85% | |============================================================ | 86% | |============================================================= | 86% | |============================================================= | 87% | |============================================================= | 88% | |============================================================== | 88% | |============================================================== | 89% | |=============================================================== | 89% | |=============================================================== | 90% | |=============================================================== | 91% | |================================================================ | 91% | |================================================================ | 92% | |================================================================= | 92% | |================================================================= | 93% | |================================================================= | 94% | |================================================================== | 94% | |================================================================== | 95% | |=================================================================== | 95% | |=================================================================== | 96% | |==================================================================== | 96% | |==================================================================== | 97% | |==================================================================== | 98% | |===================================================================== | 98% | |===================================================================== | 99% | |======================================================================| 99% | |======================================================================| 100%
#> Collating results....
#> Done.The priorSensitivity() function can also be adjusted with input parameters. First, we can focus on any combination of the degrees of relatedness, and a vector containing any combination of 1, 2, 3 or 4 will specify Same/Twin, 1st-Degree, 2nd-Degree or Unrelated to be explored, respectively. For example, to look at just the affect on calling Same/Twins, set degree=1, or for 2nd-Degree and Unrelated, degree=c(3,4). Second, the maximum value for the prior probability (the upper bound of the x-axis) can be changed using the maxPrior parameter (default is 0.5). Finally, grid_space defines the space between tested prior probability values (the width of the bars on the x-axis). The smaller this number, the more bars are calculated, but this comes at the expense of computational time (default is 0.05).
The authors wish to thank Beatriz Amorim for help testing the functionality of BREADR, and Prof. Nigel Bean and Dr. Vincent Braunack-Mayer for enlightening and instructive conversations.