It is nowadays increasingly feasible to conduct genetic analyses in polyploid populations thanks to developments in genotyping technologies as well as tools designed to deal with these genotypes. `polyqtlR`

is one such tool that provides a number of functions to perform quantitative trait locus (QTL) mapping in F1 populations of outcrossing, heterozygous polyploid species.

`polyqtlR`

assumes that a number of prior steps have already been performed - in particular, that an integrated linkage map for the mapping population has been generated. The package has been developed to utilise the output of the mapping software `polymapR`

, although any alternative mapping software that produces phased and integrated linkage maps could also be used. However, the input files may have to be altered in such cases. Currently, bi-allelic SNP markers are the expected raw genotypic data, which is also the data format used by `polymapR`

in map construction. However, some functions can also accept probabilistic genotypic data, for example in the estimation of IBD probabilities. Full functionality of probabilistic genotypes in the package has yet to be implemented but is planned for future releases. The assignment of marker dosages in polyploid species can be achieved using a number of R packages such as `fitPoly`

[1]. More background on the steps of dosage-calling and polyploid linkage mapping can be found in a review of the topic by Bourke *et al.* (2018) [2].

The QTL analysis functions primarily rely on identity-by-descent (IBD) probabilities, which are the probabilities of inheritance of particular combinations of parental haplotypes. Single-marker QTL detection methods are also available. The package was originally developed with the IBD output of TetraOrigin [3] in mind. However, the TetraOrigin package is currently only applicable to tetraploid datasets, and has been implemented in the proprietary software *Mathematica*. Therefore, alternative options to estimate IBD probabilities within R for multiple ploidy levels are offered in `polyqtlR`

.

Genotyping errors are a regular feature of modern marker datasets. Although linkage mapping software such as `polymapR`

has been shown to be relatively robust against genotyping errors [4], if present in sufficiently large proportions (~10 % errors or more) issues may arise with map accuracy and QTL detection. Therefore, a number of functions have been included in `polyqtlR`

to check map accuracy using the estimated IBD probabilities, and to re-impute marker genotypes if required. These imputed genotypes may then be used in an iterative process to re-estimate the linkage map and re-run a QTL analysis.

This tutorial goes through each of the steps in a typical QTL analysis using the example datasets provided with the package, outlining the different options that are available. However, it is certainly recommended to consult the documentation that accompanies each of the functions by running the `?`

command before the function name (*e.g.* `?QTLscan`

).

`polyqtlR`

`polyqtlR`

can be installed using the following call from within R:

`install.packages("polyqtlR")`

There are a number of package dependencies which should be installed, *e.g.* `Rcpp`

, `foreach`

, or `doParallel`

. Usually these will be automatically installed as well but if not, you can always install them yourself one by one, *e.g.*

```
install.packages("Rcpp")
install.packages("foreach")
install.packages("doParallel")
```

before re-trying the installation of `polyqtlR`

. Eventually when all dependencies are on your computer, you should be able to successfully run the following command (*i.e.* without any error message):

`library(polyqtlR)`

It is also a good time to load the datasets that you will need to perform a typical analysis, namely a phased maplist, a set of SNP marker genotypes (discrete dosage scores) and a set of trait data (phenotypes):

`data("phased_maplist.4x", "SNP_dosages.4x", "Phenotypes_4x")`

In the example that follows we are using a simulated tetraploid dataset with 2 chromosomes for simplicity.

Currently two options to estimate IBD probabilities in an F1 population exist within `polyqtlR`

. The first uses a heuristic approach originally developed for tetraploids but later applied to hexaploid populations [5, 6] and implemented here in the `polyqtlR::estimate_IBD`

function. This can be very computationally efficient at higher ploidy levels (and has been generalised to *all* ploidy levels), but the algorithm ignores partially-informative marker information. In datasets with large numbers of haplotype-specific markers this is not such an issue, but in datasets with fewer such markers the accuracy of the resulting IBD probabilities is compromised. The method behind TetraOrigin [3] for offspring IBD probability estimation given a known parental marker phasing has also been replicated in the `polyqtlR::estimate_IBD`

function (this is the default method used). Note that unlike the original TetraOrigin software, parental marker phasing is not re-estimated. However, this is usually a direct output of the linkage mapping procedure (*e.g.* using `polymapR`

).

As the IBD data are the most central objects in this package it is worth spending a moment describing them. In `polyqtlR`

, IBD probabilities are organised in nested list form. The first level of structure are the linkage groups. In our example dataset, the list should have 5 elements corresponding to the 5 linkage groups, each of which is itself a list containing the following elements:

**IBDtype**The type of IBD probability, either “genotypeIBD” or “haplotypeIBD”**IBDarray**A 3d array of IBD probabilities, with dimensions “locus”,“genotype class”, “individual”**map**The integrated linkage map (marker names and cM positions, in order)**parental_phase**The phasing of the markers in the parental map**marginal.likelihoods**A list of marginal likelihoods of different valencies if method “hmm” was used, otherwise`NULL`

**valency**The predicted valency that maximised the marginal likelihood, per offspring. For method “heur”,`NULL`

**offspring**The offspring names**biv_dec**Whether bivalents only (`TRUE`

) or also multivalents were allowed (`FALSE`

) in the procedure**gap**Here`NULL`

, but later this can hold a numeric value (e.g. 1 cM) if IBDs are ‘splined’ or interpolated.**genocodes**Ordered list of genotype codes used to represent the different genotype classes**pairing**Log likelihoods of each of the different pairing scenarios considered**ploidy**The ploidy of parent 1**ploidy2**The ploidy of parent 2**method**The method used, either “hmm” (Hidden Markov Model), “heur” (heuristic) or “hmm_TO” (Hidden Markov Model TetraOrigin)**error**The offspring error prior (eps) used in the calculation.

All functions within `polyqtlR`

for estimating or importing IBD probabilities will automatically return them in this form.

The `estimate_IBD`

function of `polyqtlR`

with the default setting `method = "hmm"`

estimates offspring IBD probabilities using a Hidden Markov Model (HMM) developed by Zheng et al [3] in the TetraOrigin package but generalised in `polyqtlR`

to multiple ploidy levels. Currently, diploid (2x), triploid (3x = 4x \(\times\) 2x), tetraploid (4x) and hexaploid (6x) populations are handled. `genotypes`

can either be discrete (*i.e.* integer marker dosage scores), or probabilistic (*i.e.* the probabilities of each of the ploidy + 1 possible scores). For triploids, tetraploids and hexaploids, the possibility of incorporating double reduction [7] (*i.e.* allowing for the phenomenon of multivalent pairing) is available, although this can have serious performance implications for hexaploids (and in hexaploids, the option of allowing multivalent pairing in *both* parents for the same chromosome is by default disabled as it requires very high RAM requirements (> 32 Gb). Use argument `full_multivalent_hexa = TRUE`

to allow multivalents simultaneously at the same position from both hexaploid parents).

Some of the code used to estimate these IBD probabilities has been programmed in C++ to improve performance, and relies on both the `Rcpp`

and `RcppArmadillo`

packages to provide the interface between R, C++ and the Armadillo C++ library.

We run the function, in this case allowing for multivalents (`bivalent_decoding = FALSE`

), as follows:

```
IBD_4x <- estimate_IBD(phased_maplist = phased_maplist.4x,
genotypes = SNP_dosages.4x,
ploidy = 4,
bivalent_decoding = FALSE,
ncores = 4)
```

Note that the default setting of `bivalent_decoding`

is `TRUE`

, as the function evaluates faster if only bivalents are considered (the difference in run-time becomes more pronounced for hexaploids). However, allowing multivalents in the model gives a more complete and correct picture.

Here we chose to enable parallel processing to speed up the calculations. Note that jobs are split across linkage groups, so with 5 linkage groups it would have made more sense to use 5 cores if they were available. Use `parallel::detectCores()`

to display how many CPU cores are available, and use 1 or 2 less than this at the very most. In this dataset there are only 2 linkage groups, so no more than 2 cores are needed.

`nc <- parallel::detectCores() - 1`

If so desired, the original TetraOrigin package can be used to estimate IBD probabilities (again using a Hidden Markov Model) rather than using the `estimate_IBD`

function. The results should be identical for tetraploids, while `polyqtlR`

also offers the flexibility of handling multiple ploidy levels (2x, 3x, 4x and 6x).

TetraOrigin produces by default a large .txt output file after running the function “inferTetraOrigin”. However, it is convenient to produce a summary of this output using the “saveAsSummaryITO” function, which produces a .csv file summarising the results. Information on how to use TetraOrigin is provided with the package itself, downloadable from GitHub.

The function `import_IBD`

imports the .csv output of TetraOrigin::saveAsSummaryITO. It takes a number of arguments, such as `folder`

, which is the folder containing the output of TetraOrigin. For example, if we wanted to import IBDs from the folder “TetraOrigin” for a species with 5 linkage groups, we would run the following:

```
IBD_4x <- import_IBD(folder = "TetraOrigin",
filename.vec = paste0("TetraOrigin_Output_bivs_LinkageGroup",1:5,"_Summary"),
bivalent_decoding = TRUE)
```

Here it is assumed that the TetraOrigin summary filenames are “TetraOrigin_Output_bivs_LinkageGroup1_Summary.csv” *etc.* If all goes well, the following messages will be printed per linkage group:

```
Importing map data under description inferTetraOrigin-Summary,Genetic map of biallelic markers
Importing parental phasing under description inferTetraOrigin-Summary,MAP of parental haplotypes
Importing IBD data under description inferTetraOrigin-Summary,Conditonal genotype probability
```

In cases where the results are needed quickly, or where there are very large numbers of markers, or for ploidy levels above 6, it is convenient to use the heuristic approach to IBD estimation. We do this using the `estimate_IBD`

function as follows:

```
IBD_4x <- estimate_IBD(phased_maplist = phased_maplist.4x,
genotypes = SNP_dosages.4x,
method = "heur",
ploidy = 4)
```

Note that the attribute `IBDtype`

is now `haplotypeIBD`

, referring to the fact that these IBDs are the probabilities of inheriting each parental haplotype at a locus, as opposed to the probabilities of inheriting particular combinations of parental alleles at a locus (`genotypeIBD`

). Although similar, certain downstream functions such as `exploreQTL`

can only work with the former (although you will still be able to visualise QTL allele effects with the `visualiseQTL`

function). The accuracy of these IBDs is generally *lower* than if `method = "hmm"`

had been used (and therefore the power of subsequent QTL analyses will be somewhat reduced).

Particularly at higher ploidy levels (6x), it becomes computationally expensive to estimate IBDs using the `hmm`

method. Our experience has shown that for hexaploids with a mapping population size of 400 or more individuals running on 4 cores, about 200 - 300 markers per chromosome can be accommodated on an “above-average” desktop computer (16 GB RAM). Running on a single core will reduce the committed memory, but the evaluation time will therefore be much longer. 250 markers per chromosome should be already enough to estimate IBDs with high accuracy - including extra markers over this amount will add incrementally less information (following the law of diminishing returns).

The function `thinmap`

can be used to make a selection of markers that tries to maximise their distribution across the genome and across parental homologues. The argument `bin_size`

is used to specify the size of the bins in which markers are selected - increasing this results in fewer markers being used. The ideal `bin_size`

is 1 cM, although at higher ploidies and with higher marker densities, wider bins may be needed. The function is called as follows:

```
thinned_maplist.4x <- thinmap(maplist = phased_maplist.4x,
dosage_matrix = SNP_dosages.4x)
```

```
##
## 87 markers from a possible 93 on LG1 were included.
##
## 89 markers from a possible 93 on LG2 were included.
```

The object `thinned_maplist.4x`

can then be used in place of `phased_maplist.4x`

in a call to `estimate_IBD`

.

Regardless of how they were generated, IBD probabilities are estimated at all marker positions provided. When we perform an initial scan for QTL, it is often more efficient to look for QTL at a grid of evenly-spaced positions (such as at every 1 cM). This is because the genotype data has been replaced with multi-point IBD estimates at the marker positions. Information at *e.g.* ten different markers within a 0.5 cM window is virtually identical and therefore just one representative for this locus should approximate the full information, while reducing the number of statistical tests. This becomes particularly relevant when we generate significance thresholds using permutation tests later, which are computationally quite demanding.

It is therefore recommended to interpolate the probabilities using the `spline_IBD`

function as follows:

```
IBD_4x.spl <- spline_IBD(IBD_list = IBD_4x,
gap = 1) #grid at 1 cM spacing
```

Before performing a QTL analysis, it is a good idea to visualise the IBD probabilities as these can give indications about the quality of the genotypes, the linkage maps and the parental marker phasing. The function `visualiseHaplo`

was developed originally to examine the inherited haplotypes of offspring with extreme (usually favourable) trait scores to see whether their inherited alleles are consistent with a predicted QTL model (we return to this later). But as a first look at the data quality, the function is also useful.

Haplotypes of linkage group 1 of the first nine F1 offspring can be visualised as follows:

```
visualiseHaplo(IBD_list = IBD_4x,
display_by = "name",
linkage_group = 1,
select_offspring = colnames(SNP_dosages.4x)[3:11], #cols 1+2 are parents
multiplot = c(3,3)) #plot layout in 3x3 grid
```

```
## $recombinants
## NULL
##
## $allele_fish
## NULL
```

Here the quality of the predicted haplotypes appears to be high - dark colours signifying high confidence (probabilities close to 1) and small numbers of recombinations predicted. Regions of dark red signify double reduction, which were permitted during the estimation of IBDs using `estimate_IBD`

. In poorer-quality datasets, considerably more “noise” may be present, with less certainty about the correct configuration per offspring.

The function returns a list of 2 elements (`recombinants`

and `allele_fish`

) which do not concern us here but which we will return to later. Note that we selected the offspring to display using the `select_offspring`

argument - using the option `select_offspring = all`

will return the whole population. We have also opted to `display_by = "name"`

, as opposed to `display_by = "phenotype"`

, in which case further arguments are required. For convenience the plots were combined into a 3 x 3 grid using the `multiplot`

argument. For more options, including how to overlay recombination events, see `?visualiseHaplo`

.

An example of IBD probabilities that show much higher levels of uncertainty might look something like the following:

```
## $recombinants
## NULL
##
## $allele_fish
## NULL
```

Here, there are unrealistically-high numbers of recombinations predicted, suggesting that the algorithm had difficulty correctly assigning genotype classes. In such cases it may be worthwhile to re-examine the steps prior to IBD estimation, in particular genotype calling. We will return to this issue again later.

Another approach to assessing the quality of the IBD probabilities, as well as understanding the “QTL detection landscape” to some degree is to estimate the **G**enotypic **I**nformation **C**oefficient (GIC). The GIC is calculated per homologue across the whole population, and ranges from 0 (no information) to 1 (full information). To maximise QTL detection power and precision, we would like the GIC to be uniform and as high as possible [8]. Note that the GIC is only defined for bivalent pairing, and therefore estimates of GIC from a multivalent-aware HMM are based on offspring predicted to have come from a bivalent-based meiosis for that linkage group (this tends to be most of the population anyway).

We calculate the GIC as follows:

`GIC_4x <- estimate_GIC(IBD_list = IBD_4x)`

We can also visualise the output of this function as follows:

`visualiseGIC(GIC_list = GIC_4x)`