# Pseudomarker - Tutorial

Genevieve Monsees and Saunak Sen

## Introduction

This is a step-by-step tutorial for performing QTL data analysis using Pseudomarker tools written in Matlab. As an example we will use the data on salt-induced hypertension in mice collected by Dr. Bev Paigen at the Jackson Laboratory. The backcross experiment between the C57BL/6J strain and A/J strain had 250 mice whose blood pressure was measured.

## First steps

Once your data is clean, in the proper format, and stored under a common directory (see Pseudomarker data format for details), you are ready to begin your analysis. You will also have to make sure you know which directory the Pseudomarker programs were saved in.

Invoke Matlab by double-clicking the icon (in Windows on on the Mac) or by typing into a command shell (if you are in UNIX).

The first step in setting up your analysis script is to inform your
Matlab command window which directories you will be accessing by adding
a path to those directories. In our case, we need to add a path to the
directory in which the Pseudomarker programs are stored (for example
c:\pseudomarker). We can do this using the `addpath`

command:

addpath c:\pseudomarker;

You need to set the path only once each time you invoke Matlab. (Note: Placing a `%`

at the beginning of a line will comment the line out.)

In order to access our data files we will change directory to where the data is located by the following command

cd c:\qtl\data\hypertensionassuming that the data is in the directory

`c:\qtl\data\hypertension`

in the pseudomarker format.
To find out what version of Pseudomarker code you are using, type into the command window:

pseudomarker## Reading data

### Reading the data into the command window

Now that Matlab knows where to look for the programs, we want to read the data into the workspace. To do this we use the `readdata`

command. The function `readdata`

will output two variables, which we are calling y and mdata. While y is a matrix containing the information in the file `pheno.dat`

,
mdata is a 1x20 array of structures containing the information on the
marker data organized by chromosome. See the section on the data
structures for more information. The final code using the above
examples would read:

Placing a semicolon at the end of the command line prevents Matlab from displaying the output in the command window. Once we have created y and qtldata, we can identify our phenotype and covariate (if we have one).

### Identify the Phenotype and Covariate

Now that all the measurements made on each animal are in the matrix y, it is time to separate out the phenotype we want to study and covariates. In the hypertension data, there are no covariates. To identify our phenotype, we state what we would like it to be called (bp) and then specify that it is all rows (noted as : ) within a column of y (for example column 1).

bp = y(:,1);If a covariate is used, it is identified in the same manner. First we state what we would like it to be called, say covar and then specify that it is all rows ( : ) within a column of y (for example column 2), then the syntax would look like:

% do not use the following step for hypertension data% covar = y(:,2);

If a covariate is not being used and this is the case with the hypertension data, we inform Matlab by creating an empty covariate matrix covar.

covar=[];Matlab is now fully aware of all of our original data.

## Basic plots

As they say: Garbage in, garbage out. For the data analysis to be meaningful we have to make sure that data integrity has not been compromised, whether there are any artifacts in the data, and if any transformations are needed. This is best done with the help of some pictures. The most basic of them are scatterplots and histograms. In Matlab this can be accomplished by giving the command

plotmatrix( y )We can also use a *butterfly* plot of the blood pressure phenotype by giving the command

Another diagnostic plot is to plot the raw genotypes. We prefer to plot the genotype after sorting the mice according to the phenotype. This way we can see patterns. figure; plotgeno( mdata, 'genolabel', {'B6','Het','Missing'} )

Now we can take our first steps towards analysis.

## Imputation step

The goal of the imputation step is to create multiple sets of fake
marker data that are compatible with the observed marker genotype data.
These will be stored in an array of structures similar to the observed
marker data structure. For exploratory scanning, we use about 16 sets
of fake marker genotypes that are separated from each other by about
10cM. The function called `impute`

does this for us.
Assuming we want to store the imputed genotypes in the array of
structures, fake here is what you have to enter:

Type

`help impute`

for more options and help about the `impute`

function.
If you wish to impute just the first four chromosomes and the seventh, you would type instead

fake = impute( mdata, 0.1, 16, [ 1:4 7 ] );This means, "Impute 16 sets of fake marker data compatible with observed genotypes in mdata on chromosomes 1 through 4 and 7. The spacing between the pseudomarkers should be 0.1 Morgans."

## Exploratory scans

On to the analysis!

Now that we have created a fake, we can begin to calculate some of the basic quantities. All three functions which we are about to introduce will require bp, covar, mdata, fake, and obsdata as input, and will output structure arrays. For details about the structures see the section on data structures.

### Mainscan

The function `mainscan`

will run a one dimensional genome scan and store the result in a structure array. We will name this array of structures lod1.

### Plotting a Mainscan

To visualize the output of `mainscan`

we can use the command `plotmainscan`

. Optionally we can indicate the chromosomes we wish to plot, and the type of plot we desire.

The chromosomes we wish to plot must be enclosed in square brackets ([]) and separated by colons (for example [1:3:4]). If we wish to plot all chromosomes, this item can be omitted from the command line. The default is to plot all chromosomes.

There are three options for the type of plot: `'lod'`

, `'lik'`

, `'F'`

and `'prop'`

. The default is `'lod'`

which plots the LOD score. The option `'prop'`

will plot the proportion of variance explained, `'loglik'`

uses the natural base for the logarithm, and `'F'`

uses F statistics. If we wish to plot the LOD score, this item can be
omitted from the command line. If we choose to plot the LOD score for
chromosomes 1, 3, and 4 our code would look like:

However, if we chose to plot the proportion of variance for all chromosomes, we just have to enter:

plotmainscan(lod1,'prop');### Running a Pairscan

The function `pairscan`

will run a pairwise
(two-dimensional) genome scan for both additive and interaction
effects. The syntax is almost identical to that of `mainscan`

.
If we call the resulting structure array lod2 we will enter:

### Plotting a Pairscan

The function `plotpairscan`

produces a plot of the pairwise effects. It has similar options as its one-dimensional counterpart `plotmainscan`

. By default, the command

will plot the results of a two-dimensional genome scan using a LOD score (`'lod'`

)
scale. This plot is more complicated than the result of the
one-dimensional scan. The plot has two halves. The lower right-hand
triangle shows the LOD score for a two-locus model *including interactions*,
while the upper left-hand triangle shows the LOD score for the
interaction term alone. For visual parity, the upper tiriangle is
inflated by a factor of three in a backcross experiment. For intercross
experiments the inflation factor is two.

We would look for peaks in the lower right half plot of the `pairscan`

results to identify major pairs of loci. Blips in the upper left half of the plot indicate evidence for interactions.

If we chose to plot the proportion of variance explained for chromosomes 1, 3, and 4, we can use the subscripting mechanism of Matlab to write:

plotpairscan(lod2([1 3 4],[1 3 4]),'prop');Note that the chromosomes we want have to be listed twice because we have to subscript the appropriate rows and columns of the pairwise scan data structure.

The procedure for adding a title and axis labels is the same as in `mainscan`

, although they have different meaning in this case.

### Model selection

Now that we have a rough idea of the results from the plots of `mainscan`

and `pairscan`

, we have to find out which loci are worth a closer look. Statistically, this is a hard problem. Here is how we do it.

### Permutation tests

First we run permutation tests on the results of `mainscan`

and `pairscan`

to find out which loci (or pairs of loci) stand out. The following
commands will run permutation tests for the one-dimensional and
two-dimensional scans:

maxlod2 = permutest2(bp,[],[],fake,100);

In the above commands 1000 permutation tests were run for the `mainscan`

and 100 were used for the `pairscan`

.
Because the permutation tests may take a while to run, we recommend
using a small of permutations to find out how long the program will
take for a larger number of permutations. We generally use 1000
permutations for `mainscan`

and 100 permutations for `pairscan`

. The `permutest2`

takes a long time to run, so be careful what number of permutations you give it to do.

### Setting the cutoff

After the permutations have finished, we have to extract the thresholds to use for subsequent use.

% gives 90% 95% and 99% cutoffs for mainscanthresh1 = thresh(maxlod1)

% gives 90% 95% and 99% cutoffs for pairscan

thresh2 = thresh(maxlod2)

For the one-dimensional scans, the thresholds are the upper percentiles of the maxima of the LOD score obtained by permuting the phenotype column. For two-dimensional scans, the permutation tests give two percentiles. The first column gives the percentiles of the maximum LOD score comparing a two-QTL model with interaction term to an null model. The second column gives the percentiles of the maximum of the LOD score testing for the interaction term for every two-QTL model fitted.

These cutoff values are in the LOD scale. To convert them to the `propvar`

scale we have to use a formula or use the function `lod2propvar`

. For example to convert the one-dimensional cutoff to the proportion of variance scale, use

n = length( ~isnan( bp ) );

lod2propvar( cut1, n );

### Using a flagging routine

After seeting the cutoffs, we can use the flagging routines that will identify the important loci. Suppose we choose the 95% cutoffs.

% get 95% mainscan cutoffcut1 = thresh1(2);

% get 95% pairscan cutoff for full model vs null model

cut2 = thresh2(2,1);

The function,

`reportscan`

flags loci that are important based on one-dimensional and
two-dimensional scans. To flag single loci based on one-dimensional
scans use
reportscan(lod1,cut1);To flag pairs of loci based on the results of one- and two-dimensional scans use reportscan(lod1, lod2, [ cut2 0.001 0.005 ] )

This function looks at all pairs of loci. A locus pair is deemend interesting if the LOD score at the locus pair comparing the full model to the null model exceeds the two-dimensional thresholds based on permutation tests as set by

`cut2`

.
For all interesting locus pairs, we make secondary tests for the
interaction term and additive effects at levels 0.001 and 0.005
respectively.
The `reportscan`

function will give a set of pairs of
loci that are deemed worthwhile. We can then see what the allelic
effects of those loci are and then fit a multiple-QTL model with all
the loci in them.

### Plotting allelic effects

First, we select the pseudomarkers closest to the loci flagged. selectfake = subsetgeno( fake, 'chrid', [ 1 4 6 15 ], ...'mpos', [ 70 30 70 20 ] );

Next we plot the allelic effects of selected loci using the function

`alleleplot`

.
figure;subplot(2,2,1)

alleleplot( bp, selectfake(1).igeno, 'Chr1QTL', {'B6','Het'} )

subplot(2,2,2)

alleleplot( bp, selectfake(2).igeno, 'Chr4QTL', {'B6','Het'} )

subplot(2,2,3)

alleleplot( bp, selectfake(4).igeno, 'Chr15QTL', {'B6','Het'} )

subplot(2,2,4)

alleleplot( bp, selectfake(3).igeno, selectfake(4).igeno, ...

{'Chr6QTL','Chr15QTL'}, {'B6','Het'} )

Note that all the QTL on chromosomes 1, 4 and 15 have strong main effects. The interaction between loci on chromosomes 6 and 15 can be seen since the locus on chromosome 15 has an effect only when the locus on chromosome 6 is heterozygous.

### Type III analysis

We can also do a "Type III" analysis using the general "scan" function. First we will fit the largest model with main effects on chromosomes 1, 4, 6 and 15, and a 6x15 interactions. Then we will drop each of these and see how much the LOD score drops as a result. These are accomplished using the general`scan`

function.
% Fit full model:[lod0,bf0] = scan( bp, [], [], selectfake, [ 1 4 6 15; 1 1 1 1 ], ...

struct( 'twoint', [ 3 4 ] ) );

% Drop QTL on chr1:

[lod1,bf1] = scan( bp, [], [], selectfake, [ 4 6 15; 1 1 1 ], ...

struct( 'twoint', [ 2 3 ] ) );

% Drop QTL on chr4

[lod4,bf4] = scan( bp, [], [], selectfake, [ 1 6 15; 1 1 1 ], ...

struct( 'twoint', [ 2 3 ] ) );

% Drop 6x15 interaction

[lod6x15,bf6x15] = scan( bp, [], [], selectfake, [ 1 4 6 15; 1 1 1 1 ] );

% Drop drop QTLs on chr6 and chr15

[lod6n15,bf6n15] = scan( bp, [], [], selectfake, [ 1 4; 1 1 ] );

Here are the LOD scores for the comparisons: % Compare models to the largest model:

comparison_lods = [ lod0-lod1 lod0-lod4 lod0-lod6x15 lod0-lod6n15 ]

Now we calculate the nominal p-values corresponding to these comparisons: % Calculate nominal p-values associated with those terms

comparison_pvalues = [ 1-chi2cdf(2*log(10)*(lod0-lod1),1)...

1-chi2cdf(2*log(10)*(lod0-lod4),1)...

1-chi2cdf(2*log(10)*(lod0-lod6x15),1)...

1-chi2cdf(2*log(10)*(lod0-lod6n15),1) ]

We can see that all terms in the model do contribute and we keep them. So our selected model has QTL on chromosomes 1, 4, 6, and 15. The last two loci interact.

## Fine mapping

Once we have narrowed down the list of chromosomes that harbor QTLs, how many of them on each chromosome, and whether they interact with each other, it is time to estimate the genetic model and the location of the QTLs. The first step in this process would be to run scanning programs at higher resolutions for the chromosomes of interest. For example, we could place the pseudomarkers spaced 2cM apart and use 128 fake datasets instead of 16 which is used for a 10cM pseudomarker map.

bigfake = impute( mdata, 0.02 );biglod1 = mainscan( bp, [], [], bigfake );

biglod2 = pairscan( bp, [], [], bigfake );

The imputation step and the scanning steps will take much longer than those for the exploratory scans. Either increase the resolution and the number of imputations gradually, or be patient!

### Single QTL on a chromosome

When the evidence indicates that there is only one QTL on a
chromosome, we can make a plot of the posterior distribution of the QTL
location. This is done by using the `plotmainlocalize`

function.

### Two QTL on a chromosome

If there is reason to believe that there are two QTL on a single
chromosome, then we can make a plot of the joint posterior distribution
of two QTL on the chromosome using the function `plotpairlocalize`

. We can use

This will draw the contours of confidence regions for the two QTL on chromosome 1 at levels 50%, 95% and 99% levels. We find these settings useful; feel free to experiment with other choices of levels for the confidence regions.

## Estimating effects

### Single QTL on a chromosome

To estimate the effects of the QTL on each chromosome and their standard errors, one can use the `estimate`

family of functions. These give the approximate the posterior means and
variances of the estimated effects. For example if there is one QTL on
chromosome 4, then to estimate its effect we would use:

obsdata = find( ~isnan(bp) );

pheno = bp(obsdata);

igeno = fake(4).igeno(obsdata,:,:);

[mm,vv] = oneestimate( pheno, [], igeno, 'bc' );

In this case mm and vv are the posterior means and variances of the estimated effects of the QTL on chromosome 4.

### Two QTL on a chromosome

If there are more than one QTL on a chromosome, then we will have to use different `estimate`

functions. For esample, if there are two additive QTL on chromosome 4 we will use

igeno = fake(4).igeno(obsdata,:,:);

[mm,vv] = twoestimate( pheno, [], igeno, 'bc', 'a+b' );

If they were interacting, we would use [mm,vv] = twoestimate( pheno, [], igeno, 'bc', 'a*b' );

The last argument is the model argument which is either

`'a+b'`

or `'a*b'`

depending on whether we are estimating additive effects or the full model including interactions.
### Two QTL on different chromosomes

If there are more than two QTL on different chromosomes, then we use `twoestimate2`

. For example, if there are two additive QTL on chromosome 1 and 4 we will use

igeno1 = fake(1).igeno(obsdata,:,:);

igeno4 = fake(4).igeno(obsdata,:,:);

[mm,vv] = twoestimate2( pheno, [], igeno1, igeno4, 'bc', 'a+b' );

If they were interacting, we would use [mm,vv] = twoestimate2( pheno, [], igeno1, igeno4, 'bc', 'a*b' );

### More than two QTL

For this we have to use the functions `threeestimate`

, `threeestimate2`

, and `threeestimate3`

functions. The syntax for these functions is similar to the `twoestimate`

functions. Please see the `help`

for these functions for details. We are trying to develop general functions to deal with all these cases together.

## Data structures

In this section we will describe the data structures underlying the Pseudomarker programs.

## readdata

This function outputs two arguments. The first one is the phenotype matrix and the second one is the marker data structure. This is an array, each element of the array corresponding to a chromosome. The fields are the following:

`chrid:`

chromosome number`geno:`

matrix of genotypes; rows correspond to inidviduals and columns correspond to markers`mnames:`

list of marker names`mpos:`

marker positions in*Morgans*`chromlen:`

length of the chromosome under consideration; at this point we*round up*from the righmost marker to the nearest 20cM

### impute

The output of this function is also a structure array with each element of the structure corresponding to a chromosome. The fields are:`chrid:`

chromosome numner`igeno:`

matrix of imputed genotypes; rows correspond to individuals, columns correspond to pseudomarkers and pages correspond to the imputation numner`mpos:`

pseudomarker positions in*Morgans*`cross:`

cross type;`'bc'`

(backcross) or`'f2'`

(F2 intercross)`chromlen:`

chromosome length

### mainscan

Each element of the output structure array corresponds to a chromosome. Here are the fields:`n:`

number of individuals`cross:`

cross type`chrid:`

chromosome number`bf:`

un-normalized bayes factor`lod:`

vector of LOD scores; this is the log of the posterior density at the corresponding pseudomarker locations`mpos:`

pseudomarker positions`chromlen:`

chromosome length in Morgans

### pairscan

This is probably the most complicated structure of them all. This may change a bit in future versions of the code. Be warned...The program outputs a *two-dimentional* array of structures. Each corresponds to a pair of chromosomes which may or may not be the same.

`n:`

number of individuals`cross:`

cross type`chrid1:`

row chromosome number`chrid2:`

col chromosome number`bfadd:`

Bayes factor for additive model (if row=col)`bfint:`

Bayes factor for full model (if row=col)`bf:`

Bayes factor for full model (if row>col) else for additive model`baselod:`

not implemented now; intended for covariates`lod:`

log of the posterior density for QTL locations at pseudomarker pair locations`chromlen1:`

row chromosome length`chromlen2:`

col chromosome length`mpos1:`

row pseudomarker positions`mpos2:`

col pseudomarker positions

You made it! You have now done some basic QTL Analysis and know some
of the ins and outs of Matlab coding. Do not forget to use the `help`

command if you ever have any questions about a function. If you find errors in this document please drop us a note!