# Least squares inference in phylogeny

**Least squares inference in phylogeny** generates a
phylogenetic tree based on an
observed matrix of pairwise genetic distances and
optionally a weight
matrix. The goal is to find a tree which satisfies the distance constraints as
best as possible.

## Ordinary and weighted least squares

The discrepancy between the observed pairwise distances $ D_{ij} $ and the distances $ T_{ij} $ over a phylogenetic tree (i.e. the sum of the branch lengths in the path from leaf $ i $ to leaf $ j $) is measured by

- $ S = \sum_{ij} w_{ij} (D_{ij}-T_{ij})^2 $

where the weights $ w_{ij} $ depend on the least squares method used.
Least squares
distance tree construction aims to find the tree (topology and branch lengths)
with minimal S. This is a non-trivial problem. It involves searching the
discrete space of unrooted binary tree topologies whose size is exponential in
the number of leaves. For n leaves there are
1 • 3 • 5 • ... • (2n-3)
different topologies. Enumerating them is not feasible already for a small
number of leaves. Heuristic search methods are used to find a reasonably
good topology. The evaluation of S for a given topology (which includes the
computation of the branch lengths) is a linear least squares problem.
There are several ways to weight the squared errors
$ (D_{ij}-T_{ij})^2 $,
depending on the knowledge and assumptions about the variances of the observed
distances. When nothing is known about the errors, or if they are assumed to be
independently distributed and equal for all observed distances, then all the
weights $ w_{ij} $ are set to one. This leads to an ordinary least
squares estimate.
In the weighted least squares case the errors are assumed to be independent
(or their correlations are not known). Given independent errors, a particular
weight should ideally be set to the inverse of the variance of the corresponding distance
estimate. Sometimes the variances may not be known, but they
can be modeled as a function of the distance estimates. In the Fitch and
Margoliash method
^{[1]}
for instance it is assumed that the variances are proportional to the squared
distances.

## Generalized least squares

The ordinary and weighted least squares methods described above assume independent distance estimates. If the distances are derived from genomic data their estimates covary, because evolutionary events on internal branches (of the true tree) can push several distances up or down at the same time. The resulting covariances can be taken into account using the method of generalized least squares, i.e. minimizing the following quantity

- $ \sum_{ij, kl} w_{ij,kl} (D_{ij}-T_{ij}) (D_{kl}-T_{kl}) $

where $ w_{ij,kl} $ are the entries of the inverse of the covariance matrix of the distance estimates.

## External links

- PHYLIP, a freely distributed phylogenetic analysis package containing an implementation of the weighted least squares method
- PAUP, a similar package available for purchase
- Darwin, a programming environment with a library of functions for statistics, numerics, sequence and phylogenetic analysis

## References

- ↑ Fitch WM, Margoliash E. (1967). Construction of phylogenetic trees.
*Science*155: 279-84.

Topics in phylogenetics | |
---|---|

Relevant fields | phylogenetics | computational phylogenetics | molecular phylogeny | cladistics |

Basic concepts | synapomorphy | phylogenetic tree | phylogenetic network | long branch attraction |

Phylogeny inference methods | maximum parsimony | maximum likelihood | neighbour joining | UPGMA | Bayesian inference | Least Squares |

Current topics | PhyloCode | DNA barcoding |

List of evolutionary biology topics |