# Chapter 5: How Do We Compare Biological Sequences?

### (Coursera Week 1)

You may be surprised, but sixty years ago, the word "programming" had nothing to do with computer programming, but instead referred to finding optimal programs, in the sense of finding a military schedule for logistics. In the 1950s, the inventor of dynamic programming, Richard Bellman, joined the RAND Corporation, which at the time was working on various contracts for the US Defense Department. At that time, the Secretary of Defense, Charles Wilson, was under pressure to reduce the military budget, including the research budget. Here is how Bellman described his interactions with Wilson and the birth of the term “dynamic programming”:

"[Wilson's] face would suffuse, he would turn red, and he would get violent if people used the term research in his presence. You can imagine how he felt, then, about the term mathematical… Hence, I felt I had to do something to shield Wilson and the Air Force from the fact that I was really doing mathematics inside the RAND Corporation. What title, what name, could I choose? In the first place I was interested in planning, in decision making, in thinking. But planning, is not a good word for various reasons. I decided therefore to use the word “programming”. I wanted to get across the idea that this was dynamic, this was multistage, this was time-varying I thought, lets kill two birds with one stone. Lets take a word that has an absolutely precise meaning, namely dynamic, in the classical physical sense. It also has a very interesting property as an adjective, and that is it's impossible to use the word dynamic in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. It's impossible. Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to. So I used it as an umbrella for my activities."

Yes, and below is some pseudocode. Instead of making recursive calls, we simply embed the algorithm's iterations within a while loop.

**IterativeOutputLCS**(*Backtrack*, *v*, *w*)
* LCS* ← an empty string
* i* ← length of string
* **j* ← length of string *w
*** while** *i* > 0 and *j* > 0
**if** *Backtrack*(*i*, *j*) = "↓"
*i* ← *i*-1
**else** **if** Backtrack(i,j) = "**→**"
*j* ← *j*-1
**else** **if** Backtrack(i,j) = "**↘**"
*LCS* ← concatenate *v*_{i} with *LCS*
* i* ← *i*-1
*j* ← *j*-1*
*** return** *LCS*

**OutputLCS**?

All three types of backtracking edges (horizontal, vertical, and diagonal) are equally important; **OutputLCS** simply starts the analysis with vertical edges because it has to start somewhere! This will indeed bias the solutions if there are *multiple* longest common subsequences, in which case you may want to vary backtracking over multiple runs.

All topological orderings result in the same running time of your alignment algorithm, so you can choose any one you like.

### (Coursera Week 2)

Indel penalties are often implemented independent of the amino acid types in the inserted or deleted fragment, despite evidence that specific residue types are preferred in gap regions. For example, all indel penalties in the PAM matrix presented in the book are equal to 8, independently of which amino acid is inserted or deleted.

If we knew in advance where the conserved region between two strings begins and ends, then we could simply change the source and sink to be the starting and ending nodes of the conserved interval. However, the key point is that we do not know this information in advance. Fortunately, since local alignment adds free taxi rides from the source (0,0) to *every* node (and from every node to the sink (*n*, *m*)) of the alignment graph, it explores *all* possible starting and ending positions of conserved regions.

Unless biologists have a good reason to align entire strings (e.g., alignment of genes between very close species such as human and chimpanzee), they use local alignment.

As mentioned in the text, an optimal local alignment fits within a sub-rectangle in the alignment graph. We denote the upper right and bottom left nodes of this sub-rectangle as (*i*, *j*) and (*i*’, *j*’), respectively. Unless (*i*, *j*) = (0, 0) or (*i*’, *j*’) = (*n*, *m*), the local alignment corresponds to taking a free taxi ride (i.e., zero-weight edge) from (0, 0) to (*i*, *j*), taking one edge at a time to travel from (*i*, *j*) to (*i*’, *j*’), and then taking another free taxi ride from (*i*’, *j*’) to (*n*, *m*).

We already know that (*i*’, *j*’) can be computed as a node such that the score *s*_{i’, j’} is maximized over all nodes of the entire alignment graph. The question, then, is how to backtrack from this node and to find (*i*, *j*).

Recall the three backtracking choices from the **OutputLCS** pseudocode corresponding to the backtracking references ("↓", "**→**", and "**↘**"). To these, we will simply add one additional option, "FREE", which is used when we have used a free taxi ride from the source and will allow us to jump back to (0, 0) when backtracking. Thus, to find (*i*, *j*), one has to backtrack from (*i*’, *j*’) until we either reach a coordinate (*i*, *j*) with *Backtrack*_{i, j} = "FREE" or until we reach the source (0,0).

For simplicity, the following **LocalAlignment** pseudocode assumes that *s _{i, j}* = -∞ if

*i*< 0 or

*j*< 0.

**LocalAlignment**(*v*, *w*, *Score*)
**for** *i* ← 0 to |*v*|
**for** __j__ ← 0 to |*w*|
*s*_{i,j} ← max{0, *s*_{i-1,j} + *Score*(*v*_{i}, -), *s*_{i,j-1 }+ *Score*(-, *w*_{j}), *s*_{i-1,j-1} + *Score*(*v*_{i}, *w*_{j})}
**if ***s*_{i,j} = 0
*Backtrack*_{i,j} ← "FREE"
**else ****if ***s*_{i,j} = *s*_{i-1,j} + *Score*(*v*_{i}, -)**
***Backtrack*_{i,j} ← "↓"
**else if ***s*_{i,j} = *s*_{i,j}-1 + *Score*(-, *w*_{j})
*Backtrack*_{i,j} ← "→"
**else** *Backtrack*_{i,j} ← "↘"
**return** *Backtrack*

After computing backtracking references, we can compute the source node of the local alignment by invoking **LocalAlignmentSource**(*Backtrack, i', j'*), where (*i', j'*) is the sink of the local alignment computed as a node with maximum score among all nodes in the alignment graph. The **LocalAlignmentSource** pseudocode is shown below.

**LocalAlignmentSource**(*Backtrack*, *i*, *j*)
**while** *i* > 0 or *j* > 0
**if** *Backtrack*(*i*, *j*) ="FREE"
**return** (*i*, *j*)
**else if** Backtrack(*i*, *j*) = "↓"
*i* ← *i *- 1
**else if** *Backtrack*(*i*, *j*) = "→"
*j* ← *j *- 1
**else if** Backtrack(*i*, *j*) = "↘"
*i* ← *i* - 1
*j* ← *j* - 1
**return** (0,0)

Please take a look at the following figure for a big hint.

### (Coursera Week 3)

The problem with this idea is that when computing the score at a node (*i*, *j*), we need to take into account the score of closing a gap from every node in the same row and same column. This is exactly what long indel edges do, and what we are trying to avoid in order to reduce the runtime of the algorithm by decreasing the number of edges in the graph.

For example, the figure below illustrates the two-level Manhattan for computing alignment with affine gap penalties (compare with the 3-level Manhattan shown in the text). At first glance, it looks like it solves the Alignment with Affine Gap Penalties Problem.

However, it you consider the following alignment with two gaps,

ATT--AA A--GGAA

you will see that the path through the two-level graph for this alignment (shown below) miscalculates the alignment score; whereas the score of the alignment above is 3-2*(σ+ε), the score of the alignment in the figure below is 3- σ-3*ε.

*i*,

*j*)

_{middle}to the nodes (

*i*,

*j*)

_{lower}and (

*i*,

*j*)

_{upper}(instead of connecting them with (

*i*+1,

*j*)

_{lower}and (

*i*,

*j*+1)

_{upper}?

Connecting (*i*, *j*)_{middle} to the nodes (*i*, *j*)_{lower} and (*i*, *j*)_{upper} (and assigning the edges a score of σ - ε) indeed seems like a more elegant approach to building a three-level Manhattan. However, in this case we would create directed cycles, since there would exist edges from (*i*, *j*)_{lower} and (*i*, *j*)_{upper} back to (*i*, *j*)_{middle} .

Please consult the following figure, which shows a topological order for a small three-level graph in which we proceed row-by-row in each level. (The topological order corresponds to visiting the nodes in increasing order, from 1 to 36.) Because there are some edges from (*i*, *j*)_{lower} to (*i*, *j*)_{middle} and from (*i*, *j*)_{upper} to (*i*, *j*)_{middle} for each *i* and *j*, we need to make sure that we visit (*i*, *j*)_{lower} and (*i*, *j*)_{upper} before (*i*, *j*)_{middle}.

Alignment with affine gap penalties is merely a search for a longest path in a DAG. Thus, backtracking in this graph is no different that backtracking in an arbitrary DAG, starting at the sink and proceeding toward the source according to a topological order. This reduces the problem to identifying a topological order, of which there are several.

It may be tricky to determine how to initialize the *lower*, *upper*, and *middle* matrices in the Alignment with Affine Gap Penalties Problem. For this reason, it may be easier to construct a DAG for this problem, approaching it as a general Longest Path in a DAG problem. Since we are interested in paths that start at the node (0,0) in the middle graph, we initialize this node with score 0. However, it is not the only node lacking incoming edges in the three-lelel graph for the Alignment with Affine Gap Penalties Problem. We therefore will initialize all other nodes with indegree equal to 0 with score -∞.

If the array *Length*(*i*) has more than one element of maximum value, then you can choose any of them, breaking ties arbitrarily.

*edge*make up less than half of the total area of the alignment graph. This implies that when we divide and conquer based on the middle node, we obtain runtime greater than 2

*nm*, whereas when we divide and conquer based on the middle edge, we obtain runtime less than 2

*nm*.

*two*middle columns to implement the divide-and-conquer linear-space algorithm for alignment with affine gap penalties; try to figure out which ones!

*n*against a profile of length

*k*, i.e., against a 4 x

*m*matrix?

Every alignment of a string *v* against a profile *Profile* = (*P*_{X, j}) (for *X* ∈ {`A`, `C`, `G`, `T`}, and 1 ≤ j ≤ m) can be represented as a path from (0,0) to (*n*, *m*) in the alignment graph. We can define scores of vertical and horizontal edges in this graph as before, i.e., by assigning penalties sigma to indels. There are various ways to assign scores to diagonal edges, e.g, a diagonal edge into a node (*i*, *j*) can be assigned a score *P*_{vi, j}. After the scores of all edges are specified, an optimal alignment between a string and a profile is computed as a longest path in the alignment graph.