Bioinformatics Algorithms: An Active Learning Approach

Chapter 5: How Do We Compare Biological Sequences?

(Coursera Week 1)

Why is it called “dynamic programming”?

You may be surprised, but sixty years ago, the word "programming" had nothing to do with computer programming, but instead referred to finding optimal programs, in the sense of finding a military schedule for logistics. In the 1950s, the inventor of dynamic programming, Richard Bellman, joined the RAND Corporation, which at the time was working on various contracts for the US Defense Department. At that time, the Secretary of Defense, Charles Wilson, was under pressure to reduce the military budget, including the research budget. Here is how Bellman described his interactions with Wilson and the birth of the term “dynamic programming”:

"[Wilson's] face would suffuse, he would turn red, and he would get violent if people used the term research in his presence. You can imagine how he felt, then, about the term mathematical… Hence, I felt I had to do something to shield Wilson and the Air Force from the fact that I was really doing mathematics inside the RAND Corporation. What title, what name, could I choose? In the first place I was interested in planning, in decision making, in thinking. But planning, is not a good word for various reasons. I decided therefore to use the word “programming”. I wanted to get across the idea that this was dynamic, this was multistage, this was time-varying I thought, lets kill two birds with one stone. Lets take a word that has an absolutely precise meaning, namely dynamic, in the classical physical sense. It also has a very interesting property as an adjective, and that is it's impossible to use the word dynamic in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. It's impossible. Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to. So I used it as an umbrella for my activities."

Is it possible to develop an iterative rather than recursive algorithm for OutputLCS?

Yes, and below is some pseudocode.  Instead of making recursive calls, we simply embed the algorithm's iterations within a while loop.

IterativeOutputLCS(Backtrack, v, w)
    LCS ← an empty string
    i ← length of string
    j ← length of string w
    while i > 0 and j > 0
        if Backtrack(i, j) = ""
            i ← i-1
        else if Backtrack(i,j) = ""
        else if Backtrack(i,j) = ""
              LCS ← concatenate vi with LCS
            i ← i-1
            j  ← j-1
    return LCS

Why do we favor vertical backtracking edges (over horizontal and diagonal) in OutputLCS?

All three types of backtracking edges (horizontal, vertical, and diagonal) are equally important; OutputLCS simply starts the analysis with vertical edges because it has to start somewhere! This will indeed bias the solutions if there are multiple longest common subsequences, in which case you may want to vary backtracking over multiple runs.

If there are many topological orders, which one should I use for computing alignments?

All topological orderings result in the same running time of your alignment algorithm, so you can choose any one you like.

(Coursera Week 2)

Can you explain how to select penalties for insertions and deletions?

Indel penalties are often implemented independent of the amino acid types in the inserted or deleted fragment, despite evidence that specific residue types are preferred in gap regions. For example, all indel penalties in the PAM matrix presented in the book are equal to 8, independently of which amino acid is inserted or deleted.

How do we know where the conserved region in a local alignment starts and ends?

If we knew in advance where the conserved region between two strings begins and ends, then we could simply change the source and sink to be the starting and ending nodes of the conserved interval. However, the key point is that we do not know this information in advance. Fortunately, since local alignment adds free taxi rides from the source (0,0) to every node (and from every node to the sink (n, m)) of the alignment graph, it explores all possible starting and ending positions of conserved regions.

How do biologists decide when to use global alignment and when to use local alignment?

Unless biologists have a good reason to align entire strings (e.g., alignment of genes between very close species such as human and chimpanzee), they use local alignment.

How do we backtrack when computing a local alignment?

As mentioned in the text, an optimal local alignment fits within a sub-rectangle in the alignment graph. We denote the upper right and bottom left nodes of this sub-rectangle as (i, j) and (i’, j’), respectively.  Unless (i, j) = (0, 0) or (i’, j’) = (n, m), the local alignment corresponds to taking a free taxi ride (i.e., zero-weight edge) from (0, 0) to (i, j), taking one edge at a time to travel from  (ij) to (i’, j’), and then taking another free taxi ride from (i’, j’) to (n, m).
We already know that (i’, j’) can be computed as a node such that the score si’, j is maximized over all nodes of the entire alignment graph. The question, then, is how to backtrack from this node and to find  (ij).

Recall the three backtracking choices from the OutputLCS pseudocode corresponding to the backtracking references ("",  "", and ""). To these, we will simply add one additional option, "FREE", which is used when we have used a free taxi ride from the source and will allow us to jump back to (0, 0) when backtracking.  Thus, to find (i, j), one has to backtrack from (i’, j’) until we either reach a coordinate (i, j) with Backtracki, j = "FREE" or until we reach the source (0,0).

For simplicity, the following LocalAlignment pseudocode assumes that si, j  = -∞ if i < 0 or j < 0.

LocalAlignment(v, w, Score)
   for i ← 0 to |v|
      for j ← 0 to |w|
         si,j ← max{0, si-1,j + Score(vi, -), si,j-1 + Score(-, wj), si-1,j-1 + Score(vi, wj)}
         if si,j = 0
            Backtracki,j ← "FREE"
         else if si,j = si-1,j + Score(vi, -)
            Backtracki,j ← ""
         else if si,j = si,j-1 + Score(-, wj)
            Backtracki,j ← ""
         else Backtracki,j ← ""
   return Backtrack

After computing backtracking references, we can compute the source node of the local alignment by invoking LocalAlignmentSource(Backtrack, i', j'), where (i', j') is the sink of the local alignment computed as a node with maximum score among all nodes in the alignment graph. The LocalAlignmentSource pseudocode is shown below.

LocalAlignmentSource(Backtrack, i, j)
   while i > 0 or j > 0
      if Backtrack(i, j) ="FREE"
         return (i, j)
      else if Backtrack(i, j) = ""
         i  ← - 1
      else if Backtrack(i, j) = ""
         j  ← - 1
      else if Backtrack(i, j) = ""
         i  ← i - 1
         j  ← j - 1
   return (0,0)

Can you help me understand how to set up different alignment problems as an instance of the Longest Path in a DAG Problem?

Please take a look at the following figure for a big hint.

(Coursera Week 3)

When building a Manhattan graph in three levels to perform affine sequence alignment, can’t we avoid building the three-level graph by using a two-level graph and keeping track of gap penalties each time we open a gap?

The problem with this idea is that when computing the score at a node (i, j), we need to take into account the score of closing a gap from every node in the same row and same column. This is exactly what long indel edges do, and what we are trying to avoid in order to reduce the runtime of the algorithm by decreasing the number of edges in the graph.

For example, the figure below illustrates the two-level Manhattan for computing alignment with affine gap penalties (compare with the 3-level Manhattan shown in the text). At first glance, it looks like it solves the Alignment with Affine Gap Penalties Problem.

However, it you consider the following alignment with two gaps,


you will see that the path through the two-level graph for this alignment (shown below) miscalculates the alignment score; whereas the score of the alignment above is 3-2*(σ+ε), the score of the alignment in the figure below is 3- σ-3*ε.

When building a Manhattan graph in three levels to perform affine sequence alignment, would it be more natural to connect (i, j)middle to the nodes (i, j)lower and (i, j)upper (instead of connecting them with (i+1, j)lower and (i, j+1)upper?

Connecting (i, j)middle to the nodes (i, j)lower and (i, j)upper (and assigning the edges a score of σ - ε) indeed seems like a more elegant approach to building a three-level Manhattan. However, in this case we would create directed cycles, since there would exist edges from (i, j)lower and (i, j)upper back to (i, j)middle .

What topological order should we use to move in the three-level Manhattan graph for affine sequence alignment?

Please consult the following figure, which shows a topological order for a small three-level graph in which we proceed row-by-row in each level. (The topological order corresponds to visiting the nodes in increasing order, from 1 to 36.) Because there are some edges from (i, j)lower to (i, j)middle and from (i, j)upper to (i, j)middle for each i and j, we need to make sure that we visit (i, j)lower and (i, j)upper before (i, j)middle.

How can we backtrack in the three-level Manhattan for alignment with affine gap penalties?

Alignment with affine gap penalties is merely a search for a longest path in a DAG. Thus, backtracking in this graph is no different that backtracking in an arbitrary DAG, starting at the sink and proceeding toward the source according to a topological order. This reduces the problem to identifying a topological order, of which there are several.

How should we perform initialization for the Alignment with Affine Gap Penalties Problem?

It may be tricky to determine how to initialize the lower, upper, and middle matrices in the Alignment with Affine Gap Penalties Problem. For this reason, it may be easier to construct a DAG for this problem, approaching it as a general Longest Path in a DAG problem. Since we are interested in paths that start at the node (0,0) in the middle graph, we initialize this node with score 0. However, it is not the only node lacking incoming edges in the three-lelel graph for the Alignment with Affine Gap Penalties Problem. We therefore will initialize all other nodes with indegree equal to 0 with score -∞.

In space-efficient sequence alignment, what should we do if there are multiple middle nodes?

If the array Length(i) has more than one element of maximum value, then you can choose any of them, breaking ties arbitrarily.

Why did we switch from finding a middle node to finding a middle edge in space-efficient alignment?
The shaded parts of the alignment graph to the northwest and southeast of the middle node make up more than half of the total area of the alignment graph. In contrast, the shaded parts of the alignment graph to the northwest and southeast of the middle edge make up less than half of the total area of the alignment graph. This implies that when we divide and conquer based on the middle node, we obtain runtime greater than 2nm, whereas when we divide and conquer based on the middle edge, we obtain runtime less than 2nm.

Is it possible to develop a linear-space algorithm for alignment with affine gap penalties?
It is possible to extend our idea from pariwise alignment (in order to find a longest common subsequence) to more complex scoring models, e.g., for alignment with affine gap penalties. The only question is where to place the "middle column". It turns out that we need to consider two middle columns to implement the divide-and-conquer linear-space algorithm for alignment with affine gap penalties; try to figure out which ones!

How would we align a nucleotide string of length n against a profile of length k, i.e., against a 4 x m matrix?

Every alignment of a string v against a profile Profile = (PX, j) (for X ∈ {A, C, G, T}, and 1 ≤ j ≤ m) can be represented as a path from (0,0) to (n, m) in the alignment graph. We can define scores of vertical and horizontal edges in this graph as before, i.e., by assigning penalties sigma to indels. There are various ways to assign scores to diagonal edges, e.g, a diagonal edge into a node (i, j) can be assigned a score Pvi, j. After the scores of all edges are specified, an optimal alignment between a string and a profile is computed as a longest path in the alignment graph.