Bioinformatics Algorithms: An Active Learning Approach

Chapter 5: How Do We Compare Biological Sequences?

(Coursera Week 1)

Why is it called “dynamic programming”?

You may be surprised, but sixty years ago, the word "programming" had nothing to do with computer programming, but instead referred to finding optimal programs, in the sense of finding a military schedule for logistics. In the 1950s, the inventor of dynamic programming, Richard Bellman, joined the RAND Corporation, which at the time was working on various contracts for the US Defense Department. At that time, the Secretary of Defense, Charles Wilson, was under pressure to reduce the military budget, including the research budget. Here is how Bellman described his interactions with Wilson and the birth of the term “dynamic programming”:

"[Wilson's] face would suffuse, he would turn red, and he would get violent if people used the term research in his presence. You can imagine how he felt, then, about the term mathematical… Hence, I felt I had to do something to shield Wilson and the Air Force from the fact that I was really doing mathematics inside the RAND Corporation. What title, what name, could I choose? In the first place I was interested in planning, in decision making, in thinking. But planning, is not a good word for various reasons. I decided therefore to use the word “programming”. I wanted to get across the idea that this was dynamic, this was multistage, this was time-varying I thought, lets kill two birds with one stone. Lets take a word that has an absolutely precise meaning, namely dynamic, in the classical physical sense. It also has a very interesting property as an adjective, and that is it's impossible to use the word dynamic in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. It's impossible. Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to. So I used it as an umbrella for my activities."

Is it possible to develop an iterative rather than recursive algorithm for OutputLCS?

Yes, and below is some pseudocode.  Instead of making recursive calls, we simply embed the algorithm's iterations within a while loop.

IterativeOutputLCS(Backtrack, v, w)
    LCS ← an empty string
    i ← length of string
    j ← length of string w
    while i > 0 and j > 0
        if Backtrack(i, j) = ""
            i ← i-1
        else if Backtrack(i,j) = ""
        else if Backtrack(i,j) = ""
              LCS ← concatenate vi with LCS
            i ← i-1
            j  ← j-1
    return LCS

(Coursera Week 2)

How do we know where the conserved region in a local alignment starts and ends?

If we knew in advance where the conserved region between two strings begins and ends, then we could simply change the source and sink to be the starting and ending nodes of the conserved interval. However, the key point is that we do not know this information in advance. Fortunately, since local alignment adds free taxi rides from the source (0,0) to every node (and from every node to the sink (n, m)) of the alignment graph, it explores all possible starting and ending positions of conserved regions.

How do biologists decide when to use global alignment and when to use local alignment?

Unless biologists have a good reason to align entire strings (e.g., alignment of genes between very close species such as human and chimpanzee), they use local alignment.

How do we backtrack when computing a local alignment?

As mentioned in the text, an optimal local alignment fits within a sub-rectangle in the alignment graph. We denote the upper right and bottom left nodes of this sub-rectangle as (i, j) and (i’, j’), respectively.  Unless (i, j) = (0, 0) or (i’, j’) = (n, m), the local alignment corresponds to taking a free taxi ride (i.e., zero-weight edge) from (0, 0) to (i, j), taking one edge at a time to travel from  (ij) to (i’, j’), and then taking another free taxi ride from (i’, j’) to (n, m).
We already know that (i’, j’) can be computed as a node such that the score si’, j is maximized over all nodes of the entire alignment graph. The question, then, is how to backtrack from this node and to find  (ij).

Recall the three backtracking choices from the OutputLCS pseudocode corresponding to the backtracking references ("",  "", and ""). To these, we will simply add one additional option, "FREE", which is used when we have used a free taxi ride from the source and will allow us to jump back to (0, 0) when backtracking.  Thus, to find (i, j), one has to backtrack from (i’, j’) until we either reach a coordinate (i, j) with Backtracki, j = "FREE" or until we reach the source (0,0).

For simplicity, the following LocalAlignment pseudocode assumes that si, j  = -∞ if i < 0 or j < 0.

LocalAlignment(v, w, Score)
   for i ← 0 to |v|
      for j ← 0 to |w|
         si,j ← max{0, si-1,j + Score(vi, -), si,j-1 + Score(-, wj), si-1,j-1 + Score(vi, wj)}
         if si,j = 0
            Backtracki,j ← "FREE"
         else if si,j = si-1,j + Score(vi, -)
            Backtracki,j ← ""
         else if si,j = si,j-1 + Score(-, wj)
            Backtracki,j ← ""
         else Backtracki,j ← ""
   return Backtrack

After computing backtracking references, we can compute the source node of the local alignment by invoking LocalAlignmentSource(Backtrack, i', j'), where (i', j') is the sink of the local alignment computed as a node with maximum score among all nodes in the alignment graph. The LocalAlignmentSource pseudocode is shown below.

LocalAlignmentSource(Backtrack, i, j)
   while i > 0 or j > 0
      if Backtrack(i, j) ="FREE"
         return (i, j)
      else if Backtrack(i, j) = ""
         i  ← - 1
      else if Backtrack(i, j) = ""
         j  ← - 1
      else if Backtrack(i, j) = ""
         i  ← i - 1
         j  ← j - 1
   return (0,0)

Can you help me understand how to set up different alignment problems as an instance of the Longest Path in a DAG Problem?

Please take a look at the following figure for a big hint.

(Coursera Week 3)

When building a Manhattan graph in three levels to perform affine sequence alignment, can’t we avoid building the three-level graph by using a two-level graph and keeping track of gap penalties each time we open a gap?

The problem with this idea is that when computing the score at a node (i, j), we need to take into account the score of closing a gap from every node in the same row and same column. This is exactly what long indel edges do, and what we are trying to avoid in order to reduce the runtime of the algorithm by decreasing the number of edges in the graph.

When building a Manhattan graph in three levels to perform affine sequence alignment, would it be more natural to connect (i, j)middle to the nodes (i, j)lower and (i, j)upper (instead of connecting them with (i+1, j)lower and (i, j+1)upper?

Connecting (i, j)middle to the nodes (i, j)lower and (i, j)upper (and assigning the edges a score of σ - ε) indeed seems like a more elegant approach to building a three-level Manhattan. However, in this case we would create directed cycles, since there would exist edges from (i, j)lower and (i, j)upper back to (i, j)middle .

What topological order should we use to move in the three-level Manhattan graph for affine sequence alignment?

Please consult the following figure, which shows a topological order for a small three-level graph in which we proceed row-by-row in each level. (The topological order corresponds to visiting the nodes in increasing order, from 1 to 36.) Because there are some edges from (i, j)lower to (i, j)middle and from (i, j)upper to (i, j)middle for each i and j, we need to make sure that we visit (i, j)lower and (i, j)upper before (i, j)middle.

How should we perform initialization for the Alignment with Affine Gap Penalties Problem?

It may be tricky to determine how to initialize the lower, upper, and middle matrices in the Alignment with Affine Gap Penalties Problem. For this reason, it may be easier to construct a DAG for this problem, approaching it as a general Longest Path in a DAG problem. Since we are interested in paths that start at the node (0,0) in the middle graph, we initialize this node with score 0. However, it is not the only node lacking incoming edges in the three-lelel graph for the Alignment with Affine Gap Penalties Problem. We therefore will initialize all other nodes with indegree equal to 0 with score -∞.

In space-efficient sequence alignment, what should we do if there are multiple middle nodes?

If the array Length(i) has more than one element of maximum value, then you can choose any of them, breaking ties arbitrarily.

Why did we switch from finding a middle node to finding a middle edge in space-efficient alignment?
The shaded parts of the alignment graph to the northwest and southeast of the middle node make up more than half of the total area of the alignment graph. In contrast, the shaded parts of the alignment graph to the northwest and southeast of the middle edge make up less than half of the total area of the alignment graph. This implies that when we divide and conquer based on the middle node, we obtain runtime greater than 2nm, whereas when we divide and conquer based on the middle edge, we obtain runtime less than 2nm.

How would we align a nucleotide string of length n against a profile of length k, i.e., against a 4 x m matrix?

Every alignment of a string v against a profile Profile = (PX, j) (for X ∈ {A, C, G, T}, and 1 ≤ j ≤ m) can be represented as a path from (0,0) to (n, m) in the alignment graph. We can define scores of vertical and horizontal edges in this graph as before, i.e., by assigning penalties sigma to indels. There are various ways to assign scores to diagonal edges, e.g, a diagonal edge into a node (i, j) can be assigned a score Pvi, j. After the scores of all edges are specified, an optimal alignment between a string and a profile is computed as a longest path in the alignment graph.