Chapter 6: Are There Fragile Regions in the Human Genome?
A permutation is a specific ordering of the positive integers from 1 to n, where each element is used exactly once. For example, there are six permutations of length 3:
In this book, we often use the term "permutation" as shorthand for a signed permutation, in which each element has a sign, or orientation (represented as a "+" or "-"). You can verify that there are 48 signed permutations of length 3.
Given a permutation P and a reversal ρ, we denote the genome resulting from applying ρ to P as P*ρ. A reversal ρ is called P-valid if the reversal distance of P*ρ is smaller than the 2-break distance of P. The following recurrence relation computes NumberOfScenarios(P), the number of different reversal scenarios that transform a genome P into the identity permutation using the minimum number of reversals:
The pair (+4 +3) forms a breakpoint because, in contrast to (-4 -3), it cannot be transformed into (+3 +4), a desirable pair when sorting by reversals, by a single reversal. For example, applying a reversal to
transforms this permutation into
but applying a reversal to
transforms it into the identity permutation
To better understand why (+4 +3) is a breakpoint, try sorting the permutation (+6 +5 +4 +3 +2 +1) – you will see that it requires many reversals!
Yes! For example, a transposition moves a segment from one location in the genome to another. For example, one transposition applied to the blue region of the chromosome (+1 +2 +3 +4 +5 +6 +7) yields (+1 +5 +6 +2 +3 +4 +7). However, transpositions are more rare than reversals and other rearrangements discussed in the chapter.
The following explanation is a modification of one given by one of our excellent community TAs, Giampaolo Eusebi.
Keep in mind that:
- Every even number 2x is a black node head, and every even number 2x−1 is a black node tail;
- every colored edge is composed by two numbers representing black heads or tails.
That being said, the order should not be very important. Take, for example, the following edge list:
If you start with (2,4):
- (2,4) ends with a 4 (even), and the only edge that starts with 4−1=3 is (3,6);
- (3,6) ends with a 6 (even), and the only edge that starts with 6−1=5 is (5,1);
- (5,1) ends with a 1 (odd), and the only edge that starts with 1+1=2 is (2,4), which brings us back where we started, thus forming a cycle.
The only remaining edges are (7,9), (10,12),(11,8). If you start with (7,9):
- (7,9) ends with a 9 (odd), and the only edge that starts with 9+1=10 is (10,12);
- (10,12) ends with a 12 (even), and the only edge that starts with 12−1=11 is (11,8);
- (11,8) ends with a 8 (even), and the only edge that starts with 8−1=7 is (7,9), which brings us back where we started, thus forming a cycle.
The edge list is now empty. The key point is that we will have obtained the same two cycles regardless of which edges we chose as starting points (feel free to try it for yourself).
In addition to the dots representing conserved genes between two species, genomic dot-plots contain many spurious dots. As discussed in the main text, even randomly generated strings have shared k-mers that result in "spurious" dots in their genomic dot-plots. Moreover, these spurious k-mers may aggregate into spurious diagonals that must be removed for follow-up analysis of synteny blocks. Since these spurious diagonals are usually short, we filter out short diagonals when constructing synteny blocks.
Our algorithm for constructing synteny blocks, which is based on shared k-mers, does account for mutations. For example, even though the two "genes" ACTGAGTTC and ACTGGGTTC differ from each other by a mutation (A -> G), the genomic dot-plot with k = 3 will reveal that they form a single synteny block; construct this dot-plot and see for yourself!
As specified in the main text:
We color the point (x, y) red if the two genomes share a k-mer at respective positions x and y. We color (x, y) blue if the k-mer starting at position x in the first genome is the reverse complement of the k-mer starting at position y in the second genome.
This definition does not specify what to do with reverse palindromes. A reverse palindrome is a DNA string that is its own reverse complement, such as ACGCGT. If a reverse complement starts at respective positions x and y in two genomes, then (x, y) should technically be colored both red and blue! To address this issue, we will use only red to color points corresponding to reverse palindromes.
Yes. For example, the strings AGCAGGTTATCTACCTGT and AGCAGGAGATAAACCTGT can be transformed into sequences of their 3-mers along with these 3-mers' respective starting positions:
(AGC,0) (GCA,1) (CAG,2) (AGG,3) (GGT,4) (GTT,5) (TTA,6) (TAT,7) (ATC,8) (TCT,9) (CTA,10) (TAC,11), (ACC,12) (CCT,13) (CTG,14) (TGT,15)
(AGC,0) (GCA,1) (CAG,2) (AGG,3) (GGA,4) (GAG,5) (AGA,6) (GAT,7) (ATA,8) (TAA,9) (AAA,10) (AAC,11) (ACC,12) (CCT,13) (CTG,14) (TGT,15)
We now take the reverse complement of each 3-mer of the second string AGCAGGACATAAACCTGT and color them blue:
(GCT,0) (TGC,1) (CTG,2) (CCT,3) (TCC,4) (CTC,5) (TCT,6) (ATC,7) (TAT,8) (TTA,9) (TTT,10) (GTT,11) (GGT,12) (AGG,13) (CAG,14) (ACA,15)
Afterwards, we lexicographically merge and sort all three arrays into a single array:
(AAA,10) (AAC,11) (ACA,15) (ACC,12) (ACC,12) (AGA,6) (AGC,0) (AGC,0) (AGG,3) (AGG,3) (AGG,13) (GTT,11) (ATA,8) (ATC,8) (ATC,7) (CAG,2) (CAG,2) (CAG,14) (CCT,13) (CCT,13) (CCT,3) (CTA,10) (CTC,5) (CTG,14) (CTG,2) (CTG,14) (GAG,5) (GAT,7) (GCA,1) (GCA,1) (GCT,0) (GGA,4) (GGT,4) (GGT,12) (TAA,9) (TAC,11) (TAT,7) (TAT,8) (TCC,4) (TCT,9) (TCT,6) (TGC,1) (TGT,15) (TGT,15) (TTA,6) (TTA,9) (TTT,10)
We will retain only 3-mers that appear as a black 3-mer and a colored 3-mer:
(ACC,12) (ACC,12) (AGC,0) (AGC,0) (AGG,3) (AGG,3) (AGG,13) (ATC,8) (ATC,7) (CAG,2) (CAG,2) (CAG,14) (CCT,13) (CCT,13) (CCT,3) (CTG,14) (CTG,2) (CTG,14) (GCA,1) (GCA,1) (GGT,4) (GGT,12) (GTT,5) (GTT,11) (TAT,7) (TAT,8) (TCT,9) (TCT,6) (TGT,15) (TGT,15) (TTA,6) (TTA,9)
If a 3-mer appears in this list as black and red with positions x and y, then we add a red point (x, y) to the genomic dot-plot (note: in this case x = y). If a 3-mer appears in this list as black and blue with respective positions x and y, then we add a blue point (x, y) to the genomic dot-plot. See the figure below.
(ACC,12) (ACC,12) (12,12)
(AGC,0) (AGC,0) (0,0)
(AGG,3) (AGG,3) (AGG,13) (3,3), (3,13)
(ATC,8) (ATC,7) (8,7)
(CAG,2) (CAG,2) (CAG,14) (2,2), (2,14)
(CCT,13) (CCT,13) (CCT,3) (13,13), (13,3)
(CTG,14) (CTG,2) (CTG,14) (14,2), (14,14)
(GCA,1) (GCA,1) (1,1)
(GGT,4) (GGT,12) (4,12)
(GTT,5) (GTT,11) (5,11)
(TAT,7) (TAT,8) (7,8)
(TCT,9) (TCT,6) (9,6)
(TGT,15) (TGT,15) (15,15)
(TTA,6) (TTA,9) (6,9)
Some species may have synteny blocks present in multiple copies. For example, a chromosome may be represented as follows:
Unfortunately, however, there is no known polynomial algorithm to analyze rearrangements including duplications in multichromosomal genomes.
2-breaks include reversals, but not every 2-break is a reversal. For example, one 2-break on the linear chromosome (+a +b +c +d +e) may yield a fission operation, resulting in the linear chromosome (+a +b +e) and the circular chromosome (+c +d).