Bioinformatics Algorithms: An Active Learning Approach

Chapter 8: How Did Yeast Become a Wine-Maker?

(Coursera Week 1)

Why are we interested in analyzing genes whose expression significantly decreases during the course of an experiment?

We are interested in all genes that may be implicated in the diauxic shift. If the expression of a gene decreases during the diauxic shift, then chances are it is still relevant to the diauxic shift. For example, the gene may have to be repressed in order to enable the diauxic shift.

Can I modify FarthestFirstTraversal to solve the k-Means Clustering Problem?

Yes. Simply substitute d(DataPoint, Centers) by Distortion(DataPoint, Centers) in the psesudocode for FarthestFirstTraversal.

How do biologists visualize gene expression matrices?

To make sense of a gene expression matrix, biologists often rearrange its genes so that similar expression vectors correspond to consecutive rows in the gene expression matrix. Then, they often assign colors to each element of the gene expression matrix according to their expression levels, resulting in a heatmap (shown below).

Figure: The heatmap of a gene expression matrix. The rows and columns of the gene expression matrix are ordered according to the order of leaves constructed using hierarchical clustering.

How many partitions of a set of points into k clusters are there?

Let {n, k} denote the number of partitions of n points into k clusters. We will establish a recurrence relation to compute {n, k}. To do so, select an arbitrary point x, and note that there are two possibilities. First, x could belong to a cluster all by itself. In this case, the remaining points form k - 1 clusters, and we know that there are {n-1, k-1} ways to partition them. Second, x could belong to a cluster with other points. In this case, there are {n-1, k} ways to partition the other points into k clusters, and then there are k different choices for the cluster to which x belongs.

Combining these two cases results in the following recurrence relation for all n > 1 and all k satisfying 1 < k < n:

{n, k} = {n-1, k-1} + k · {n-1, k}

As for a base case, note that {n, 1} = 1 for all n 1 because there is only one way to partition n points into a single cluster. By the same reasoning, {n, n} = 1 for all n 1.

Exercise Break: Find a formula for {n, 2} in terms of n.

The numbers {n, k} are called Stirling numbers of the second kind.

What is the dot product?

Given two vectors a = (a1, …, an) and b = (b1, …, bn), their dot product is defined as the sum of pairwise products of their elements:

a·b = Σni=1 ai·bi = a1 · b1 + a2 · b2 + ... + an · bn

Note that the two occurrences of the "·" symbol in this equation mean different things. On the left side, it represents the dot product of vectors, which we have just defined. On the right side, it indicates a normal product of two numbers ai and bi.

(Coursera Week 2)

We saw that the Lloyd algorithm does not necessarily converge to an optimal solution to the k-Means Clustering Problem. Does the soft k-means clustering algorithm converge to an optimal solution?

No, not necessarily.