Bioinformatics Algorithms: An Active Learning Approach

Chapter 8: How Did Yeast Become a Wine-Maker?

(Coursera Week 1)

Why are we interested in analyzing genes whose expression significantly decreases during the course of an experiment?

We are interested in all genes that may be implicated in the diauxic shift. If the expression of a gene decreases during the diauxic shift, then chances are it is still relevant to the diauxic shift. For example, the gene may have to be repressed in order to enable the diauxic shift.

Can I modify FarthestFirstTraversal to solve the k-Means Clustering Problem?

Yes. Simply substitute d(DataPoint, Centers) by Distortion(DataPoint, Centers) in the psesudocode for FarthestFirstTraversal.

If outliers present so many challenges for clustering, why don’t we simply remove outliers before running clustering algorithms?

To remove outliers, we must first define what we mean by an outlier. It turns out that outlier detection is related to clustering, and so it is not easy to define what an outlier is without clustering! In fact, a common approach to defining outliers is to apply the Lloyd algorithm and list as outliers the top m points that are the farthest away from their nearest cluster centers. However, the Lloyd algorithm itself is sensitive to outliers, and such outliers may have an impact on the clusters that this algorithm constructs. :)

What is the running time of the Lloyd algorithm?

The running time of the Lloyd algorithm is proportional to the number of iterations that it requires. Furthermore, each iteration requires O(n*k) distance computations between all data points and all centers, where n is the number of data points and k is the number of centers.

Isn’t k-means++Initializer rather slow?

k-means++Initializer requires O(n*k) distance computations, where n is the number of data points and k is the number of centers. It therefore results in roughly the same running time as a single iteration of the Lloyd algorithm.

How do biologists visualize gene expression matrices?

To make sense of a gene expression matrix, biologists often rearrange its genes so that similar expression vectors correspond to consecutive rows in the gene expression matrix. Then, they often assign colors to each element of the gene expression matrix according to their expression levels, resulting in a heatmap (shown below).

Figure: The heatmap of a gene expression matrix. The rows and columns of the gene expression matrix are ordered according to the order of leaves constructed using hierarchical clustering.

How many partitions of a set of points into k clusters are there?

Let {n, k} denote the number of partitions of n points into k clusters. We will establish a recurrence relation to compute {n, k}. To do so, select an arbitrary point x, and note that there are two possibilities. First, x could belong to a cluster all by itself. In this case, the remaining points form k - 1 clusters, and we know that there are {n-1, k-1} ways to partition them. Second, x could belong to a cluster with other points. In this case, there are {n-1, k} ways to partition the other points into k clusters, and then there are k different choices for the cluster to which x belongs.

Combining these two cases results in the following recurrence relation for all n > 1 and all k satisfying 1 < k < n:

{n, k} = {n-1, k-1} + k · {n-1, k}

As for a base case, note that {n, 1} = 1 for all n 1 because there is only one way to partition n points into a single cluster. By the same reasoning, {n, n} = 1 for all n 1.

Exercise Break: Find a formula for {n, 2} in terms of n.

The numbers {n, k} are called Stirling numbers of the second kind.

What is the dot product?

Given two vectors a = (a1, …, an) and b = (b1, …, bn), their dot product is defined as the sum of pairwise products of their elements:

a·b = Σni=1 ai·bi = a1 · b1 + a2 · b2 + ... + an · bn

Note that the two occurrences of the "·" symbol in this equation mean different things. On the left side, it represents the dot product of vectors, which we have just defined. On the right side, it indicates a normal product of two numbers ai and bi.

(Coursera Week 2)

We saw that the Lloyd algorithm does not necessarily converge to an optimal solution to the k-Means Clustering Problem. Does the soft k-means clustering algorithm converge to an optimal solution?

No, not necessarily.

Can we use k-means++Initializer for soft k-means clustering?

Of course, feel free to implement it! And it will likely improve your results.

How do we determine an appropriate stiffness parameter?

As with many parametrized algorithms, you should experiment with various values (and check which ones makes sense) as there is no golden rule for determining the stiffness parameter.

What is the stopping rule for the EM algorithm?

In theory, the EM algorithm continues until Parameters cease to change. In practice, it is often stopped when the difference between respective values of Parameters in the previous iteration and the current iteration of the algorithm becomes small. Alternatively, the EM algorithm can be run for a fixed number of iterations.

In contrast to hierarchical clustering, the Lloyd algorithm is run for a fixed number of clusters k, and it is not clear how to select k in advance. Why would we ever select the Lloyd algorithm over hierarchical clustering?

Running time is one issue. The Lloyd algorithm requires O(n*k) comparisons, where n is the number of data points and k is the number of centers. In contrast, the hierarchical clustering algorithm requires O(n2) comparisons just to compute the distance matrix.

Also, since the Lloyd algorithm starts with randomly chosen cluster centers, it may yield different clustering results on different runs of the algorithm. Although this lacks consistency, it may have advantages since you can select the best result out of many initializations. With hierarchical clustering, you will obtain the same final tree each time you run the algorithm (up to breaking ties).