# Chapter 8: How Did Yeast Become a Wine-Maker?

### (Coursera Week 1)

*decreases*during the course of an experiment?

We are interested in all genes that may be implicated in the diauxic shift. If the expression of a gene decreases during the diauxic shift, then chances are it is still relevant to the diauxic shift. For example, the gene may have to be repressed in order to enable the diauxic shift.

**FarthestFirstTraversal**to solve the

*k*-Means Clustering Problem?

Yes. Simply substitute *d*(*DataPoint*, *Centers*) by *Distortion*(*DataPoint*, *Centers*) in the psesudocode for **FarthestFirstTraversal**.

To remove outliers, we must first define what we mean by an outlier. It turns out that outlier detection is related to clustering, and so it is not easy to define what an outlier is without clustering! In fact, a common approach to defining outliers is to apply the Lloyd algorithm and list as outliers the top *m* points that are the farthest away from their nearest cluster centers. However, the Lloyd algorithm itself is sensitive to outliers, and such outliers may have an impact on the clusters that this algorithm constructs. :)

The running time of the Lloyd algorithm is proportional to the number of iterations that it requires. Furthermore, each iteration requires O(*n***k*) distance computations between all data points and all centers, where *n* is the number of data points and *k* is the number of centers.

**k-means++Initializer** requires O(*n***k*) distance computations, where *n* is the number of data points and *k* is the number of centers. It therefore results in roughly the same running time as a single iteration of the Lloyd algorithm.

To make sense of a gene expression matrix, biologists often rearrange its genes so that similar expression vectors correspond to consecutive rows in the gene expression matrix. Then, they often assign colors to each element of the gene expression matrix according to their expression levels, resulting in a **heatmap** (shown below).

**Figure:** The heatmap of a gene expression matrix. The rows and columns of the gene expression matrix are ordered according to the order of leaves constructed using hierarchical clustering.

*k*clusters are there?

Let {*n*, *k*} denote the number of partitions of *n* points into *k* clusters. We will establish a recurrence relation to compute {*n*, *k*}. To do so, select an arbitrary point *x*, and note that there are two possibilities. First, *x* could belong to a cluster all by itself. In this case, the remaining points form *k* - 1 clusters, and we know that there are {*n*-1, *k*-1} ways to partition them. Second, *x* could belong to a cluster with other points. In this case, there are {*n*-1, *k*} ways to partition the other points into *k* clusters, and then there are *k* different choices for the cluster to which *x* belongs.

Combining these two cases results in the following recurrence relation for all *n* > 1 and all *k* satisfying 1 < *k* < *n*:

*n*,

*k*} = {

*n*-1,

*k*-1} +

*k*· {

*n*-1,

*k*}

As for a base case, note that {*n*, 1} = 1 for all *n *≥ 1 because there is only one way to partition *n* points into a single cluster. By the same reasoning, {*n, n*} = 1 for all *n *≥ 1.

**Exercise Break:** Find a formula for {*n, *2} in terms of *n*.

The numbers {*n*, *k*} are called **Stirling numbers of the second kind**.

Given two vectors *a* = (*a*_{1}, …, *a*_{n}) and *b* = (*b*_{1}, …, *b*_{n}), their **dot product** is defined as the sum of pairwise products of their elements:

*a*·

*b*= Σ

^{n}

_{i=1}

*a*

_{i}·

*b*

_{i}=

*a*

_{1}·

*b*

_{1}+

*a*

_{2}·

*b*

_{2}+ ... +

*a*

_{n}·

*b*

_{n}

Note that the two occurrences of the "·" symbol in this equation mean different things. On the left side, it represents the dot product of vectors, which we have just defined. On the right side, it indicates a normal product of two numbers *a*_{i} and *b*_{i}.

### (Coursera Week 2)

*k*-Means Clustering Problem. Does the

*soft*

*k*-means clustering algorithm converge to an optimal solution?

No, not necessarily.

*k*-means clustering?

Of course, feel free to implement it! And it will likely improve your results.

As with many parametrized algorithms, you should experiment with various values (and check which ones makes sense) as there is no golden rule for determining the stiffness parameter.

In theory, the EM algorithm continues until *Parameters* cease to change. In practice, it is often stopped when the difference between respective values of *Parameters* in the previous iteration and the current iteration of the algorithm becomes small. Alternatively, the EM algorithm can be run for a fixed number of iterations.

*k*, and it is not clear how to select

*k*in advance. Why would we ever select the Lloyd algorithm over hierarchical clustering?

Running time is one issue. The Lloyd algorithm requires O(*n***k*) comparisons, where *n* is the number of data points and *k* is the number of centers. In contrast, the hierarchical clustering algorithm requires O(*n*^{2}) comparisons just to compute the distance matrix.

Also, since the Lloyd algorithm starts with randomly chosen cluster centers, it may yield different clustering results on different runs of the algorithm. Although this lacks consistency, it may have advantages since you can select the best result out of many initializations. With hierarchical clustering, you will obtain the same final tree each time you run the algorithm (up to breaking ties).