联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2019-04-05 10:21

Assignment 4 : Part II – Expectation Maximization and K-Means

Submission deadline:

Final version due 10 April, by 11:59pm (extensions possible as long as draft submitted by then).

You may do this in groups of 2 or 3.

NOTE THE “ALTERNATE” OPTION THAT WAS DESCRIBED IN CLASS TODAY-- I WILL SHORTLY

ADD A BRIEF DESCRIPTION OF THAT!

The goal is to familiarize yourself with, and to help you review EM and K-Means for the exam.

[We are including time estimates associated with each question, as before; note that the total “fast” time

estimate is around 2 hours for a single person working alone quickly, as it involves only a minimal amount

of code; if you are taking more than 3 hours in this case, I would strongly encourage you to be practical

and use the google groups discussion as a resource-- everybody will benefit from this! ]

1. [15 marks; ~1.5 hrs ] Expectation Maximization: In this question, you will apply EM to cluster Iris

data (See sklearn.load_iris). [SEE BELOW FOR AN ALTERNATIVE VERSION FOR Q1]. The Iris

dataset is a well-known example, with three types of Irises which can be classified using

4-dimensional measurements (sepal length, sepal width, petal length, petal width). Note that the

sklearn page has links to visualizations of K-Means clustering examples on this dataset, and also

you can see scatterplots of the basic Iris data set here.

a. [10 marks; ~0.75hr ] 1-dimensional EM.

If you look at the the scatterplots linked above, you can see that you can at least partially

cluster the Iris dataset using even just one dimension. You can still use the EM algorithm to

do this, selecting just one of the measurements and using two or more Gaussians to cluster

according to that measurement. (Note that the three classes can potentially be partitioned

into two groups, as is visible in some of the scatterplots).

Implement a 1-D version of the EM algorithm (not using a pre-written skLearn library EM

function) , and use the scatterplots to help you choose reasonable initializations.

Can you choose a dimension that lets you cluster the data into two clear clusters? Which

two flowers get grouped together in that case?

Can you choose a dimension and initial values that let you cluster the data into three

clusters? Which dimension did you use, and which two flowers are still a bit hard to

differentiate this way? Is this consistent with the scatterplots?

Make sure to clearly indicate the dimensions you used and the resulting means and

variances of your clusters after running the EM algorithm.

b. [5 marks; ~0.75hr ] multi-dimensional EM.

Extend your work from the previous part to an EM algorithm that clusters according to the

full 4-dimensional measurements. Let k=3 and plot the Sepal data points with a color

coding based on the obtained clusters. More specifically, you can plot the data points

with color where the RGB colour values correspond to the probability estimates of a data

point belonging to each class.

Hint: You can use numpy.linalg.pinv to find the inverse of a matrix. Also, numpy.copy can

be used to temporary save a vector.

Optional/Hints: If you wish, you can compare your results with SkLearn by using the same seed to

initialize the variables. In order to match the results from SkLearn, the following tips may be useful:

● Begin by setting the random seed using np.random.seed. Example:

np.random.seed = 10.

● Begin with the E-step by computing the soft class assignments randomly as:

assignments = np.random.rand(data.shape[0], num_classes)

assignments = assignments / np.sum(assignments, axis=1, keepdims=True)

● Now, perform the M-step and repeat E and M steps alternately till convergence.

● You may use the following code to generate the SkLearn-results:

gmm = GaussianMixture(

n_components = num_classes,

random_state=10,

init_params ="random",

covariance_type="diag",

reg_covar=0,

verbose=2,

verbose_interval=5)

gmm.fit(data)

1. ALTERNATIVE VERSION

As discussed in class, I will only provide a high-level version here, for basic reference.

Instead of implementing EM, you may choose to do the EM algorithm “by hand” for 2 iterations. In

this case, you must demonstrate it for two cases of the same (iris) dataset:

Case 1: 2 clusters, 1 dimensional version (choose a dimension as discussed in class)

Case 2: 3 clusters, 4 dimensional version (i.e. entire dataset), assuming spherical Gaussians

Present your results very clearly.

Again, as discussed in class, “by hand” means that you can write separate bits of code to compute

the means and variances, and the responsibilities, etc.

A total of 12/15 marks (i.e. in the A range) can be obtained by just doing this.

2. [10 marks; ~0.5 hrs] K-means is derived from EM. However, instead of calculating both means

and variances, it calculates only the means. What is(are) the assumption(s) that K-Means makes?

Using a suitable synthetic dataset, demonstrate how K-Means fails when any one of the

assumptions are violated. Use appropriate visualizations to explain better. (You can use the

K-means function available in the skLearn library).

Hint: examples of such datasets have been discussed in class on several occasions.


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp