0%

机器学习-04

  • 无监督学习
    • 聚类
    • 异常检测
  • 推荐系统
  • 强化学习

学习聚类算法 clustering

聚类

Clustering Algorithm

应用

  • 为相似新闻分类
  • DNA分析
  • 天文数据分析

K-means

算法步骤

  • 随机选取k个位置作为质心(centroid)初始位置
  • 将每个点与每个质心距离进行比较,将其分配给最近的质心,最终形成k个集群
  • 将每个集群的质心位置移动到其所有点的平均值上
  • 重复上述(2),(3)步骤直到所有质心不再移动,达到收敛

特殊情况:

如果一个质心没有分配到任何点,可以:

  • 删除这个质心,变为k - 1个集群
  • 重新随机分配质心位置,下轮继续(对集群数量比较严格)

损失函数

收敛:损失函数不再下降

初始化

  • 随机选择K个训练集样例作为质心初始点位
  • 随机选择K个初始点位

注意:单次运行K-means可能会导致局部最优解,所以要使用不同的初始化多次运行找损失函数的全局最小值

选择聚类数量

Elbow method

选择一个尽量较大但之后更大J下降缓慢的点作为集群数量

缺点:应用场景局限性与人为性

更常用:根据业务需求

栗子:衣服尺码

比较K = 3和K = 5的损失函数

实现代码

实现K-means:

找最近质心的函数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def find_closest_centroids(X, centroids):
"""
Computes the centroid memberships for every example

Args:
X (ndarray): (m, n) Input values
centroids (ndarray): k centroids

Returns:
idx (array_like): (m,) closest centroids

"""

# Set K
K = centroids.shape[0]

# You need to return the following variables correctly
idx = np.zeros(X.shape[0], dtype=int)

### START CODE HERE ###

for i in range(X.shape[0]):
distance = []
for j in range(centroids.shape[0]):
norm_ij = np.linalg.norm(X[i] - centroids[j]) # 默认返回二阶矩阵范数
distance.append(norm_ij)
idx[i] = np.argmin(distance)

### END CODE HERE ###

return idx

调整质心位置的函数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def compute_centroids(X, idx, K):
"""
Returns the new centroids by computing the means of the
data points assigned to each centroid.

Args:
X (ndarray): (m, n) Data points
idx (ndarray): (m,) Array containing index of closest centroid for each
example in X. Concretely, idx[i] contains the index of
the centroid closest to example i
K (int): number of centroids

Returns:
centroids (ndarray): (K, n) New centroids computed
"""

# Useful variables
m, n = X.shape

# You need to return the following variables correctly
centroids = np.zeros((K, n))

### START CODE HERE ###

for k in range(K):
points = X[idx == k]
centroids[k] = np.mean(points, axis = 0)

### END CODE HERE ##

return centroids

K-means运行的核心函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def run_kMeans(X, initial_centroids, max_iters=10, plot_progress=False):
"""
Runs the K-Means algorithm on data matrix X, where each row of X
is a single example
"""

# Initialize values
m, n = X.shape
K = initial_centroids.shape[0]
centroids = initial_centroids
idx = np.zeros(m)

# Run K-Means
for i in range(max_iters):
#Output progress
print("K-Means iteration %d/%d" % (i, max_iters-1))

# For each example in X, assign it to the closest centroid
idx = find_closest_centroids(X, centroids)

# Given the memberships, compute new centroids
centroids = compute_centroids(X, idx, K)
plt.show()
return centroids, idx

随机初始化K(以随机样例点为起点):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def kMeans_init_centroids(X, K):
"""
This function initializes K centroids that are to be
used in K-Means on the dataset X

Args:
X (ndarray): Data points
K (int): number of centroids/clusters

Returns:
centroids (ndarray): Initialized centroids
"""

# Randomly reorder the indices of examples
randidx = np.random.permutation(X.shape[0])

# Take the first K examples as centroids
centroids = X[randidx[:K]]

return centroids

调用:

1
2
3
4
5
6
7
8
9
10
11
# Load an example dataset
X = load_data()

# Set initial centroids
K = 3
initial_centroids = kMeans_init_centroids(X, K)

# Number of iterations
max_iters = 10

centroids, idx = run_kMeans(X, initial_centroids, max_iters, plot_progress=True)

实际栗子

以图像压缩为案例

将RGB图像压缩到只有16种颜色

任务:针对128*128的彩色图像,以每个像素作为数据集,使用K-means算法找16种最优颜色进行相近颜色替代,从而做到图像压缩且画质依然优秀。

读取图像:

1
2
3
4
5
# Load an image of a bird
original_img = plt.imread('bird_small.png')

# Visualizing the image
plt.imshow(original_img)

了解参数:

1
2
print("Shape of original_img is:", original_img.shape)
# Shape of original_img is: (128, 128, 3)

数据调整(将128 * 128 * 3变为16384 * 3):

1
2
3
4
# Divide by 255 so that all values are in the range 0 - 1
original_img = original_img / 255

X_img = np.reshape(original_img, (original_img.shape[0] * original_img.shape[1], 3))

调用K-means算法:

1
2
3
4
5
6
7
8
9
# You should try different values of K and max_iters here
K = 16
max_iters = 10

# Using the function you have implemented above.
initial_centroids = kMeans_init_centroids(X_img, K)

# Run K-Means - this takes a couple of minutes
centroids, idx = run_kMeans(X_img, initial_centroids, max_iters)

还原图像:

1
2
3
4
5
# Represent image in terms of indices
X_recovered = centroids[idx, :] # 将每个像素赋值为所属集群的代表RGB

# Reshape recovered image into proper dimensions
X_recovered = np.reshape(X_recovered, original_img.shape)

画图:

1
2
3
4
5
6
7
8
9
10
11
12
13
# Display original image
fig, ax = plt.subplots(1,2, figsize=(8,8))
plt.axis('off')

ax[0].imshow(original_img*255)
ax[0].set_title('Original')
ax[0].set_axis_off()


# Display compressed image
ax[1].imshow(X_recovered*255)
ax[1].set_title('Compressed with %d colours'%K)
ax[1].set_axis_off()