クラスター分析の基礎

irisデータセットについてクラスター分析を行う。

irisデータセットは150個のデータからできており、3つの品種についてそれぞれ50個のデータが入っている。問題を簡単にするため、それぞれの品種について3つのデータを取ることにする。

from sklearn.datasets import load_iris

iris = load_iris()
print(iris.target[0], iris.target[1], iris.target[2])
print(iris.target[50], iris.target[51], iris.target[52])
print(iris.target[100], iris.target[101], iris.target[102])

このように、品種0、品種1、品種2がそれぞれ3件ずつであることが分かる。

0 0 0
1 1 1
2 2 2

対応するデータを取得し、標準化を行う。
ちなみに、scikit-learnを用いて標準化した値と、R言語のscale関数で求めた値はデルタの自由度の値が異なるため、違う値になることに注意する。

from sklearn.preprocessing import StandardScaler

X = [iris.data[0], iris.data[1], iris.data[2],
iris.data[50], iris.data[51], iris.data[52],
iris.data[100], iris.data[101], iris.data[102]
]

sc = StandardScaler()
X_sc = sc.fit_transform(X)
X_sc

このようなデータが得られる。

array([[-1.04454175, 1.73925271, -1.35065657, -1.30321735],
[-1.27106888, -0.63245553, -1.35065657, -1.30321735],
[-1.497596 , 0.31622777, -1.40444378, -1.30321735],
[ 1.10746595, 0.31622777, 0.42432131, 0.14778753],
[ 0.42788457, 0.31622777, 0.31674689, 0.26870461],
[ 0.99420239, -0.15811388, 0.53189573, 0.26870461],
[ 0.31462101, 0.79056942, 1.12355502, 1.47787534],
[-0.25169681, -2.05548048, 0.63947014, 0.7523729 ],
[ 1.22072952, -0.63245553, 1.06976781, 0.99420705]])

データ間の距離を求める

1件目のデータ$x_{1}$と2件目のデータ$x_2$について、ユーグリッド距離を求める。
ユーグリッド距離は、それぞれの値の差の平方を足して、その平方根を求めたものである。
$$
d = \sqrt{(x_{1,1}-x_{2,1})^2 + (x_{1,2}-x_{2,2})^2 + (x_{1,3}-x_{2,3})^2 + (x_{1,4}-x_{2,4})^2}
$$

# X_sc[0]とX_sc[1]のユーグリッド距離を求める
import numpy as np
np.sqrt((X_sc[0][0] - X_sc[1][0]) ** 2 +
(X_sc[0][1] - X_sc[1][1]) ** 2 +
(X_sc[0][2] - X_sc[1][2]) ** 2 +
(X_sc[0][3] - X_sc[1][3]) ** 2)

2.3825017395837125

scikit-learnでは、下記のように求める。

# ユーグリッド距離を求める
from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric('euclidean')
dist.pairwise(X_sc)

array([[0. , 2.38250174, 1.49437319, 3.45139085, 3.0731437 ,
3.71098632, 4.07474207, 4.81815926, 4.67900277],
[2.38250174, 0. , 0.9768355 , 3.43706118, 3.00626276,
3.37215001, 4.2890106 , 3.35412806, 4.16481359],
[1.49437319, 0.9768355 , 0. , 3.49802011, 3.02339402,
3.55730355, 4.19933149, 3.9471889 , 4.43724864],
[3.45139085, 3.43706118, 3.49802011, 0. , 0.69858718,
0.51383054, 1.76399106, 2.80787035, 1.43033416],
[3.0731437 , 3.00626276, 3.02339402, 0.69858718, 0. ,
0.76941854, 1.53325205, 2.53474183, 1.61925829],
[3.71098632, 3.37215001, 3.55730355, 0.51383054, 0.76941854,
0. , 1.78156825, 2.32331059, 1.04497594],
[4.07474207, 4.2890106 , 4.19933149, 1.76399106, 1.53325205,
1.78156825, 0. , 3.0300838 , 1.75580771],
[4.81815926, 3.35412806, 3.9471889 , 2.80787035, 2.53474183,
2.32331059, 3.0300838 , 0. , 2.10634259],
[4.67900277, 4.16481359, 4.43724864, 1.43033416, 1.61925829,
1.04497594, 1.75580771, 2.10634259, 0. ]])

距離を求める方法としては、他に絶対距離（市街地距離、マンハッタン距離）、ミンコフスキー距離、マハラノビス距離がある。

階層型（凝集型）クラスタリング

scikit-learnで階層型クラスタリングを行う。
affinityの値は距離の求め方（デフォルトはユーグリッド距離）、linkageはクラスター間の距離をどの点に基づいて求めるかを示す。
linkageには、下記の種類がある。

最短距離法（single）
最長距離法（complete）
平均距離法（average）
重心距離法
ウォード法（ward）※ウォード法を用いる場合、affinityはユーグリッド距離のみが使用できる

ここではウォード法を用いる。

# 階層型（凝集型）クラスタリング
from sklearn.cluster import AgglomerativeClustering

ac = AgglomerativeClustering(
affinity='euclidean',
linkage='ward',
distance_threshold=0,
n_clusters=None)
ac.fit(X_sc)

階層型クラスタリングによって求めたクラスター間の距離を図示する方法として、デンドログラムがある。下記の関数を定義して描くことができる。
＜参考＞ https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html

def plot_dendrogram(model, **kwargs):
counts = np.zeros(model.children_.shape[0])
n_samples = len(model.labels_)
for i, merge in enumerate(model.children_):
current_count = 0
for child_idx in merge:
if child_idx < n_samples:
current_count += 1 # leaf node
else:
current_count += counts[child_idx - n_samples]
counts[i] = current_count

linkage_matrix = np.column_stack([model.children_, model.distances_,
counts]).astype(float)

dendrogram(linkage_matrix, **kwargs)

from scipy.cluster.hierarchy import dendrogram

labels = range(1, len(X)+1)
plot_dendrogram(ac, labels=labels)

このように、1番目から3番目までのデータ、4番目から6番目までのデータはうまくクラスターに分類できているが、7番目から9番目までのデータについては鎖状効果が発生しておりうまく分類できていない。

クラスターの各階層の距離を求める。

for i in range(len(ac.children_)):
a = (ac.children_[i][0] + 1) * -1 if ac.children_[i][0] <= len(ac.children_) else ac.children_[i][0] - len(ac.children_)
b = (ac.children_[i][1] + 1) * -1 if ac.children_[i][1] <= len(ac.children_) else ac.children_[i][1] - len(ac.children_)
print('%2i %3i %3i %0.3f' % (i+1, a, b, ac.distances_[i]))

1 -4 -6 0.514
2 -5 1 0.795
3 -2 -3 0.977
4 -9 2 1.630
5 -7 4 1.994
6 -1 3 2.226
7 -8 5 3.139
8 6 7 6.979

ここでは、4番目のデータと6番目のデータで形成した1つ目のクラスターの距離が0.514、5番目のデータと1つ目のクラスターで形成した2つ目のクラスターの距離が0.795であることを示す。

非階層型クラスタリング

階層型クラスタリングで用いた9件のデータについて、非階層型クラスタリング手法であるK-Meansで3つのクラスターに分類する。

from sklearn.cluster import KMeans

kms = KMeans(n_clusters=3, random_state=0)
kms.fit(X_sc)

for i in range(0, len(kms.labels_)):
print(kms.labels_[i])

1つ目の品種については上手くクラスターを形成できているが、2つ目と3つ目の品種については上手くいっていない。

次に、150個すべてのデータを用いたクラスタリングを行う。まず、主成分分析を行い、第2主成分までを求めた上でK-Meansで3つのクラスターに分類する。

# 主成分分析を行い第2主成分までを使用する
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# すべてのデータを使用する
X = iris.data

# 標準化
sc = StandardScaler()
X_sc = sc.fit_transform(X)

# 主成分分析を行い、第2主成分までに射影する
pca = PCA(n_components=2)
pca.fit(X_sc)
X_projection = pca.transform(X_sc)

# K-Meansでクラスタリングする
kms = KMeans(n_clusters=3, random_state=0)
kms.fit(X_projection)
kms.labels_

このような結果が得られる。やはり2つ目と3つ目の品種の分類は上手く行かない。

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 0, 0, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0,
2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0,
0, 0, 0, 2, 2, 0, 0, 0, 0, 2, 0, 2, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0,
0, 2, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 2], dtype=int32)

散布図を描く。

%matplotlib inline
from matplotlib import pyplot as plt

ax = plt.subplot()

markers = ['x', 'v', 'o']
colors = ['red', 'green', 'blue']

for i in range(0, len(X_projection)):
ax.scatter(X_projection[i][0], X_projection[i][1],
marker=markers[kms.labels_[i]], color=colors[kms.labels_[i]])
plt.show()