判別分析の基礎

irisデータセットについて判別分析を行う。

irisデータセットは、下図のようなものである。ここでは、花びらの長さ（sepal length）と、がくの長さ（petal length）の2つの値を使う。

%matplotlib inline
import pandas as pd
from sklearn.datasets import load_iris
from matplotlib import pyplot as plt

# irisデータセットを取得し、targetを含めたDataFrameを作成
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df = pd.concat([df, pd.Series(iris.target, name='target')], axis=1)

# 花びらとがくの長さだけに絞る
df2 = df[[iris.feature_names[0], iris.feature_names[2], 'target']]

# 品種ごとに分ける
a = df2[df2['target'] == 0]
b = df2[df2['target'] == 1]
c = df2[df2['target'] == 2]

x1a = a[iris.feature_names[0]].values
x2a = a[iris.feature_names[2]].values
x1b = b[iris.feature_names[0]].values
x2b = b[iris.feature_names[2]].values
x1c = c[iris.feature_names[0]].values
x2c = c[iris.feature_names[2]].values

# 可視化
ax = plt.subplot()
ax.scatter(x1a, x2a, marker='o', color='red', label=iris.target_names[0])
ax.scatter(x1b, x2b, marker='x', color='blue', label=iris.target_names[1])
ax.scatter(x1c, x2c, marker='v', color='green', label=iris.target_names[2])
ax.set_xlabel(iris.feature_names[0])
ax.set_ylabel(iris.feature_names[2])
ax.legend()
plt.show()

平均と分散

import numpy as np

str_format = '[%s]\nがく平均:%f がく分散:%f 花びら平均:%f 花びら分散:%f'
print(str_format % (iris.target_names[0], np.mean(x1a), np.std(x1a), np.mean(x2a), np.std(x2a)))
print(str_format % (iris.target_names[1], np.mean(x1b), np.std(x1b), np.mean(x2b), np.std(x2b)))
print(str_format % (iris.target_names[2], np.mean(x1c), np.std(x1c), np.mean(x2c), np.std(x2c)))

[setosa]
がく平均:5.006000 がく分散:0.348947 花びら平均:1.462000 花びら分散:0.171919
[versicolor]
がく平均:5.936000 がく分散:0.510983 花びら平均:4.260000 花びら分散:0.465188
[virginica]
がく平均:6.588000 がく分散:0.629489 花びら平均:5.552000 花びら分散:0.546348

がくの長さは、品種ごとの平均が近似しており、分散が大きいため、判別に用いるのは難しい。
一方、花びらの長さの分布は平均がずれており（特にsetosaとversicolor）、分散が小さいため、判別に適している。

花びらの長さについて、setosaとversicolorの平均値からの距離を求め、判別を試みる。
距離$d$は、次式で求める。$\bar{x}$は$x$の平均値、$s^2$は標準偏差である。
$$
d = \frac{(x-\bar{x})^2}{s^2}
$$
距離が短い方に判別される。

for i in range(0, 100):
  d1 = ((iris.data[i][2] - np.mean(x2a)) ** 2) / (np.std(x2a)) ** 2
  d2 = ((iris.data[i][2] - np.mean(x2b)) ** 2) / (np.std(x2b)) ** 2
  print('正答:%i 判別結果:%i (%3.4f, %3.4f)' % (iris.target[i], not d1<d2, d1, d2))

正答:0 判別結果:0 (0.1301, 37.7985)
正答:0 判別結果:0 (0.1301, 37.7985)
（中略）
正答:1 判別結果:1 (354.7383, 0.8946)
正答:1 判別結果:1 (312.2697, 0.2662)
（後略）

線形判別分析

versicolorとvirginicaの2品種について、線形判別分析（Linear Discriminant Analysis）を行う。

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

x = pd.concat([b.drop('target', axis=1), c.drop('target', axis=1)], axis=0).values
y = pd.concat([b['target'], c['target']], axis=0).values

lda = LinearDiscriminantAnalysis()
lda.fit(x, y)
print('係数', lda.scalings_)

# グループの平均と係数の線形結合の平均が定数値
cv = np.mean(np.dot(lda.means_, lda.scalings_))
print('定数値', cv)

係数 [[-1.63793744]
[ 3.15236795]]
定数値 5.208752927587185

次の線形判別式が得られた。
$$
z = -1.64x_1 + 3.15x_2 – 5.21
$$

$z$の正負により判別できる。

for i in range(len(x)):
  z = x[i][0] * lda.scalings_[0] + x[i][1] * lda.scalings_[1] - cv
  transform = lda.transform(x[i].reshape(1,-1))
  print('正答:%s 判別結果:%i LDA:%f (%f)' % (y[i], 1 if z<0 else 2, z, transform))

正答:1 判別結果:1 LDA:-1.858186 (-1.858186)
正答:1 判別結果:1 LDA:-1.505897 (-1.505897)
（中略）
正答:1 判別結果:2 LDA:0.258782 (0.258782)
（中略）
正答:2 判別結果:2 LDA:3.386449 (3.386449)
正答:2 判別結果:2 LDA:1.368286 (1.368286)
（後略）

決定境界を描画する。

import numpy as np

x1_min = min(x[:, 0]) - 1
x1_max = max(x[:, 0]) + 1
x2_min = min(x[:, 1]) - 1
x2_max = max(x[:, 1]) + 1

ax = plt.subplot()

# 要素数100を指定して等差数列を作成
x1 = np.linspace(x1_min, x1_max, 100)

# 決定境界の描画
ax.plot(x1, -(x1 * lda.scalings_[0][0] - cv) / lda.scalings_[1][0])

ax.scatter(x1b, x2b, marker='x', color='blue', label=iris.target_names[1])
ax.scatter(x1c, x2c, marker='v', color='green', label=iris.target_names[2])
ax.set_xlabel(iris.feature_names[0])
ax.set_ylabel(iris.feature_names[2])
ax.legend()
ax.set_xlim(x1_min, x1_max)
ax.set_ylim(x2_min, x2_max)
plt.show()

決定境界をmatplotlibのpcolormesh()で描くこともできる。

# x1とx2の格子点を生成する
x1 = np.linspace(x1_min, x1_max, 100)
x2 = np.linspace(x2_min, x2_max, 100)
xx1, xx2 = np.meshgrid(x1, x2)
xx = np.c_[xx1.ravel(), xx2.ravel()]

# 生成した格子点を説明変数として予測
y = lda.predict(xx)

ax = plt.subplot()

# 決定境界を描く
ax.pcolormesh(xx1, xx2, y.reshape(xx1.shape))

ax.scatter(x1b, x2b, marker='x', color='blue', label=iris.target_names[1])
ax.scatter(x1c, x2c, marker='v', color='green', label=iris.target_names[2])
ax.set_xlabel(iris.feature_names[0])
ax.set_ylabel(iris.feature_names[2])
ax.legend()
ax.set_xlim(x1_min, x1_max)
ax.set_ylim(x2_min, x2_max)
plt.show()

ロジスティック回帰

versicolorとvirginicaの2品種について、ロジスティック回帰（Logistic Regression）を行う。

from sklearn.linear_model import LogisticRegression

x = pd.concat([b.drop('target', axis=1), c.drop('target', axis=1)], axis=0).values
y = pd.concat([b['target'], c['target']], axis=0).values

lr = LogisticRegression()
lr.fit(x, y)
print('係数', lr.coef_[0])
print('定数値', lr.intercept_[0])

係数 [-0.48774364 3.75728302]
定数値 -15.290837735364322

下記の回帰式が得られた。
$$
z = -0.49x_1 + 3.76x_2 + (-15.29)
$$
この$z$をシグモイド関数を使用して0～1の範囲に変換し、二値分類では0.5を閾値にして分類する。

def sigmoid(x):
  # np.eはネイピア数
  return 1 / (1 + np.e ** -x)

for i in range(len(x)):
  z = x[i][0] * lr.coef_[0][0] + x[i][1] * lr.coef_[0][1] + lr.intercept_[0]
  pred = lr.predict(x[i].reshape(1,-1))
  print('正答:%s 判別結果:%i LR:%f (%i)' % (y[i], 1 if sigmoid(z)<0.5 else 2, sigmoid(z), pred))

正答:1 判別結果:1 LR:0.260030 (1)
正答:1 判別結果:1 LR:0.181737 (1)
（中略）
正答:1 判別結果:2 LR:0.511764 (2)
（中略）
正答:2 判別結果:2 LR:0.984933 (2)
正答:2 判別結果:2 LR:0.739311 (2)
（後略）

決定境界を描画する。

ax = plt.subplot()

# 決定境界の描画
ax.plot(x1, -(x1 * lr.coef_[0][0] + lr.intercept_[0]) / lr.coef_[0][1])

ax.scatter(x1b, x2b, marker='x', color='blue', label=iris.target_names[1])
ax.scatter(x1c, x2c, marker='v', color='green', label=iris.target_names[2])
ax.set_xlabel(iris.feature_names[0])
ax.set_ylabel(iris.feature_names[2])
ax.legend()
ax.set_xlim(x1_min, x1_max)
ax.set_ylim(x2_min, x2_max)
plt.show()