GMM高斯混合模型原理 混合密度模型 复杂的概率密度函数可以由简单密度函数线性组合构成
p ( x ∣ θ ) = ∑ k = 1 M a k p k ( x ∣ θ k ) , a k > 0 , ∑ k = 1 M a k = 1 p(x|\theta) = {\sum\limits_{k=1}^{M} a_{k}p_{k}(x|\theta_k)},\quad a_{k}>0,\quad \sum\limits_{k=1}^{M} a_{k}=1 p ( x ∣ θ ) = k = 1 ∑ M a k p k ( x ∣ θ k ) , a k > 0 , k = 1 ∑ M a k = 1
GMM 是混合密度模型的一个特例,由多个高斯(正态分布)函数的组合构成
p ( x ) = ∑ k = 1 M a k N ( x ; μ k , Σ k ) p(x) = \sum\limits_{k=1}^{M} a_{k}N(x;\mu_k,\Sigma_k) p ( x ) = k = 1 ∑ M a k N ( x ; μ k , Σ k )
以下3点对于理解GMM模型的预测很重要 !!!重点敲黑板!!!
这里的p ( x ) p(x) p ( x ) 是似然函数,完整表达应该是p ( x ∣ ω i ) p(x|\omega_i) p ( x ∣ ω i ) ,这里的ω i \omega_i ω i 表示是第i i i 类数据,即训练GMM其实对每一类都单独学习一个GMM模型。 比如:在分类问题中,如10分类问题。首先学习到10个不同的GMM模型(即似然函数公式); 通过预测样本带入似然函数公式,得到概率密度值; 再乘以不同类别的先验概率p ( ω i ) p(\omega_i) p ( ω i ) ,得到10个对应预测样本的后验概率p ( ω i ∣ x ) p(\omega_i|x) p ( ω i ∣ x ) ; 取这10个值里最大的所对应的类别作为这个样本的类别。 这一切预测都是基于贝叶斯公式: P ( ω i ∣ x ) = p ( x ∣ ω i ) P ( ω i ) p ( x ) P(\omega_i|x)={\dfrac{p(x|\omega_i)P(\omega_i)}{p(x)}} P ( ω i ∣ x ) = p ( x ) p ( x ∣ ω i ) P ( ω i )
GMM的极大似然估计(参数估计) GMM 需要估计的参数:
θ = ( α 1 , … , α M , μ 1 , … , μ M , Σ 1 , … , Σ M ) \theta=(\alpha_1,\dots,\alpha_M,\mu_1,\dots,\mu_M,\Sigma_1,\dots,\Sigma_M) θ = ( α 1 , … , α M , μ 1 , … , μ M , Σ 1 , … , Σ M )
对数似然函数:
l ( θ ) = ∑ i = 1 n l n [ ∑ k = 1 M a k N ( x i ; μ k , Σ k ) ] l({\theta})=\sum\limits_{i=1}^nln\left[\sum\limits_{k=1}^M a_{k}N(x_i;\mu_k,\Sigma_k)\right] l ( θ ) = i = 1 ∑ n l n [ k = 1 ∑ M a k N ( x i ; μ k , Σ k ) ]
极值点方程是复杂的超越方程组,很难直接求解 常用的 GMM 参数估计方法是 EM 算法 训练集和待学习参数
训练数据集:D X = { X 1 , … , X n } D_X = \{X_1,\dots,X_n\} D X = { X 1 , … , X n } 学习参数:θ = { α k , μ k , Σ k } k = 1 , … , M \theta=\{\alpha_k,\mu_k,\Sigma_k\}_{k=1,\dots,M} θ = { α k , μ k , Σ k } k = 1 , … , M 存在的问题
只有样本x i x_i x i ,但不知道是由哪一个成份高斯产生的 令y i = k y_i=k y i = k 表示x i x_i x i 是由第k k k 个成份高斯产生,构造集合: D Y = { y 1 , … , y n } D_Y=\{y_1,\dots,y_n\} D Y = { y 1 , … , y n }
完整的数据集D = D X ∪ D Y D=D_X\cup D_Y D = D X ∪ D Y ,其中D Y D_Y D Y 为缺失的数据 EM算法 理解要点!!!重点敲黑板!!! 对于D Y = { y 1 , … , y n } D_Y=\{y_1,\dots,y_n\} D Y = { y 1 , … , y n } ,相对于武断的将每个训练样本确定为某个高斯产生的,更好的做法是计算每个样本由每个高斯产生的概率,后文的代码实现也是基于这一点的,其中p d f s pdfs p df s 为m ∗ k m*k m ∗ k 维矩阵,每行为该样本由不同高斯产生的概率,共有m个样本。 E步(Expectation) P ( y i = k ∣ x i ) = P ( y i = k ) P ( x i ∣ y i = k ) P ( x i ) = a k N ( x i ; μ k , Σ k ) ∑ j = 1 M a j N ( x i ; μ j , Σ j ) (1) P(y_i=k|x_i)=\dfrac{P(y_i=k)P(x_i|y_i=k)}{P(x_i)}=\dfrac{a_{k}N(x_i;\mu_k,\Sigma_k)}{\sum\limits_{j=1}^M a_{j}N(x_i;\mu_j,\Sigma_j)} \tag{1} P ( y i = k ∣ x i ) = P ( x i ) P ( y i = k ) P ( x i ∣ y i = k ) = j = 1 ∑ M a j N ( x i ; μ j , Σ j ) a k N ( x i ; μ k , Σ k ) ( 1 )
M步(Maximize) 每个高斯分布参数也需要由所有样本参与估计,同时需要 考虑样本由不同高斯产生的概率 a ^ k = 1 n ∑ i = 1 n P ( y i = k ) (2) \hat{a}_k=\dfrac{1}{n}\sum\limits_{i=1}^nP(y_i=k) \tag{2} a ^ k = n 1 i = 1 ∑ n P ( y i = k ) ( 2 )
μ ^ k = ∑ i = 1 n P ( y i = k ) x i ∑ i = 1 n P ( y i = k ) (3) \hat{\mu}_k=\dfrac{\sum\limits_{i=1}^nP(y_i=k)x_i}{\sum\limits_{i=1}^nP(y_i=k)} \tag{3} μ ^ k = i = 1 ∑ n P ( y i = k ) i = 1 ∑ n P ( y i = k ) x i ( 3 )
Σ ^ k = ∑ i = 1 n P ( y i = k ) ( x i − μ ^ k ) ( x i − μ ^ k ) t ∑ i = 1 n P ( y i = k ) (4) \hat{\Sigma}_k=\dfrac{\sum\limits_{i=1}^nP(y_i=k)(x_i-\hat{\mu}_k)(x_i-\hat{\mu}_k)^t}{\sum\limits_{i=1}^nP(y_i=k)} \tag{4} Σ ^ k = i = 1 ∑ n P ( y i = k ) i = 1 ∑ n P ( y i = k ) ( x i − μ ^ k ) ( x i − μ ^ k ) t ( 4 )
算法 Input: 训练集D X = { X 1 , … , X n } D_X = \{X_1,\dots,X_n\} D X = { X 1 , … , X n } ,高斯数M M M Output: GMM的模型参数θ \theta θ
随机初始化θ \theta θ ; repeat E步:公式(1)估计样本由不同高斯产生的概率; M步:公式(2-4)重新估计模型的参数θ \theta θ ; until 达到收敛精度 类比KMeans算法:
E步是在已知质心位置的条件下,把各点聚类到最近的质心; M步是根据类内各点,调整质心位置。 实验及代码 实验内容 使用Python编程实现GMM算法:要求独立完成算法编程,禁止调用已有函数库或工具箱中的函数; 使用仿真数据测试程序的正确性:两类2维各1000个训练样本; 每个类别的样本分别采样自包含2个分量的高斯混合模型,分别使用两个类别的训练样本学习模型化每个类别条件概率密度的GMM模型参数,并与模型的真实值比较; 假设两个类别的先验概率相等,使用学习到的模型参数分类测试样本数据,统计分类的正确率 MNIST数据集测试:MNIST-Train-Samples.csv中包含30000个17维特征手写数字样本训练,MNIST-Train-Labels.csv中包含训练样本的标签; 分别使用每个数字的样本集学习该类别的GMM模型参数; 使用10个类别GMM模型构成的分类器,分类MNIST-Test-Samples.csv中的10000个样本,与MNIST-Test-Labels.csv的类别标记比对,计算识别的正确率; 尝试设置不同的GMM模型超参数—高斯数,测试不同高斯数的GMM分类器的识别正确率; 程序代码 导入库并制作工具函数 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 import numpy as npimport matplotlib.pyplot as pltfrom matplotlib.patches import Ellipseimport pandas as pdimport randomimport warningsimport mathimport copywarnings.filterwarnings("ignore" ) def plot_cov_ellipse (cov, pos, nstd=2 , ax=None ): def eigsorted (cov ): vals, vecs = np.linalg.eigh(cov) order = vals.argsort()[::-1 ] return vals[order], vecs[:, order] if ax is None : ax = plt.gca() vals, vecs = eigsorted(cov) theta = np.degrees(np.arctan2(*vecs[:, 0 ][::-1 ])) width, height = 2 * nstd * np.sqrt(vals) ellip = Ellipse(xy=pos, width=width, height=height, facecolor='#c8e0e4' , angle=theta) ax.add_artist(ellip) return ellip def plot (data, mu, covariance, class_label ): plt.scatter(data[:, 0 ], data[:, 1 ], c=class_label) n_components = len (mu) for j in range (n_components): plot_cov_ellipse(covariance[j], mu[j]) plt.show() def Gaussian (x, mu, covariance ): ret = [] for i in x: X, mu, covariance = np.array(i), np.array(mu), np.array(covariance) ret.append( np.exp(-0.5 * (np.dot(np.dot((X-mu).T, np.linalg.inv(covariance)), X-mu))) / math.sqrt(np.linalg.det(2.0 * math.pi * covariance))) return np.array(ret)
GMM面向对象实现 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 class GMM : def __init__ (self, X, y, n_components, iteration=40 , eps=1e-9 ): self.X = X self.y = y self.m, self.n = X.shape self.n_components = n_components self.iteration = iteration self.eps = eps self.alpha = np.ones(n_components) * 1 / n_components self.mu = np.random.random((self.n_components, self.n)) self.covariance = np.zeros((self.n_components, self.n, self.n)) row_rand_array = np.arange(X.shape[0 ]) np.random.shuffle(row_rand_array) self.mu = X[row_rand_array[0 :n_components]] distance = np.tile(np.sum (X * X, axis=1 ).reshape((self.m, 1 )), (1 , self.n_components)) + np.tile(np.sum (self.mu * self.mu, axis=1 ).T,(self.m, 1 )) - 2 * np.dot(X, self.mu.T) orginial_labels = np.argmin(distance, axis=1 ) for i in range (self.n_components): temp = X[orginial_labels == i, :] self.alpha[i] = temp.shape[0 ] / self.m self.covariance[i, :, :] = np.cov(temp.T) self.pdfs = np.zeros((self.m, self.n_components)) def Train (self ): nowIter = 0 preLogLikelihood = self.LogLikelihood() while nowIter < self.iteration: if nowIter % 20 == 0 : print ("Iter:" , nowIter) self.Expectation() self.Maximize() nowIter += 1 logLikelihood = self.LogLikelihood() if abs (logLikelihood - preLogLikelihood) < self.eps: break preLogLikelihood = logLikelihood def LogLikelihood (self ): for j in range (self.n_components): self.pdfs[:, j] = self.alpha[j] * Gaussian(self.X, self.mu[j], self.covariance[j]) return np.mean(np.log(np.sum (self.pdfs, axis=1 ))) def Expectation (self ): for k in range (self.n_components): self.pdfs[:,k] = self.alpha[k] * Gaussian(self.X, self.mu[k], self.covariance[k]) self.W = self.pdfs / np.sum (self.pdfs,axis=1 ).reshape(-1 ,1 ) def Maximize (self ): self.alpha = np.sum (self.W, axis=0 ) / np.sum (self.W) for k in range (self.n_components): self.mu[k] = np.average(self.X, axis=0 , weights=self.W[:, k]) cov = 0 for i in range (self.m): temp = (self.X[i, :] - self.mu[k, :]).reshape(-1 , 1 ) cov += self.W[i, k] * np.dot(temp, temp.T) self.covariance[k, :, :] = cov / np.sum (self.W[:, k])
Emu数据测试 参数估计
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 train_X = pd.read_csv(r'Emu-Train-Samples.csv' , header=None ) train_Y = pd.read_csv(r'Emu-Train-Labels.csv' , header=None ) train = copy.deepcopy(train_X) train["label" ] = train_Y train = train.to_numpy() train_1_X = np.array([[i[0 ], i[1 ]] for i in train if i[2 ] == 0 ]) train_1_Y = np.array([i[2 ] for i in train if i[2 ] == 0 ]) train_2_X = np.array([[i[0 ], i[1 ]] for i in train if i[2 ] == 1 ]) train_2_Y = np.array([i[2 ] for i in train if i[2 ] == 1 ]) plt.scatter(train_1_X[:, 0 ], train_1_X[:, 1 ], c='blue' ) plt.scatter(train_2_X[:, 0 ], train_2_X[:, 1 ], c='green' ) myGMM1 = GMM(train_1_X, train_1_Y, 2 ) myGMM1.Train() myGMM2 = GMM(train_2_X, train_2_Y, 2 ) myGMM2.Train() print ("Class_1:" )print ("mu:" )print (myGMM1.mu)print ("alpha:" )print (myGMM1.alpha)print ("covariance:" )print (myGMM1.covariance)print ("Class_2:" )print ("mu:" )print (myGMM2.mu)print ("alpha:" )print (myGMM2.alpha)print ("covariance:" )print (myGMM2.covariance)
预测
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 test_X = pd.read_csv(r'Emu-Test-Samples.csv' , header=None ).to_numpy() test_Y = pd.read_csv(r'Emu-Test-Labels.csv' , header=None ).to_numpy().reshape(-1 ) def Likelihood (X, alpha, mu, covariance, n_components ): pdfs = np.zeros((len (X), n_components)) for k in range (n_components): pdfs[:, k] = alpha[k] * Gaussian(X, mu[k], covariance[k]) return np.sum (pdfs, axis=1 ) pdfs_1 = Likelihood(test_X, myGMM1.alpha, myGMM1.mu, myGMM1.covariance, len (myGMM1.alpha)) pdfs_2 = Likelihood(test_X, myGMM2.alpha, myGMM2.mu, myGMM2.covariance, len (myGMM2.alpha)) predict = [] for i in range (len (test_X)): if (pdfs_1[i] > pdfs_2[i]): predict.append(0 ) else : predict.append(1 ) accuracy = sum (predict == test_Y) / len (predict) print (accuracy)
MNIST数据集测试 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 train_mnist_X = pd.read_csv(r'MNIST-Train-Samples.csv' , header=None ) train_mnist_Y = pd.read_csv(r'MNIST-Train-Labels.csv' , header=None ) test_mnist_X = pd.read_csv(r'MNIST-Test-Samples.csv' , header=None ) test_mnist_Y = pd.read_csv(r'MNIST-Test-Labels.csv' , header=None ) train_mnist = train_mnist_X train_mnist["label" ] = train_mnist_Y train_mnist = train_mnist.to_numpy() test_mnist_X = test_mnist_X.to_numpy() test_mnist_Y = test_mnist_Y.to_numpy().reshape(-1 ) def GmmOfMnist (number ): alphas = [] mus = [] covariances = [] pdfs = [] for index in range (0 , 10 ): print ("GMM" , number," " , "class" , index) nowTrain_X = np.array([i[:-1 ] for i in train_mnist if i[-1 ] == index]) nowTrain_Y = [index] * len (nowTrain_X) nowGMM = GMM(nowTrain_X, nowTrain_Y, number) nowGMM.Train() alphas.append(nowGMM.alpha) mus.append(nowGMM.mu) covariances.append(nowGMM.covariance) for i in range (0 , 10 ): pdfs.append(Likelihood(test_mnist_X, alphas[i], mus[i], covariances[i], len (alphas[i]))) result = np.argmax(pdfs, axis = 0 ) accuracy_num = sum (result == test_mnist_Y) accuracy = accuracy_num / len (result) return accuracy_num, accuracy for i in range (2 , 6 ): accuracy_num, accuracy = GmmOfMnist(i) print ("GMM number:" , i) print ("正确识别数: " , accuracy_num) print ("正确识别率: " , accuracy)
遇到的问题及解决方法 高斯均值初始化问题: 在前几轮训练中有几次收敛到局部最优解,即有个别高斯模型重叠覆盖了同一个训练样本,或者一个高斯完全覆盖了另一个高斯的样本范围。这是由于初始化不够好导致的,起初选用的是固定选点,后续使用的是随机样本选点,更好的做法应该是使用Kmeans聚类选择初始化点。