我正在尝试比较二维数组的sklearn.neighbors.KernelDensity和scipy.stats.gaussian_kde的性能.
从this article开始,我发现带宽(bw)在每个函数中的处理方式都不同.本文提供了一种在scipy中设置正确的bw的方法,因此它将等同于sklearn中使用的方法.基本上,它将bw除以样本标准偏差.结果是这样的:
# For sklearn
bw = 0.15
# For scipy
bw = 0.15/x.std(ddof=1)
其中x是我用来获取KDE的示例数组.这在1D模式下可以正常工作,但我无法在2D模式下运行.
这是我得到的MWE:
import numpy as np
from scipy import stats
from sklearn.neighbors import KernelDensity
# Generate random data.
n = 1000
m1, m2 = np.random.normal(0.2, 0.2, size=n), np.random.normal(0.2, 0.2, size=n)
# Define limits.
xmin, xmax = min(m1), max(m1)
ymin, ymax = min(m2), max(m2)
# Format data.
x, y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([x.ravel(), y.ravel()])
values = np.vstack([m1, m2])
# Define some point to evaluate the KDEs.
x1, y1 = 0.5, 0.5
# -------------------------------------------------------
# Perform a kernel density estimate on the data using scipy.
kernel = stats.gaussian_kde(values, bw_method=0.15/np.asarray(values).std(ddof=1))
# Get KDE value for the point.
iso1 = kernel((x1,y1))
print 'iso1 = ', iso[0]
# -------------------------------------------------------
# Perform a kernel density estimate on the data using sklearn.
kernel_sk = KernelDensity(kernel='gaussian', bandwidth=0.15).fit(zip(*values))
# Get KDE value for the point.
iso2 = kernel_sk.score_samples([[x1, y1]])
print 'iso2 = ', np.exp(iso2[0])
(由于sklearn返回对数值,所以iso2以指数形式显示)
我对iso1和iso2所获得的结果是不同的,我不知道如何影响带宽(在任一函数中)以使其相等(应按其要求).
加
在sklearn聊天中(由ep)建议我在使用scipy计算内核之前,应缩放(x,y)中的值,以便获得与sklearn相当的结果.
这就是我所做的:
# Scale values.
x_val_sca = np.asarray(values[0])/np.asarray(values).std(axis=1)[0]
y_val_sca = np.asarray(values[1])/np.asarray(values).std(axis=1)[1]
values = [x_val_sca, y_val_sca]
kernel = stats.gaussian_kde(values, bw_method=bw_value)
即:我缩放了两个维度,然后使用scipy获取内核,同时保持了在sklearn中获取内核的那一行.
这样可以得到更好的结果,但是获得的内核仍然存在差异:
红点是代码中的(x1,y1)点.可以看出,密度估计的形状仍然存在差异,尽管很小.也许这是可以实现的最好成绩?
解决方法:
几年后,我尝试了此操作,并认为我无需重新缩放数据就可以使用它.带宽值确实需要一些缩放:
# For sklearn
bw = 0.15
# For scipy
bw = 0.15/x.std(ddof=1)
对同一点的两个KDE的评估并不完全相等.例如,以下是对(x1,y1)点的评估:
iso1 = 0.00984751705005 # Scipy
iso2 = 0.00989788224787 # Sklearn
但我想它已经足够接近了
这是2D情况的MWE和输出,据我所见,看起来几乎完全相同:
import numpy as np
from scipy import stats
from sklearn.neighbors import KernelDensity
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
# Generate random data.
n = 1000
m1, m2 = np.random.normal(-3., 3., size=n), np.random.normal(-3., 3., size=n)
# Define limits.
xmin, xmax = min(m1), max(m1)
ymin, ymax = min(m2), max(m2)
ext_range = [xmin, xmax, ymin, ymax]
# Format data.
x, y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([x.ravel(), y.ravel()])
values = np.vstack([m1, m2])
# Define some point to evaluate the KDEs.
x1, y1 = 0.5, 0.5
# Bandwidth value.
bw = 0.15
# -------------------------------------------------------
# Perform a kernel density estimate on the data using scipy.
# **Bandwidth needs to be scaled to match Sklearn results**
kernel = stats.gaussian_kde(
values, bw_method=bw/np.asarray(values).std(ddof=1))
# Get KDE value for the point.
iso1 = kernel((x1, y1))
print 'iso1 = ', iso1[0]
# -------------------------------------------------------
# Perform a kernel density estimate on the data using sklearn.
kernel_sk = KernelDensity(kernel='gaussian', bandwidth=bw).fit(zip(*values))
# Get KDE value for the point. Use exponential since sklearn returns the
# log values
iso2 = np.exp(kernel_sk.score_samples([[x1, y1]]))
print 'iso2 = ', iso2[0]
# Plot
fig = plt.figure(figsize=(10, 10))
gs = gridspec.GridSpec(1, 2)
# Scipy
plt.subplot(gs[0])
plt.title("Scipy", x=0.5, y=0.92, fontsize=10)
# Evaluate kernel in grid positions.
k_pos = kernel(positions)
kde = np.reshape(k_pos.T, x.shape)
plt.imshow(np.rot90(kde), cmap=plt.cm.YlOrBr, extent=ext_range)
plt.contour(x, y, kde, 5, colors='k', linewidths=0.6)
# Sklearn
plt.subplot(gs[1])
plt.title("Sklearn", x=0.5, y=0.92, fontsize=10)
# Evaluate kernel in grid positions.
k_pos2 = np.exp(kernel_sk.score_samples(zip(*positions)))
kde2 = np.reshape(k_pos2.T, x.shape)
plt.imshow(np.rot90(kde2), cmap=plt.cm.YlOrBr, extent=ext_range)
plt.contour(x, y, kde2, 5, colors='k', linewidths=0.6)
fig.tight_layout()
plt.savefig('KDEs', dpi=300, bbox_inches='tight')