如何检查两个数据集的匹配列之间的相关性?

如果我们有数据集:

import pandas as pd
a = pd.DataFrame({"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]})
b = pd.DataFrame({"A":[45,24,65,65,65], "B":[45,87,65,52,12], "C":[98,52,32,32,12], "D":[0,23,1,365,53], "E":[24,12,65,3,65]})

如何创建相关矩阵,其中y轴表示“a”而x轴表示“b”?

目的是查看两个数据集的匹配列之间的相关性,如下所示:

如何检查两个数据集的匹配列之间的相关性?

解决方法:

这实现了你想要的:

from scipy.stats import pearsonr

# create a new DataFrame where the values for the indices and columns
# align on the diagonals
c = pd.DataFrame(columns = a.columns, index = a.columns)

# since we know set(a.columns) == set(b.columns), we can just iterate
# through the columns in a (although a more robust way would be to iterate
# through the intersection of the two sets of columns, in the case your actual dataframes' columns don't match up
for col in a.columns:
    correl_signif = pearsonr(a[col], b[col]) # correlation of those two Series
    correl = correl_signif[0] # grab the actual Pearson R value from the tuple from above
    c.loc[col, col] = correl   # locate the diagonal for that column and assign the correlation coefficient   

编辑:嗯,它实现了你想要的,直到问题被修改.虽然这很容易改变:

c = pd.DataFrame(columns = a.columns, index = a.columns)

for col in c.columns:
    for idx in c.index:
        correl_signif = pearsonr(a[col], b[idx])
        correl = correl_signif[0]
        c.loc[idx, col] = correl

c现在是这样的:

Out[16]: 
           A          B         C         D          E
A   0.713185  -0.592371 -0.970444  0.487752 -0.0740101
B  0.0306753 -0.0705457  0.488012   0.34686  -0.339427
C  -0.266264 -0.0198347  0.661107  -0.50872   0.683504
D   0.580956  -0.552312 -0.320539  0.384165  -0.624039
E  0.0165272   0.140005 -0.582389   0.12936   0.286023
上一篇:python – Pandas DataFrame列与自定义函数的成对关联


下一篇:python – 有没有办法测试数据X和二进制输出Y之间的相关性?