Python使用pandas_profiling库生成报告

2021-12-28 01:27:42

Python使用pandas_profiling库生成报告

Python安装pandas_profiling

命令行安装
pip install pandas_profiling
pip install pandas_profiling==2.10.1 --指定版本

清华镜像安装
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pandas_profiling

卸载pandas_profiling
pip uninstall pandas_profiling

安装pandas_profiling报错处理

报错：
ERROR: Cannot uninstall 'PyYAML'.  It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.

错误:无法卸载“PyYAML”。 它是一个distutils安装的项目，因此我们不能准确地确定哪些文件属于它，这将导致只部分卸载。

解决办法：卸载以后，在重新安装就可以了

在线下载命令
pip install -i https://pypi.douban.com/simple  scrapy

常用的python 镜像
豆瓣，该网站比较稳定，速度也比较快
https://pypi.douban.com/simple

清华大学
https://pypi.tuna.tsinghua.edu.cn/simple

中国科技大学
https://mirrors.ustc.edu.cn/pypi/web/simple

阿里
https://mirrors.aliyun.com/pypi/simple/

Python 代码如下：

import pandas as pd
import pandas_profiling
import os
import re

intput_dir = os.walk(r"../test_data")
output_dir = '../test_data'
hospitol = 'XX'

for path, dir_list, file_list in intput_dir:
    for file_name in file_list:
        if file_name == 'XX.csv': #跑单张表pandas_profiling时使用；
            file_path = os.path.join(path, file_name)
            df = pd.read_csv(file_path)
            # 获取表名
            tablename = re.compile(r'\w+')
            t_lst = re.findall(tablename, file_name)
            for l in t_lst:
                table_name = str.lower(l)
                #minimal=True 该参数，如果不设会出更详细的pandas_profiling报告;
                profile = pandas_profiling.ProfileReport(df, title=f'{hospitol}{table_name}表数据质量报告',minimal=True)
                profile.to_file(output_file=os.path.join(output_dir, table_name + '.html'))

以下是Pandas Profiling(2.11版)官方文档内容:

Pandas Profiling

Documentation | Slack | Stack Overflow

Generates profile reports from a pandas DataFrame.

The pandas df.describe() function is great but a little basic for serious exploratory data analysis.
pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.

For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

Type inference: detect the types of columns in a dataframe.
Essentials: type, unique values, missing values
Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
Most frequent values
Histogram
Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
Missing values matrix, count, heatmap and dendrogram of missing values
Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.
File and Image analysis extract file sizes, creation dates and dimensions and scan for truncated images or those containing EXIF information.

Announcements

Version v2.10.0rc1 released

v2.10.0rc1 includes a major overhaul of the type system, now fully reliant on visions.
See the changelog below to know what has changed.

Spark backend in progress

We can happily announce that we’re nearing v1 for the Spark backend for generating profile reports.
Stay tuned.

Support `pandas-profiling`

The development of pandas-profiling relies completely on contributions.
If you find value in the package, we welcome you to support the project through GitHub Sponsors!
It’s extra exciting that GitHub matches your contribution for the first year.

Find more information here:

January 5, 2021

码农公寓

Python使用pandas_profiling库生成报告

Pandas Profiling

Announcements

Version v2.10.0rc1 released

Spark backend in progress

Support pandas-profiling

相关文章

Support `pandas-profiling`