Pandas学习笔记02

  1. 基础:https://www.cnblogs.com/HusterX/p/14673631.html
  2. 清理:https://www.cnblogs.com/HusterX/p/14673920.html

数据清理

针对:空单元格(NAN),数据格式错误,错误的数据,重复的数据
建议:每次运行测试时都重新加载csv文件。

# Online Python compiler (interpreter) to run Python online.
# Write Python 3 code in this online editor and run it.
import pandas as pd
# Data cleaning means fixing bad data in your data set.
# Bad data cloud be: ["Empty cells", "Data in wrong format", "Wrong data", "Duplicates"].

# load data
mydata = pd.read_csv("https://www.w3schools.com/python/pandas/dirtydata.csv")
print("\n 00 --- Primary data.")
print(mydata.to_string())

# Empty cells can potentially give you a wrong result when you analyze data.
# One way to deal with empty cells is to remove rows that contain empyt cells. This is usually ok, since data sets can be very big, and removing a few rows will not have a big impace on the result.
# Return a new DataFrame with no empyt cells 
# By default, the method will not change the original. If you want to change, use the inplace = True.
# new_data = mydata.dropna(inplace = True)
# Note: If use the argument, the method will NOT return a new a DataFrame, but it will remove all rows containg NULL values from the original DataFrame.
# Empty cells at [18,22] row
new_data = mydata
new_data = new_data.dropna()
print("\n01 --- Remove empty cells.")
print(new_data.to_string())


# Another way of dealing with empyt cells is to insert a new vlaue instead. This way you do not have to delete entire rows just because of some empty cells. The fillna() method allows us to replace empty cells with a value.
# Replace NULL values with the number 130.
new_data = mydata
new_data.fillna(130, inplace = True)
print("\n02 --- Replace empty cells with number 130.")
print(new_data.to_string())

# Replace only for a SPECIFIED columns
# Replace NULL values in the "Calories" columns with the number 140
new_data = mydata
new_data["Calories"].fillna(140, inplace = True)
print("\n03 --- Replace specified columns empty cells with number 140")
print(new_data.to_string())

# Replace Using Mean, Median, or Mode.
# Pandas uses the mean(), median(), mode() methods to calculate the respective values for a specified column
# Mean = the average value(the sum of all values divided by number of values)
# Median = the value in the middle, after you have sorted all values ascending.
# Mode = the value that appears most frequently.
new_data = mydata
x = new_data["Calories"].mean()
new_data["Calories"].fillna(x, inplace = True)
print("\n04 --- Replace specified columns empty cells with mean()")
print(new_data.to_string())

# Data of Wrong Format (row [22,26])
# To fix it, you have two options:
# 1. Convert into a Correct Fromat
new_data = mydata
new_data['Date'] = pd.to_datetime(new_data['Date'])
print("\n05 --- Convert into a Correct Format")
print(new_data.to_string())
# After process, if have NaT value, which can be handled as a NULL value, we can remove the row by using the dropna() method.
# new_data.dropna(subset=['Date'], inplace = True)

# Wrong data (row [7] Duration)
new_data = mydata
# 1. Replacing Values
new_data.loc[7, 'Duration'] = 45
# 2. Loop through all values, replace by your rules.
for x in new_data.index:
    if new_data.loc[x, "Duration"] > 120:
        new_data.loc[x, "Duration"] = 120
        
# 3. Removing Rows
for x in new_data.index:
    if new_data.loc[x, "Duration"] > 120:
        new_data.drop(x, inplace = True)
        
# Discovering duplicates (row [11, 12])
new_data = mydata
# To discover duplicates, we can use the duplicated() method, this method returns a Boolean values for each row
# Return True for every row that is a duplicate, otherwise False.
print(new_data.duplicated())
# Removing Duplicates
# To remove duplicates, use the drop_duplicates() method.
new_data.drop_duplicates(inplace = True)
# The (inplace = True) will make sure that he method does NOT return a new DateFram, but it will remove all duplicate  from the original DataFram.

参考:https://www.w3schools.com/python/pandas/pandas_cleaning.asp
在线编码:https://www.programiz.com/python-programming/online-compiler/

上一篇:docker部署elastic


下一篇:Docker 安装 ElasticSearch+Logstash+Kibana