数据分析课程笔记
pandas
为什么要学习pandas
常见数据类型
创建series
Series切片和索引
Series的索引和值
读取外部数据
数据来源:https://www.kaggle.com/new-york-city/nyc-dog-names/data
# coding=utf-8
import pandas as pd
#pandas读取csv中的文件
df = pd.read_csv("./dogNames2.csv")
print(df[(800<df["Count_AnimalName"])|(df["Count_AnimalName"]<1000)])
# coding=utf-8
from pymongo import MongoClient
import pandas as pd
client = MongoClient()
collection = client["douban"]["tv1"]
data = collection.find()
data_list = []
for i in data:
temp = {}
temp["info"]= i["info"]
temp["rating_count"] = i["rating"]["count"]
temp["rating_value"] = i["rating"]["value"]
temp["title"] = i["title"]
temp["country"] = i["tv_category"]
temp["directors"] = i["directors"]
temp["actors"] = i['actors']
data_list.append(temp)
# t1 = data[0]
# t1 = pd.Series(t1)
# print(t1)
df = pd.DataFrame(data_list)
# print(df)
#显示头几行
print(df.head(1))
# print("*"*100)
# print(df.tail(2))
#展示df的概览
# print(df.info())
# print(df.describe())
print(df["info"].str.split("/").tolist())
DataFrame
# coding=utf-8
import pandas as pd
df = pd.read_csv("./dogNames2.csv")
# print(df.head())
# print(df.info())
#dataFrame中排序的方法
df = df.sort_values(by="Count_AnimalName",ascending=False)
# print(df.head(5))
#pandas取行或者列的注意点
# - 方括号写数组,表示取行,对行进行操作
# - 写字符串,表示的去列索引,对列进行操作
print(df[:20])
print(df["Row_Labels"])
print(type(df["Row_Labels"]))
索引数据
loc
iloc
布尔索引
字符串方法
缺失数据处理
#### numpy数组的拼接
- np.hstack(t1,t2)
- np.vstack(t1,t2)
#### Series如何创建,如何进行索引和切片
- pd.Series([])
- pd.Series({}) #字典的键就是Series的索引
- s1["a"]
- `s1[["a","c"]]`
- `s1[1]`
- `s2[[1,5,3]]`
- s2[4:10]
#### DataFrame如何创建,如何进行索引和切片
- `pd.DataFrame([[],[],[]])` #接收2维数组
- pd.DataFrame({"a":[1,23],"c":[2,3]})
- pd.DataFrame([{},{},{}])
#### DataFrame缺失数据处理
- 0
- 并不是所有的都需要处理
- df[df==0] = np.nan
- nan
- pd.isnan ,pd.notnan
- df.dropna(axis=0,how="any[all]",inplace=True)
- df.fillna(df.mean())
- df["A"] = df["A"].fillna(df["A"].mean())