下面零碎的16部分代码是Task1中的3个任务的代码实现。在本文的后面放置代码的输出部分。
第一部分:代码实现
''' df = pd.read_csv(r'C:\Users\DELL\Desktop\动手学习数据分析\hands-on-data-analysis-master\第一单元项目集合\train.csv') #print(df.head(3)) 输出前三行 '''
- chunk分块
''' 2、设置chunksize参数,来控制每次迭代数据的大小 chunker = pd.read_csv(r'C:\Users\DELL\Desktop\动手学习数据分析\hands-on-data-analysis-master\第一单元项目集合\train.csv',chunksize=5) for piece in chunker: #print(type(piece)) #<class 'pandas.core.frame.DataFrame'> #print(piece) #按块输出 '''
- 换列名
''' #3.换成中文咯,且没有省略号 pd.set_option('display.max_columns',1000) pd.set_option('display.width', 1000) pd.set_option('display.max_colwidth',1000) df = pd.read_csv(r'C:\Users\DELL\Desktop\动手学习数据分析\hands-on-data-analysis-master\第一单元项目集合\train.csv', names=['乘客ID','是否幸存','舱位等级','姓名','性别','年龄','兄弟姐妹','父母子女','船票','票价','客舱','登船港口'],index_col='乘客ID',header=0) print(df.head()) '''
- 基本信息,第一行,第二行以及是否为空
''' #4.查看数据基本信息 df.info() 前几行print(df.head(1)) 后几行print(df.tail(1)) 判断前五行是否为空print(df.isnull().head()) '''
- 文件保存
''' #5.文件保存df.to_csv('train_chinese.csv') '''
- pandas中的两种数据结构
''' #6.df的series与dataframe sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000} example_1 = pd.Series(sdata) print(example_1) data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002, 2003],'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]} example_2 = pd.DataFrame(data) print(example_2) '''
- 对列进行分析
''' #7.df.columns分析栏,此外可以用e变量赋值哦。 df=pd.read_csv(r'train_chinese.csv') print(df.head(3)) e=df.columns print(e) '''
- 读取特定列
''' #8.读取特定列 df=pd.read_csv(r'train_chinese.csv') a=df['舱位等级'].head(3) print(a) '''
- 删除与测试集不同的部分
''' #9.删除列,测试集与训练集不同的列 del test_1['a'] '''
- 隐藏无关紧要的列
''' #10.隐藏列,更容易清晰的观察其他元素 p=test_1.drop(['PassengerId','Name','Age','Ticket'],axis=1).head(3) print(p) '''
- 筛选
''' #11.第100行 in[67] a=df.loc[[100],["舱位等级","性别"]] midage = df[(df['年龄']>10)& (df['年龄']<50)] midage = midage.reset_index(drop=True) p=midage.loc[[100],['舱位等级','性别']] rr=midage print(rr) print(p) print(a) '''
- loc方法
''' #12.loc方法 in[69] tt=midage.loc[[100,105,108],['舱位等级','姓名','性别']] print(tt) '''
- 各种排序
#13.导入数据,对示例数据排序, frame = pd.DataFrame(np.arange(8).reshape((2,4)), index=['2','1'], columns=['d','a','b','c']) print(frame) print() #行索引升序排列 (另外赋变量才行) a=frame.sort_index() print(a) print() #列索引升序排列 (axis可能是布尔类型) b=frame.sort_index(axis=1) print(b) #列索引降序排列 c=frame.sort_index(axis=1,ascending=False) print(c) d=frame.sort_values(by=['a','c']) print(d)
- 一个比较特殊的排序
#14.对泰坦尼克号数据按票价和年龄两列进行综合排序(降序排列) a=text.sort_values(by=['票价','年龄'],ascending=False).head(20) print(a) b=text.sort_values(by=['年龄','票价'],ascending=False).head(20) print(b) #先票价再年龄,先年龄再票价这两种情形是不一样的。 #很多内容见原书《利用python进行数据分析的第五章,算数运算与数据对齐》
- dataframe数据相加
#15.计算两个dataframe数据相加的结果 frame1_a=pd.DataFrame(np.arange(9.).reshape(3,3), index=['one','two','three'], columns=['a','b','c']) print(frame1_a) frame1_a=pd.DataFrame(np.arange(9.).reshape(3,3), columns=['a','b','c'], index=['one','two','three']) print(frame1_a) frame1_b=pd.DataFrame(np.arange(12.).reshape(4,3), columns=['a','e','c'], index=['first','one','two','second']) print(frame1_b) p=frame1_a+frame1_b print(p) frame1_b=pd.DataFrame(np.arange(12.).reshape(4,3), columns=['o','e','c'], index=['first','one','two','second']) #这里偷改了一下 print(frame1_b) p=frame1_a+frame1_b print(p)
- 一个比较有趣的点(对数据的逻辑化思考)
#16.计算最大值 print(max(text['兄弟姐妹']+text['父母子女']))#仔细思考,这个很有逻辑的
- describe() 函数!感觉很酷!
#17.describe()函数查看数据基本统计信息 frame2=pd.DataFrame([[1.4,np.nan], [7.1,-4.5], [np.nan,np.nan], [0.75,-1.3] ],index=['a','b','c','d'],columns=['one','two']) print(frame2) print(frame2.describe()) print() pp=text['票价'].describe() print(pp)
第二部分:输出
由于输出框经过清空,于是无法找全。以下为部分输出。我所使用的编译环境是spyder(python 3.7)
first NaN NaN NaN NaN one 3.0 NaN 7.0 NaN second NaN NaN NaN NaN three NaN NaN NaN NaN two 9.0 NaN 13.0 NaN o e c first 0.0 1.0 2.0 one 3.0 4.0 5.0 two 6.0 7.0 8.0 second 9.0 10.0 11.0 a b c e o first NaN NaN NaN NaN NaN one NaN NaN 7.0 NaN NaN second NaN NaN NaN NaN NaN three NaN NaN NaN NaN NaN two NaN NaN 13.0 NaN NaN
one two a 1.40 NaN b 7.10 -4.5 c NaN NaN d 0.75 -1.3 one two count 3.000000 2.000000 mean 3.083333 -2.900000 std 3.493685 2.262742 min 0.750000 -4.500000 25% 1.075000 -3.700000 50% 1.400000 -2.900000 75% 4.250000 -2.100000 max 7.100000 -1.300000 count 891.000000 mean 32.204208 std 49.693429 min 0.000000 25% 7.910400 50% 14.454200 75% 31.000000 max 512.329200 Name: 票价, dtype: float64
乘客ID 是否幸存 舱位等级 姓名 性别 年龄 兄弟姐妹 父母子女 船票 票价 客舱 登船港口 630 631 1 1 Barkworth, Mr. Algernon Henry Wilson male 80.0 0 0 27042 30.0000 A23 S 851 852 0 3 Svensson, Mr. Johan male 74.0 0 0 347060 7.7750 NaN S 493 494 0 1 Artagaveytia, Mr. Ramon male 71.0 0 0 PC 17609 49.5042 NaN C 96 97 0 1 Goldschmidt, Mr. George B male 71.0 0 0 PC 17754 34.6542 A5 C 116 117 0 3 Connors, Mr. Patrick male 70.5 0 0 370369 7.7500 NaN Q 745 746 0 1 Crosby, Capt. Edward Gifford male 70.0 1 1 WE/P 5735 71.0000 B22 S 672 673 0 2 Mitchell, Mr. Henry Michael male 70.0 0 0 C.A. 24580 10.5000 NaN S 33 34 0 2 Wheadon, Mr. Edward H male 66.0 0 0 C.A. 24579 10.5000 NaN S 54 55 0 1 Ostby, Mr. Engelhart Cornelius male 65.0 0 1 113509 61.9792 B30 C 456 457 0 1 Millet, Mr. Francis Davis male 65.0 0 0 13509 26.5500 E38 S 280 281 0 3 Duane, Mr. Frank male 65.0 0 0 336439 7.7500 NaN Q 438 439 0 1 Fortune, Mr. Mark male 64.0 1 4 19950 263.0000 C23 C25 C27 S 545 546 0 1 Nicholson, Mr. Arthur Ernest male 64.0 0 0 693 26.0000 NaN S 275 276 1 1 Andrews, Miss. Kornelia Theodosia female 63.0 1 0 13502 77.9583 D7 S 483 484 1 3 Turkula, Mrs. (Hedwig) female 63.0 0 0 4134 9.5875 NaN S 829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0000 B28 NaN 252 253 0 1 Stead, Mr. William Thomas male 62.0 0 0 113514 26.5500 C87 S 555 556 0 1 Wright, Mr. George male 62.0 0 0 113807 26.5500 NaN S 570 571 1 2 Harris, Mr. George male 62.0 0 0 S.W./PP 752 10.5000 NaN S 170 171 0 1 Van der hoef, Mr. Wyckoff male 61.0 0 0 111240 33.5000 B19 S
第三部分:疑问与困惑
1.为什么前面要加一个r以及相对路径与绝对路径
2.非要Print(df.head(3)),即一定要通过print才能打印。
3.一个看pdf书的收获,结合最近在做的数据竞赛,下面这句话挺有共鸣。【做项目和工作过程中,遇到没有遇见的问题,要多查资料,使用google,了解业务逻辑和输入输出。】