Python数据分析pandas入门练习题(八)

Python数据分析基础

Preparation

需要数据集可以自行网上寻找(都是公开的数据集)或私聊博主,传到csdn,你们下载要会员,就不传了。下面数据集链接下载不一定能成功。

Exercise 1- US - Baby Names

Introduction:

We are going to use a subset of US Baby Names from Kaggle.
In the file it will be names from 2004 until 2014

Step 1. Import the necessary libraries

代码如下:

import pandas as pd

Step 2. Import the dataset from this address.

Step 3. Assign it to a variable called baby_names.

代码如下:

baby_names = pd.read_csv("US_Baby_Names_right.csv")
baby_names.info()

输出结果如下:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1016395 entries, 0 to 1016394
Data columns (total 7 columns):
Unnamed: 0    1016395 non-null int64
Id            1016395 non-null int64
Name          1016395 non-null object
Year          1016395 non-null int64
Gender        1016395 non-null object
State         1016395 non-null object
Count         1016395 non-null int64
dtypes: int64(4), object(3)
memory usage: 54.3+ MB

Step 4. See the first 10 entries

代码如下:

baby_names.head(10)

输出结果如下:

Unnamed: 0 Id Name Year Gender State Count
0 11349 11350 Emma 2004 F AK 62
1 11350 11351 Madison 2004 F AK 48
2 11351 11352 Hannah 2004 F AK 46
3 11352 11353 Grace 2004 F AK 44
4 11353 11354 Emily 2004 F AK 41
5 11354 11355 Abigail 2004 F AK 37
6 11355 11356 Olivia 2004 F AK 33
7 11356 11357 Isabella 2004 F AK 30
8 11357 11358 Alyssa 2004 F AK 29
9 11358 11359 Sophia 2004 F AK 28

Step 5. Delete the column ‘Unnamed: 0’ and ‘Id’

代码如下:

del baby_names['Id']
# OR del baby_names['Unnamed: 0']
baby_names = baby_names.loc[:, ~baby_names.columns.str.contains('^Unnamed')]
baby_names.head()

输出结果如下:

Name Year Gender State Count
0 Emma 2004 F AK 62
1 Madison 2004 F AK 48
2 Hannah 2004 F AK 46
3 Grace 2004 F AK 44
4 Emily 2004 F AK 41

Step 6. Is there more male or female names in the dataset?

代码如下:

# baby_names['Gender'].value_counts()
baby_names.groupby('Gender').Count.sum()

输出结果如下:

Gender
F    16380293
M    19041199
Name: Count, dtype: int64

Step 7. Group the dataset by name and assign to names

代码如下:

del baby_names["Year"]
names = baby_names.groupby("Name").sum()
names.head()
print(names.shape)
names.sort_values("Count", ascending = 0).head()
# names= baby_names.groupby('Name')
# names.head(1)

输出结果如下:

(17632, 1)
Count
Name
Jacob 242874
Emma 214852
Michael 214405
Ethan 209277
Isabella 204798

Step 8. How many different names exist in the dataset?

代码如下:

len(names)

输出结果如下:

17632

Step 9. What is the name with most occurrences?

代码如下:

# names['Count'].sum().argmax()
names.Count.idxmax()  # idxmax()获取pandas中series最大值对应的索引

输出结果如下:

'Jacob'

Step 10. How many different names have the least occurrences?

代码如下:

len(names[names.Count == names.Count.min()])

输出结果如下:

2578

Step 11. What is the median name occurrence?

代码如下:

names[names.Count == names.Count.median()]

输出结果如下:

Count
Name
Aishani 49
Alara 49
Alysse 49
Ameir 49
Anely 49
Antonina 49
Aveline 49
Aziah 49
Baily 49
Caleah 49
Carlota 49
Cristine 49
Dahlila 49
Darvin 49
Deante 49
Deserae 49
Devean 49
Elizah 49
Emmaly 49
Emmanuela 49
Envy 49
Esli 49
Fay 49
Gurshaan 49
Hareem 49
Iven 49
Jaice 49
Jaiyana 49
Jamiracle 49
Jelissa 49
... ...
Kyndle 49
Kynsley 49
Leylanie 49
Maisha 49
Malillany 49
Mariann 49
Marquell 49
Maurilio 49
Mckynzie 49
Mehdi 49
Nabeel 49
Nalleli 49
Nassir 49
Nazier 49
Nishant 49
Rebecka 49
Reghan 49
Ridwan 49
Riot 49
Rubin 49
Ryatt 49
Sameera 49
Sanjuanita 49
Shalyn 49
Skylie 49
Sriram 49
Trinton 49
Vita 49
Yoni 49
Zuleima 49

66 rows × 1 columns

Step 12. What is the standard deviation of names?

代码如下:

names.Count.std()

输出结果如下:

11006.069467891111

Step 13. Get a summary with the mean, min, max, std and quartiles.

代码如下:

names.describe()

输出结果如下:

Count
count 17632.000000
mean 2008.932169
std 11006.069468
min 5.000000
25% 11.000000
50% 49.000000
75% 337.000000
max 242874.000000

Exercise 2- Wind Statistics

Introduction:

The data have been modified to contain some missing values, identified by NaN.
Using pandas should make this exercise
easier, in particular for the bonus question.

You should be able to perform all of these operations without using
a for loop or other looping construct.

  1. The data in ‘wind.data’ has the following format:
"""
Yr Mo Dy   RPT   VAL   ROS   KIL   SHA   BIR   DUB   CLA   MUL   CLO   BEL   MAL
61  1  1 15.04 14.96 13.17  9.29   NaN  9.87 13.67 10.25 10.83 12.58 18.50 15.04
61  1  2 14.71   NaN 10.83  6.50 12.62  7.67 11.50 10.04  9.79  9.67 17.54 13.83
61  1  3 18.50 16.88 12.33 10.13 11.17  6.17 11.25   NaN  8.50  7.67 12.75 12.71
"""
'\nYr Mo Dy   RPT   VAL   ROS   KIL   SHA   BIR   DUB   CLA   MUL   CLO   BEL   MAL\n61  1  1 15.04 14.96 13.17  9.29   NaN  9.87 13.67 10.25 10.83 12.58 18.50 15.04\n61  1  2 14.71   NaN 10.83  6.50 12.62  7.67 11.50 10.04  9.79  9.67 17.54 13.83\n61  1  3 18.50 16.88 12.33 10.13 11.17  6.17 11.25   NaN  8.50  7.67 12.75 12.71\n'

The first three columns are year, month and day. The
remaining 12 columns are average windspeeds in knots at 12
locations in Ireland on that day.

More information about the dataset go here.

Step 1. Import the necessary libraries

代码如下:

import pandas as pd
import datetime

Step 2. Import the dataset from this address

Step 3. Assign it to a variable called data and replace the first 3 columns by a proper datetime index.

代码如下:

data = pd.read_table('wind.data', sep='\s+', parse_dates = [[0, 1, 2]])
data.head()

输出结果如下:

Yr_Mo_Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
0 2061-01-01 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04
1 2061-01-02 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83
2 2061-01-03 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71
3 2061-01-04 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88
4 2061-01-05 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83

Step 4. Year 2061? Do we really have data from this year? Create a function to fix it and apply it.

代码如下:

def fix_century(x):
    year = x.year - 100 if x.year > 1989 else x.year
    return datetime.date(year, x.month, x.day)
data['Yr_Mo_Dy'] = data['Yr_Mo_Dy'].apply(fix_century)
data.head()

输出结果如下:

Yr_Mo_Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
0 1961-01-01 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04
1 1961-01-02 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83
2 1961-01-03 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71
3 1961-01-04 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88
4 1961-01-05 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83

Step 5. Set the right dates as the index. Pay attention at the data type, it should be datetime64[ns].

代码如下:

data["Yr_Mo_Dy"] = pd.to_datetime(data["Yr_Mo_Dy"])  # 转换为datetime64
data = data.set_index('Yr_Mo_Dy')
data.head()

输出结果如下:

RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
Yr_Mo_Dy
1961-01-01 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04
1961-01-02 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83
1961-01-03 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71
1961-01-04 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88
1961-01-05 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83

Step 6. Compute how many values are missing for each location over the entire record.

They should be ignored in all calculations below.

代码如下:

data.isnull().sum()

输出结果如下:

RPT    6
VAL    3
ROS    2
KIL    5
SHA    2
BIR    0
DUB    3
CLA    2
MUL    3
CLO    1
BEL    0
MAL    4
dtype: int64

Step 7. Compute how many non-missing values there are in total.

代码如下:

data.shape[0] - data.isnull().sum()
#OR data.notnull.sum()

输出结果如下:

RPT    6568
VAL    6571
ROS    6572
KIL    6569
SHA    6572
BIR    6574
DUB    6571
CLA    6572
MUL    6571
CLO    6573
BEL    6574
MAL    6570
dtype: int64

Step 8. Calculate the mean windspeeds of the windspeeds over all the locations and all the times.

A single number for the entire dataset.

代码如下:

data.fillna(0).values.flatten().mean()  # a.flatten()就是把data降到一维,默认是按行的方向降

输出结果如下:

10.223864592840483

Step 9. Create a DataFrame called loc_stats and calculate the min, max and mean windspeeds and standard deviations of the windspeeds at each location over all the days

A different set of numbers for each location.

代码如下:

# loc_stats = data.loc[:, 'RPT':'MAL'].describe(percentiles=[])
# loc_stats
data.describe(percentiles=[])

输出结果如下:

RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
count 6568.000000 6571.000000 6572.000000 6569.000000 6572.000000 6574.000000 6571.000000 6572.000000 6571.000000 6573.000000 6574.000000 6570.000000
mean 12.362987 10.644314 11.660526 6.306468 10.455834 7.092254 9.797343 8.495053 8.493590 8.707332 13.121007 15.599079
std 5.618413 5.267356 5.008450 3.605811 4.936125 3.968683 4.977555 4.499449 4.166872 4.503954 5.835037 6.699794
min 0.670000 0.210000 1.500000 0.000000 0.130000 0.000000 0.000000 0.000000 0.000000 0.040000 0.130000 0.670000
50% 11.710000 10.170000 10.920000 5.750000 9.960000 6.830000 9.210000 8.080000 8.170000 8.290000 12.500000 15.000000
max 35.800000 33.370000 33.840000 28.460000 37.540000 26.160000 30.370000 31.080000 25.880000 28.210000 42.380000 42.540000

Step 10. Create a DataFrame called day_stats and calculate the min, max and mean windspeed and standard deviations of the windspeeds across all the locations at each day.

A different set of numbers for each day.

代码如下:

day_stats = pd.DataFrame()

day_stats['min'] = data.min(axis = 1)
day_stats['max'] = data.max(axis = 1)
day_stats['mean'] = data.mean(axis = 1)
day_stats['std'] = data.std(axis = 1)

day_stats.head()

输出结果如下:

min max mean std
Yr_Mo_Dy
1961-01-01 9.29 18.50 13.018182 2.808875
1961-01-02 6.50 17.54 11.336364 3.188994
1961-01-03 6.17 18.50 11.641818 3.681912
1961-01-04 1.79 11.75 6.619167 3.198126
1961-01-05 6.17 13.33 10.630000 2.445356

Step 11. Find the average windspeed in January for each location.

Treat January 1961 and January 1962 both as January.

代码如下:

data.loc[data.index.month == 1].mean()

输出结果如下:

RPT    14.847325
VAL    12.914560
ROS    13.299624
KIL     7.199498
SHA    11.667734
BIR     8.054839
DUB    11.819355
CLA     9.512047
MUL     9.543208
CLO    10.053566
BEL    14.550520
MAL    18.028763
dtype: float64

Step 12. Downsample the record to a yearly frequency for each location.

代码如下:

# pd.Period()创建时期数据
# pd.Period()参数:一个时间戳 + freq 参数 → freq 用于指明该 period 的长度,时间戳则说明该 period 在时间轴上的位置
# DatetimeIndex对象的数据转换为PeriodIndex
data.groupby(data.index.to_period('A')).mean() 

输出结果如下:

RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
Yr_Mo_Dy
1961 12.299583 10.351796 11.362369 6.958227 10.881763 7.729726 9.733923 8.858788 8.647652 9.835577 13.502795 13.680773
1962 12.246923 10.110438 11.732712 6.960440 10.657918 7.393068 11.020712 8.793753 8.316822 9.676247 12.930685 14.323956
1963 12.813452 10.836986 12.541151 7.330055 11.724110 8.434712 11.075699 10.336548 8.903589 10.224438 13.638877 14.999014
1964 12.363661 10.920164 12.104372 6.787787 11.454481 7.570874 10.259153 9.467350 7.789016 10.207951 13.740546 14.910301
1965 12.451370 11.075534 11.848767 6.858466 11.024795 7.478110 10.618712 8.879918 7.907425 9.918082 12.964247 15.591644
1966 13.461973 11.557205 12.020630 7.345726 11.805041 7.793671 10.579808 8.835096 8.514438 9.768959 14.265836 16.307260
1967 12.737151 10.990986 11.739397 7.143425 11.630740 7.368164 10.652027 9.325616 8.645014 9.547425 14.774548 17.135945
1968 11.835628 10.468197 11.409754 6.477678 10.760765 6.067322 8.859180 8.255519 7.224945 7.832978 12.808634 15.017486
1969 11.166356 9.723699 10.902000 5.767973 9.873918 6.189973 8.564493 7.711397 7.924521 7.754384 12.621233 15.762904
1970 12.600329 10.726932 11.730247 6.217178 10.567370 7.609452 9.609890 8.334630 9.297616 8.289808 13.183644 16.456027
1971 11.273123 9.095178 11.088329 5.241507 9.440329 6.097151 8.385890 6.757315 7.915370 7.229753 12.208932 15.025233
1972 12.463962 10.561311 12.058333 5.929699 9.430410 6.358825 9.704508 7.680792 8.357295 7.515273 12.727377 15.028716
1973 11.828466 10.680493 10.680493 5.547863 9.640877 6.548740 8.482110 7.614274 8.245534 7.812411 12.169699 15.441096
1974 13.643096 11.811781 12.336356 6.427041 11.110986 6.809781 10.084603 9.896986 9.331753 8.736356 13.252959 16.947671
1975 12.008575 10.293836 11.564712 5.269096 9.190082 5.668521 8.562603 7.843836 8.797945 7.382822 12.631671 15.307863
1976 11.737842 10.203115 10.761230 5.109426 8.846339 6.311038 9.149126 7.146202 8.883716 7.883087 12.332377 15.471448
1977 13.099616 11.144493 12.627836 6.073945 10.003836 8.586438 11.523205 8.378384 9.098192 8.821616 13.459068 16.590849
1978 12.504356 11.044274 11.380000 6.082356 10.167233 7.650658 9.489342 8.800466 9.089753 8.301699 12.967397 16.771370

Step 13. Downsample the record to a monthly frequency for each location.

代码如下:

data.groupby(data.index.to_period('M')).mean().head()

输出结果如下:

RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
Yr_Mo_Dy
1961-01 14.841333 11.988333 13.431613 7.736774 11.072759 8.588065 11.184839 9.245333 9.085806 10.107419 13.880968 14.703226
1961-02 16.269286 14.975357 14.441481 9.230741 13.852143 10.937500 11.890714 11.846071 11.821429 12.714286 18.583214 15.411786
1961-03 10.890000 11.296452 10.752903 7.284000 10.509355 8.866774 9.644194 9.829677 10.294138 11.251935 16.410968 15.720000
1961-04 10.722667 9.427667 9.998000 5.830667 8.435000 6.495000 6.925333 7.094667 7.342333 7.237000 11.147333 10.278333
1961-05 9.860968 8.850000 10.818065 5.905333 9.490323 6.574839 7.604000 8.177097 8.039355 8.499355 11.900323 12.011613

Step 14. Downsample the record to a weekly frequency for each location.

代码如下:

data.groupby(data.index.to_period('W')).mean().head()

输出结果如下:

RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
Yr_Mo_Dy
1960-12-26/1961-01-01 15.040000 14.960000 13.170000 9.290000 NaN 9.870000 13.670000 10.250000 10.830000 12.580000 18.500000 15.040000
1961-01-02/1961-01-08 13.541429 11.486667 10.487143 6.417143 9.474286 6.435714 11.061429 6.616667 8.434286 8.497143 12.481429 13.238571
1961-01-09/1961-01-15 12.468571 8.967143 11.958571 4.630000 7.351429 5.072857 7.535714 6.820000 5.712857 7.571429 11.125714 11.024286
1961-01-16/1961-01-22 13.204286 9.862857 12.982857 6.328571 8.966667 7.417143 9.257143 7.875714 7.145714 8.124286 9.821429 11.434286
1961-01-23/1961-01-29 19.880000 16.141429 18.225714 12.720000 17.432857 14.828571 15.528571 15.160000 14.480000 15.640000 20.930000 22.530000

Step 15. Calculate the min, max and mean windspeeds and standard deviations of the windspeeds across all locations for each week (assume that the first week starts on January 2 1961) for the first 52 weeks.

代码如下:

# data.groupby(data.index.to_period('1961-01-02', 'W')).describe(percentiles=[]).head()
weekly = data.resample('W').agg(['min', 'max', 'mean', 'std'])  # resample()重新设置频率采样,再sh
weekly.loc[weekly.index[1:53], "RPT":"MAL"].head(10)

输出结果如下:

RPT VAL ROS ... CLO BEL MAL
min max mean std min max mean std min max ... mean std min max mean std min max mean std
Yr_Mo_Dy
1961-01-08 10.58 18.50 13.541429 2.631321 6.63 16.88 11.486667 3.949525 7.62 12.33 ... 8.497143 1.704941 5.46 17.54 12.481429 4.349139 10.88 16.46 13.238571 1.773062
1961-01-15 9.04 19.75 12.468571 3.555392 3.54 12.08 8.967143 3.148945 7.08 19.50 ... 7.571429 4.084293 5.25 20.71 11.125714 5.552215 5.17 16.92 11.024286 4.692355
1961-01-22 4.92 19.83 13.204286 5.337402 3.42 14.37 9.862857 3.837785 7.29 20.79 ... 8.124286 4.783952 6.50 15.92 9.821429 3.626584 6.79 17.96 11.434286 4.237239
1961-01-29 13.62 25.04 19.880000 4.619061 9.96 23.91 16.141429 5.170224 12.67 25.84 ... 15.640000 3.713368 14.04 27.71 20.930000 5.210726 17.50 27.63 22.530000 3.874721
1961-02-05 10.58 24.21 16.827143 5.251408 9.46 24.21 15.460000 5.187395 9.04 19.70 ... 9.460000 2.839501 9.17 19.33 14.012857 4.210858 7.17 19.25 11.935714 4.336104
1961-02-12 16.00 24.54 19.684286 3.587677 11.54 21.42 16.417143 3.608373 13.67 21.34 ... 14.440000 1.746749 15.21 26.38 21.832857 4.063753 17.04 21.84 19.155714 1.828705
1961-02-19 6.04 22.50 15.130000 5.064609 11.63 20.17 15.091429 3.575012 6.13 19.41 ... 13.542857 2.531361 14.09 29.63 21.167143 5.910938 10.96 22.58 16.584286 4.685377
1961-02-26 7.79 25.80 15.221429 7.020716 7.08 21.50 13.625714 5.147348 6.08 22.42 ... 12.730000 4.920064 9.59 23.21 16.304286 5.091162 6.67 23.87 14.322857 6.182283
1961-03-05 10.96 13.33 12.101429 0.997721 8.83 17.00 12.951429 2.851955 8.17 13.67 ... 12.370000 1.593685 11.58 23.45 17.842857 4.332331 8.83 17.54 13.951667 3.021387
1961-03-12 4.88 14.79 9.376667 3.732263 8.08 16.96 11.578571 3.230167 7.54 16.38 ... 10.458571 3.655113 10.21 22.71 16.701429 4.358759 5.54 22.54 14.420000 5.769890

10 rows × 48 columns

Conclusion

今天的pandas题更新,继续刷题,加油!

上一篇:ORACLE


下一篇:学习模糊系统