Chapter3: Creating and Persisting DataFrames
[102/627]
id=list(range(1,5))
fruit=list('abcd')
import pandas as pd
# by default, pandas will create a RangeIndex for our DataFrame
data=pd.DataFrame({'id':id,'fruit':fruit})
data
Out[7]:
id fruit
0 1 a
1 2 b
2 3 c
3 4 d
# we can specify another index for the DataFrame if we desire
data=pd.DataFrame({'id':id,'fruit':fruit},index=id)
data
Out[11]:
id fruit
1 1 a
2 2 b
3 3 c
4 4 d
# we can create a DataFrame from a list of dictionaries, and use the columns parameter to specify the column order
data2=pd.DataFrame([{"id":1,'fruit':'a'},{'id':2,'fruit':'b'},{'id':3,f'fruit':'c'},{'id':4,'fruit':'d'}],columns=["id",'fruit'],index=id)
data2
Out[18]:
id fruit
1 1 a
2 2 b
3 3 c
4 4 d
data.loc[2] # get the record or row with the 2th index
Out[19]:
id 2
fruit b
Name: 2, dtype: object
writing CSV
- the
read_csv
function has anindex_col
parameter that you can use to specify the location of the index.
Using Excel files
[116/627]
Working with Zip files
- 读到时期列的两种转换方式
注:上图的zip文件中,只有一个csv文件。
import zipfile
with zipfile.ZipFile("data/kaggle-survey-2018.zip") as z:
print('\n'.join(z.namelist()))
kag=pd.read_csv(z.open("mutipleChoiceResponses.csv"))
kag_question=kag.iloc[0]
survey=kag.iloc[1:]
Zip files with only a single file can be read directly with the read_csv
function, otherwise you will need to resort to another mechanism to read the data.
Sadly, the zipfile module will not work with URLs unlike the read_csv
functiion.
So, if you ZIP file is in a URL, you will need to download it first
Chapter4 beginning Data Analysis
# college is a DataFrame
college.info() # list the data type of each column, the number of non-missing values, and memory usage with .info method
# for a column that has mixed types, reported as object
college.describe(include=[np.number]).T # get summary statistics for the numerical columns and transpose the DataFrame for more readable output
college.describe(include=[np.object]).T # get summary statistic for the object(string) column
# return counts, frequencies for a column with non-numeric data type
Data dictionaries
a data dictionary is a table of metadata and notes on each column of data, in order to explain the meaning of the column names.
abbreviation n.缩写
Reducing memory by changing data types
when you load data from type-less format such as CSV, pandas has to infer the type;
change the data type of one of the object columns from the college dataset to special pandas categorical data type to drastically reduce its memory usage.
[143/627]
low cardinality(number of unique values), 低基数
college=pd.read_csv('college.csv')
different_cols=["RELAFFIL","SATMTMID","CURROPER",'INSTNM','STABBR']
col2=college.loc[:,different_cols]
col2.head()
Out[11]:
RELAFFIL SATMTMID CURROPER INSTNM STABBR
0 0 420.0 1 Alabama A & M University AL
1 0 565.0 1 University of Alabama at Birmingham AL
2 1 NaN 1 Amridge University AL
3 0 590.0 1 University of Alabama in Huntsville AL
4 0 430.0 1 Alabama State University AL
col2.dtypes # inspect the data type of each column
Out[12]:
RELAFFIL int64
SATMTMID float64
CURROPER int64
INSTNM object
STABBR object
dtype: object
col2.memory_usage()
Out[13]:
Index 128
RELAFFIL 60280
SATMTMID 60280
CURROPER 60280
INSTNM 60280
STABBR 60280
dtype: int64
# find the memory usage of each column with .memory_usage method
#for pandas to extract the exact amount of memory of an object data type column, the deep parameter must be set to True in the .memory_usage method
original_mem=col2.memory_usage(deep=True)
original_mem
Out[16]:
Index 128
RELAFFIL 60280
SATMTMID 60280
CURROPER 60280
INSTNM 660240
STABBR 444565
dtype: int64
col2["RELAFFIL"].head()
Out[18]:
0 0
1 0
2 1
3 0
4 0
Name: RELAFFIL, dtype: int64
col2[["RELAFFIL"]].value_counts()
Out[21]:
RELAFFIL
0 6096
1 1439
dtype: int64
col2["RELAFFIL"].value_counts()
Out[22]:
0 6096
1 1439
Name: RELAFFIL, dtype: int64
# RELAFFIL column contains only 0 and 1 , so convert 'int64' to 8-bit integer with the .astype method
col2["RELAFFIL"]=col2["RELAFFIL"].astype(np.int8)
# use .dtypes to confirm the data type change
col2.dtypes
Out[27]:
RELAFFIL int8
SATMTMID float64
CURROPER int64
INSTNM object
STABBR object
dtype: object
# find the memory usage of each column again
col2.memory_usage(deep=True)
Out[29]:
Index 128
RELAFFIL 7535
SATMTMID 60280
CURROPER 60280
INSTNM 660240
STABBR 444565
dtype: int64
# object columns are targets for the largest memory savings
# pandas has an additional categorical data type that is not available in Numpy
# when converting to category, pandas internally creates a mapping from integers to each unique string value
# Thus, each string only needs to be kept a single time in the memory
# change object data to categorical if they have a reasonably low cardinality
# first check the number of unique values for the object columns
col2.select_dtypes(include=["object"]).nunique()
Out[32]:
INSTNM 7535
STABBR 59
dtype: int64
# so convert 'STABBR' to 'category' data
col2['STABBR']=col2['STABBR'].astype("category")
# confirm the data type of 'STABBR' column modified
col2.dtypes
Out[36]:
RELAFFIL int8
SATMTMID float64
CURROPER int64
INSTNM object
STABBR category
dtype: object
new_mem=col2.memory_usage(deep=True)
new_mem
Out[39]:
Index 128
RELAFFIL 7535
SATMTMID 60280
CURROPER 60280
INSTNM 660699
STABBR 13120
dtype: int64
# compare the original memory usage with our updated memory uesage
new_mem/original_mem
Out[41]:
Index 1.000000
RELAFFIL 0.125000
SATMTMID 1.000000
CURROPER 1.000000
INSTNM 1.000695
STABBR 0.029512
dtype: float64
# notice the 'RELAFFIL' and 'STABBR' columns, their size shrunk
# pandas defaults the index to a RangeInedex.
# The RangeIndex is very similar to the built-in range function.
# it produces values on demand and only stores the minimum amount of information needed to create an index.
college.loc[:,"CURROPER"].value_counts()
Out[44]:
1 6957
0 578
Name: CURROPER, dtype: int64
college.loc[:,"CURROPER"]=10000000
college.loc[0,"INSTNM"]=college.loc[0,"INSTNM"]+'a'
college[["CURROPER","INSTNM"]].memory_usage(deep=True)
Out[52]:
Index 128
CURROPER 60280
INSTNM 660700
dtype: int64
# memory usage for 'CURROPER' remained the same since a 64-bit integer is ,more than enough space for the larger number
# memory usage for 'INSTNM' increased by 105 btyes by just adding a single letter to one value
# ...it uses up to 4 bytes per character .pandas has some ovverhead(100bytes) when making the first modification to a character value
# aftrward, increments of 5 bytes per character are sustained.
# there is no integer representation for missing values for the int64type
# see the memory difference between the minimal RangeIndex and Int64Index, which stores every row index in memory
college.index=pd.Int64Index(college.index)
college.index.memory_usage()
Out[60]: 60280
Selecting the smallest of the largest
[149/627]
movie=pd.read_csv("movie.csv")
movie2=movie[["movie_title","imdb_score","budget"]]
movie.head()
Out[64]:
color director_name ... aspect_ratio movie_facebook_likes
0 Color James Cameron ... 1.78 33000
1 Color Gore Verbinski ... 2.35 0
2 Color Sam Mendes ... 2.35 85000
3 Color Christopher Nolan ... 2.35 164000
4 NaN Doug Walker ... NaN 0
[5 rows x 28 columns]
movie2.head()
Out[65]:
movie_title imdb_score budget
0 Avatar 7.9 237000000.0
1 Pirates of the Caribbean: At World's End 7.1 300000000.0
2 Spectre 6.8 245000000.0
3 The Dark Knight Rises 8.5 250000000.0
4 Star Wars: Episode VII - The Force Awakens 7.1 NaN
# use the .nlargest method to select the top 100 movies by imdb_score
movie2.nlargest(100,'imdb_score').head()
Out[67]:
movie_title imdb_score budget
2725 Towering Inferno 9.5 NaN
1920 The Shawshank Redemption 9.3 25000000.0
3402 The Godfather 9.2 6000000.0
2779 Dekalog 9.1 NaN
4312 Kickboxer: Vengeance 9.1 17000000.0
# chain the .nasmallest method to return the five lowest budget films among those with a top 100 score
movie2.nlargest(100,"imdb_score").nsmallest(5,"budget")
Out[69]:
movie_title imdb_score budget
4804 Butterfly Girl 8.7 180000.0
4801 Children of Heaven 8.5 180000.0
4706 12 Angry Men 8.9 350000.0
4550 A Separation 8.4 500000.0
4636 The Other Dream Team 8.4 500000.0
Selecting the largest of each group by sorting
movie[["movie_title","title_year","imdb_score"]]
Out[70]:
movie_title title_year imdb_score
0 Avatar 2009.0 7.9
1 Pirates of the Caribbean: At World's End 2007.0 7.1
2 Spectre 2015.0 6.8
3 The Dark Knight Rises 2012.0 8.5
4 Star Wars: Episode VII - The Force Awakens NaN 7.1
... ... ...
4911 Signed Sealed Delivered 2013.0 7.7
4912 The Following NaN 7.5
4913 A Plague So Pleasant 2013.0 6.3
4914 Shanghai Calling 2012.0 6.3
4915 My Date with Drew 2004.0 6.6
[4916 rows x 3 columns]
# use .sort_values method to sort the DataFrame by 'title_year'
# the default behavior sorts from the smallest to largest,
#use ascending=True to invert this behavior
movie[["movie_title","title_year","imdb_score"]].sort_values("title_year",ascending=True)
Out[74]:
movie_title title_year imdb_score
4695 Intolerance: Love's Struggle Throughout the Ages 1916.0 8.0
4833 Over the Hill to the Poorhouse 1920.0 4.8
4767 The Big Parade 1925.0 8.3
2694 Metropolis 1927.0 8.3
4697 The Broadway Melody 1929.0 6.3
... ... ...
4683 Heroes NaN 7.7
4688 Home Movies NaN 8.2
4704
Revolution NaN 6.7
4752 Happy Valley NaN 8.5
4912 The Following NaN 7.5
[4916 rows x 3 columns]
sort multiple columns at once, use a list
movie[["movie_title","title_year","imdb_score"]].sort_values(["title_year","imdb_score"],ascending=True)
Out[76]:
movie_title title_year imdb_score
4695 Intolerance: Love's Struggle Throughout the Ages 1916.0 8.0
4833 Over the Hill to the Poorhouse 1920.0 4.8
4767 The Big Parade 1925.0 8.3
2694 Metropolis 1927.0 8.3
4697 The Broadway Melody 1929.0 6.3
... ... ...
453 Daredevil NaN 8.8
1825 It's Always Sunny in Philadelphia NaN 8.8
3415 Fargo NaN 9.0
2779 Dekalog NaN 9.1
2725 Towering Inferno NaN 9.5
[4916 rows x 3 columns]
movie[["movie_title","title_year","imdb_score"]].sort_values(["title_year","imdb_score"],ascending=False)
Out[77]:
movie_title title_year imdb_score
4312 Kickboxer: Vengeance 2016.0 9.1
4277 A Beginner's Guide to Snuff 2016.0 8.7
3798 Airlift 2016.0 8.5
27 Captain America: Civil War 2016.0 8.2
98 Godzilla Resurgence 2016.0 8.2
... ... ...
1391 Rush Hour NaN 5.8
4031 Creature NaN 5.0
2165 Meet the Browns NaN 3.5
3246 The Bold and the Beautiful NaN 3.5
2119 The Bachelor NaN 2.9
[4916 rows x 3 columns]
#us .drop_dupilicates method to keep only the first row of every year
movie[["movie_title","title_year","imdb_score"]].sort_values(["title_year","imdb_score"],ascending=False).drop_duplicates(subset="title_year")
# by default, .drop_duplicate keeps the very first appearance of a value, but this behavior may be modified by pass keep='last' to select the last row of each group or keep=False to drop all duplicates entirely
Out[79]:
movie_title title_year imdb_score
4312 Kickboxer: Vengeance 2016.0 9.1
3745 Running Forever 2015.0 8.6
4369 Queen of the Mountains 2014.0 8.7
3935 Batman: The Dark Knight Returns, Part 2 2013.0 8.4
3 The Dark Knight Rises 2012.0 8.5
... ... ...
2694 Metropolis 1927.0 8.3
4767 The Big Parade 1925.0 8.3
4833 Over the Hill to the Poorhouse 1920.0 4.8
4695 Intolerance: Love's Struggle Throughout the Ages 1916.0 8.0
2725 Towering Inferno NaN 9.5
[92 rows x 3 columns]
Replicating nlargest with sort_values
# the first expression
movie[["movie_title","imdb_score","budget"]].nlargest(100,"imdb_score").nsmallest(5,"budget")
# the second expression
movie[["movie_title","imdb_score","budget"]].sort_values("imdb_score",ascending=False).head(100)
.sort_values("budget",ascending=True).head(5)