【pandas cookbook学习笔记】Chap3&4

Chapter3: Creating and Persisting DataFrames

[102/627]

id=list(range(1,5))
fruit=list('abcd')
import pandas as pd
# by default, pandas will create a RangeIndex for our DataFrame 
data=pd.DataFrame({'id':id,'fruit':fruit})
data
Out[7]: 
   id fruit
0   1     a
1   2     b
2   3     c
3   4     d
# we can specify another index for the DataFrame if we desire
data=pd.DataFrame({'id':id,'fruit':fruit},index=id)
data
Out[11]: 
   id fruit
1   1     a
2   2     b
3   3     c
4   4     d
# we can create a DataFrame from a list of dictionaries, and use the columns parameter to specify the column order
data2=pd.DataFrame([{"id":1,'fruit':'a'},{'id':2,'fruit':'b'},{'id':3,f'fruit':'c'},{'id':4,'fruit':'d'}],columns=["id",'fruit'],index=id)
data2
Out[18]: 
   id fruit
1   1     a
2   2     b
3   3     c
4   4     d
data.loc[2] # get the record or row with the 2th index
Out[19]: 
id       2
fruit    b
Name: 2, dtype: object

writing CSV

  • the read_csv function has an index_col parameter that you can use to specify the location of the index.
    【pandas cookbook学习笔记】Chap3&4
    【pandas cookbook学习笔记】Chap3&4
    【pandas cookbook学习笔记】Chap3&4

Using Excel files

[116/627]

Working with Zip files

  • 读到时期列的两种转换方式
    【pandas cookbook学习笔记】Chap3&4
    【pandas cookbook学习笔记】Chap3&4
    注:上图的zip文件中,只有一个csv文件。
import zipfile
with zipfile.ZipFile("data/kaggle-survey-2018.zip") as z:
	print('\n'.join(z.namelist()))
	kag=pd.read_csv(z.open("mutipleChoiceResponses.csv"))
	kag_question=kag.iloc[0]
	survey=kag.iloc[1:]

Zip files with only a single file can be read directly with the read_csv function, otherwise you will need to resort to another mechanism to read the data.
Sadly, the zipfile module will not work with URLs unlike the read_csv functiion.
So, if you ZIP file is in a URL, you will need to download it first

Chapter4 beginning Data Analysis

【pandas cookbook学习笔记】Chap3&4

# college is a DataFrame
college.info() # list the data type of each column, the number of non-missing values, and memory usage with .info method
# for a column that has mixed types, reported as object
college.describe(include=[np.number]).T # get summary statistics for the numerical columns and transpose the DataFrame for more readable output
college.describe(include=[np.object]).T # get summary statistic for the object(string) column
# return counts, frequencies for a column with non-numeric data type

Data dictionaries

a data dictionary is a table of metadata and notes on each column of data, in order to explain the meaning of the column names.
abbreviation n.缩写
【pandas cookbook学习笔记】Chap3&4

Reducing memory by changing data types

when you load data from type-less format such as CSV, pandas has to infer the type;
change the data type of one of the object columns from the college dataset to special pandas categorical data type to drastically reduce its memory usage.
[143/627]
low cardinality(number of unique values), 低基数

college=pd.read_csv('college.csv')
different_cols=["RELAFFIL","SATMTMID","CURROPER",'INSTNM','STABBR']
col2=college.loc[:,different_cols]
col2.head()
Out[11]: 
   RELAFFIL  SATMTMID  CURROPER                               INSTNM STABBR
0         0     420.0         1             Alabama A & M University     AL
1         0     565.0         1  University of Alabama at Birmingham     AL
2         1       NaN         1                   Amridge University     AL
3         0     590.0         1  University of Alabama in Huntsville     AL
4         0     430.0         1             Alabama State University     AL
col2.dtypes # inspect the data type of each column
Out[12]: 
RELAFFIL      int64
SATMTMID    float64
CURROPER      int64
INSTNM       object
STABBR       object
dtype: object
col2.memory_usage()
Out[13]: 
Index         128
RELAFFIL    60280
SATMTMID    60280
CURROPER    60280
INSTNM      60280
STABBR      60280
dtype: int64

# find the memory usage of each column with .memory_usage method
#for pandas to extract the exact amount of memory of an object data type column, the deep parameter must be set to True in the .memory_usage method 
original_mem=col2.memory_usage(deep=True)
original_mem
Out[16]: 
Index          128
RELAFFIL     60280
SATMTMID     60280
CURROPER     60280
INSTNM      660240
STABBR      444565
dtype: int64

col2["RELAFFIL"].head()
Out[18]: 
0    0
1    0
2    1
3    0
4    0
Name: RELAFFIL, dtype: int64
col2[["RELAFFIL"]].value_counts()
Out[21]: 
RELAFFIL
0           6096
1           1439
dtype: int64
col2["RELAFFIL"].value_counts()
Out[22]: 
0    6096
1    1439
Name: RELAFFIL, dtype: int64
# RELAFFIL column contains only 0 and 1 , so convert 'int64' to 8-bit integer with the .astype method
col2["RELAFFIL"]=col2["RELAFFIL"].astype(np.int8)
# use .dtypes to confirm the data type change 
col2.dtypes
Out[27]: 
RELAFFIL       int8
SATMTMID    float64
CURROPER      int64
INSTNM       object
STABBR       object
dtype: object
# find the memory usage of each column again 
col2.memory_usage(deep=True)
Out[29]: 
Index          128
RELAFFIL      7535
SATMTMID     60280
CURROPER     60280
INSTNM      660240
STABBR      444565
dtype: int64

# object columns are targets for the largest memory savings 
# pandas has an additional categorical data type that is not available in Numpy
# when converting to category, pandas internally creates a mapping from integers to each unique string value
# Thus, each string only needs to be kept a single time in the memory 
# change object data to categorical if they have a reasonably low cardinality
# first check the number of unique values for the object columns
col2.select_dtypes(include=["object"]).nunique()
Out[32]: 
INSTNM    7535
STABBR      59
dtype: int64
# so convert 'STABBR' to 'category' data
col2['STABBR']=col2['STABBR'].astype("category")
# confirm the data type of 'STABBR' column modified
col2.dtypes
Out[36]: 
RELAFFIL        int8
SATMTMID     float64
CURROPER       int64
INSTNM        object
STABBR      category
dtype: object
new_mem=col2.memory_usage(deep=True)
new_mem
Out[39]: 
Index          128
RELAFFIL      7535
SATMTMID     60280
CURROPER     60280
INSTNM      660699
STABBR       13120
dtype: int64
# compare the original memory usage with our updated memory uesage
new_mem/original_mem
Out[41]: 
Index       1.000000
RELAFFIL    0.125000
SATMTMID    1.000000
CURROPER    1.000000
INSTNM      1.000695
STABBR      0.029512
dtype: float64
# notice the 'RELAFFIL' and 'STABBR' columns, their size shrunk
# pandas defaults the index to a RangeInedex.
# The RangeIndex is very similar to the built-in range function.
# it produces values on demand and only stores the minimum amount of information needed to create an index.

college.loc[:,"CURROPER"].value_counts()
Out[44]: 
1    6957
0     578
Name: CURROPER, dtype: int64
college.loc[:,"CURROPER"]=10000000
college.loc[0,"INSTNM"]=college.loc[0,"INSTNM"]+'a'
college[["CURROPER","INSTNM"]].memory_usage(deep=True)
Out[52]: 
Index          128
CURROPER     60280
INSTNM      660700
dtype: int64
# memory usage for 'CURROPER' remained the same since a 64-bit integer is ,more than enough space for the larger number 
# memory usage for 'INSTNM' increased by 105 btyes by just adding a single letter to one value
# ...it uses up to 4 bytes per character .pandas has some ovverhead(100bytes) when making the first modification to a character value
# aftrward, increments of 5 bytes per character are sustained.
# there is no integer representation for missing values for the int64type 
# see the memory difference between the minimal RangeIndex and Int64Index, which stores every row index in memory
college.index=pd.Int64Index(college.index)
college.index.memory_usage()
Out[60]: 60280
 

Selecting the smallest of the largest

[149/627]

movie=pd.read_csv("movie.csv")
movie2=movie[["movie_title","imdb_score","budget"]]
movie.head()
Out[64]: 
   color      director_name  ...  aspect_ratio  movie_facebook_likes
0  Color      James Cameron  ...          1.78                 33000
1  Color     Gore Verbinski  ...          2.35                     0
2  Color         Sam Mendes  ...          2.35                 85000
3  Color  Christopher Nolan  ...          2.35                164000
4    NaN        Doug Walker  ...           NaN                     0
[5 rows x 28 columns]
movie2.head()
Out[65]: 
                                  movie_title  imdb_score       budget
0                                      Avatar         7.9  237000000.0
1    Pirates of the Caribbean: At World's End         7.1  300000000.0
2                                     Spectre         6.8  245000000.0
3                       The Dark Knight Rises         8.5  250000000.0
4  Star Wars: Episode VII - The Force Awakens         7.1          NaN
# use the .nlargest method to select the top 100 movies by imdb_score
movie2.nlargest(100,'imdb_score').head()
Out[67]: 
                   movie_title  imdb_score      budget
2725          Towering Inferno         9.5         NaN
1920  The Shawshank Redemption         9.3  25000000.0
3402             The Godfather         9.2   6000000.0
2779                   Dekalog         9.1         NaN
4312      Kickboxer: Vengeance         9.1  17000000.0
# chain the .nasmallest method to return the five lowest budget films among those with a top 100 score
movie2.nlargest(100,"imdb_score").nsmallest(5,"budget")
Out[69]: 
               movie_title  imdb_score    budget
4804        Butterfly Girl         8.7  180000.0
4801    Children of Heaven         8.5  180000.0
4706          12 Angry Men         8.9  350000.0
4550          A Separation         8.4  500000.0
4636  The Other Dream Team         8.4  500000.0

Selecting the largest of each group by sorting

movie[["movie_title","title_year","imdb_score"]]
Out[70]: 
                                     movie_title  title_year  imdb_score
0                                         Avatar      2009.0         7.9
1       Pirates of the Caribbean: At World's End      2007.0         7.1
2                                        Spectre      2015.0         6.8
3                          The Dark Knight Rises      2012.0         8.5
4     Star Wars: Episode VII - The Force Awakens         NaN         7.1
                                          ...         ...         ...
4911                     Signed Sealed Delivered      2013.0         7.7
4912                               The Following         NaN         7.5
4913                        A Plague So Pleasant      2013.0         6.3
4914                            Shanghai Calling      2012.0         6.3
4915                           My Date with Drew      2004.0         6.6
[4916 rows x 3 columns]
# use .sort_values method to sort the DataFrame by 'title_year'
# the default behavior sorts from the smallest to largest, 
#use ascending=True to invert this behavior
movie[["movie_title","title_year","imdb_score"]].sort_values("title_year",ascending=True)
Out[74]: 
                                           movie_title  title_year  imdb_score
4695  Intolerance: Love's Struggle Throughout the Ages      1916.0         8.0
4833                    Over the Hill to the Poorhouse      1920.0         4.8
4767                                    The Big Parade      1925.0         8.3
2694                                        Metropolis      1927.0         8.3
4697                               The Broadway Melody      1929.0         6.3
                                                ...         ...         ...
4683                                            Heroes         NaN         7.7
4688                                       Home Movies         NaN         8.2
4704                           
        Revolution         NaN         6.7

4752 Happy Valley NaN 8.5
4912 The Following NaN 7.5
[4916 rows x 3 columns]

sort multiple columns at once, use a list

movie[["movie_title","title_year","imdb_score"]].sort_values(["title_year","imdb_score"],ascending=True)
Out[76]: 
                                           movie_title  title_year  imdb_score
4695  Intolerance: Love's Struggle Throughout the Ages      1916.0         8.0
4833                    Over the Hill to the Poorhouse      1920.0         4.8
4767                                    The Big Parade      1925.0         8.3
2694                                        Metropolis      1927.0         8.3
4697                               The Broadway Melody      1929.0         6.3
                                                ...         ...         ...
453                                          Daredevil         NaN         8.8
1825                 It's Always Sunny in Philadelphia         NaN         8.8
3415                                             Fargo         NaN         9.0
2779                                           Dekalog         NaN         9.1
2725                                  Towering Inferno         NaN         9.5
[4916 rows x 3 columns]
movie[["movie_title","title_year","imdb_score"]].sort_values(["title_year","imdb_score"],ascending=False)
Out[77]: 
                      movie_title  title_year  imdb_score
4312         Kickboxer: Vengeance      2016.0         9.1
4277  A Beginner's Guide to Snuff      2016.0         8.7
3798                      Airlift      2016.0         8.5
27     Captain America: Civil War      2016.0         8.2
98            Godzilla Resurgence      2016.0         8.2
                           ...         ...         ...
1391                    Rush Hour         NaN         5.8
4031                     Creature         NaN         5.0
2165              Meet the Browns         NaN         3.5
3246   The Bold and the Beautiful         NaN         3.5
2119                 The Bachelor         NaN         2.9
[4916 rows x 3 columns]
#us .drop_dupilicates method to keep only the first row of every year
movie[["movie_title","title_year","imdb_score"]].sort_values(["title_year","imdb_score"],ascending=False).drop_duplicates(subset="title_year")
# by default, .drop_duplicate keeps the very first appearance of a value, but this behavior may be modified by pass keep='last' to select the last row of each group or keep=False to drop all duplicates entirely
Out[79]: 
                                           movie_title  title_year  imdb_score
4312                              Kickboxer: Vengeance      2016.0         9.1
3745                                   Running Forever      2015.0         8.6
4369                            Queen of the Mountains      2014.0         8.7
3935           Batman: The Dark Knight Returns, Part 2      2013.0         8.4
3                                The Dark Knight Rises      2012.0         8.5
                                                ...         ...         ...
2694                                        Metropolis      1927.0         8.3
4767                                    The Big Parade      1925.0         8.3
4833                    Over the Hill to the Poorhouse      1920.0         4.8
4695  Intolerance: Love's Struggle Throughout the Ages      1916.0         8.0
2725                                  Towering Inferno         NaN         9.5
[92 rows x 3 columns]

Replicating nlargest with sort_values

# the first expression 
movie[["movie_title","imdb_score","budget"]].nlargest(100,"imdb_score").nsmallest(5,"budget")

# the second expression 
movie[["movie_title","imdb_score","budget"]].sort_values("imdb_score",ascending=False).head(100)
.sort_values("budget",ascending=True).head(5)
上一篇:第119天:Python 爬取豆瓣电影 top 250


下一篇:20个原生jsdemo:2 movie-seat-booking