Sorting Non-numeric Data in R & Python

1. Introduction

In some situations, we will want to sort non-numeric data. This process may help us understand how different level is changing, or help us compare data from one level to another. In other words, we may want to arrange data not only from aspect of numeric data(like sum of revenue), but also from some text data(like tiers, cabins, classes, an so on).

However, sorting non-numeric data by deafult will be the order of alphabet from a to z. This is not very useful because the order of our data is normally different. Algorithmically speaking, this is actually not a problem because we can map our tier to numeric data and teat them as numeric data as well. But in business world this may not be the perpect soution(at least not a daily solution) because we want to focus on business question itself, rather than extra coding.

 

In this article, we will use classic titanic.csv data set as example.

In R:

library(tidyverse)

titanic <- read_csv("titanic.csv")
titanic

# A tibble: 891 x 15
   survived pclass sex      age sibsp parch  fare embarked class  who   adult_male deck  embark_town alive alone
      <dbl>  <dbl> <chr>  <dbl> <dbl> <dbl> <dbl> <chr>    <chr>  <chr> <lgl>      <chr> <chr>       <chr> <lgl>
 1        0      3 male      22     1     0  7.25 S        Third  man   TRUE       NA    Southampton no    FALSE
 2        1      1 female    38     1     0 71.3  C        First  woman FALSE      C     Cherbourg   yes   FALSE
 3        1      3 female    26     0     0  7.92 S        Third  woman FALSE      NA    Southampton yes   TRUE 
 4        1      1 female    35     1     0 53.1  S        First  woman FALSE      C     Southampton yes   FALSE
 5        0      3 male      35     0     0  8.05 S        Third  man   TRUE       NA    Southampton no    TRUE 
 6        0      3 male      NA     0     0  8.46 Q        Third  man   TRUE       NA    Queenstown  no    TRUE 
 7        0      1 male      54     0     0 51.9  S        First  man   TRUE       E     Southampton no    TRUE 
 8        0      3 male       2     3     1 21.1  S        Third  child FALSE      NA    Southampton no    FALSE
 9        1      3 female    27     0     2 11.1  S        Third  woman FALSE      NA    Southampton yes   FALSE
10        1      2 female    14     1     0 30.1  C        Second child FALSE      NA    Cherbourg   yes   FALSE
# ... with 881 more rows

In Python:

import pandas as pd

titanic = pd.read_csv("titanic.csv")
titanic.head()

Sorting Non-numeric Data in R & Python  

 

2. In R

In R, there is a special jargon to describe text data in levels: factor. Not surpringly, function factor() is the key. I don't know why as.factor() and as_factor() functions are exsit but not working as people may expect. This is just another example of showing how chaos base::R was.

If someone is thinking manual process, think twice because this could be extramingly tedious when the level of tiers is large. For example, there are 10 to 15 cabins from a flight and we need to sort them from high to low. What's more, if the tires are the second level of grouping, this will be even more impossible to do it manually. For example, there are 10 different routes of flights and each route has 10 to 15 different cabins and we need to sort the cabins from high to low.

This idea is showing below with titanic.csv. What to do if we want to sort class from "Third" to "First"?

titanic %>%
  group_by(embark_town, class) %>%
  summarise(survived_rate=mean(survived!=0))

# A tibble: 10 x 3
# Groups:   embark_town [4]
   embark_town class  survived_rate
   <chr>       <chr>          <dbl>
 1 Cherbourg   First          0.694
 2 Cherbourg   Second         0.529
 3 Cherbourg   Third          0.379
 4 Queenstown  First          0.5  
 5 Queenstown  Second         0.667
 6 Queenstown  Third          0.375
 7 Southampton First          0.583
 8 Southampton Second         0.463
 9 Southampton Third          0.190
10 NA          First          1  

This is how factor() function comes to help. It has an important argument called levels=. One more note: in argument levels=, left hand is low and right hand it high.

factor() function changes data into type factor, and it is backed with numeric data underhood so it can be sorted as we need.

titanic %>%
  group_by(embark_town, class) %>%
  summarise(survived_rate=mean(survived!=0)) %>%
  mutate(class=factor(class, levels=c("Third", "Second", "First"))) %>%
  arrange(embark_town, class)

# A tibble: 10 x 3
# Groups:   embark_town [4]
   embark_town class  survived_rate
   <chr>       <fct>          <dbl>
 1 Cherbourg   Third          0.379
 2 Cherbourg   Second         0.529
 3 Cherbourg   First          0.694
 4 Queenstown  Third          0.375
 5 Queenstown  Second         0.667
 6 Queenstown  First          0.5  
 7 Southampton Third          0.190
 8 Southampton Second         0.463
 9 Southampton First          0.583
10 NA          First          1

Furthermore, if we want a pivot table with column names is arranged by our desire, we can use the same idea:

If you don't familiar with pivot_wider(), this is a tidyverse function for reshaping dataframe. We often use it to build "excel style pivot table" in R.

titanic %>%
  group_by(embark_town, class) %>%
  summarise(survived_rate=mean(survived!=0)) %>%
  mutate(class=factor(class, levels=c("Third", "Second", "First"))) %>%
  arrange(embark_town, class) %>%  
  pivot_wider(id_cols=embark_town, names_from=class, values_from=survived_rate)

# A tibble: 4 x 4
# Groups:   embark_town [4]
  embark_town  Third Second First
  <chr>        <dbl>  <dbl> <dbl>
1 Cherbourg    0.379  0.529 0.694
2 Queenstown   0.375  0.667 0.5  
3 Southampton  0.190  0.463 0.583
4 NA          NA     NA     1   

  

3. In Python  

In Python pandas, text data in levels are called: category. We can use "pd.api.types.CategoricalDtype()" to set order of text data, and then we use as.type() function to change them into category type so we can sort.

order =  pd.api.types.CategoricalDtype(
    categories=["Third", "Second", "First"], ordered=True)
titanic["class"] = titanic["class"].astype(order)

(titanic.groupby(["embark_town", "class"])
    .agg({"survived": "mean"})
    .reset_index()
    .sort_values(by=["embark_town", "class"]))

Sorting Non-numeric Data in R & Python

This process is not perfect because it has an extra step outside the pipline, let alone the function name pd.api.types.CategoricalDtype() is long and hard to remember.  

Interchangeably, we can use pd.Categorical() as type function to do the same job. The process is more accetable because it is the same with how we  treat time series data with pd.to_datetime().

titanic["class"] = pd.Categorical(titanic["class"],
    categories=["Third", "Second", "First"], ordered=True)

(titanic.groupby(["embark_town", "class"])
    .agg({"survived": "mean"})
    .reset_index()
    .sort_values(by=["embark_town", "class"]))

Sorting Non-numeric Data in R & Python 

 

上一篇:获取最新citysql 城市sql 国家统计局最新市区分布 thinkphp php 抓取


下一篇:poj2431Expedition