1. Introduction
In some situations, we will want to sort non-numeric data. This process may help us understand how different level is changing, or help us compare data from one level to another. In other words, we may want to arrange data not only from aspect of numeric data(like sum of revenue), but also from some text data(like tiers, cabins, classes, an so on).
However, sorting non-numeric data by deafult will be the order of alphabet from a to z. This is not very useful because the order of our data is normally different. Algorithmically speaking, this is actually not a problem because we can map our tier to numeric data and teat them as numeric data as well. But in business world this may not be the perpect soution(at least not a daily solution) because we want to focus on business question itself, rather than extra coding.
In this article, we will use classic titanic.csv data set as example.
In R:
library(tidyverse) titanic <- read_csv("titanic.csv") titanic # A tibble: 891 x 15 survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> <lgl> 1 0 3 male 22 1 0 7.25 S Third man TRUE NA Southampton no FALSE 2 1 1 female 38 1 0 71.3 C First woman FALSE C Cherbourg yes FALSE 3 1 3 female 26 0 0 7.92 S Third woman FALSE NA Southampton yes TRUE 4 1 1 female 35 1 0 53.1 S First woman FALSE C Southampton yes FALSE 5 0 3 male 35 0 0 8.05 S Third man TRUE NA Southampton no TRUE 6 0 3 male NA 0 0 8.46 Q Third man TRUE NA Queenstown no TRUE 7 0 1 male 54 0 0 51.9 S First man TRUE E Southampton no TRUE 8 0 3 male 2 3 1 21.1 S Third child FALSE NA Southampton no FALSE 9 1 3 female 27 0 2 11.1 S Third woman FALSE NA Southampton yes FALSE 10 1 2 female 14 1 0 30.1 C Second child FALSE NA Cherbourg yes FALSE # ... with 881 more rows
In Python:
import pandas as pd titanic = pd.read_csv("titanic.csv") titanic.head()
2. In R
In R, there is a special jargon to describe text data in levels: factor. Not surpringly, function factor() is the key. I don't know why as.factor() and as_factor() functions are exsit but not working as people may expect. This is just another example of showing how chaos base::R was.
If someone is thinking manual process, think twice because this could be extramingly tedious when the level of tiers is large. For example, there are 10 to 15 cabins from a flight and we need to sort them from high to low. What's more, if the tires are the second level of grouping, this will be even more impossible to do it manually. For example, there are 10 different routes of flights and each route has 10 to 15 different cabins and we need to sort the cabins from high to low.
This idea is showing below with titanic.csv. What to do if we want to sort class from "Third" to "First"?
titanic %>% group_by(embark_town, class) %>% summarise(survived_rate=mean(survived!=0)) # A tibble: 10 x 3 # Groups: embark_town [4] embark_town class survived_rate <chr> <chr> <dbl> 1 Cherbourg First 0.694 2 Cherbourg Second 0.529 3 Cherbourg Third 0.379 4 Queenstown First 0.5 5 Queenstown Second 0.667 6 Queenstown Third 0.375 7 Southampton First 0.583 8 Southampton Second 0.463 9 Southampton Third 0.190 10 NA First 1
This is how factor() function comes to help. It has an important argument called levels=. One more note: in argument levels=, left hand is low and right hand it high.
factor() function changes data into type factor, and it is backed with numeric data underhood so it can be sorted as we need.
titanic %>% group_by(embark_town, class) %>% summarise(survived_rate=mean(survived!=0)) %>% mutate(class=factor(class, levels=c("Third", "Second", "First"))) %>% arrange(embark_town, class) # A tibble: 10 x 3 # Groups: embark_town [4] embark_town class survived_rate <chr> <fct> <dbl> 1 Cherbourg Third 0.379 2 Cherbourg Second 0.529 3 Cherbourg First 0.694 4 Queenstown Third 0.375 5 Queenstown Second 0.667 6 Queenstown First 0.5 7 Southampton Third 0.190 8 Southampton Second 0.463 9 Southampton First 0.583 10 NA First 1
Furthermore, if we want a pivot table with column names is arranged by our desire, we can use the same idea:
If you don't familiar with pivot_wider(), this is a tidyverse function for reshaping dataframe. We often use it to build "excel style pivot table" in R.
titanic %>% group_by(embark_town, class) %>% summarise(survived_rate=mean(survived!=0)) %>% mutate(class=factor(class, levels=c("Third", "Second", "First"))) %>% arrange(embark_town, class) %>% pivot_wider(id_cols=embark_town, names_from=class, values_from=survived_rate) # A tibble: 4 x 4 # Groups: embark_town [4] embark_town Third Second First <chr> <dbl> <dbl> <dbl> 1 Cherbourg 0.379 0.529 0.694 2 Queenstown 0.375 0.667 0.5 3 Southampton 0.190 0.463 0.583 4 NA NA NA 1
3. In Python
In Python pandas, text data in levels are called: category. We can use "pd.api.types.CategoricalDtype()" to set order of text data, and then we use as.type() function to change them into category type so we can sort.
order = pd.api.types.CategoricalDtype( categories=["Third", "Second", "First"], ordered=True) titanic["class"] = titanic["class"].astype(order) (titanic.groupby(["embark_town", "class"]) .agg({"survived": "mean"}) .reset_index() .sort_values(by=["embark_town", "class"]))
This process is not perfect because it has an extra step outside the pipline, let alone the function name pd.api.types.CategoricalDtype() is long and hard to remember.
Interchangeably, we can use pd.Categorical() as type function to do the same job. The process is more accetable because it is the same with how we treat time series data with pd.to_datetime().
titanic["class"] = pd.Categorical(titanic["class"], categories=["Third", "Second", "First"], ordered=True) (titanic.groupby(["embark_town", "class"]) .agg({"survived": "mean"}) .reset_index() .sort_values(by=["embark_town", "class"]))