R语言小白学习笔记10—数据重构：Tidyverse

2024-04-09 22:01:29

R语言小白学习笔记10—数据重构：Tidyverse

笔记链接
学习笔记10—数据重构

笔记链接

学习笔记1—R语言基础.
学习笔记2—高级数据结构.
学习笔记3—R语言读取数据.
学习笔记4—统计图.
学习笔记5—编写R语言函数和简单的控制循环语句.
学习笔记6—分组操作.
学习笔记7—高效的分组操作：dplyr.
学习笔记8—数据迭代.
学习笔记9—数据整理.

学习笔记10—数据重构

前边虽然已经介绍了几种重构数据的方法，但包dplyr和tidyr包适应管道设计、性能有提高，更易使用。

10.1 合并行和列数据

dplyr包中合并行和列的函数：bind_rows函数和bind_cols函数，相比较于之前的rbind和cbind函数，其使用局限性更大，但对数据框友好。

用法和之前的rbind函数和cbind函数相同。

10.2 用dplyr连接

dplyr包连接系列函数：left_join、right_join、inner_join、full_join、semi_join和anti_join函数。

例：将diamonds数据集中颜色的列与其关于颜色的详细说明进行连接

首先，用read_csv函数读取数据

> library(readr)
> colorsURL <- 'http://www.jaredlander.com/data/DiamondColors.csv'
> diamondColors <- read_csv(colorsURL)

-- Column specification ---------------------------------
cols(
  Color = col_character(),
  Description = col_character(),
  Details = col_character()
)

> diamondColors
# A tibble: 10 x 3
   Color Description          Details                    
   <chr> <chr>                <chr>                      
 1 D     Absolutely Colorless No color                   
 2 E     Colorless            Minute traces of color     
 3 F     Colorless            Minute traces of color     
 4 G     Near Colorless       Color is dificult to detect
 5 H     Near Colorless       Color is dificult to detect
 6 I     Near Colorless       Slightly detectable color  
 7 J     Near Colorless       Slightly detectable color  
 8 K     Faint Color          Noticeable color           
 9 L     Faint Color          Noticeable color           
10 M     Faint Color          Noticeable color

然后查看diamonds数据集中的color列

> data(diamonds, package = 'ggplot2')
> unique(diamonds$color)
[1] E I J H F G D
Levels: D < E < F < G < H < I < J

接下来进行左连接，因为两个数据框的键列的列名不同，所以要用By参数指定

> library(dplyr)
> left_join(diamonds, diamondColors, by=c('color' = 'Color'))
# A tibble: 53,940 x 12
   carat cut   color clarity depth table price     x
   <dbl> <ord> <chr> <ord>   <dbl> <dbl> <int> <dbl>
 1 0.23  Ideal E     SI2      61.5    55   326  3.95
 2 0.21  Prem~ E     SI1      59.8    61   326  3.89
 3 0.23  Good  E     VS1      56.9    65   327  4.05
 4 0.290 Prem~ I     VS2      62.4    58   334  4.2 
 5 0.31  Good  J     SI2      63.3    58   335  4.34
 6 0.24  Very~ J     VVS2     62.8    57   336  3.94
 7 0.24  Very~ I     VVS1     62.3    57   336  3.95
 8 0.26  Very~ H     SI1      61.9    55   337  4.07
 9 0.22  Fair  E     VS2      65.1    61   337  3.87
10 0.23  Very~ H     VS1      59.4    61   338  4   
# ... with 53,930 more rows, and 4 more variables:
#   y <dbl>, z <dbl>, Description <chr>, Details <chr>

这里用的是左连接，左表（diamonds）所有行保留，右表（diamondColors）只保留匹配行。

右连接（right_join），右表所有行保留，左表只保留匹配行。

inner_join返回两个表中都匹配的行数据（即只有在两个表中都存在才会返回）。

full_join返回两个表中所有的行数据。

semi_join并不连接两个表，而是返回左表中和右表匹配的第一行。

例：

> semi_join(diamondColors, diamonds, by=c('Color' = 'color'))
# A tibble: 7 x 3
  Color Description          Details                    
  <chr> <chr>                <chr>                      
1 D     Absolutely Colorless No color                   
2 E     Colorless            Minute traces of color     
3 F     Colorless            Minute traces of color     
4 G     Near Colorless       Color is dificult to detect
5 H     Near Colorless       Color is dificult to detect
6 I     Near Colorless       Slightly detectable color  
7 J     Near Colorless       Slightly detectable color

anti_join和semi_join正好相反，它只返回左表中行数据没有在右表中匹配上的行数据。

例：

> anti_join(diamondColors, diamonds, by=c('Color' = 'color'))
# A tibble: 3 x 3
  Color Description Details         
  <chr> <chr>       <chr>           
1 K     Faint Color Noticeable color
2 L     Faint Color Noticeable color
3 M     Faint Color Noticeable color

10.3 行列变换

这里使用tidyr包，支持管道操作。

例：这里用情绪反应和管理的实验数据作为例子

> library(readr)
> emotion <- read_tsv('http://www.jaredlander.com/data/reaction.txt')

-- Column specification ---------------------------------
cols(
  ID = col_double(),
  Test = col_double(),
  Age = col_double(),
  Gender = col_character(),
  BMI = col_double(),
  React = col_double(),
  Regulate = col_double()
)

> emotion
# A tibble: 99 x 7
      ID  Test   Age Gender   BMI React Regulate
   <dbl> <dbl> <dbl> <chr>  <dbl> <dbl>    <dbl>
 1     1     1  9.69 F       14.7  4.17     3.15
 2     1     2 12.3  F       14.6  3.89     2.55
 3     2     1 15.7  F       19.5  4.39     4.41
 4     2     2 17.6  F       20.0  2.03     2.2 
 5     3     1  9.52 F       20.9  3.38     2.65
 6     3     2 11.8  F       24.0  4        3.63
 7     4     1 16.3  M       25.1  3.15     3.59
 8     4     2 18.8  M       28.0  3.02     3.54
 9     5     1 15.8  M       28.4  3.08     2.64
10     5     2 18.2  M       19.6  3.17     2.29
# ... with 89 more rows

这里使用gather函数转换数据。

把Age、React和Regulate列合并成一列，成为Measurement列，另一列Type保留原来的列名。

gather函数第一个参数是tibble或数据框数据，key参数是新建的列名，保留原始列名或key，Value参数是新建的列名，保留被归档的原始列的实际数据。剩下的参数是被归档的列名。

> library(tidyr)
> emotion %>%
+     gather(key=Type, value=Measurement, Age, BMI, React, Regulate)
# A tibble: 396 x 5
      ID  Test Gender Type  Measurement
   <dbl> <dbl> <chr>  <chr>       <dbl>
 1     1     1 F      Age          9.69
 2     1     2 F      Age         12.3 
 3     2     1 F      Age         15.7 
 4     2     2 F      Age         17.6 
 5     3     1 F      Age          9.52
 6     3     2 F      Age         11.8 
 7     4     1 M      Age         16.3 
 8     4     2 M      Age         18.8 
 9     5     1 M      Age         15.8 
10     5     2 M      Age         18.2 
# ... with 386 more rows

按ID排序：

> emotionLong <- emotion %>%
+     gather(key=Type, value=Measurement, Age, BMI, React, Regulate) %>%
+     arrange(ID)
# A tibble: 396 x 5
      ID  Test Gender Type     Measurement
   <dbl> <dbl> <chr>  <chr>          <dbl>
 1     1     1 F      Age             9.69
 2     1     2 F      Age            12.3 
 3     1     1 F      BMI            14.7 
 4     1     2 F      BMI            14.6 
 5     1     1 F      React           4.17
 6     1     2 F      React           3.89
 7     1     1 F      Regulate        3.15
 8     1     2 F      Regulate        2.55
 9     2     1 F      Age            15.7 
10     2     2 F      Age            17.6 
# ... with 386 more rows

gather函数的反向操作是spread函数，将行转换为列。

> emotionLong %>% 
+     spread(key=Type, value=Measurement)
# A tibble: 99 x 7
      ID  Test Gender   Age   BMI React Regulate
   <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>    <dbl>
 1     1     1 F       9.69  14.7  4.17     3.15
 2     1     2 F      12.3   14.6  3.89     2.55
 3     2     1 F      15.7   19.5  4.39     4.41
 4     2     2 F      17.6   20.0  2.03     2.2 
 5     3     1 F       9.52  20.9  3.38     2.65
 6     3     2 F      11.8   24.0  4        3.63
 7     4     1 M      16.3   25.1  3.15     3.59
 8     4     2 M      18.8   28.0  3.02     3.54
 9     5     1 M      15.8   28.4  3.08     2.64
10     5     2 M      18.2   19.6  3.17     2.29
# ... with 89 more rows

码农公寓