R语言小白学习笔记11—字符串操作

2024-03-01 18:28:10

R语言小白学习笔记11—字符串操作

笔记链接
学习笔记11—字符串操作

笔记链接

学习笔记1—R语言基础.
学习笔记2—高级数据结构.
学习笔记3—R语言读取数据.
学习笔记4—统计图.
学习笔记5—编写R语言函数和简单的控制循环语句.
学习笔记6—分组操作.
学习笔记7—高效的分组操作：dplyr.
学习笔记8—数据迭代.
学习笔记9—数据整理.
学习笔记10—数据重构：Tidyverse.

学习笔记11—字符串操作

用途：处理文本，如创建、提取或操作文本。

功能：构造或解构字符串。

11.1 paste

paste函数可以把字符串放到一起

例：

> paste("Hello", "Jared", "and others")
[1] "Hello Jared and others"

注意：字符串之间有空格是因为paste函数第3个参数sep决定了条目之间放什么字符。

例：

> paste("Hello", "Jared", "and others", sep="/")
[1] "Hello/Jared/and others"

paste函数也是向量化的，可以将向量中字符串放到一起

例：

> paste(c("Hello", "Hey", "Howdy"), c("Jared", "Bob", "David"))
[1] "Hello Jared" "Hey Bob"     "Howdy David"

当每个向量条目相同时一一配对，长度不同时进行循环

例：

> paste("Hello", c("Jared", "Bob", "David"), c("Goodbye", "Seeya"))
[1] "Hello Jared Goodbye" "Hello Bob Seeya"    
[3] "Hello David Goodbye"

利用collapse参数，paste函数能把一个文本向量拆解成由任意分隔符的元素组成的向量

例：

> vectorOfText <- c("Hello", "Everyone", "out there", ".")
> vectorOfText
[1] "Hello"     "Everyone"  "out there" "."        
> paste(vectorOfText, collapse = " ")
[1] "Hello Everyone out there ."

11.2 把格式数据写成串（sprintf）

paste函数可以方便地将短文本组合到一起

冗长的句子插入特殊变量时可以用sprintf函数
例：

> person <- "Jared"
> partySize <- "eight"
> waitTime <- 25
> sprintf("Hello %s, your party of %s will be seated in %s minutes", person, partySize, waitTime)
[1] "Hello Jared, your party of eight will be seated in 25 minutes"

sprintf函数也是向量化的

例：

> sprintf("Hello %s, your party of %s will be seated in %s minutes", c("Jared", "Bob"), c("eight", 16, "four", 10), waitTime)
[1] "Hello Jared, your party of eight will be seated in 25 minutes"
[2] "Hello Bob, your party of 16 will be seated in 25 minutes"     
[3] "Hello Jared, your party of four will be seated in 25 minutes" 
[4] "Hello Bob, your party of 10 will be seated in 25 minutes"

11.3 提取文本

提取文本有很多函数，这里使用stringr包

str_split函数：分离字符串

例：

首先用XML包从*下载有关美国总统的一张表

> library(XML)
> theURL <- "http://www.loc.gov/rr/print/list/057_chron.html"
> presidents1 <- readHTMLTable(theURL, which=3, as.data.frame = TRUE, skip.rows = 1, header=TRUE, stringsAsFactors=FALSE)
错误: failed to load HTTP resource

出错！

登录原网页发现这个网页加了验证措施，所以我就直接将表复制到了excel，用readxl包读取excel数据（表格已上传到资源: presidents.，免费下载）

> library(readxl)
> presidents <- read_excel('E:/B/R/presidents.xlsx')
> head(presidents)
# A tibble: 6 x 4
  YEAR   PRESIDENT   `FIRST LADY`        `VICE PRESIDENT`
  <chr>  <chr>       <chr>               <chr>           
1 1789-~ George Was~ Martha Washington   John Adams      
2 1797-~ John Adams  Abigail Adams       Thomas Jefferson
3 1801-~ Thomas Jef~ [Martha Wayles Ske~ Aaron Burr      
4 1805-~ Thomas Jef~ see above           George Clinton  
5 1809-~ James Madi~ Dolley Madison      George Clinton  
6 1812-~ James Madi~ Dolley Madison      office vacant

这里我们创建两个新列，一列是任期开始时间，一列是任期结束时间。

即从连接字符（-）分离Year这一列。

这里用str_split函数，可以基于一些值分离字符串，返回一个由输入向量的元素组成的列表。

> library(stringr)
> yearList <- str_split(string = presidents$YEAR, pattern = "-")
> head(yearList)
[[1]]
[1] "1789" "1797"

[[2]]
[1] "1797" "1801"

[[3]]
[1] "1801" "1805"

[[4]]
[1] "1805" "1809"

[[5]]
[1] "1809" "1812"

[[6]]
[1] "1812" "1813"

然后将这个列表用rbind函数合并

> yearMatrix <- data.frame(Reduce(rbind, yearList))
> head(yearMatrix)
       X1   X2
init 1789 1797
X    1797 1801
X.1  1801 1805
X.2  1805 1809
X.3  1809 1812
X.4  1812 1813

起个名字

> names(yearMatrix) <- c("Start", "Stop")

用cbind将两个数据框合并

> presidents <- cbind(presidents, yearMatrix)

将factor改为数值型数据

注意：这里在进行转换时需要先将factor转换为字符型，因为as.numeric看到的是R给factor中的每个唯一值的一个整数。（具体可参考笔记：R语言基础1.4.2）

> presidents$Start <- as.numeric(as.character(presidents$Start))
> presidents$Stop <- as.numeric(as.character(presidents$Stop))
> head(presidents)
          YEAR         PRESIDENT
init 1789-1797 George Washington
X    1797-1801        John Adams
X.1  1801-1805  Thomas Jefferson
X.2  1805-1809  Thomas Jefferson
X.3  1809-1812     James Madison
X.4  1812-1813     James Madison
                                                                                                     FIRST LADY
init                                                                                          Martha Washington
X                                                                                                 Abigail Adams
X.1  [Martha Wayles Skelton Jefferson，died before Jefferson assumed office;no image of her in P&P collections]
X.2                                                                                                   see above
X.3                                                                                              Dolley Madison
X.4                                                                                              Dolley Madison
       VICE PRESIDENT Start Stop
init       John Adams  1789 1797
X    Thomas Jefferson  1797 1801
X.1        Aaron Burr  1801 1805
X.2    George Clinton  1805 1809
X.3    George Clinton  1809 1812
X.4     office vacant  1812 1813

str_sub函数：从文本中选择指定的字符

例：

选择总统名字的前三个英文字母

> str_sub(string = presidents$PRESIDENT, start = 1, end = 3)
 [1] "Geo" "Joh" "Tho" "Tho" "Jam" "Jam" "Jam" "Jam"
 [9] "Jam" "Joh" "And" "And" "Mar" "Wil" "Joh" "Jam"
[17] "Zac" "Mil" "Fra" "Fra" "Jam" "Abr" "Abr" "And"
[25] "Uly" "Uly" "Uly" "Rut" "Jam" "Che" "Gro" "Gro"
[33] "Ben" "Gro" "Wil" "Wil" "Wil" "The" "The" "Wil"
[41] "Wil" "Woo" "War" "Cal" "Cal" "Her" "Fra" "Fra"
[49] "Fra" "Har" "Har" "Dwi" "Joh" "Lyn" "Lyn" "Ric"
[57] "Ric" "Ger" "Jim" "Ron" "Geo" "Bil" "Geo" "Bar"
[65] "Don" "Jos"

查找一位任期开始于“以1结束的年份”的总统

> presidents[str_sub(string = presidents$Start, start = 4, end = 4) == 1, c("YEAR", "PRESIDENT", "Start", "Stop")]
          YEAR              PRESIDENT Start Stop
X.1  1801-1805       Thomas Jefferson  1801 1805
X.12      1841 William Henry Harrison  1841 1841
X.13 1841-1845             John Tyler  1841 1845
X.20 1861-1865        Abraham Lincoln  1861 1865
X.27      1881      James A. Garfield  1881 1881
X.28 1881-1885      Chester A. Arthur  1881 1885
X.35      1901       William McKinley  1901 1901
X.36 1901-1905     Theodore Roosevelt  1901 1905
X.41 1921-1923      Warren G. Harding  1921 1923
X.46 1941-1945  Franklin D. Roosevelt  1941 1945
X.51 1961-1963        John F. Kennedy  1961 1963
X.58 1981-1989          Ronald Reagan  1981 1989
X.61 2001-2009         George W. Bush  2001 2009
X.64     2021-        Joseph R. Biden  2021   NA

11.4 正则表达式

在筛选文本时需要通用和灵活的模式，而正则表达式正满足。

例：

用str_detect函数找任何名字中带“John”的总统

> johnPos <- str_detect(string = presidents$PRESIDENT, pattern = "John")
> presidents[johnPos, c("YEAR", "PRESIDENT", "Start", "Stop")]
          YEAR         PRESIDENT Start Stop
X    1797-1801        John Adams  1797 1801
X.8  1825-1829 John Quincy Adams  1825 1829
X.13 1841-1845        John Tyler  1841 1845
X.22 1865-1869    Andrew Johnson  1865 1869
X.51 1961-1963   John F. Kennedy  1961 1963
X.52 1963-1965 Lyndon B. Johnson  1963 1965
X.53 1965-1969 Lyndon B. Johnson  1965 1969

要想忽略大小写，可以用ignore.case

str_detect(presidents$PRESIDENT, ignore.case("John"))

下面以一个关于美国战争的表的例子来具体展示正则表达式在数据处理的作用

例：

首先，加载数据（从一个URL加载RData文件需要先利用url创建链接，再利用load登录链接，再用close关闭链接）

> con <- url("http://www.jaredlander.com/data/warTimes.rdata")
> load(con)
> head(warTimes)
[1] "September 1, 1774 ACAEA September 3, 1783"
[2] "September 1, 1774 ACAEA March 17, 1776"   
[3] "1775ACAEA1783"                            
[4] "June 1775 ACAEA October 1776"             
[5] "July 1776 ACAEA March 1777"               
[6] "June 14, 1777 ACAEA October 17, 1777"

我们想创建一个新列，包含战争开始的时间，所以需要分离时间列。

> warTimes[str_detect(string = warTimes, pattern = "-")]
[1] "6 June 1944 ACAEA mid-July 1944"
[2] "25 August-17 December 1944"

通过选择带有“-”的数据可以看到，“-”有时候用作分离器，有时候用作连接符，所以我们分离时应该注意。

当分割时，需要寻找“ACAEA”或“-”。使用正则表达式“(ACAEA)|-”，为了避免将mid-July分离，我们将参数n设置为2，这样返回时至多返回两个元素。

> theTimes <- str_split(string = warTimes, pattern = "(ACAEA)|-", n=2)
> head(theTimes)
[[1]]
[1] "September 1, 1774 " " September 3, 1783"

[[2]]
[1] "September 1, 1774 " " March 17, 1776"   

[[3]]
[1] "1775" "1783"

[[4]]
[1] "June 1775 "    " October 1776"

[[5]]
[1] "July 1776 "  " March 1777"

[[6]]
[1] "June 14, 1777 "    " October 17, 1777"

然后我们可以检查下mid-July是否被分离

> which(str_detect(string = warTimes, pattern = "-"))
[1] 147 150
> theTimes[[147]]
[1] "6 June 1944 "   " mid-July 1944"
> theTimes[[150]]
[1] "25 August"        "17 December 1944"

可以看到mid-July仍然是一个整体

我们只需要战争开始日期，所以需要建立函数提取列表中每个向量的第一个元素

> theStart <- sapply(theTimes, FUN = function(x) x[1])
> head(theStart)
[1] "September 1, 1774 " "September 1, 1774 "
[3] "1775"               "June 1775 "        
[5] "July 1776 "         "June 14, 1777 "

可以看到分隔符附近有些有空格，用str_trim函数去除前后空白符

> theStart <- str_trim(theStart)
> head(theStart)
[1] "September 1, 1774" "September 1, 1774"
[3] "1775"              "June 1775"        
[5] "July 1776"         "June 14, 1777"

接下来可以利用str_detect函数找到包含“January”的元素，并返回整个条目

> theStart[str_detect(string = theStart, pattern = "January")]
[1] "January"          "January 21"      
[3] "January 1942"     "January"         
[5] "January 22, 1944" "22 January 1944" 
[7] "January 4, 1989"  "15 January 2002" 
[9] "January 14, 2010"

为了提取年份，我们要搜寻四个连在一起的数字，在正则表达式中“[0~9]”可以搜寻任何数字。

这里用str_extract函数搜寻

> head(str_extract(string = theStart, "[0-9][0-9][0-9][0-9]"), 20)
 [1] "1774" "1774" "1775" "1775" "1776" "1777" "1777"
 [8] "1775" "1776" "1778" "1775" "1779" NA     "1785"
[15] "1798" "1801" NA     "1812" "1812" "1813"

这样输入较为麻烦，d是整数的简写，大括号{x}可以搜寻任意x个在一起的数字。（大部分语言整数简写为“\d”，R语言需要两个反斜杠“\d”）

> head(str_extract(string = theStart, "\\d{4}"), 20)
 [1] "1774" "1774" "1775" "1775" "1776" "1777" "1777"
 [8] "1775" "1776" "1778" "1775" "1779" NA     "1785"
[15] "1798" "1801" NA     "1812" "1812" "1813"

正则表达式可以用“^”“$”搜寻带有锚点的文本，分别表示一行的开始和结尾

> head(str_extract(string = theStart, "^\\d{4}"), 20)
 [1] NA     NA     "1775" NA     NA     NA     "1777"
 [8] "1775" "1776" "1778" "1775" "1779" NA     "1785"
[15] "1798" "1801" NA     NA     "1812" "1813"
> head(str_extract(string = theStart, "\\d{4}$"), 20)
 [1] "1774" "1774" "1775" "1775" "1776" "1777" "1777"
 [8] "1775" "1776" "1778" "1775" "1779" NA     "1785"
[15] "1798" "1801" NA     "1812" "1812" "1813"

正则表达式还可以有选择地替换文本，如：

用str_replace将第一个数字替换为x

> head(str_replace(string = theStart, pattern = "\\d", replacement = "x"), 30)
 [1] "September x, 1774" "September x, 1774"
 [3] "x775"              "June x775"        
 [5] "July x776"         "June x4, 1777"    
 [7] "x777"              "x775"             
 [9] "x776"              "x778"             
[11] "x775"              "x779"             
[13] "January"           "x785"             
[15] "x798"              "x801"             
[17] "August"            "June x8, 1812"    
[19] "x812"              "x813"             
[21] "x812"              "x812"             
[23] "x813"              "x813"             
[25] "x813"              "x814"             
[27] "x813"              "x814"             
[29] "x813"              "x815"

用str_replace_all将所有数字替换为x*数字个数

> head(str_replace_all(string = theStart, pattern = "\\d", replacement = "x"), 30)
 [1] "September x, xxxx" "September x, xxxx"
 [3] "xxxx"              "June xxxx"        
 [5] "July xxxx"         "June xx, xxxx"    
 [7] "xxxx"              "xxxx"             
 [9] "xxxx"              "xxxx"             
[11] "xxxx"              "xxxx"             
[13] "January"           "xxxx"             
[15] "xxxx"              "xxxx"             
[17] "August"            "June xx, xxxx"    
[19] "xxxx"              "xxxx"             
[21] "xxxx"              "xxxx"             
[23] "xxxx"              "xxxx"             
[25] "xxxx"              "xxxx"             
[27] "xxxx"              "xxxx"             
[29] "xxxx"              "xxxx"

用str_replace_all将所有数字替换为x

> head(str_replace_all(string = theStart, pattern = "\\d{1,4}", replacement = "x"), 30)
 [1] "September x, x" "September x, x" "x"             
 [4] "June x"         "July x"         "June x, x"     
 [7] "x"              "x"              "x"             
[10] "x"              "x"              "x"             
[13] "January"        "x"              "x"             
[16] "x"              "August"         "June x, x"     
[19] "x"              "x"              "x"             
[22] "x"              "x"              "x"             
[25] "x"              "x"              "x"             
[28] "x"              "x"              "x"

正则表达式还可以提取HTML之间的文本

例：

先创建包含HTML命令的一个向量

> commands <- c("<a href=index.html>The link is here</a>",
+               "<b>This is a bold text</b>")

这里提取的模式是“<.+?>”和文本“.+?”。“.”代表搜寻任何东西，“+”代表搜寻一次或更多次，“?”代表不是贪婪的搜索。

因为我们不知道之间是怎样的文本，这是我们想要替代的内容，所以我们在括号里分组，使用反向引用，利用“\1”重新插入

> str_replace(string = commands, pattern = "<.+?>(.+?)<.+>", replacement = "\\1")
[1] "The link is here"    "This is a bold text"

总结

这节主要讲了字符串的一些操作，如创建、提取或操作文本，个人认为这在提取文本中的数据，或者从网页上提取数据较为方便，而且我们一般在得到数据时都会有大量的字符需要去处理，所以字符串操作可以大大节省时间。

码农公寓