R语言小白学习笔记11—字符串操作
笔记链接
学习笔记1—R语言基础.
学习笔记2—高级数据结构.
学习笔记3—R语言读取数据.
学习笔记4—统计图.
学习笔记5—编写R语言函数和简单的控制循环语句.
学习笔记6—分组操作.
学习笔记7—高效的分组操作:dplyr.
学习笔记8—数据迭代.
学习笔记9—数据整理.
学习笔记10—数据重构:Tidyverse.
学习笔记11—字符串操作
用途:处理文本,如创建、提取或操作文本。
功能:构造或解构字符串。
11.1 paste
paste函数可以把字符串放到一起
例:
> paste("Hello", "Jared", "and others")
[1] "Hello Jared and others"
注意:字符串之间有空格是因为paste函数第3个参数sep决定了条目之间放什么字符。
例:
> paste("Hello", "Jared", "and others", sep="/")
[1] "Hello/Jared/and others"
paste函数也是向量化的,可以将向量中字符串放到一起
例:
> paste(c("Hello", "Hey", "Howdy"), c("Jared", "Bob", "David"))
[1] "Hello Jared" "Hey Bob" "Howdy David"
当每个向量条目相同时一一配对,长度不同时进行循环
例:
> paste("Hello", c("Jared", "Bob", "David"), c("Goodbye", "Seeya"))
[1] "Hello Jared Goodbye" "Hello Bob Seeya"
[3] "Hello David Goodbye"
利用collapse参数,paste函数能把一个文本向量拆解成由任意分隔符的元素组成的向量
例:
> vectorOfText <- c("Hello", "Everyone", "out there", ".")
> vectorOfText
[1] "Hello" "Everyone" "out there" "."
> paste(vectorOfText, collapse = " ")
[1] "Hello Everyone out there ."
11.2 把格式数据写成串(sprintf)
paste函数可以方便地将短文本组合到一起
冗长的句子插入特殊变量时可以用sprintf函数
例:
> person <- "Jared"
> partySize <- "eight"
> waitTime <- 25
> sprintf("Hello %s, your party of %s will be seated in %s minutes", person, partySize, waitTime)
[1] "Hello Jared, your party of eight will be seated in 25 minutes"
sprintf函数也是向量化的
例:
> sprintf("Hello %s, your party of %s will be seated in %s minutes", c("Jared", "Bob"), c("eight", 16, "four", 10), waitTime)
[1] "Hello Jared, your party of eight will be seated in 25 minutes"
[2] "Hello Bob, your party of 16 will be seated in 25 minutes"
[3] "Hello Jared, your party of four will be seated in 25 minutes"
[4] "Hello Bob, your party of 10 will be seated in 25 minutes"
11.3 提取文本
提取文本有很多函数,这里使用stringr包
- str_split函数:分离字符串
例:
首先用XML包从*下载有关美国总统的一张表
> library(XML)
> theURL <- "http://www.loc.gov/rr/print/list/057_chron.html"
> presidents1 <- readHTMLTable(theURL, which=3, as.data.frame = TRUE, skip.rows = 1, header=TRUE, stringsAsFactors=FALSE)
错误: failed to load HTTP resource
出错!
登录原网页发现这个网页加了验证措施,所以我就直接将表复制到了excel,用readxl包读取excel数据(表格已上传到资源: presidents.,免费下载)
> library(readxl)
> presidents <- read_excel('E:/B/R/presidents.xlsx')
> head(presidents)
# A tibble: 6 x 4
YEAR PRESIDENT `FIRST LADY` `VICE PRESIDENT`
<chr> <chr> <chr> <chr>
1 1789-~ George Was~ Martha Washington John Adams
2 1797-~ John Adams Abigail Adams Thomas Jefferson
3 1801-~ Thomas Jef~ [Martha Wayles Ske~ Aaron Burr
4 1805-~ Thomas Jef~ see above George Clinton
5 1809-~ James Madi~ Dolley Madison George Clinton
6 1812-~ James Madi~ Dolley Madison office vacant
这里我们创建两个新列,一列是任期开始时间,一列是任期结束时间。
即从连接字符(-)分离Year这一列。
这里用str_split函数,可以基于一些值分离字符串,返回一个由输入向量的元素组成的列表。
> library(stringr)
> yearList <- str_split(string = presidents$YEAR, pattern = "-")
> head(yearList)
[[1]]
[1] "1789" "1797"
[[2]]
[1] "1797" "1801"
[[3]]
[1] "1801" "1805"
[[4]]
[1] "1805" "1809"
[[5]]
[1] "1809" "1812"
[[6]]
[1] "1812" "1813"
然后将这个列表用rbind函数合并
> yearMatrix <- data.frame(Reduce(rbind, yearList))
> head(yearMatrix)
X1 X2
init 1789 1797
X 1797 1801
X.1 1801 1805
X.2 1805 1809
X.3 1809 1812
X.4 1812 1813
起个名字
> names(yearMatrix) <- c("Start", "Stop")
用cbind将两个数据框合并
> presidents <- cbind(presidents, yearMatrix)
将factor改为数值型数据
注意:这里在进行转换时需要先将factor转换为字符型,因为as.numeric看到的是R给factor中的每个唯一值的一个整数。(具体可参考笔记:R语言基础1.4.2)
> presidents$Start <- as.numeric(as.character(presidents$Start))
> presidents$Stop <- as.numeric(as.character(presidents$Stop))
> head(presidents)
YEAR PRESIDENT
init 1789-1797 George Washington
X 1797-1801 John Adams
X.1 1801-1805 Thomas Jefferson
X.2 1805-1809 Thomas Jefferson
X.3 1809-1812 James Madison
X.4 1812-1813 James Madison
FIRST LADY
init Martha Washington
X Abigail Adams
X.1 [Martha Wayles Skelton Jefferson,died before Jefferson assumed office;no image of her in P&P collections]
X.2 see above
X.3 Dolley Madison
X.4 Dolley Madison
VICE PRESIDENT Start Stop
init John Adams 1789 1797
X Thomas Jefferson 1797 1801
X.1 Aaron Burr 1801 1805
X.2 George Clinton 1805 1809
X.3 George Clinton 1809 1812
X.4 office vacant 1812 1813
- str_sub函数:从文本中选择指定的字符
例:
选择总统名字的前三个英文字母
> str_sub(string = presidents$PRESIDENT, start = 1, end = 3)
[1] "Geo" "Joh" "Tho" "Tho" "Jam" "Jam" "Jam" "Jam"
[9] "Jam" "Joh" "And" "And" "Mar" "Wil" "Joh" "Jam"
[17] "Zac" "Mil" "Fra" "Fra" "Jam" "Abr" "Abr" "And"
[25] "Uly" "Uly" "Uly" "Rut" "Jam" "Che" "Gro" "Gro"
[33] "Ben" "Gro" "Wil" "Wil" "Wil" "The" "The" "Wil"
[41] "Wil" "Woo" "War" "Cal" "Cal" "Her" "Fra" "Fra"
[49] "Fra" "Har" "Har" "Dwi" "Joh" "Lyn" "Lyn" "Ric"
[57] "Ric" "Ger" "Jim" "Ron" "Geo" "Bil" "Geo" "Bar"
[65] "Don" "Jos"
查找一位任期开始于“以1结束的年份”的总统
> presidents[str_sub(string = presidents$Start, start = 4, end = 4) == 1, c("YEAR", "PRESIDENT", "Start", "Stop")]
YEAR PRESIDENT Start Stop
X.1 1801-1805 Thomas Jefferson 1801 1805
X.12 1841 William Henry Harrison 1841 1841
X.13 1841-1845 John Tyler 1841 1845
X.20 1861-1865 Abraham Lincoln 1861 1865
X.27 1881 James A. Garfield 1881 1881
X.28 1881-1885 Chester A. Arthur 1881 1885
X.35 1901 William McKinley 1901 1901
X.36 1901-1905 Theodore Roosevelt 1901 1905
X.41 1921-1923 Warren G. Harding 1921 1923
X.46 1941-1945 Franklin D. Roosevelt 1941 1945
X.51 1961-1963 John F. Kennedy 1961 1963
X.58 1981-1989 Ronald Reagan 1981 1989
X.61 2001-2009 George W. Bush 2001 2009
X.64 2021- Joseph R. Biden 2021 NA
11.4 正则表达式
在筛选文本时需要通用和灵活的模式,而正则表达式正满足。
例:
用str_detect函数找任何名字中带“John”的总统
> johnPos <- str_detect(string = presidents$PRESIDENT, pattern = "John")
> presidents[johnPos, c("YEAR", "PRESIDENT", "Start", "Stop")]
YEAR PRESIDENT Start Stop
X 1797-1801 John Adams 1797 1801
X.8 1825-1829 John Quincy Adams 1825 1829
X.13 1841-1845 John Tyler 1841 1845
X.22 1865-1869 Andrew Johnson 1865 1869
X.51 1961-1963 John F. Kennedy 1961 1963
X.52 1963-1965 Lyndon B. Johnson 1963 1965
X.53 1965-1969 Lyndon B. Johnson 1965 1969
要想忽略大小写,可以用ignore.case
str_detect(presidents$PRESIDENT, ignore.case("John"))
下面以一个关于美国战争的表的例子来具体展示正则表达式在数据处理的作用
例:
首先,加载数据(从一个URL加载RData文件需要先利用url创建链接,再利用load登录链接,再用close关闭链接)
> con <- url("http://www.jaredlander.com/data/warTimes.rdata")
> load(con)
> head(warTimes)
[1] "September 1, 1774 ACAEA September 3, 1783"
[2] "September 1, 1774 ACAEA March 17, 1776"
[3] "1775ACAEA1783"
[4] "June 1775 ACAEA October 1776"
[5] "July 1776 ACAEA March 1777"
[6] "June 14, 1777 ACAEA October 17, 1777"
我们想创建一个新列,包含战争开始的时间,所以需要分离时间列。
> warTimes[str_detect(string = warTimes, pattern = "-")]
[1] "6 June 1944 ACAEA mid-July 1944"
[2] "25 August-17 December 1944"
通过选择带有“-”的数据可以看到,“-”有时候用作分离器,有时候用作连接符,所以我们分离时应该注意。
当分割时,需要寻找“ACAEA”或“-”。使用正则表达式“(ACAEA)|-”,为了避免将mid-July分离,我们将参数n设置为2,这样返回时至多返回两个元素。
> theTimes <- str_split(string = warTimes, pattern = "(ACAEA)|-", n=2)
> head(theTimes)
[[1]]
[1] "September 1, 1774 " " September 3, 1783"
[[2]]
[1] "September 1, 1774 " " March 17, 1776"
[[3]]
[1] "1775" "1783"
[[4]]
[1] "June 1775 " " October 1776"
[[5]]
[1] "July 1776 " " March 1777"
[[6]]
[1] "June 14, 1777 " " October 17, 1777"
然后我们可以检查下mid-July是否被分离
> which(str_detect(string = warTimes, pattern = "-"))
[1] 147 150
> theTimes[[147]]
[1] "6 June 1944 " " mid-July 1944"
> theTimes[[150]]
[1] "25 August" "17 December 1944"
可以看到mid-July仍然是一个整体
我们只需要战争开始日期,所以需要建立函数提取列表中每个向量的第一个元素
> theStart <- sapply(theTimes, FUN = function(x) x[1])
> head(theStart)
[1] "September 1, 1774 " "September 1, 1774 "
[3] "1775" "June 1775 "
[5] "July 1776 " "June 14, 1777 "
可以看到分隔符附近有些有空格,用str_trim函数去除前后空白符
> theStart <- str_trim(theStart)
> head(theStart)
[1] "September 1, 1774" "September 1, 1774"
[3] "1775" "June 1775"
[5] "July 1776" "June 14, 1777"
接下来可以利用str_detect函数找到包含“January”的元素,并返回整个条目
> theStart[str_detect(string = theStart, pattern = "January")]
[1] "January" "January 21"
[3] "January 1942" "January"
[5] "January 22, 1944" "22 January 1944"
[7] "January 4, 1989" "15 January 2002"
[9] "January 14, 2010"
为了提取年份,我们要搜寻四个连在一起的数字,在正则表达式中“[0~9]”可以搜寻任何数字。
这里用str_extract函数搜寻
> head(str_extract(string = theStart, "[0-9][0-9][0-9][0-9]"), 20)
[1] "1774" "1774" "1775" "1775" "1776" "1777" "1777"
[8] "1775" "1776" "1778" "1775" "1779" NA "1785"
[15] "1798" "1801" NA "1812" "1812" "1813"
这样输入较为麻烦,d是整数的简写,大括号{x}可以搜寻任意x个在一起的数字。(大部分语言整数简写为“\d”,R语言需要两个反斜杠“\d”)
> head(str_extract(string = theStart, "\\d{4}"), 20)
[1] "1774" "1774" "1775" "1775" "1776" "1777" "1777"
[8] "1775" "1776" "1778" "1775" "1779" NA "1785"
[15] "1798" "1801" NA "1812" "1812" "1813"
正则表达式可以用“^”“$”搜寻带有锚点的文本,分别表示一行的开始和结尾
> head(str_extract(string = theStart, "^\\d{4}"), 20)
[1] NA NA "1775" NA NA NA "1777"
[8] "1775" "1776" "1778" "1775" "1779" NA "1785"
[15] "1798" "1801" NA NA "1812" "1813"
> head(str_extract(string = theStart, "\\d{4}$"), 20)
[1] "1774" "1774" "1775" "1775" "1776" "1777" "1777"
[8] "1775" "1776" "1778" "1775" "1779" NA "1785"
[15] "1798" "1801" NA "1812" "1812" "1813"
正则表达式还可以有选择地替换文本,如:
用str_replace将第一个数字替换为x
> head(str_replace(string = theStart, pattern = "\\d", replacement = "x"), 30)
[1] "September x, 1774" "September x, 1774"
[3] "x775" "June x775"
[5] "July x776" "June x4, 1777"
[7] "x777" "x775"
[9] "x776" "x778"
[11] "x775" "x779"
[13] "January" "x785"
[15] "x798" "x801"
[17] "August" "June x8, 1812"
[19] "x812" "x813"
[21] "x812" "x812"
[23] "x813" "x813"
[25] "x813" "x814"
[27] "x813" "x814"
[29] "x813" "x815"
用str_replace_all将所有数字替换为x*数字个数
> head(str_replace_all(string = theStart, pattern = "\\d", replacement = "x"), 30)
[1] "September x, xxxx" "September x, xxxx"
[3] "xxxx" "June xxxx"
[5] "July xxxx" "June xx, xxxx"
[7] "xxxx" "xxxx"
[9] "xxxx" "xxxx"
[11] "xxxx" "xxxx"
[13] "January" "xxxx"
[15] "xxxx" "xxxx"
[17] "August" "June xx, xxxx"
[19] "xxxx" "xxxx"
[21] "xxxx" "xxxx"
[23] "xxxx" "xxxx"
[25] "xxxx" "xxxx"
[27] "xxxx" "xxxx"
[29] "xxxx" "xxxx"
用str_replace_all将所有数字替换为x
> head(str_replace_all(string = theStart, pattern = "\\d{1,4}", replacement = "x"), 30)
[1] "September x, x" "September x, x" "x"
[4] "June x" "July x" "June x, x"
[7] "x" "x" "x"
[10] "x" "x" "x"
[13] "January" "x" "x"
[16] "x" "August" "June x, x"
[19] "x" "x" "x"
[22] "x" "x" "x"
[25] "x" "x" "x"
[28] "x" "x" "x"
正则表达式还可以提取HTML之间的文本
例:
先创建包含HTML命令的一个向量
> commands <- c("<a href=index.html>The link is here</a>",
+ "<b>This is a bold text</b>")
这里提取的模式是“<.+?>”和文本“.+?”。“.”代表搜寻任何东西,“+”代表搜寻一次或更多次,“?”代表不是贪婪的搜索。
因为我们不知道之间是怎样的文本,这是我们想要替代的内容,所以我们在括号里分组,使用反向引用,利用“\1”重新插入
> str_replace(string = commands, pattern = "<.+?>(.+?)<.+>", replacement = "\\1")
[1] "The link is here" "This is a bold text"
总结
这节主要讲了字符串的一些操作,如创建、提取或操作文本,个人认为这在提取文本中的数据,或者从网页上提取数据较为方便,而且我们一般在得到数据时都会有大量的字符需要去处理,所以字符串操作可以大大节省时间。