代码3-1 餐饮销量额数据缺失值及异常值检测代码
1 # 设置工作空间 2 # 把“数据及程序”文件夹拷贝到F盘下,再用setwd设置工作空间 3 setwd("D:/R_Project/book1_R/chapter3/示例程序") 4 # 读入数据 5 saledata <- read.csv(file = "./data/catering_sale.csv", header = TRUE) 6 7 # 缺失值检测 并打印结果,由于R把TRUE和FALSE分别当作1、0,可以用sum()和mean()函数来分别获取缺失样本数、缺失比例 8 sum(complete.cases(saledata)) 9 sum(!complete.cases(saledata)) 10 mean(!complete.cases(saledata)) 11 saledata[!complete.cases(saledata), ] 12 13 # 异常值检测箱线图 14 # boxwex :a scale factor to be applied to all boxes.也就是调整箱子的比例 15 sp <- boxplot(saledata$"销量", boxwex = 0.7) 16 #给箱线图添加标题 17 title("销量异常值检测箱线图") 18 19 xi <- 1.1 20 21 #计算销量的标准差 22 sd.s <- sd(saledata[complete.cases(saledata), ]$"销量") 23 24 #计算销量的平均值 25 mn.s <- mean(saledata[complete.cases(saledata), ]$"销量") 26 points(xi, mn.s, col = "red", pch = 18) 27 arrows(xi, mn.s - sd.s, xi, mn.s + sd.s, code = 3, col = "pink", angle = 75, length = .1) 28 text(rep(c(1.05, 1.05, 0.95, 0.95), length = length(sp$out)), 29 labels = sp$out[order(sp$out)], sp$out[order(sp$out)] + 30 rep(c(150, -150, 150, -150), length = length(sp$out)), col = "red")
Notes:
(1)
(2)
* R语言用complete.cases 和 na.omit去除有空值的行:http://blog.sina.com.cn/s/blog_59990a450101qnvy.html
* complete.cases()函数:Return a logical vector indicating which cases are complete, i.e., have no missing values.
* 也就是说它返回的是一个TRUE/FALSE的逻辑向量
(3)
* points()函数:用于标记某个点,设定参数pch即标记该点要用什么形状的来标记,pch=20是实心圆形状;设定参数cex表示这个形状的大小设定为多少,一般cex=2就够了;参数col设定点的颜色
* pch
* plotting ‘character’, i.e., symbol to use. This can either be a single character or an integer code for one of a set of graphics symbols.The full set of S symbols is available with pch = 0:18, see the examples below. (NB: R uses circles instead of the octagons used in S.) Value pch = "." (equivalently pch = 46) is handled specially. It is a rectangle of side 0.01 inch (scaled by cex). In addition, if cex = 1 (the default), each side is at least one pixel (1/72 inch on the pdf, postscript and xfig devices). For other text symbols, cex = 1 corresponds to the default fontsize of the device, often specified by an argument pointsize. For pch in 0:25 the default size is about 75% of the character height (see par("cin")).
* cex
* character (or symbol) expansion: a numerical vector. This works as a multiple of par("cex").
(4)
* arrows()函数:用于在图像画箭头的函数
* 定义:arrows(x0, y0, x1 = x0, y1 = y0, length = 0.25, angle = 30,code = 2, col = par("fg"), lty = par("lty"),lwd = par("lwd"), ...)
* Argument:
* x0, y0
* coordinates of points from which to draw.
* x1, y1
* coordinates of points to which to draw. At least one must the supplied
* length
* length of the edges of the arrow head (in inches).
* angle
* angle from the shaft of the arrow to the edge of the arrow head.(从箭头的轴到箭头的边缘的角度。)
* code
* integer code, determining kind of arrows to be drawn.
* col, lty, lwd
* graphical parameters, possible vectors. NA values in col cause the arrow to be omitted.
箱线图分析: