数据准备
数据下载:美国宇航局肯尼迪航天中心WEB日志
我们先来看看数据:首先将日志加载到RDD,并显示出前20行(默认)。
import sys
import os
log_file_path ='apache.access.log.PROJECT'
base_df = sqlContext.read.text(log_file_path)
base_df.show(truncate=False)
数据框输出如下:
+-------------------------------------------------------------------------------------------------------------------------------+
|value |
+-------------------------------------------------------------------------------------------------------------------------------+
|in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0" 200 1839 |
|uplherc.upl.com - - [01/Aug/1995:00:00:07 -0400] "GET / HTTP/1.0" 304 0 |
|uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] "GET /images/ksclogo-medium.gif HTTP/1.0" 304 0 |
|uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] "GET /images/MOSAIC-logosmall.gif HTTP/1.0" 304 0 |
|uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] "GET /images/USA-logosmall.gif HTTP/1.0" 304 0 |
|ix-esc-ca2-07.ix.netcom.com - - [01/Aug/1995:00:00:09 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1713 |
|uplherc.upl.com - - [01/Aug/1995:00:00:10 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0" 304 0 |
|slppp6.intermind.net - - [01/Aug/1995:00:00:10 -0400] "GET /history/skylab/skylab.html HTTP/1.0" 200 1687 |
|piweba4y.prodigy.com - - [01/Aug/1995:00:00:10 -0400] "GET /images/launchmedium.gif HTTP/1.0" 200 11853 |
|slppp6.intermind.net - - [01/Aug/1995:00:00:11 -0400] "GET /history/skylab/skylab-small.gif HTTP/1.0" 200 9202 |
|slppp6.intermind.net - - [01/Aug/1995:00:00:12 -0400] "GET /images/ksclogosmall.gif HTTP/1.0" 200 3635 |
|ix-esc-ca2-07.ix.netcom.com - - [01/Aug/1995:00:00:12 -0400] "GET /history/apollo/images/apollo-logo1.gif HTTP/1.0" 200 1173 |
|slppp6.intermind.net - - [01/Aug/1995:00:00:13 -0400] "GET /history/apollo/images/apollo-logo.gif HTTP/1.0" 200 3047 |
|uplherc.upl.com - - [01/Aug/1995:00:00:14 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 304 0 |
|133.43.96.45 - - [01/Aug/1995:00:00:16 -0400] "GET /shuttle/missions/sts-69/mission-sts-69.html HTTP/1.0" 200 10566 |
|kgtyk4.kj.yamagata-u.ac.jp - - [01/Aug/1995:00:00:17 -0400] "GET / HTTP/1.0" 200 7280 |
|kgtyk4.kj.yamagata-u.ac.jp - - [01/Aug/1995:00:00:18 -0400] "GET /images/ksclogo-medium.gif HTTP/1.0" 200 5866 |
|d0ucr6.fnal.gov - - [01/Aug/1995:00:00:19 -0400] "GET /history/apollo/apollo-16/apollo-16.html HTTP/1.0" 200 2743 |
|ix-esc-ca2-07.ix.netcom.com - - [01/Aug/1995:00:00:19 -0400] "GET /shuttle/resources/orbiters/discovery.html HTTP/1.0" 200 6849|
|d0ucr6.fnal.gov - - [01/Aug/1995:00:00:20 -0400] "GET /history/apollo/apollo-16/apollo-16-patch-small.gif HTTP/1.0" 200 14897 |
+-------------------------------------------------------------------------------------------------------------------------------+
下面给出字段的解释:
字段 | 说明 |
---|---|
remotehost | 远程主机名(如果DNS不可用显示ip) |
rfc931 | 用户登录名,这里我们的分析没有用到这个字段 |
authuser | 经过HTTP服务器身份认证的用户登录名 |
[date] | 客户端请求的日期和时间 |
"request" | 来自于浏览器或客户端的请求 |
status | 服务器返回客户端的状态码 |
bytes | 传输到客户端的字节数(Content-Length) |
数据解析
首先我们将数据进行解析,这里会用到正则表达式(XKCD:1313):
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems --Jamie Zawinski
我们还会用到spark的regexp_extract函数。这个函数可以使用包含一个或多个捕获组的正则表达式匹配内容,然后提取其中一个捕获组匹配的内容。
from pyspark.sql.functions import split, regexp_extract
split_df = base_df.select(regexp_extract('value', r'^([^\s]+\s)', 1).alias('host'),
regexp_extract('value', r'^.*\[(\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})]', 1).alias('timestamp'),
regexp_extract('value', r'^.*"\w+\s+([^\s]+)\s+HTTP.*"', 1).alias('path'),
regexp_extract('value', r'^.*"\s+([^\s]+)', 1).cast('integer').alias('status'),
regexp_extract('value', r'^.*\s+(\d+)$', 1).cast('integer').alias('content_size'))
split_df.show(truncate=False)
解析后的数据输出如下:
+----------------------------+--------------------------+---------------------------------------------------+------+------------+
|host |timestamp |path |status|content_size|
+----------------------------+--------------------------+---------------------------------------------------+------+------------+
|in24.inetnebr.com |01/Aug/1995:00:00:01 -0400|/shuttle/missions/sts-68/news/sts-68-mcc-05.txt |200 |1839 |
|uplherc.upl.com |01/Aug/1995:00:00:07 -0400|/ |304 |0 |
|uplherc.upl.com |01/Aug/1995:00:00:08 -0400|/images/ksclogo-medium.gif |304 |0 |
|uplherc.upl.com |01/Aug/1995:00:00:08 -0400|/images/MOSAIC-logosmall.gif |304 |0 |
|uplherc.upl.com |01/Aug/1995:00:00:08 -0400|/images/USA-logosmall.gif |304 |0 |
|ix-esc-ca2-07.ix.netcom.com |01/Aug/1995:00:00:09 -0400|/images/launch-logo.gif |200 |1713 |
|uplherc.upl.com |01/Aug/1995:00:00:10 -0400|/images/WORLD-logosmall.gif |304 |0 |
|slppp6.intermind.net |01/Aug/1995:00:00:10 -0400|/history/skylab/skylab.html |200 |1687 |
|piweba4y.prodigy.com |01/Aug/1995:00:00:10 -0400|/images/launchmedium.gif |200 |11853 |
|slppp6.intermind.net |01/Aug/1995:00:00:11 -0400|/history/skylab/skylab-small.gif |200 |9202 |
|slppp6.intermind.net |01/Aug/1995:00:00:12 -0400|/images/ksclogosmall.gif |200 |3635 |
|ix-esc-ca2-07.ix.netcom.com |01/Aug/1995:00:00:12 -0400|/history/apollo/images/apollo-logo1.gif |200 |1173 |
|slppp6.intermind.net |01/Aug/1995:00:00:13 -0400|/history/apollo/images/apollo-logo.gif |200 |3047 |
|uplherc.upl.com |01/Aug/1995:00:00:14 -0400|/images/NASA-logosmall.gif |304 |0 |
|133.43.96.45 |01/Aug/1995:00:00:16 -0400|/shuttle/missions/sts-69/mission-sts-69.html |200 |10566 |
|kgtyk4.kj.yamagata-u.ac.jp |01/Aug/1995:00:00:17 -0400|/ |200 |7280 |
|kgtyk4.kj.yamagata-u.ac.jp |01/Aug/1995:00:00:18 -0400|/images/ksclogo-medium.gif |200 |5866 |
|d0ucr6.fnal.gov |01/Aug/1995:00:00:19 -0400|/history/apollo/apollo-16/apollo-16.html |200 |2743 |
|ix-esc-ca2-07.ix.netcom.com |01/Aug/1995:00:00:19 -0400|/shuttle/resources/orbiters/discovery.html |200 |6849 |
|d0ucr6.fnal.gov |01/Aug/1995:00:00:20 -0400|/history/apollo/apollo-16/apollo-16-patch-small.gif|200 |14897 |
+----------------------------+--------------------------+---------------------------------------------------+------+------------+
数据清洗
首先我们查看一下原始日志包含多少空行。
base_df.filter(base_df['value'].isNull()).count()
输出为0,很好!没有空行:
0
接下来,我们统计有多少行的数据至少包含一个null值。
bad_rows_df = split_df.filter(split_df['host'].isNull() |
split_df['timestamp'].isNull() |
split_df['path'].isNull() |
split_df['status'].isNull() |
split_df['content_size'].isNull())
bad_rows_df.count()
输出如下:
8756
下面我们统计每列有多少个null值(参考:*):
from pyspark.sql.functions import col, sum
def count_null(col_name):
return sum(col(col_name).isNull().cast('integer')).alias(col_name)
exprs = []
for col_name in split_df.columns:
exprs.append(count_null(col_name))
split_df.agg(*exprs).show() #*exprs把列表转换为可变参数
统计结果如下:
+----+---------+----+------+------------+
|host|timestamp|path|status|content_size|
+----+---------+----+------+------------+
| 0| 0| 0| 0| 8756|
+----+---------+----+------+------------+
看起来最后一列有很多空值,我们用下面的代码统计有多少不以一个或多个数字结尾的行。
bad_content_size_df = base_df.filter(~ base_df['value'].rlike(r'\d+$'))
bad_content_size_df.count()
可以看到,这正好和之前的统计值一样。
8756
我们来看一下这部分数据,因为数据可能包含一个或多个空格,所以我们在最后拼接一个*号。
from pyspark.sql.functions import lit, concat
bad_content_size_df.select(concat(bad_content_size_df['value'], lit('*'))).show(truncate=False)
输出如下:
+----------------------------------------------------------------------------------------------------------------------------+
|concat(value,*) |
+----------------------------------------------------------------------------------------------------------------------------+
|gw1.att.com - - [01/Aug/1995:00:03:53 -0400] "GET /shuttle/missions/sts-73/news HTTP/1.0" 302 -* |
|js002.cc.utsunomiya-u.ac.jp - - [01/Aug/1995:00:07:33 -0400] "GET /shuttle/resources/orbiters/discovery.gif HTTP/1.0" 404 -*|
|tia1.eskimo.com - - [01/Aug/1995:00:28:41 -0400] "GET /pub/winvn/release.txt HTTP/1.0" 404 -* |
|itws.info.eng.niigata-u.ac.jp - - [01/Aug/1995:00:38:01 -0400] "GET /ksc.html/facts/about_ksc.html HTTP/1.0" 403 -* |
|grimnet23.idirect.com - - [01/Aug/1995:00:50:12 -0400] "GET /www/software/winvn/winvn.html HTTP/1.0" 404 -* |
|miriworld.its.unimelb.edu.au - - [01/Aug/1995:01:04:54 -0400] "GET /history/history.htm HTTP/1.0" 404 -* |
|ras38.srv.net - - [01/Aug/1995:01:05:14 -0400] "GET /elv/DELTA/uncons.htm HTTP/1.0" 404 -* |
|cs1-06.leh.ptd.net - - [01/Aug/1995:01:17:38 -0400] "GET /sts-71/launch/" 404 -* |
|www-b2.proxy.aol.com - - [01/Aug/1995:01:22:07 -0400] "GET /shuttle/countdown HTTP/1.0" 302 -* |
|maui56.maui.net - - [01/Aug/1995:01:31:56 -0400] "GET /shuttle HTTP/1.0" 302 -* |
|dialip-24.athenet.net - - [01/Aug/1995:01:33:02 -0400] "GET /history/apollo/apollo-13.html HTTP/1.0" 404 -* |
|h96-158.ccnet.com - - [01/Aug/1995:01:35:50 -0400] "GET /history/apollo/a-001/a-001-patch-small.gif HTTP/1.0" 404 -* |
|h96-158.ccnet.com - - [01/Aug/1995:01:36:23 -0400] "GET /history/apollo/a-001/movies/ HTTP/1.0" 404 -* |
|h96-158.ccnet.com - - [01/Aug/1995:01:36:30 -0400] "GET /history/apollo/a-001/a-001-patch-small.gif HTTP/1.0" 404 -* |
|h96-158.ccnet.com - - [01/Aug/1995:01:36:38 -0400] "GET /history/apollo/a-001/movies/ HTTP/1.0" 404 -* |
|h96-158.ccnet.com - - [01/Aug/1995:01:36:42 -0400] "GET /history/apollo/a-001/a-001-patch-small.gif HTTP/1.0" 404 -* |
|h96-158.ccnet.com - - [01/Aug/1995:01:36:44 -0400] "GET /history/apollo/a-001/images/ HTTP/1.0" 404 -* |
|h96-158.ccnet.com - - [01/Aug/1995:01:36:47 -0400] "GET /history/apollo/a-001/a-001-patch-small.gif HTTP/1.0" 404 -* |
|h96-158.ccnet.com - - [01/Aug/1995:01:37:04 -0400] "GET /history/apollo/a-004/a-004-patch-small.gif HTTP/1.0" 404 -* |
|h96-158.ccnet.com - - [01/Aug/1995:01:37:05 -0400] "GET /history/apollo/a-004/movies/ HTTP/1.0" 404 -* |
+----------------------------------------------------------------------------------------------------------------------------+
这表明没有数据response客户端,所以服务器写了一个dash字符,我们接下来要把这些字符替换为数字0。
cleaned_df = split_df.na.fill({'content_size': 0}) #na返回了一个DataFrameNaFunctions Object,这个对象包含了很多可以处理null列的方法。
exprs = []
for col_name in cleaned_df.columns:
exprs.append(count_null(col_name))
cleaned_df.agg(*exprs).show()
可以看到,现在已经不包含某些列为null的行了。
+----+---------+----+------+------------+
|host|timestamp|path|status|content_size|
+----+---------+----+------+------------+
| 0| 0| 0| 0| 0|
+----+---------+----+------+------------+
数据转换
现在数据框中的timestamp列并不是实际的时间戳,而是apache服务器的时间格式:[dd/mmm/yyyy:hh:mm:ss (+/-)zzzz],接下来我们将其转换为标准的时间戳格式
month_map = {
'Jan': 1, 'Feb': 2, 'Mar':3, 'Apr':4, 'May':5, 'Jun':6, 'Jul':7,
'Aug':8, 'Sep': 9, 'Oct':10, 'Nov': 11, 'Dec': 12
}
def parse_clf_time(s):
return "{0:04d}-{1:02d}-{2:02d} {3:02d}:{4:02d}:{5:02d}".format(
int(s[7:11]),
month_map[s[3:6]],
int(s[0:2]),
int(s[12:14]),
int(s[15:17]),
int(s[18:20])
)
u_parse_time = udf(parse_clf_time)
logs_df = (cleaned_df
.select('*', u_parse_time(split_df['timestamp'])
.cast('timestamp') #转型为时间戳
.alias('time'))
.drop('timestamp')) #将原始的列删除
total_log_entries = logs_df.count()
我们来看一下转换后的数据:
+----------------------------+---------------------------------------------------+------+------------+---------------------+
|host |path |status|content_size|time |
+----------------------------+---------------------------------------------------+------+------------+---------------------+
|in24.inetnebr.com |/shuttle/missions/sts-68/news/sts-68-mcc-05.txt |200 |1839 |1995-08-01 00:00:01.0|
|uplherc.upl.com |/ |304 |0 |1995-08-01 00:00:07.0|
|uplherc.upl.com |/images/ksclogo-medium.gif |304 |0 |1995-08-01 00:00:08.0|
|uplherc.upl.com |/images/MOSAIC-logosmall.gif |304 |0 |1995-08-01 00:00:08.0|
|uplherc.upl.com |/images/USA-logosmall.gif |304 |0 |1995-08-01 00:00:08.0|
|ix-esc-ca2-07.ix.netcom.com |/images/launch-logo.gif |200 |1713 |1995-08-01 00:00:09.0|
|uplherc.upl.com |/images/WORLD-logosmall.gif |304 |0 |1995-08-01 00:00:10.0|
|slppp6.intermind.net |/history/skylab/skylab.html |200 |1687 |1995-08-01 00:00:10.0|
|piweba4y.prodigy.com |/images/launchmedium.gif |200 |11853 |1995-08-01 00:00:10.0|
|slppp6.intermind.net |/history/skylab/skylab-small.gif |200 |9202 |1995-08-01 00:00:11.0|
|slppp6.intermind.net |/images/ksclogosmall.gif |200 |3635 |1995-08-01 00:00:12.0|
|ix-esc-ca2-07.ix.netcom.com |/history/apollo/images/apollo-logo1.gif |200 |1173 |1995-08-01 00:00:12.0|
|slppp6.intermind.net |/history/apollo/images/apollo-logo.gif |200 |3047 |1995-08-01 00:00:13.0|
|uplherc.upl.com |/images/NASA-logosmall.gif |304 |0 |1995-08-01 00:00:14.0|
|133.43.96.45 |/shuttle/missions/sts-69/mission-sts-69.html |200 |10566 |1995-08-01 00:00:16.0|
|kgtyk4.kj.yamagata-u.ac.jp |/ |200 |7280 |1995-08-01 00:00:17.0|
|kgtyk4.kj.yamagata-u.ac.jp |/images/ksclogo-medium.gif |200 |5866 |1995-08-01 00:00:18.0|
|d0ucr6.fnal.gov |/history/apollo/apollo-16/apollo-16.html |200 |2743 |1995-08-01 00:00:19.0|
|ix-esc-ca2-07.ix.netcom.com |/shuttle/resources/orbiters/discovery.html |200 |6849 |1995-08-01 00:00:19.0|
|d0ucr6.fnal.gov |/history/apollo/apollo-16/apollo-16-patch-small.gif|200 |14897 |1995-08-01 00:00:20.0|
+----------------------------+---------------------------------------------------+------+------------+---------------------+
注意time这一列,已经转换为timestamp类型了。
数据统计
数据总览
我们先看一下content_size这一列数据的统计值:
content_size_summary_df = logs_df.describe(['content_size'])
content_size_summary_df.show()
输出如下:
+-------+------------------+
|summary| content_size|
+-------+------------------+
| count| 1043177|
| mean|17531.555702435926|
| stddev| 68561.99906264187|
| min| 0|
| max| 3421948|
+-------+------------------+
我们也可以用另一种方法来计算这些统计值
from pyspark.sql import functions as sqlFunctions
content_size_stats = (logs_df
.agg(sqlFunctions.min(logs_df['content_size']),
sqlFunctions.avg(logs_df['content_size']),
sqlFunctions.max(logs_df['content_size']))
.first())
print 'Using SQL functions:'
print 'Content Size Avg: {1:,.2f}; Min: {0:.2f}; Max: {2:,.0f}'.format(*content_size_stats)
输出如下:
Using SQL functions:
Content Size Avg: 17,531.56; Min: 0.00; Max: 3,421,948
HTTP响应状态统计
下面我们来分析一下HTTP的响应状态,我们把数据按照status分组计数,然后按照status排序,最后把计算结果缓存起来,因为后面会多次使用到这个数据框。
status_to_count_df =(logs_df
.groupBy('status')
.count()
.sort('status')
.cache())
status_to_count_length = status_to_count_df.count()
print 'Found %d response codes' % status_to_count_length
status_to_count_df.show()
最终的输出结果如下所示:
+------+------+
|status| count|
+------+------+
| 200|940847|
| 302| 16244|
| 304| 79824|
| 403| 58|
| 404| 6185|
| 500| 2|
| 501| 17|
+------+------+
当然我们可以可视化数据:
因为status=200的占据了很大一部分比例,为了更好地对比数据,我们可以对count列取对数。
log_status_to_count_df = status_to_count_df.withColumn('log(count)', sqlFunctions.log(status_to_count_df['count']))
display(log_status_to_count_df)
结果如下图所示:
接下来我们使用Python的matplotlib来作图
# https://pypi.python.org/pypi/spark_notebook_helpers/1.0.1
# np is just an alias for numpy.
# cm and plt are aliases for matplotlib.cm (for "color map") and matplotlib.pyplot, respectively.
# prepareSubplot is a helper.
from spark_notebook_helpers import prepareSubplot, np, plt, cm
data = log_status_to_count_df.drop('count').collect() # 将数据汇总到driver node(数据量很大不应该使用collect方法)
x, y = zip(*data) # 要绘制的两组数据
index = np.arange(len(x)) # x轴索引
bar_width = 0.7 # 柱状图宽度
colorMap = 'Set2' # 颜色样式
cmap = cm.get_cmap(colorMap)
fig, ax = prepareSubplot(np.arange(0, 6, 1), np.arange(0, 14, 2)) # 坐标轴的范围
plt.bar(index, y, width=bar_width, color=cmap(0)) # 绘制柱状图
plt.xticks(index + bar_width/2.0, x) # 添加刻度值
display(fig) # 渲染
显示效果如下:
客户端访问频率统计
我们现在统计一下访问服务器超过10次的host。代码如下所示:
host_sum_df =(logs_df
.groupBy('host')
.count())
host_more_than_10_df = (host_sum_df
.filter(host_sum_df['count'] > 10)
.select(host_sum_df['host']))
print 'Any 20 hosts that have accessed more then 10 times:\n'
host_more_than_10_df.show(truncate=False)
输出如下:
+-----------------------------+
|host |
+-----------------------------+
|gcl-s2.aero.kyushu-u.ac.jp |
|dd09-015.compuserve.com |
|sun8.hrz.th-darmstadt.de |
|128.159.144.47 |
|160.151.233.33 |
|128.159.132.13 |
|s025n217.ummed.edu |
|204.126.175.80 |
|n1043367.ksc.nasa.gov |
|128.159.140.124 |
|hermes.rz.uni-duesseldorf.de |
|csa.bu.edu |
|139.169.136.137 |
|knet.kntl.co.kr |
|pcmnbib03.uio.no |
|ppp2_100.bekkoame.or.jp |
|hp3.lsw.uni-heidelberg.de |
|163.205.105.9 |
|netcom9.netcom.com |
|dd24-025.compuserve.com |
+-----------------------------+
URIs访问量统计
下面我们统计服务器资源的的访问量,首先按照path分组,然后计数。代码如下:
paths_df = (logs_df
.groupBy('path')
.count()
.sort('count', ascending=False))
paths_counts = (paths_df
.select('path', 'count')
.map(lambda r: (r[0], r[1]))
.collect())
paths, counts = zip(*paths_counts)
colorMap = 'Accent' # 样式
cmap = cm.get_cmap(colorMap)
index = np.arange(1000) # x轴下标
fig, ax = prepareSubplot(np.arange(0, 1000, 100), np.arange(0, 70000, 10000)) # 规定坐标轴范围
plt.xlabel('Paths')
plt.ylabel('Number of Hits')
plt.plot(index, counts[:1000], color=cmap(0), linewidth=3) # 横坐标可以看作是URIs的索引
plt.axhline(linewidth=2, color='#999999') # 设置曲线样式
display(fig)
显示效果如下:
我们也可以用
paths_df.show(truncate=False)
查看一下数据:
+-----------------------------------------------+-----+
|path |count|
+-----------------------------------------------+-----+
|/images/NASA-logosmall.gif |59666|
|/images/KSC-logosmall.gif |50420|
|/images/MOSAIC-logosmall.gif |43831|
|/images/USA-logosmall.gif |43604|
|/images/WORLD-logosmall.gif |43217|
|/images/ksclogo-medium.gif |41267|
|/ksc.html |28536|
|/history/apollo/images/apollo-logo1.gif |26766|
|/images/launch-logo.gif |24742|
|/ |20173|
|/images/ksclogosmall.gif |18955|
|/shuttle/missions/sts-69/mission-sts-69.html |17383|
|/shuttle/missions/sts-69/sts-69-patch-small.gif|16155|
|/shuttle/missions/missions.html |15819|
|/shuttle/countdown/ |15740|
|/shuttle/missions/sts-69/count69.gif |15214|
|/images/launchmedium.gif |13930|
|/icons/menu.xbm |8704 |
|/icons/blank.xbm |8653 |
|/icons/image.xbm |7504 |
+-----------------------------------------------+-----+
统计HTTP返回状态不是200的十大URI
首先查询所有status<>200的记录,然后按照path分组计数排序,显示结果。代码如下:
from pyspark.sql.functions import desc
not200DF = logs_df.filter('status<>200')
logs_sum_df = (not200DF
.groupBy('path')
.count()
.sort('count', ascending=False))
print 'Top Ten failed URLs:'
logs_sum_df.show(10, False)
处理后的数据框如下:
Top Ten failed URLs:
+---------------------------------------+-----+
|path |count|
+---------------------------------------+-----+
|/images/NASA-logosmall.gif |8761 |
|/images/KSC-logosmall.gif |7236 |
|/images/MOSAIC-logosmall.gif |5197 |
|/images/USA-logosmall.gif |5157 |
|/images/WORLD-logosmall.gif |5020 |
|/images/ksclogo-medium.gif |4728 |
|/history/apollo/images/apollo-logo1.gif|2907 |
|/images/launch-logo.gif |2811 |
|/ |2199 |
|/images/ksclogosmall.gif |1622 |
+---------------------------------------+-----+
统计host的数量(去重)
这里我们用了dropDuplicates,因为它比groupBy效率更好
unique_host_count = logs_df.dropDuplicates(['host']).count()
print 'Unique hosts: {0}'.format(unique_host_count)
host去重后的数量为:
Unique hosts: 54507
统计每天host访问量
现在我们统计每天的访客数(根据host去重),首先选择host和time这两列(这里还需要将timestamp转换为day),然后将同一天host相同的数据去重。最后按照day分组计数。
from pyspark.sql.functions import dayofmonth
day_to_host_pair_df = logs_df.select(logs_df.host, dayofmonth(logs_df.time).alias('day'))
day_group_hosts_df = day_to_host_pair_df.distinct()
daily_hosts_df = day_group_hosts_df.groupBy('day').count()
daily_hosts_df.cache()
print 'Unique hosts per day:'
daily_hosts_df.show(30, False)
我们计算出的结果如下所示:
Unique hosts per day:
+---+-----+
|day|count|
+---+-----+
|1 |2582 |
|3 |3222 |
|4 |4190 |
|5 |2502 |
|6 |2537 |
|7 |4106 |
|8 |4406 |
|9 |4317 |
|10 |4523 |
|11 |4346 |
|12 |2864 |
|13 |2650 |
|14 |4454 |
|15 |4214 |
|16 |4340 |
|17 |4385 |
|18 |4168 |
|19 |2550 |
|20 |2560 |
|21 |4134 |
|22 |4456 |
+---+-----+
当然可以将这些数据进行可视化:
days_with_hosts = []
hosts = []
for day, count in daily_hosts_df.collect():
days_with_hosts.append(day)
hosts.append(count)
fig, ax = prepareSubplot(np.arange(0, 30, 5), np.arange(0, 5000, 1000)) # 坐标轴范围
colorMap = 'Dark2' # 样式
cmap = cm.get_cmap(colorMap)
plt.plot(days_with_hosts, hosts, color=cmap(0), linewidth=3) # 绘制数据
plt.axis([0, max(days_with_hosts), 0, max(hosts)+500]) # 绘制坐标轴
plt.xlabel('Day')
plt.ylabel('Hosts')
plt.axhline(linewidth=3, color='#999999')
plt.axvline(linewidth=2, color='#999999')
display(fig)
显示效果如下图所示:
当然也可用DataBricks提供的函数进行可视化:
display(daily_hosts_df)
效果如下:
日均host访问量
我们现在要统计每天平均有多少个host访问了我们的服务。首先可以统计出每天的总访问量,将结果存储在total_req_per_day_df中。因为我们之前统计了每天的host访问数,所以我们只要按天关联两个数据框,然后用总访问量除以host数量。就得到了日均host访问量。
total_req_per_day_df = (
logs_df.select(logs_df.host, dayofmonth(logs_df.time).alias('day'))
.groupBy('day')
.count()
)
daily_hosts_df.registerTempTable("daily_hosts_df")
total_req_per_day_df.registerTempTable("total_req_per_day_df")
sql = 'SELECT t1.day AS day, t1.count / t2.count AS avg_reqs_per_host_per_day \
FROM total_req_per_day_df t1, daily_hosts_df t2 \
WHERE t1.day=t2.day'
avg_daily_req_per_host_df = sqlContext.sql(sql).cache()
print 'Average number of daily requests per Hosts is:\n'
avg_daily_req_per_host_df.show()
统计后的结果如下:
+---+-------------------------+
|day|avg_reqs_per_host_per_day|
+---+-------------------------+
| 1| 13.166537567776917|
| 3| 12.845127250155183|
| 4| 14.213365155131266|
| 5| 12.745003996802557|
| 6| 12.777296018919984|
| 7| 13.968582562104238|
| 8| 13.650022696323196|
| 9| 14.00440120454019|
| 10| 13.540791510059695|
| 11| 14.091578462954441|
| 12| 13.292597765363128|
| 13| 13.766037735849057|
| 14| 13.442523574315222|
| 15| 13.964167062173706|
| 16| 13.053225806451612|
| 17| 13.450399087799315|
| 18| 13.494241842610364|
| 19| 12.585098039215687|
| 20| 12.876171875|
| 21| 13.434687953555878|
+---+-------------------------+
现在我们将数据可视化:
days_with_avg = []
avgs = []
for day, avg in avg_daily_req_per_host_df.collect():
days_with_avg.append(day)
avgs.append(avg)
fig, ax = prepareSubplot(np.arange(0, 20, 5), np.arange(0, 16, 2)) # 坐标范围
colorMap = 'Set3' # 样式
cmap = cm.get_cmap(colorMap)
plt.plot(days_with_avg, avgs, color=cmap(0), linewidth=3) # 绘制数据
plt.axis([0, max(days_with_avg), 0, max(avgs)+2]) # 绘制坐标轴
plt.xlabel('Day')
plt.ylabel('Average')
plt.axhline(linewidth=3, color='#999999')
plt.axvline(linewidth=2, color='#999999')
display(fig)
效果如下图所示:
当然也可以用Databricks提供的函数可视化数据:
display(avg_daily_req_per_host_df)
如下图所示:
404状态数据分析
我们先来看看日志中有多少HTTP的响应是404,我们使用filter函数来过滤数据。
not_found_df = logs_df.filter('status=404')
not_found_df.cache()
print('Found {0} 404 URLs').format(not_found_df.count())
打印如下:
Found 6185 404 URLs
我们来看看有哪些URIs,返回了HTTP 404,这里需要按照path去重。
not_found_paths_df = not_found_df.select('path')
unique_not_found_paths_df = not_found_paths_df.distinct()
print '404 URLS:\n'
unique_not_found_paths_df.show(n=40, truncate=False)
打印如下:
+--------------------------------------------------------------------+
|path |
+--------------------------------------------------------------------+
|/history/apollo/apollo-13/apollo-13.html. |
|/www/shuttle/countdown/liftoff.html |
|/shuttle/missions/sts-25/images/images.html |
|/shutttle/missions/sts-70/ |
|/history/discovery |
|/html/STS66.htm |
|/history/apollo/apollo13 |
|/statistics/1995/Jul/Jul95_request.gif |
|/news/sci.space.shuttle/archive/sci-space-shuttle-15-mar-1995-00.txt|
|/software.winvn.winvn.html |
|/images/KSCogosmall.gif |
|/astronaut-candidates/pub/FTPfiles/vomit-comet.postings |
|/shuttle/resources/orbiters/discovery.gif |
|:// |
|/manned |
|/128.159.104.89/tv/tv.html |
|/images/crawlerway.gif |
|/history/skylab/skylab-3.gif |
|/elv/SCOUT/elvpage.htm |
|/software/winvn/winvnhtml |
|/kscarea-small.gif |
|/shuttle/mission/sts-69/countdown.html |
|/history/apollo/a-001/movies/ |
|/statistics/Mar.wwwstats.html |
|/history/apollo/pad-abort-test-1/pad-abort-test-1-patch-small.gif |
|/statistics/1995/Jul/Jul95_byte.gif |
|/ads/images/wsji_ad.gif |
|/kcs.html |
|/shuttle/technology/sts-newsref/sts-jsc.html' |
|/statistics/1995/Aug/Aug95_request.gif |
|/.nasa.gov/shuttle/ |
|/.../liftoff.html |
|/facilities/mil.htm/ |
|/ksk.html |
|/\\yahoo.com |
|/shuttle/missions/.html |
|/shuttle/missions/51-L/mission-51-l.html |
|/history/apollo/a-004/movies/ |
|/history/apollo/sa-1/docs/ |
|/magazine/p08aug/08pet11.gif |
+--------------------------------------------------------------------+
接下来我们统计返回HTTP 404状态最多的20个URIs,这很简单,按照path分组后计数然后按照count降序排列。
top_20_not_found_df = (not_found_paths_df
.groupBy('path')
.count()
.sort(desc('count')))
print 'Top Twenty 404 URLs:\n'
top_20_not_found_df.show(n=20, truncate=False)
打印结果如下所示:
Top Twenty 404 URLs:
+-----------------------------------------------------------------+-----+
|path |count|
+-----------------------------------------------------------------+-----+
|/pub/winvn/readme.txt |633 |
|/pub/winvn/release.txt |494 |
|/shuttle/missions/STS-69/mission-STS-69.html |430 |
|/images/nasa-logo.gif |319 |
|/elv/DELTA/uncons.htm |178 |
|/shuttle/missions/sts-68/ksc-upclose.gif |154 |
|/history/apollo/sa-1/sa-1-patch-small.gif |146 |
|/images/crawlerway-logo.gif |120 |
|/://spacelink.msfc.nasa.gov |117 |
|/history/apollo/pad-abort-test-1/pad-abort-test-1-patch-small.gif|100 |
|/history/apollo/a-001/a-001-patch-small.gif |97 |
|/images/Nasa-logo.gif |85 |
| |76 |
|/shuttle/resources/orbiters/atlantis.gif |63 |
|/history/apollo/images/little-joe.jpg |62 |
|/images/lf-logo.gif |59 |
|/shuttle/resources/orbiters/discovery.gif |56 |
|/shuttle/resources/orbiters/challenger.gif |54 |
|/robots.txt |53 |
|/history/apollo/pad-abort-test-2/pad-abort-test-2-patch-small.gif|38 |
+-----------------------------------------------------------------+-----+
我们现在统计收到HTTP 404状态最多的25个host,这也不太难,代码如下所示:
hosts_404_count_df = (not_found_df
.select('host')
.groupBy('host')
.count()
.sort(desc('count')))
print 'Top 25 hosts that generated errors:\n'
hosts_404_count_df.show(n=25, truncate=False)
我们来看看输出:
+-----------------------------+-----+
|piweba3y.prodigy.com |39 |
|maz3.maz.net |39 |
|gate.barr.com |38 |
|nexus.mlckew.edu.au |37 |
|ts8-1.westwood.ts.ucla.edu |37 |
|m38-370-9.mit.edu |37 |
|204.62.245.32 |33 |
|163.206.104.34 |27 |
|spica.sci.isas.ac.jp |27 |
|www-d4.proxy.aol.com |26 |
|203.13.168.24 |25 |
|203.13.168.17 |25 |
|www-c4.proxy.aol.com |25 |
|internet-gw.watson.ibm.com |24 |
|scooter.pa-x.dec.com |23 |
|piweba5y.prodigy.com |23 |
|crl5.crl.com |23 |
|onramp2-9.onr.com |22 |
|slip145-189.ut.nl.ibm.net |22 |
|198.40.25.102.sap2.artic.edu |21 |
|msp1-16.nas.mr.net |20 |
|gn2.getnet.com |20 |
|tigger.nashscene.com |19 |
|dial055.mbnet.mb.ca |19 |
|isou24.vilspa.esa.es |19 |
+-----------------------------+-----+
我们可以统计每天出现HTTP 404的次数:
errors_by_date_sorted_df = (not_found_df
.select(dayofmonth('time').alias('day'))
.groupBy('day')
.count()
.cache())
print '404 Errors by day:\n'
errors_by_date_sorted_df.show()
我们来看一下输出:
404 Errors by day:
+---+-----+
|day|count|
+---+-----+
| 1| 243|
| 3| 303|
| 4| 346|
| 5| 234|
| 6| 372|
| 7| 532|
| 8| 381|
| 9| 279|
| 10| 314|
| 11| 263|
| 12| 195|
| 13| 216|
| 14| 287|
| 15| 326|
| 16| 258|
| 17| 269|
| 18| 255|
| 19| 207|
| 20| 312|
| 21| 305|
+---+-----+
现在我们来可视化上面的数据:
days_with_errors_404 = []
errors_404_by_day = []
for day, count in errors_by_date_sorted_df.collect():
days_with_errors_404.append(day)
errors_404_by_day.append(count)
print days_with_errors_404
print errors_404_by_day
fig, ax = prepareSubplot(np.arange(0, 20, 5), np.arange(0, 600, 100))
colorMap = 'rainbow' # 样式
cmap = cm.get_cmap(colorMap)
plt.plot(days_with_errors_404, errors_404_by_day, color=cmap(0), linewidth=3)
plt.axis([0, max(days_with_errors_404), 0, max(errors_404_by_day)])
plt.xlabel('Day')
plt.ylabel('404 Errors')
plt.axhline(linewidth=3, color='#999999')
plt.axvline(linewidth=2, color='#999999')
display(fig)
效果如下:
我们也可以换一种方式查看:
display(errors_by_date_sorted_df)
效果如下:
我们现在统计哪5天出现HTTP 404的次数最多:
top_err_date_df = errors_by_date_sorted_df.sort(desc('count'))
print 'Top Five Dates for 404 Requests:\n'
top_err_date_df.show(5)
可以看到day7, day8, day6, day4, day15出现HTTP 404的次数最多:
+---+-----+
|day|count|
+---+-----+
| 7| 532|
| 8| 381|
| 6| 372|
| 4| 346|
| 15| 326|
+---+-----+
既然能统计每天HTTP 404出现的次数,我们也能按照小时进行统计,代码如下:
from pyspark.sql.functions import hour
hour_records_sorted_df = (not_found_df
.select(hour('time').alias('hour'))
.groupBy('hour')
.count()
.cache())
print 'Top hours for 404 requests:\n'
hour_records_sorted_df.show(24)
输出如下:
Top hours for 404 requests:
+----+-----+
|hour|count|
+----+-----+
| 0| 175|
| 1| 171|
| 2| 422|
| 3| 272|
| 4| 102|
| 5| 95|
| 6| 93|
| 7| 122|
| 8| 199|
| 9| 185|
| 10| 329|
| 11| 263|
| 12| 438|
| 13| 397|
| 14| 318|
| 15| 347|
| 16| 373|
| 17| 330|
| 18| 268|
| 19| 269|
| 20| 270|
| 21| 241|
| 22| 234|
| 23| 272|
+----+-----+
现在我们把上面的数据进行可视化,这都是一样的套路:
hours_with_not_found = []
not_found_counts_per_hour = []
for hour, count in hour_records_sorted_df.collect():
hours_with_not_found.append(hour)
not_found_counts_per_hour.append(count)
fig, ax = prepareSubplot(np.arange(0, 25, 5), np.arange(0, 500, 50))
colorMap = 'seismic'
cmap = cm.get_cmap(colorMap)
plt.plot(hours_with_not_found, not_found_counts_per_hour, color=cmap(0), linewidth=3)
plt.axis([0, max(hours_with_not_found), 0, max(not_found_counts_per_hour)])
plt.xlabel('Hour')
plt.ylabel('404 Errors')
plt.axhline(linewidth=3, color='#999999')
plt.axvline(linewidth=2, color='#999999')
display(fig)
折线图如下:
我们也可以换一种显示方式:
display(hour_records_sorted_df)
柱状图如下: