[spark案例学习] WEB日志分析

数据准备

数据下载:美国宇航局肯尼迪航天中心WEB日志

我们先来看看数据:首先将日志加载到RDD,并显示出前20行(默认)。

import sys
import os
log_file_path ='apache.access.log.PROJECT'
base_df = sqlContext.read.text(log_file_path)
base_df.show(truncate=False)

数据框输出如下:

+-------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                          |
+-------------------------------------------------------------------------------------------------------------------------------+
|in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0" 200 1839     |
|uplherc.upl.com - - [01/Aug/1995:00:00:07 -0400] "GET / HTTP/1.0" 304 0                                                        |
|uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] "GET /images/ksclogo-medium.gif HTTP/1.0" 304 0                               |
|uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] "GET /images/MOSAIC-logosmall.gif HTTP/1.0" 304 0                             |
|uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] "GET /images/USA-logosmall.gif HTTP/1.0" 304 0                                |
|ix-esc-ca2-07.ix.netcom.com - - [01/Aug/1995:00:00:09 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1713                   |
|uplherc.upl.com - - [01/Aug/1995:00:00:10 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0" 304 0                              |
|slppp6.intermind.net - - [01/Aug/1995:00:00:10 -0400] "GET /history/skylab/skylab.html HTTP/1.0" 200 1687                      |
|piweba4y.prodigy.com - - [01/Aug/1995:00:00:10 -0400] "GET /images/launchmedium.gif HTTP/1.0" 200 11853                        |
|slppp6.intermind.net - - [01/Aug/1995:00:00:11 -0400] "GET /history/skylab/skylab-small.gif HTTP/1.0" 200 9202                 |
|slppp6.intermind.net - - [01/Aug/1995:00:00:12 -0400] "GET /images/ksclogosmall.gif HTTP/1.0" 200 3635                         |
|ix-esc-ca2-07.ix.netcom.com - - [01/Aug/1995:00:00:12 -0400] "GET /history/apollo/images/apollo-logo1.gif HTTP/1.0" 200 1173   |
|slppp6.intermind.net - - [01/Aug/1995:00:00:13 -0400] "GET /history/apollo/images/apollo-logo.gif HTTP/1.0" 200 3047           |
|uplherc.upl.com - - [01/Aug/1995:00:00:14 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 304 0                               |
|133.43.96.45 - - [01/Aug/1995:00:00:16 -0400] "GET /shuttle/missions/sts-69/mission-sts-69.html HTTP/1.0" 200 10566            |
|kgtyk4.kj.yamagata-u.ac.jp - - [01/Aug/1995:00:00:17 -0400] "GET / HTTP/1.0" 200 7280                                          |
|kgtyk4.kj.yamagata-u.ac.jp - - [01/Aug/1995:00:00:18 -0400] "GET /images/ksclogo-medium.gif HTTP/1.0" 200 5866                 |
|d0ucr6.fnal.gov - - [01/Aug/1995:00:00:19 -0400] "GET /history/apollo/apollo-16/apollo-16.html HTTP/1.0" 200 2743              |
|ix-esc-ca2-07.ix.netcom.com - - [01/Aug/1995:00:00:19 -0400] "GET /shuttle/resources/orbiters/discovery.html HTTP/1.0" 200 6849|
|d0ucr6.fnal.gov - - [01/Aug/1995:00:00:20 -0400] "GET /history/apollo/apollo-16/apollo-16-patch-small.gif HTTP/1.0" 200 14897  |
+-------------------------------------------------------------------------------------------------------------------------------+

下面给出字段的解释:

字段 说明
remotehost 远程主机名(如果DNS不可用显示ip)
rfc931 用户登录名,这里我们的分析没有用到这个字段
authuser 经过HTTP服务器身份认证的用户登录名
[date] 客户端请求的日期和时间
"request" 来自于浏览器或客户端的请求
status 服务器返回客户端的状态码
bytes 传输到客户端的字节数(Content-Length)

参考:The Common Logfile Format

数据解析

首先我们将数据进行解析,这里会用到正则表达式(XKCD:1313):

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems --Jamie Zawinski
[spark案例学习] WEB日志分析

我们还会用到spark的regexp_extract函数。这个函数可以使用包含一个或多个捕获组的正则表达式匹配内容,然后提取其中一个捕获组匹配的内容。

from pyspark.sql.functions import split, regexp_extract
split_df = base_df.select(regexp_extract('value', r'^([^\s]+\s)', 1).alias('host'),
                          regexp_extract('value', r'^.*\[(\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})]', 1).alias('timestamp'),
                          regexp_extract('value', r'^.*"\w+\s+([^\s]+)\s+HTTP.*"', 1).alias('path'),
                          regexp_extract('value', r'^.*"\s+([^\s]+)', 1).cast('integer').alias('status'),
                          regexp_extract('value', r'^.*\s+(\d+)$', 1).cast('integer').alias('content_size'))
split_df.show(truncate=False)

解析后的数据输出如下:

+----------------------------+--------------------------+---------------------------------------------------+------+------------+
|host                        |timestamp                 |path                                               |status|content_size|
+----------------------------+--------------------------+---------------------------------------------------+------+------------+
|in24.inetnebr.com           |01/Aug/1995:00:00:01 -0400|/shuttle/missions/sts-68/news/sts-68-mcc-05.txt    |200   |1839        |
|uplherc.upl.com             |01/Aug/1995:00:00:07 -0400|/                                                  |304   |0           |
|uplherc.upl.com             |01/Aug/1995:00:00:08 -0400|/images/ksclogo-medium.gif                         |304   |0           |
|uplherc.upl.com             |01/Aug/1995:00:00:08 -0400|/images/MOSAIC-logosmall.gif                       |304   |0           |
|uplherc.upl.com             |01/Aug/1995:00:00:08 -0400|/images/USA-logosmall.gif                          |304   |0           |
|ix-esc-ca2-07.ix.netcom.com |01/Aug/1995:00:00:09 -0400|/images/launch-logo.gif                            |200   |1713        |
|uplherc.upl.com             |01/Aug/1995:00:00:10 -0400|/images/WORLD-logosmall.gif                        |304   |0           |
|slppp6.intermind.net        |01/Aug/1995:00:00:10 -0400|/history/skylab/skylab.html                        |200   |1687        |
|piweba4y.prodigy.com        |01/Aug/1995:00:00:10 -0400|/images/launchmedium.gif                           |200   |11853       |
|slppp6.intermind.net        |01/Aug/1995:00:00:11 -0400|/history/skylab/skylab-small.gif                   |200   |9202        |
|slppp6.intermind.net        |01/Aug/1995:00:00:12 -0400|/images/ksclogosmall.gif                           |200   |3635        |
|ix-esc-ca2-07.ix.netcom.com |01/Aug/1995:00:00:12 -0400|/history/apollo/images/apollo-logo1.gif            |200   |1173        |
|slppp6.intermind.net        |01/Aug/1995:00:00:13 -0400|/history/apollo/images/apollo-logo.gif             |200   |3047        |
|uplherc.upl.com             |01/Aug/1995:00:00:14 -0400|/images/NASA-logosmall.gif                         |304   |0           |
|133.43.96.45                |01/Aug/1995:00:00:16 -0400|/shuttle/missions/sts-69/mission-sts-69.html       |200   |10566       |
|kgtyk4.kj.yamagata-u.ac.jp  |01/Aug/1995:00:00:17 -0400|/                                                  |200   |7280        |
|kgtyk4.kj.yamagata-u.ac.jp  |01/Aug/1995:00:00:18 -0400|/images/ksclogo-medium.gif                         |200   |5866        |
|d0ucr6.fnal.gov             |01/Aug/1995:00:00:19 -0400|/history/apollo/apollo-16/apollo-16.html           |200   |2743        |
|ix-esc-ca2-07.ix.netcom.com |01/Aug/1995:00:00:19 -0400|/shuttle/resources/orbiters/discovery.html         |200   |6849        |
|d0ucr6.fnal.gov             |01/Aug/1995:00:00:20 -0400|/history/apollo/apollo-16/apollo-16-patch-small.gif|200   |14897       |
+----------------------------+--------------------------+---------------------------------------------------+------+------------+

参考:regexone

数据清洗

首先我们查看一下原始日志包含多少空行。

base_df.filter(base_df['value'].isNull()).count()

输出为0,很好!没有空行:

0

接下来,我们统计有多少行的数据至少包含一个null值。

bad_rows_df = split_df.filter(split_df['host'].isNull() |
                              split_df['timestamp'].isNull() |
                              split_df['path'].isNull() |
                              split_df['status'].isNull() |
                              split_df['content_size'].isNull())
bad_rows_df.count()

输出如下:

8756

下面我们统计每列有多少个null值(参考:*):

from pyspark.sql.functions import col, sum

def count_null(col_name):
  return sum(col(col_name).isNull().cast('integer')).alias(col_name)

exprs = []
for col_name in split_df.columns:
  exprs.append(count_null(col_name))
split_df.agg(*exprs).show() #*exprs把列表转换为可变参数

统计结果如下:

+----+---------+----+------+------------+
|host|timestamp|path|status|content_size|
+----+---------+----+------+------------+
|   0|        0|   0|     0|        8756|
+----+---------+----+------+------------+

看起来最后一列有很多空值,我们用下面的代码统计有多少不以一个或多个数字结尾的行。

bad_content_size_df = base_df.filter(~ base_df['value'].rlike(r'\d+$'))
bad_content_size_df.count()

可以看到,这正好和之前的统计值一样。

8756

我们来看一下这部分数据,因为数据可能包含一个或多个空格,所以我们在最后拼接一个*号。

from pyspark.sql.functions import lit, concat
bad_content_size_df.select(concat(bad_content_size_df['value'], lit('*'))).show(truncate=False)

输出如下:

+----------------------------------------------------------------------------------------------------------------------------+
|concat(value,*)                                                                                                             |
+----------------------------------------------------------------------------------------------------------------------------+
|gw1.att.com - - [01/Aug/1995:00:03:53 -0400] "GET /shuttle/missions/sts-73/news HTTP/1.0" 302 -*                            |
|js002.cc.utsunomiya-u.ac.jp - - [01/Aug/1995:00:07:33 -0400] "GET /shuttle/resources/orbiters/discovery.gif HTTP/1.0" 404 -*|
|tia1.eskimo.com - - [01/Aug/1995:00:28:41 -0400] "GET /pub/winvn/release.txt HTTP/1.0" 404 -*                               |
|itws.info.eng.niigata-u.ac.jp - - [01/Aug/1995:00:38:01 -0400] "GET /ksc.html/facts/about_ksc.html HTTP/1.0" 403 -*         |
|grimnet23.idirect.com - - [01/Aug/1995:00:50:12 -0400] "GET /www/software/winvn/winvn.html HTTP/1.0" 404 -*                 |
|miriworld.its.unimelb.edu.au - - [01/Aug/1995:01:04:54 -0400] "GET /history/history.htm HTTP/1.0" 404 -*                    |
|ras38.srv.net - - [01/Aug/1995:01:05:14 -0400] "GET /elv/DELTA/uncons.htm HTTP/1.0" 404 -*                                  |
|cs1-06.leh.ptd.net - - [01/Aug/1995:01:17:38 -0400] "GET /sts-71/launch/" 404 -*                                            |
|www-b2.proxy.aol.com - - [01/Aug/1995:01:22:07 -0400] "GET /shuttle/countdown HTTP/1.0" 302 -*                              |
|maui56.maui.net - - [01/Aug/1995:01:31:56 -0400] "GET /shuttle HTTP/1.0" 302 -*                                             |
|dialip-24.athenet.net - - [01/Aug/1995:01:33:02 -0400] "GET /history/apollo/apollo-13.html HTTP/1.0" 404 -*                 |
|h96-158.ccnet.com - - [01/Aug/1995:01:35:50 -0400] "GET /history/apollo/a-001/a-001-patch-small.gif HTTP/1.0" 404 -*        |
|h96-158.ccnet.com - - [01/Aug/1995:01:36:23 -0400] "GET /history/apollo/a-001/movies/ HTTP/1.0" 404 -*                      |
|h96-158.ccnet.com - - [01/Aug/1995:01:36:30 -0400] "GET /history/apollo/a-001/a-001-patch-small.gif HTTP/1.0" 404 -*        |
|h96-158.ccnet.com - - [01/Aug/1995:01:36:38 -0400] "GET /history/apollo/a-001/movies/ HTTP/1.0" 404 -*                      |
|h96-158.ccnet.com - - [01/Aug/1995:01:36:42 -0400] "GET /history/apollo/a-001/a-001-patch-small.gif HTTP/1.0" 404 -*        |
|h96-158.ccnet.com - - [01/Aug/1995:01:36:44 -0400] "GET /history/apollo/a-001/images/ HTTP/1.0" 404 -*                      |
|h96-158.ccnet.com - - [01/Aug/1995:01:36:47 -0400] "GET /history/apollo/a-001/a-001-patch-small.gif HTTP/1.0" 404 -*        |
|h96-158.ccnet.com - - [01/Aug/1995:01:37:04 -0400] "GET /history/apollo/a-004/a-004-patch-small.gif HTTP/1.0" 404 -*        |
|h96-158.ccnet.com - - [01/Aug/1995:01:37:05 -0400] "GET /history/apollo/a-004/movies/ HTTP/1.0" 404 -*                      |
+----------------------------------------------------------------------------------------------------------------------------+

这表明没有数据response客户端,所以服务器写了一个dash字符,我们接下来要把这些字符替换为数字0。

cleaned_df = split_df.na.fill({'content_size': 0}) #na返回了一个DataFrameNaFunctions Object,这个对象包含了很多可以处理null列的方法。
exprs = []
for col_name in cleaned_df.columns:
  exprs.append(count_null(col_name))
cleaned_df.agg(*exprs).show()

可以看到,现在已经不包含某些列为null的行了。

+----+---------+----+------+------------+
|host|timestamp|path|status|content_size|
+----+---------+----+------+------------+
|   0|        0|   0|     0|           0|
+----+---------+----+------+------------+

数据转换

现在数据框中的timestamp列并不是实际的时间戳,而是apache服务器的时间格式:[dd/mmm/yyyy:hh:mm:ss (+/-)zzzz],接下来我们将其转换为标准的时间戳格式

month_map = {
  'Jan': 1, 'Feb': 2, 'Mar':3, 'Apr':4, 'May':5, 'Jun':6, 'Jul':7,
  'Aug':8,  'Sep': 9, 'Oct':10, 'Nov': 11, 'Dec': 12
}

def parse_clf_time(s):
    return "{0:04d}-{1:02d}-{2:02d} {3:02d}:{4:02d}:{5:02d}".format(
      int(s[7:11]),
      month_map[s[3:6]],
      int(s[0:2]),
      int(s[12:14]),
      int(s[15:17]),
      int(s[18:20])
    )

u_parse_time = udf(parse_clf_time)
logs_df = (cleaned_df
           .select('*', u_parse_time(split_df['timestamp'])
           .cast('timestamp')  #转型为时间戳
           .alias('time'))
           .drop('timestamp')) #将原始的列删除
total_log_entries = logs_df.count()

我们来看一下转换后的数据:

+----------------------------+---------------------------------------------------+------+------------+---------------------+
|host                        |path                                               |status|content_size|time                 |
+----------------------------+---------------------------------------------------+------+------------+---------------------+
|in24.inetnebr.com           |/shuttle/missions/sts-68/news/sts-68-mcc-05.txt    |200   |1839        |1995-08-01 00:00:01.0|
|uplherc.upl.com             |/                                                  |304   |0           |1995-08-01 00:00:07.0|
|uplherc.upl.com             |/images/ksclogo-medium.gif                         |304   |0           |1995-08-01 00:00:08.0|
|uplherc.upl.com             |/images/MOSAIC-logosmall.gif                       |304   |0           |1995-08-01 00:00:08.0|
|uplherc.upl.com             |/images/USA-logosmall.gif                          |304   |0           |1995-08-01 00:00:08.0|
|ix-esc-ca2-07.ix.netcom.com |/images/launch-logo.gif                            |200   |1713        |1995-08-01 00:00:09.0|
|uplherc.upl.com             |/images/WORLD-logosmall.gif                        |304   |0           |1995-08-01 00:00:10.0|
|slppp6.intermind.net        |/history/skylab/skylab.html                        |200   |1687        |1995-08-01 00:00:10.0|
|piweba4y.prodigy.com        |/images/launchmedium.gif                           |200   |11853       |1995-08-01 00:00:10.0|
|slppp6.intermind.net        |/history/skylab/skylab-small.gif                   |200   |9202        |1995-08-01 00:00:11.0|
|slppp6.intermind.net        |/images/ksclogosmall.gif                           |200   |3635        |1995-08-01 00:00:12.0|
|ix-esc-ca2-07.ix.netcom.com |/history/apollo/images/apollo-logo1.gif            |200   |1173        |1995-08-01 00:00:12.0|
|slppp6.intermind.net        |/history/apollo/images/apollo-logo.gif             |200   |3047        |1995-08-01 00:00:13.0|
|uplherc.upl.com             |/images/NASA-logosmall.gif                         |304   |0           |1995-08-01 00:00:14.0|
|133.43.96.45                |/shuttle/missions/sts-69/mission-sts-69.html       |200   |10566       |1995-08-01 00:00:16.0|
|kgtyk4.kj.yamagata-u.ac.jp  |/                                                  |200   |7280        |1995-08-01 00:00:17.0|
|kgtyk4.kj.yamagata-u.ac.jp  |/images/ksclogo-medium.gif                         |200   |5866        |1995-08-01 00:00:18.0|
|d0ucr6.fnal.gov             |/history/apollo/apollo-16/apollo-16.html           |200   |2743        |1995-08-01 00:00:19.0|
|ix-esc-ca2-07.ix.netcom.com |/shuttle/resources/orbiters/discovery.html         |200   |6849        |1995-08-01 00:00:19.0|
|d0ucr6.fnal.gov             |/history/apollo/apollo-16/apollo-16-patch-small.gif|200   |14897       |1995-08-01 00:00:20.0|
+----------------------------+---------------------------------------------------+------+------------+---------------------+

注意time这一列,已经转换为timestamp类型了。

数据统计

数据总览

我们先看一下content_size这一列数据的统计值:

content_size_summary_df = logs_df.describe(['content_size'])
content_size_summary_df.show()

输出如下:

+-------+------------------+
|summary|      content_size|
+-------+------------------+
|  count|           1043177|
|   mean|17531.555702435926|
| stddev| 68561.99906264187|
|    min|                 0|
|    max|           3421948|
+-------+------------------+

我们也可以用另一种方法来计算这些统计值

from pyspark.sql import functions as sqlFunctions
content_size_stats =  (logs_df
                       .agg(sqlFunctions.min(logs_df['content_size']),
                            sqlFunctions.avg(logs_df['content_size']),
                            sqlFunctions.max(logs_df['content_size']))
                       .first())

print 'Using SQL functions:'
print 'Content Size Avg: {1:,.2f}; Min: {0:.2f}; Max: {2:,.0f}'.format(*content_size_stats)

输出如下:

Using SQL functions:
Content Size Avg: 17,531.56; Min: 0.00; Max: 3,421,948

HTTP响应状态统计

下面我们来分析一下HTTP的响应状态,我们把数据按照status分组计数,然后按照status排序,最后把计算结果缓存起来,因为后面会多次使用到这个数据框。

status_to_count_df =(logs_df
                     .groupBy('status')
                     .count()
                     .sort('status')
                     .cache())

status_to_count_length = status_to_count_df.count()
print 'Found %d response codes' % status_to_count_length
status_to_count_df.show()

最终的输出结果如下所示:

+------+------+
|status| count|
+------+------+
|   200|940847|
|   302| 16244|
|   304| 79824|
|   403|    58|
|   404|  6185|
|   500|     2|
|   501|    17|
+------+------+

当然我们可以可视化数据:
[spark案例学习] WEB日志分析
因为status=200的占据了很大一部分比例,为了更好地对比数据,我们可以对count列取对数。

log_status_to_count_df = status_to_count_df.withColumn('log(count)', sqlFunctions.log(status_to_count_df['count']))
display(log_status_to_count_df)

结果如下图所示:
[spark案例学习] WEB日志分析
接下来我们使用Python的matplotlib来作图

# https://pypi.python.org/pypi/spark_notebook_helpers/1.0.1
# np is just an alias for numpy.
# cm and plt are aliases for matplotlib.cm (for "color map") and matplotlib.pyplot, respectively.
# prepareSubplot is a helper.
from spark_notebook_helpers import prepareSubplot, np, plt, cm

data = log_status_to_count_df.drop('count').collect() # 将数据汇总到driver node(数据量很大不应该使用collect方法)
x, y = zip(*data) # 要绘制的两组数据
index = np.arange(len(x)) # x轴索引
bar_width = 0.7 # 柱状图宽度
colorMap = 'Set2' # 颜色样式
cmap = cm.get_cmap(colorMap)

fig, ax = prepareSubplot(np.arange(0, 6, 1), np.arange(0, 14, 2)) # 坐标轴的范围
plt.bar(index, y, width=bar_width, color=cmap(0)) # 绘制柱状图
plt.xticks(index + bar_width/2.0, x) # 添加刻度值
display(fig) # 渲染

显示效果如下:
[spark案例学习] WEB日志分析

客户端访问频率统计

我们现在统计一下访问服务器超过10次的host。代码如下所示:

host_sum_df =(logs_df
              .groupBy('host')
              .count())

host_more_than_10_df = (host_sum_df
                        .filter(host_sum_df['count'] > 10)
                        .select(host_sum_df['host']))

print 'Any 20 hosts that have accessed more then 10 times:\n'
host_more_than_10_df.show(truncate=False)

输出如下:

+-----------------------------+
|host                         |
+-----------------------------+
|gcl-s2.aero.kyushu-u.ac.jp   |
|dd09-015.compuserve.com      |
|sun8.hrz.th-darmstadt.de     |
|128.159.144.47               |
|160.151.233.33               |
|128.159.132.13               |
|s025n217.ummed.edu           |
|204.126.175.80               |
|n1043367.ksc.nasa.gov        |
|128.159.140.124              |
|hermes.rz.uni-duesseldorf.de |
|csa.bu.edu                   |
|139.169.136.137              |
|knet.kntl.co.kr              |
|pcmnbib03.uio.no             |
|ppp2_100.bekkoame.or.jp      |
|hp3.lsw.uni-heidelberg.de    |
|163.205.105.9                |
|netcom9.netcom.com           |
|dd24-025.compuserve.com      |
+-----------------------------+

URIs访问量统计

下面我们统计服务器资源的的访问量,首先按照path分组,然后计数。代码如下:

paths_df = (logs_df
            .groupBy('path')
            .count()
            .sort('count', ascending=False))

paths_counts = (paths_df
                .select('path', 'count')
                .map(lambda r: (r[0], r[1]))
                .collect())

paths, counts = zip(*paths_counts)

colorMap = 'Accent' # 样式
cmap = cm.get_cmap(colorMap)
index = np.arange(1000) # x轴下标

fig, ax = prepareSubplot(np.arange(0, 1000, 100), np.arange(0, 70000, 10000)) # 规定坐标轴范围
plt.xlabel('Paths')
plt.ylabel('Number of Hits')
plt.plot(index, counts[:1000], color=cmap(0), linewidth=3) # 横坐标可以看作是URIs的索引
plt.axhline(linewidth=2, color='#999999') # 设置曲线样式
display(fig)  

显示效果如下:
[spark案例学习] WEB日志分析
我们也可以用

paths_df.show(truncate=False)

查看一下数据:

+-----------------------------------------------+-----+
|path                                           |count|
+-----------------------------------------------+-----+
|/images/NASA-logosmall.gif                     |59666|
|/images/KSC-logosmall.gif                      |50420|
|/images/MOSAIC-logosmall.gif                   |43831|
|/images/USA-logosmall.gif                      |43604|
|/images/WORLD-logosmall.gif                    |43217|
|/images/ksclogo-medium.gif                     |41267|
|/ksc.html                                      |28536|
|/history/apollo/images/apollo-logo1.gif        |26766|
|/images/launch-logo.gif                        |24742|
|/                                              |20173|
|/images/ksclogosmall.gif                       |18955|
|/shuttle/missions/sts-69/mission-sts-69.html   |17383|
|/shuttle/missions/sts-69/sts-69-patch-small.gif|16155|
|/shuttle/missions/missions.html                |15819|
|/shuttle/countdown/                            |15740|
|/shuttle/missions/sts-69/count69.gif           |15214|
|/images/launchmedium.gif                       |13930|
|/icons/menu.xbm                                |8704 |
|/icons/blank.xbm                               |8653 |
|/icons/image.xbm                               |7504 |
+-----------------------------------------------+-----+

统计HTTP返回状态不是200的十大URI

首先查询所有status<>200的记录,然后按照path分组计数排序,显示结果。代码如下:

from pyspark.sql.functions import desc
not200DF = logs_df.filter('status<>200')
logs_sum_df = (not200DF
               .groupBy('path')
               .count()
               .sort('count', ascending=False))

print 'Top Ten failed URLs:'
logs_sum_df.show(10, False)

处理后的数据框如下:

Top Ten failed URLs:
+---------------------------------------+-----+
|path                                   |count|
+---------------------------------------+-----+
|/images/NASA-logosmall.gif             |8761 |
|/images/KSC-logosmall.gif              |7236 |
|/images/MOSAIC-logosmall.gif           |5197 |
|/images/USA-logosmall.gif              |5157 |
|/images/WORLD-logosmall.gif            |5020 |
|/images/ksclogo-medium.gif             |4728 |
|/history/apollo/images/apollo-logo1.gif|2907 |
|/images/launch-logo.gif                |2811 |
|/                                      |2199 |
|/images/ksclogosmall.gif               |1622 |
+---------------------------------------+-----+

统计host的数量(去重)

这里我们用了dropDuplicates,因为它比groupBy效率更好

unique_host_count = logs_df.dropDuplicates(['host']).count()
print 'Unique hosts: {0}'.format(unique_host_count)

host去重后的数量为:

Unique hosts: 54507

统计每天host访问量

现在我们统计每天的访客数(根据host去重),首先选择host和time这两列(这里还需要将timestamp转换为day),然后将同一天host相同的数据去重。最后按照day分组计数。

from pyspark.sql.functions import dayofmonth

day_to_host_pair_df = logs_df.select(logs_df.host, dayofmonth(logs_df.time).alias('day'))
day_group_hosts_df = day_to_host_pair_df.distinct()
daily_hosts_df = day_group_hosts_df.groupBy('day').count()
daily_hosts_df.cache()

print 'Unique hosts per day:'
daily_hosts_df.show(30, False)

我们计算出的结果如下所示:

Unique hosts per day:
+---+-----+
|day|count|
+---+-----+
|1  |2582 |
|3  |3222 |
|4  |4190 |
|5  |2502 |
|6  |2537 |
|7  |4106 |
|8  |4406 |
|9  |4317 |
|10 |4523 |
|11 |4346 |
|12 |2864 |
|13 |2650 |
|14 |4454 |
|15 |4214 |
|16 |4340 |
|17 |4385 |
|18 |4168 |
|19 |2550 |
|20 |2560 |
|21 |4134 |
|22 |4456 |
+---+-----+

当然可以将这些数据进行可视化:

days_with_hosts = []
hosts = []
for day, count in daily_hosts_df.collect():
  days_with_hosts.append(day)
  hosts.append(count)

fig, ax = prepareSubplot(np.arange(0, 30, 5), np.arange(0, 5000, 1000)) # 坐标轴范围
colorMap = 'Dark2' # 样式
cmap = cm.get_cmap(colorMap)
plt.plot(days_with_hosts, hosts, color=cmap(0), linewidth=3) # 绘制数据
plt.axis([0, max(days_with_hosts), 0, max(hosts)+500]) # 绘制坐标轴
plt.xlabel('Day')
plt.ylabel('Hosts')
plt.axhline(linewidth=3, color='#999999')
plt.axvline(linewidth=2, color='#999999')
display(fig) 

显示效果如下图所示:
[spark案例学习] WEB日志分析
当然也可用DataBricks提供的函数进行可视化:

display(daily_hosts_df)

效果如下:
[spark案例学习] WEB日志分析

日均host访问量

我们现在要统计每天平均有多少个host访问了我们的服务。首先可以统计出每天的总访问量,将结果存储在total_req_per_day_df中。因为我们之前统计了每天的host访问数,所以我们只要按天关联两个数据框,然后用总访问量除以host数量。就得到了日均host访问量。

total_req_per_day_df = (
  logs_df.select(logs_df.host, dayofmonth(logs_df.time).alias('day'))
         .groupBy('day')
         .count()
)

daily_hosts_df.registerTempTable("daily_hosts_df")
total_req_per_day_df.registerTempTable("total_req_per_day_df")
sql = 'SELECT t1.day AS day, t1.count / t2.count AS avg_reqs_per_host_per_day \
       FROM total_req_per_day_df t1, daily_hosts_df t2                        \
       WHERE t1.day=t2.day'
avg_daily_req_per_host_df = sqlContext.sql(sql).cache()

print 'Average number of daily requests per Hosts is:\n'
avg_daily_req_per_host_df.show()

统计后的结果如下:

+---+-------------------------+
|day|avg_reqs_per_host_per_day|
+---+-------------------------+
|  1|       13.166537567776917|
|  3|       12.845127250155183|
|  4|       14.213365155131266|
|  5|       12.745003996802557|
|  6|       12.777296018919984|
|  7|       13.968582562104238|
|  8|       13.650022696323196|
|  9|        14.00440120454019|
| 10|       13.540791510059695|
| 11|       14.091578462954441|
| 12|       13.292597765363128|
| 13|       13.766037735849057|
| 14|       13.442523574315222|
| 15|       13.964167062173706|
| 16|       13.053225806451612|
| 17|       13.450399087799315|
| 18|       13.494241842610364|
| 19|       12.585098039215687|
| 20|             12.876171875|
| 21|       13.434687953555878|
+---+-------------------------+

现在我们将数据可视化:

days_with_avg = []
avgs = []

for day, avg in avg_daily_req_per_host_df.collect():
  days_with_avg.append(day)
  avgs.append(avg)

fig, ax = prepareSubplot(np.arange(0, 20, 5), np.arange(0, 16, 2)) # 坐标范围
colorMap = 'Set3' # 样式
cmap = cm.get_cmap(colorMap)
plt.plot(days_with_avg, avgs, color=cmap(0), linewidth=3) # 绘制数据
plt.axis([0, max(days_with_avg), 0, max(avgs)+2]) # 绘制坐标轴
plt.xlabel('Day')
plt.ylabel('Average')
plt.axhline(linewidth=3, color='#999999')
plt.axvline(linewidth=2, color='#999999')
display(fig)

效果如下图所示:
[spark案例学习] WEB日志分析
当然也可以用Databricks提供的函数可视化数据:

display(avg_daily_req_per_host_df)

如下图所示:
[spark案例学习] WEB日志分析

404状态数据分析

我们先来看看日志中有多少HTTP的响应是404,我们使用filter函数来过滤数据。

not_found_df = logs_df.filter('status=404')
not_found_df.cache()
print('Found {0} 404 URLs').format(not_found_df.count())

打印如下:

Found 6185 404 URLs

我们来看看有哪些URIs,返回了HTTP 404,这里需要按照path去重。

not_found_paths_df = not_found_df.select('path')
unique_not_found_paths_df = not_found_paths_df.distinct()

print '404 URLS:\n'
unique_not_found_paths_df.show(n=40, truncate=False)

打印如下:

+--------------------------------------------------------------------+
|path                                                                |
+--------------------------------------------------------------------+
|/history/apollo/apollo-13/apollo-13.html.                           |
|/www/shuttle/countdown/liftoff.html                                 |
|/shuttle/missions/sts-25/images/images.html                         |
|/shutttle/missions/sts-70/                                          |
|/history/discovery                                                  |
|/html/STS66.htm                                                     |
|/history/apollo/apollo13                                            |
|/statistics/1995/Jul/Jul95_request.gif                              |
|/news/sci.space.shuttle/archive/sci-space-shuttle-15-mar-1995-00.txt|
|/software.winvn.winvn.html                                          |
|/images/KSCogosmall.gif                                             |
|/astronaut-candidates/pub/FTPfiles/vomit-comet.postings             |
|/shuttle/resources/orbiters/discovery.gif                           |
|://                                                                 |
|/manned                                                             |
|/128.159.104.89/tv/tv.html                                          |
|/images/crawlerway.gif                                              |
|/history/skylab/skylab-3.gif                                        |
|/elv/SCOUT/elvpage.htm                                              |
|/software/winvn/winvnhtml                                           |
|/kscarea-small.gif                                                  |
|/shuttle/mission/sts-69/countdown.html                              |
|/history/apollo/a-001/movies/                                       |
|/statistics/Mar.wwwstats.html                                       |
|/history/apollo/pad-abort-test-1/pad-abort-test-1-patch-small.gif   |
|/statistics/1995/Jul/Jul95_byte.gif                                 |
|/ads/images/wsji_ad.gif                                             |
|/kcs.html                                                           |
|/shuttle/technology/sts-newsref/sts-jsc.html'                       |
|/statistics/1995/Aug/Aug95_request.gif                              |
|/.nasa.gov/shuttle/                                                 |
|/.../liftoff.html                                                   |
|/facilities/mil.htm/                                                |
|/ksk.html                                                           |
|/\\yahoo.com                                                        |
|/shuttle/missions/.html                                             |
|/shuttle/missions/51-L/mission-51-l.html                            |
|/history/apollo/a-004/movies/                                       |
|/history/apollo/sa-1/docs/                                          |
|/magazine/p08aug/08pet11.gif                                        |
+--------------------------------------------------------------------+

接下来我们统计返回HTTP 404状态最多的20个URIs,这很简单,按照path分组后计数然后按照count降序排列。

top_20_not_found_df = (not_found_paths_df
                       .groupBy('path')
                       .count()
                       .sort(desc('count')))

print 'Top Twenty 404 URLs:\n'
top_20_not_found_df.show(n=20, truncate=False)

打印结果如下所示:

Top Twenty 404 URLs:

+-----------------------------------------------------------------+-----+
|path                                                             |count|
+-----------------------------------------------------------------+-----+
|/pub/winvn/readme.txt                                            |633  |
|/pub/winvn/release.txt                                           |494  |
|/shuttle/missions/STS-69/mission-STS-69.html                     |430  |
|/images/nasa-logo.gif                                            |319  |
|/elv/DELTA/uncons.htm                                            |178  |
|/shuttle/missions/sts-68/ksc-upclose.gif                         |154  |
|/history/apollo/sa-1/sa-1-patch-small.gif                        |146  |
|/images/crawlerway-logo.gif                                      |120  |
|/://spacelink.msfc.nasa.gov                                      |117  |
|/history/apollo/pad-abort-test-1/pad-abort-test-1-patch-small.gif|100  |
|/history/apollo/a-001/a-001-patch-small.gif                      |97   |
|/images/Nasa-logo.gif                                            |85   |
|                                                                 |76   |
|/shuttle/resources/orbiters/atlantis.gif                         |63   |
|/history/apollo/images/little-joe.jpg                            |62   |
|/images/lf-logo.gif                                              |59   |
|/shuttle/resources/orbiters/discovery.gif                        |56   |
|/shuttle/resources/orbiters/challenger.gif                       |54   |
|/robots.txt                                                      |53   |
|/history/apollo/pad-abort-test-2/pad-abort-test-2-patch-small.gif|38   |
+-----------------------------------------------------------------+-----+

我们现在统计收到HTTP 404状态最多的25个host,这也不太难,代码如下所示:

hosts_404_count_df = (not_found_df
                      .select('host')
                      .groupBy('host')
                      .count()
                      .sort(desc('count')))

print 'Top 25 hosts that generated errors:\n'
hosts_404_count_df.show(n=25, truncate=False)

我们来看看输出:

+-----------------------------+-----+
|piweba3y.prodigy.com         |39   |
|maz3.maz.net                 |39   |
|gate.barr.com                |38   |
|nexus.mlckew.edu.au          |37   |
|ts8-1.westwood.ts.ucla.edu   |37   |
|m38-370-9.mit.edu            |37   |
|204.62.245.32                |33   |
|163.206.104.34               |27   |
|spica.sci.isas.ac.jp         |27   |
|www-d4.proxy.aol.com         |26   |
|203.13.168.24                |25   |
|203.13.168.17                |25   |
|www-c4.proxy.aol.com         |25   |
|internet-gw.watson.ibm.com   |24   |
|scooter.pa-x.dec.com         |23   |
|piweba5y.prodigy.com         |23   |
|crl5.crl.com                 |23   |
|onramp2-9.onr.com            |22   |
|slip145-189.ut.nl.ibm.net    |22   |
|198.40.25.102.sap2.artic.edu |21   |
|msp1-16.nas.mr.net           |20   |
|gn2.getnet.com               |20   |
|tigger.nashscene.com         |19   |
|dial055.mbnet.mb.ca          |19   |
|isou24.vilspa.esa.es         |19   |
+-----------------------------+-----+

我们可以统计每天出现HTTP 404的次数:

errors_by_date_sorted_df = (not_found_df
                            .select(dayofmonth('time').alias('day'))
                            .groupBy('day')
                            .count()
                            .cache())

print '404 Errors by day:\n'
errors_by_date_sorted_df.show()

我们来看一下输出:

404 Errors by day:

+---+-----+
|day|count|
+---+-----+
|  1|  243|
|  3|  303|
|  4|  346|
|  5|  234|
|  6|  372|
|  7|  532|
|  8|  381|
|  9|  279|
| 10|  314|
| 11|  263|
| 12|  195|
| 13|  216|
| 14|  287|
| 15|  326|
| 16|  258|
| 17|  269|
| 18|  255|
| 19|  207|
| 20|  312|
| 21|  305|
+---+-----+

现在我们来可视化上面的数据:

days_with_errors_404 = []
errors_404_by_day = []
for day, count in errors_by_date_sorted_df.collect():
  days_with_errors_404.append(day)
  errors_404_by_day.append(count)

print days_with_errors_404
print errors_404_by_day

fig, ax = prepareSubplot(np.arange(0, 20, 5), np.arange(0, 600, 100))
colorMap = 'rainbow' # 样式
cmap = cm.get_cmap(colorMap)
plt.plot(days_with_errors_404, errors_404_by_day, color=cmap(0), linewidth=3)
plt.axis([0, max(days_with_errors_404), 0, max(errors_404_by_day)])
plt.xlabel('Day')
plt.ylabel('404 Errors')
plt.axhline(linewidth=3, color='#999999')
plt.axvline(linewidth=2, color='#999999')
display(fig)

效果如下:
[spark案例学习] WEB日志分析
我们也可以换一种方式查看:

display(errors_by_date_sorted_df)

效果如下:
[spark案例学习] WEB日志分析
我们现在统计哪5天出现HTTP 404的次数最多:

top_err_date_df = errors_by_date_sorted_df.sort(desc('count'))

print 'Top Five Dates for 404 Requests:\n'
top_err_date_df.show(5)

可以看到day7, day8, day6, day4, day15出现HTTP 404的次数最多:

+---+-----+
|day|count|
+---+-----+
|  7|  532|
|  8|  381|
|  6|  372|
|  4|  346|
| 15|  326|
+---+-----+

既然能统计每天HTTP 404出现的次数,我们也能按照小时进行统计,代码如下:

from pyspark.sql.functions import hour
hour_records_sorted_df = (not_found_df
                          .select(hour('time').alias('hour'))
                          .groupBy('hour')
                          .count()
                          .cache())
print 'Top hours for 404 requests:\n'
hour_records_sorted_df.show(24)

输出如下:

Top hours for 404 requests:

+----+-----+
|hour|count|
+----+-----+
|   0|  175|
|   1|  171|
|   2|  422|
|   3|  272|
|   4|  102|
|   5|   95|
|   6|   93|
|   7|  122|
|   8|  199|
|   9|  185|
|  10|  329|
|  11|  263|
|  12|  438|
|  13|  397|
|  14|  318|
|  15|  347|
|  16|  373|
|  17|  330|
|  18|  268|
|  19|  269|
|  20|  270|
|  21|  241|
|  22|  234|
|  23|  272|
+----+-----+

现在我们把上面的数据进行可视化,这都是一样的套路:

hours_with_not_found = []
not_found_counts_per_hour = []

for hour, count in hour_records_sorted_df.collect():
  hours_with_not_found.append(hour)
  not_found_counts_per_hour.append(count)

fig, ax = prepareSubplot(np.arange(0, 25, 5), np.arange(0, 500, 50))
colorMap = 'seismic'
cmap = cm.get_cmap(colorMap)
plt.plot(hours_with_not_found, not_found_counts_per_hour, color=cmap(0), linewidth=3)
plt.axis([0, max(hours_with_not_found), 0, max(not_found_counts_per_hour)])
plt.xlabel('Hour')
plt.ylabel('404 Errors')
plt.axhline(linewidth=3, color='#999999')
plt.axvline(linewidth=2, color='#999999')
display(fig)

折线图如下:
[spark案例学习] WEB日志分析
我们也可以换一种显示方式:

display(hour_records_sorted_df)

柱状图如下:
[spark案例学习] WEB日志分析

总结

Spark Python API官网

上一篇:js字符串转日期,js字符串解析成日期,js日期解析, Date.parse小时是8点,Date.parse时间多了8小时


下一篇:【项目启动】 tomcat启动,项目无法启动,无法正常访问/项目可以启动,报错:java.lang.ClassNotFoundException: ContextLoaderListener