Python实现OPTICS聚类算法提取网站文章列表
前言
新闻网站一般都会有文章标题列表,也就是概览,另外顶部和左右两边可能还会有其它链接,比如广告、导航、推荐新闻列表、热门新闻列表等,如果通过爬虫获取主要的文章列表,会有一定的困扰。对于介绍获取网站文章正文的方法比较多,常见的算法是对HTML的DOM树进行文本标签打分,结合文本密度分析,提取文章正文。百度出来的实现工具主要有readability、newspaper、html2article等。然而,对于文章列表的获取,本人还没搜到合适的工具,因此本文主要介绍了通过爬虫获取网页文章列表的方法。
一、方案介绍
1. 需求
通过爬虫获取网页上的文章列表,比如网址https://news.163.com/domestic/,如下图所示:
提取后爬虫输出效果示例:
获取到文章标题和文章链接后就可以进一步爬取文章正文了。
2. 解决方案
最简单的实现方式就是爬虫获取网页的HTML后,通过xpath获取所有a标签的文本和href属性值,即可得到文章标题和链接。通常对于只有文章列表的、干净的网页是没有问题的,比如*部门网站:
图3所示的网页比较干净,中间一大块都是文章列表。但是即便如此,页面顶部还存在栏目导航和其它系统入口链接,底部还有页面跳转和其他相关站点的导航链接。因此,几乎不存在100%纯粹的文章列表的页面,由此获取的文章列表也会存在噪点。
一般网页设计都会分为多个区域,推荐列表、热门列表、导航、广告和文章列表等,每个板块单独一个区域。如果是新闻网页,文章列表是主要的区域。另外,文章列表作为主要的区域,从视觉上看也是面积最大和最密集的区域。
据此,可通过聚类算法和文本密度分析,提取文章列表。
3. 算法介绍
3.1 聚类算法
本方案利用基于密度的OPTICS(通过点排序识别聚类结构)聚类算法,识别网页的超链接簇即区域,从而为提取文章列表做准备。
大多数的聚类算法对输入参数ε(邻域的最大半径)和MinPts(核心对象的邻域中要求的最少点数)都比较敏感。输入参数的细微变化,可能得出差别很大的簇。OPTICS算法弱化输入参数的敏感性,本身并不直接输出数据集的聚类,而是输出簇排序,该排序是线性表,并且代表了数据的基于密度的聚类结构,据此结构可以导出具体的数据簇。关于具体算法,数据挖掘教科书和网上都有详述,本文不再赘述。建议读者先从其他资源学习OPTICS算法,然后再阅读本文。
算法的关键部分是数据建模,构造数据的相似度模型。基于密度的聚类算法需要刻画数据点之间的距离,距离越近相似度越高,距离越远相似度越低。对于文章列表的分析,是对HTML DOM结构中a标签的聚类分析。
如何刻画HTML DOM结构中两两标签的距离?DOM是个树状结构,如下图所示:
disti,j = (Pmax - M) / Pmax
其中, Pmax = max(Pi, Pj)。
M值取决于两个标签具有相同祖先的个数或者祖先是同类的个数,算法如下:
- 初始 M=0。
- 自底向上对比两个元素的祖先,先对比“父亲”,再对比“爷爷”,依次类推。如果祖先是同一个元素,M增加1。如果祖先不是同一个元素,但是其标签名且class属性相同M增加0.9(经验值)。
如图4所示的DOM树,各链接对应的a标签之间距离如下:
链接1和链接2的 M=0,距离是 1 ;
链接2和链接3的 M=4.9,距离是 (5-4.9) / 5 = 0.02 ;
链接3和链接4的 M=4.8,距离是 (5-4.8) / 5 = 0.04 。
这样就把任意两个标签的距离规范化到 [0,1] 区间。将DOM树中所有的a标签放入一个线性列表,计算任意两个标签之间的距离,由此构造 N × N的距离矩阵,主对角线为全0(同一个标签的距离是0)。
然后通过OPTICS算法输出a标签的簇排序,再按如下算法生成a标签聚类结果:
- 从排序线性表中按顺序取出元素,如果该该元素的可达距离不大于给定半径ε,则该点属于当前类别,否则至步骤2。
- 如果该元素的核心距离大于给定半径ε,则该点为噪声,可以忽略,否则该点属于新的聚类,跳至步骤1。
- 排序线性表遍历结束,算法结束。
(注:建议读者先从其他资源学习OPTICS算法,然后再阅读本文。)
3.2 文本密度分析
获得a标签聚类后,还不能断定哪一个簇是文章列表。通过对大量新闻网站的查验,文章列表一般是文本密度最大的簇。簇的文本密度 T 定义如下:
T = ΣNiWi / N
Wi : 簇内第i个标签的文本字数
N: 簇内a标签个数
其实也就是簇内标签文本平均字数。计算每个聚类簇的文本密度,将簇按文本密度排序,密度最大的簇即为文章列表。
3.3 降噪处理
- ε的取值对聚类的效果有较大的影响,取值越大聚类越不纯粹,夹杂噪声越多。取值太小可能会把原本是同类的簇,分成更小的簇。在本方案中,通过对大量网站的测试,取值0.13,可以提取到比较纯粹的聚类。
- 新闻网站的标题都有一定的长度,一般至少5个字及以上。所以在聚类前,先把文本小于5个字的链接过滤掉。
二、代码介绍
1. 总体介绍
代码使用python + selenium + chrome + chromedriver实现,还需要安装numpy和selenium库。
python -m pip install numpy
python -m pip install selenium
Linux上安装Chrome:
yum install https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm
下载chromedriver: https://npm.taobao.org/mirrors/chromedriver/2.40/chromedriver_linux64.zip
解压后将chromedriver文件放在代码所在的目录下。
2. 实体类
- Article.py
class Article(object):
def __init__(self):
self.title = ''
self.text = ''
self.url = ''
self.published_date = ''
3. 爬虫类
- ArticleListSpider.py
from Article import Article
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from lxml import etree
import numpy as np
from functools import cmp_to_key
from scrapy.selector import Selector
import time
import re
import traceback
class ArticleListSpider:
_undefined_dist = 2
_epsilon = 0.135
_min_pts = 3
_min_title_len = 5
def __init__(self, url):
self.start_url = url
self.domain = ArticleListSpider.get_domain(url)
if self.start_url.startswith('http://'):
self.protocol = 'http://'
if self.start_url.startswith('https://'):
self.protocol = 'https://'
def download_articles(self):
url = self.start_url
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
service = Service('./chromedriver')
driver = webdriver.Chrome(service=service, options=options)
driver.get(url)
# driver.implicitly_wait(5)
html = driver.page_source
driver.close()
# print(html)
articles = self.extract_article_list(html)
return articles
def make_article_url(self, node):
start_url = self.start_url
href = node.get("href")
if not href or href.startswith("javascript:") or href.strip() == '#':
href = None
onclick = node.get("onclick")
if onclick:
try:
m = re.search(r'\((.+)\)', onclick)
if m:
href = m.group(1).strip()
href = re.sub(r'[\'"]', '', href)
except:
traceback.print_exc()
if not href:
raise AttributeError("Unable to make article URL from node: %s, from start URL: %s"
% (etree.tostring(node, encoding='utf-8').decode('utf-8'), start_url))
article_url = href
article_url = re.sub('^./', '', article_url)
m = re.search("^https?://", article_url)
if not m:
if article_url.startswith('/'):
article_url = self.protocol + self.domain + article_url
else:
if start_url.endswith('html'):
start_url = re.sub(r'/[a-z_]+\.[s]?html', '', start_url)
start_url = start_url + '/'
article_url = start_url + article_url
return article_url
def extract_article_list(self, html):
doc = etree.HTML(html)
link_nodes = doc.xpath('//a')
# Filter empty href and short title
tmp_nodes = []
for node in link_nodes:
title = ArticleListSpider.get_title(node)
if title and len(title) >= ArticleListSpider._min_title_len and node.get("href"):
tmp_nodes.append(node)
link_nodes = tmp_nodes
link_nodes_num = len(link_nodes)
ancestors = [None for _ in range(link_nodes_num)]
# Construct distance matrix
dist_matrix = np.zeros((link_nodes_num, link_nodes_num))
for i in range(link_nodes_num):
node_i = link_nodes[i]
ancestors_i = ancestors[i]
if not ancestors_i:
ancestors_i = ArticleListSpider.get_ancestors(node_i)
ancestors[i] = ancestors_i
for j in range(i+1, link_nodes_num):
node_j = link_nodes[j]
ancestors_j = ancestors[j]
if not ancestors_j:
ancestors_j = ArticleListSpider.get_ancestors(node_j)
ancestors[j] = ancestors_j
dist_matrix[i][j] = ArticleListSpider.dist(node_i, ancestors_i, node_j, ancestors_j)
dist_matrix[j][i] = dist_matrix[i][j]
core_dists = [ArticleListSpider._undefined_dist for _ in range(link_nodes_num)]
reach_dists = [ArticleListSpider._undefined_dist for _ in range(link_nodes_num)]
ordered_indices = ArticleListSpider.optics_ordering(link_nodes, core_dists, reach_dists, dist_matrix)
node_clusters = ArticleListSpider.clustering(link_nodes, core_dists, reach_dists, ordered_indices)
articles = []
if len(node_clusters) > 0:
article_list_nodes = ArticleListSpider.get_article_list_cluster(node_clusters)
for node in article_list_nodes:
article = Article()
article.title = ArticleListSpider.get_title(node)
article.url = self.make_article_url(node)
articles.append(article)
return articles
@staticmethod
def get_title(node):
# node.text is not available if there is sub element in <a> tag,
# but xpath can retrieve the text. For example, <a href="./index.html"><i>Some</i>thing</a>,
# the "thing" inside <a> tag cannot be retrieved by node.text
title = Selector(text=etree.tostring(node, encoding='utf-8').decode('utf-8')).xpath('//a//text()').extract()
title = [t.strip() for t in title]
return ''.join(title)
@staticmethod
def get_domain(start_url):
url = start_url.replace("http://", '')
url = url.replace("https://", '')
components = url.split('/')
domain = components[0]
return domain
@staticmethod
def get_article_list_cluster(node_clusters):
# Article links usually have the highest words density
words_density = [0.0 for _ in range(len(node_clusters))]
for i in range(len(node_clusters)):
nodes = node_clusters[i]
words = 0
for node in nodes:
title = ArticleListSpider.get_title(node)
words = words + len(title)
words_density[i] = words / len(nodes)
i = np.argsort(words_density)[len(words_density) - 1]
return node_clusters[i]
@staticmethod
def clustering(link_nodes, core_dists, reach_dists, ordered_indices):
clusters = []
cluster = []
for i in range(len(ordered_indices)):
p = ordered_indices[i]
node = link_nodes[p]
# (ArticleListSpider.get_title(node) + "\t" + str(reach_dists[p]))
if reach_dists[p] <= ArticleListSpider._epsilon:
cluster.append(node)
elif core_dists[p] <= ArticleListSpider._epsilon:
if len(cluster) > 0:
clusters.append(cluster)
# new cluster
cluster = [node]
else:
# p is noise
pass
# Append the last cluster
if len(cluster) > 0:
clusters.append(cluster)
return clusters
@staticmethod
def optics_ordering(link_nodes, core_dists, reach_dists, dist_matrix):
link_nodes_num = len(link_nodes)
neighbors = [[] for _ in range(link_nodes_num)]
core_indices = [False for _ in range(link_nodes_num)]
for i in range(link_nodes_num):
dist = dist_matrix[i]
# Gathering all distances from current node to others,
# those of which are less than or equal to epsilon
core_dist = dist[dist <= ArticleListSpider._epsilon]
# Gathering indices of all nodes whose distance from current node is less than or equal to epsilon.
# They are neighbors of current node
neighbors_i = np.argwhere(dist <= ArticleListSpider._epsilon).flatten()
# Remove index of current node
neighbors_i = neighbors_i[neighbors_i != i]
if len(core_dist) > ArticleListSpider._min_pts:
core_indices[i] = True
# Sorting core_dist, the min_pts'th position is exactly the core dist of current node
core_dists[i] = np.sort(core_dist)[ArticleListSpider._min_pts]
neighbors[i] = neighbors_i
# Updating reachable distance and sorting all nodes by reachable distance in term of OPTICS algorithm
visited = [False for _ in range(link_nodes_num)]
order_seeds = []
p_queue = [i for i in range(link_nodes_num)]
np.random.shuffle(p_queue)
p = ArticleListSpider.next(p_queue, order_seeds, visited)
ordered_indices = []
while p > -1:
ordered_indices.append(p)
visited[p] = True
if core_indices[p]:
neighbors_p = neighbors[p]
for q in neighbors_p:
if visited[q]:
continue
else:
ArticleListSpider.update(p, q, order_seeds, reach_dists, core_dists, dist_matrix)
p = ArticleListSpider.next(p_queue, order_seeds, visited)
return ordered_indices
@staticmethod
def update(p, q, order_seeds, reach_dists, core_dists, dist_matrix):
# p is core object
if q not in order_seeds:
order_seeds.append(q)
q_reach_dist = reach_dists[q]
new_reach_dist = max(core_dists[p], dist_matrix[p][q])
if q_reach_dist == ArticleListSpider._undefined_dist:
# If reachable distance from p to q was not set yet, then set it.
reach_dists[q] = new_reach_dist
else:
# Otherwise, update q's reachable distance if the new one is less than the old one
reach_dists[q] = min(q_reach_dist, new_reach_dist)
order_seeds.sort(key=cmp_to_key(lambda x, y: reach_dists[x] - reach_dists[y]))
@staticmethod
def next(p_queue, order_seeds, visited):
while True:
if len(order_seeds) > 0:
p = order_seeds[0]
del order_seeds[0]
elif len(p_queue) > 0:
p = p_queue.pop()
else:
p = -1
if p == -1 or not visited[p]:
break
return p
@staticmethod
def get_ancestors(node):
ancestors = []
node = node.getparent()
while node is not None:
ancestors.append(node)
node = node.getparent()
return ancestors
@staticmethod
def dist(node1, ancestors1, node2, ancestors2):
if node1 is node2:
return 0
p = max(len(ancestors1), len(ancestors2))
m = 0
for i in range(p):
if i < len(ancestors1) and i < len(ancestors2):
p1 = ancestors1[i]
p2 = ancestors2[i]
if p1 is p2:
# If their ancestor nodes are the same, plus one
m += 1
elif p1.tag == p2.tag and p1.get("class") == p2.get("class"):
# Otherwise if the tag names are equal and have save css classes, plus 0.9
m += 0.9
return (p - m) / p
4. 用法
article_list_url = "https://news.163.com/domestic/"
spider = ArticleListSpider(article_list_url)
article_list = spider.download_articles()
for a in article_list:
print(a.title + "\t" + a.url)
三、存在的问题
1. 基于密度的聚类算法,聚类结果通常对ε的取值比较敏感。尽管OPTICS算法降低了对此参数的敏感度,但网页设计千姿百态,层次深度不一。层次越深,同类元素间的距离越小,反之越大。所以ε的选取,并非所有网站皆适用。本方案ε取值0.13,是爬取上百个网站的经验值。
2. 文本密度的计算方法有些简单粗暴,等同于簇内的链接文本的平均长度。对一些边上有热门列表或者推荐列表的网页,提取出来的可能是旁边的列表。如图5所示,算法有时提取到了今日推荐,而非主要的新闻列表。
四、应用案例
http://47.119.182.42/news/index.html
五、总结
通过基于密度的OPTICS聚类算法以及文本密度分析,可获取新闻网站的新闻列表,由此可以进一步爬取新闻正文,为文本挖掘、信息检索、机器学习提供数据支撑。