Python实现OPTICS聚类算法提取网站文章列表

2021-07-10 03:45:26

Python实现OPTICS聚类算法提取网站文章列表

前言
一、方案介绍
二、代码介绍
三、存在的问题
四、应用案例
五、总结

前言

新闻网站一般都会有文章标题列表，也就是概览，另外顶部和左右两边可能还会有其它链接，比如广告、导航、推荐新闻列表、热门新闻列表等，如果通过爬虫获取主要的文章列表，会有一定的困扰。对于介绍获取网站文章正文的方法比较多，常见的算法是对HTML的DOM树进行文本标签打分，结合文本密度分析，提取文章正文。百度出来的实现工具主要有readability、newspaper、html2article等。然而，对于文章列表的获取，本人还没搜到合适的工具，因此本文主要介绍了通过爬虫获取网页文章列表的方法。

一、方案介绍

1. 需求

通过爬虫获取网页上的文章列表，比如网址https://news.163.com/domestic/，如下图所示：

图1 新闻网站文章列表
提取后爬虫输出效果示例：

图2 网站文章列表提取结果

获取到文章标题和文章链接后就可以进一步爬取文章正文了。

2. 解决方案

最简单的实现方式就是爬虫获取网页的HTML后，通过xpath获取所有a标签的文本和href属性值，即可得到文章标题和链接。通常对于只有文章列表的、干净的网页是没有问题的，比如*部门网站：

图3 *部门网站文章列表

图3所示的网页比较干净，中间一大块都是文章列表。但是即便如此，页面顶部还存在栏目导航和其它系统入口链接，底部还有页面跳转和其他相关站点的导航链接。因此，几乎不存在100%纯粹的文章列表的页面，由此获取的文章列表也会存在噪点。

一般网页设计都会分为多个区域，推荐列表、热门列表、导航、广告和文章列表等，每个板块单独一个区域。如果是新闻网页，文章列表是主要的区域。另外，文章列表作为主要的区域，从视觉上看也是面积最大和最密集的区域。

据此，可通过聚类算法和文本密度分析，提取文章列表。

3. 算法介绍

3.1 聚类算法

本方案利用基于密度的OPTICS（通过点排序识别聚类结构）聚类算法，识别网页的超链接簇即区域，从而为提取文章列表做准备。

大多数的聚类算法对输入参数ε（邻域的最大半径）和MinPts（核心对象的邻域中要求的最少点数）都比较敏感。输入参数的细微变化，可能得出差别很大的簇。OPTICS算法弱化输入参数的敏感性，本身并不直接输出数据集的聚类，而是输出簇排序，该排序是线性表，并且代表了数据的基于密度的聚类结构，据此结构可以导出具体的数据簇。关于具体算法，数据挖掘教科书和网上都有详述，本文不再赘述。建议读者先从其他资源学习OPTICS算法，然后再阅读本文。

算法的关键部分是数据建模，构造数据的相似度模型。基于密度的聚类算法需要刻画数据点之间的距离，距离越近相似度越高，距离越远相似度越低。对于文章列表的分析，是对HTML DOM结构中a标签的聚类分析。

如何刻画HTML DOM结构中两两标签的距离？DOM是个树状结构，如下图所示：

图4 DOM树定义DOM树中标签（即元素）的深度为P，也可以理解为标签的祖先的个数。任意两个标签N _i, N _j的距离 dist _i,j 定义如下：

dist_i,j = (P_max - M) / P_max
其中， P_max = max(P_i, P_j)。
M值取决于两个标签具有相同祖先的个数或者祖先是同类的个数，算法如下：

初始 M=0。
自底向上对比两个元素的祖先，先对比“父亲”，再对比“爷爷”，依次类推。如果祖先是同一个元素，M增加1。如果祖先不是同一个元素，但是其标签名且class属性相同M增加0.9（经验值）。
如图4所示的DOM树，各链接对应的a标签之间距离如下：
链接1和链接2的 M=0，距离是 1 ；
链接2和链接3的 M=4.9，距离是 (5-4.9) / 5 = 0.02 ；
链接3和链接4的 M=4.8，距离是 (5-4.8) / 5 = 0.04 。

这样就把任意两个标签的距离规范化到 [0,1] 区间。将DOM树中所有的a标签放入一个线性列表，计算任意两个标签之间的距离，由此构造 N × N的距离矩阵，主对角线为全0(同一个标签的距离是0)。

然后通过OPTICS算法输出a标签的簇排序，再按如下算法生成a标签聚类结果：

从排序线性表中按顺序取出元素，如果该该元素的可达距离不大于给定半径ε，则该点属于当前类别，否则至步骤2。
如果该元素的核心距离大于给定半径ε，则该点为噪声，可以忽略，否则该点属于新的聚类，跳至步骤1。
排序线性表遍历结束，算法结束。

（注：建议读者先从其他资源学习OPTICS算法，然后再阅读本文。）

3.2 文本密度分析

获得a标签聚类后，还不能断定哪一个簇是文章列表。通过对大量新闻网站的查验，文章列表一般是文本密度最大的簇。簇的文本密度 T 定义如下：
T = Σ^N_iW_i / N
W_i : 簇内第i个标签的文本字数
N: 簇内a标签个数

其实也就是簇内标签文本平均字数。计算每个聚类簇的文本密度，将簇按文本密度排序，密度最大的簇即为文章列表。

3.3 降噪处理

ε的取值对聚类的效果有较大的影响，取值越大聚类越不纯粹，夹杂噪声越多。取值太小可能会把原本是同类的簇，分成更小的簇。在本方案中，通过对大量网站的测试，取值0.13，可以提取到比较纯粹的聚类。
新闻网站的标题都有一定的长度，一般至少5个字及以上。所以在聚类前，先把文本小于5个字的链接过滤掉。

二、代码介绍

1. 总体介绍

代码使用python + selenium + chrome + chromedriver实现，还需要安装numpy和selenium库。
python -m pip install numpy
python -m pip install selenium

Linux上安装Chrome:
yum install https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm

下载chromedriver: https://npm.taobao.org/mirrors/chromedriver/2.40/chromedriver_linux64.zip
解压后将chromedriver文件放在代码所在的目录下。

2. 实体类

Article.py

class Article(object):
    def __init__(self):
        self.title = ''
        self.text = ''
        self.url = ''
        self.published_date = ''

3. 爬虫类

ArticleListSpider.py

from Article import Article
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from lxml import etree
import numpy as np
from functools import cmp_to_key
from scrapy.selector import Selector
import time
import re
import traceback


class ArticleListSpider:

    _undefined_dist = 2
    _epsilon = 0.135
    _min_pts = 3
    _min_title_len = 5

    def __init__(self, url):
        self.start_url = url
        self.domain = ArticleListSpider.get_domain(url)
        if self.start_url.startswith('http://'):
            self.protocol = 'http://'
        if self.start_url.startswith('https://'):
            self.protocol = 'https://'

    def download_articles(self):
        url = self.start_url
        options = Options()
        options.add_argument('--headless')
        options.add_argument('--disable-gpu')
        service = Service('./chromedriver')
        driver = webdriver.Chrome(service=service, options=options)
        driver.get(url)
        # driver.implicitly_wait(5)
        html = driver.page_source
        driver.close()
        # print(html)
        articles = self.extract_article_list(html)
        return articles

    def make_article_url(self, node):
        start_url = self.start_url
        href = node.get("href")
        if not href or href.startswith("javascript:") or href.strip() == '#':
            href = None
            onclick = node.get("onclick")
            if onclick:
                try:
                    m = re.search(r'\((.+)\)', onclick)
                    if m:
                        href = m.group(1).strip()
                        href = re.sub(r'[\'"]', '', href)
                except:
                    traceback.print_exc()
        if not href:
            raise AttributeError("Unable to make article URL from node: %s, from start URL: %s"
                                 % (etree.tostring(node, encoding='utf-8').decode('utf-8'), start_url))
        article_url = href
        article_url = re.sub('^./', '', article_url)
        m = re.search("^https?://", article_url)
        if not m:
            if article_url.startswith('/'):
                article_url = self.protocol + self.domain + article_url
            else:
                if start_url.endswith('html'):
                    start_url = re.sub(r'/[a-z_]+\.[s]?html', '', start_url)
                    start_url = start_url + '/'
                article_url = start_url + article_url
        return article_url

    def extract_article_list(self, html):
        doc = etree.HTML(html)
        link_nodes = doc.xpath('//a')
        # Filter empty href and short title
        tmp_nodes = []
        for node in link_nodes:
            title = ArticleListSpider.get_title(node)
            if title and len(title) >= ArticleListSpider._min_title_len and node.get("href"):
                tmp_nodes.append(node)
        link_nodes = tmp_nodes
        link_nodes_num = len(link_nodes)
        ancestors = [None for _ in range(link_nodes_num)]
        # Construct distance matrix
        dist_matrix = np.zeros((link_nodes_num, link_nodes_num))
        for i in range(link_nodes_num):
            node_i = link_nodes[i]
            ancestors_i = ancestors[i]
            if not ancestors_i:
                ancestors_i = ArticleListSpider.get_ancestors(node_i)
                ancestors[i] = ancestors_i
            for j in range(i+1, link_nodes_num):
                node_j = link_nodes[j]
                ancestors_j = ancestors[j]
                if not ancestors_j:
                    ancestors_j = ArticleListSpider.get_ancestors(node_j)
                    ancestors[j] = ancestors_j
                dist_matrix[i][j] = ArticleListSpider.dist(node_i, ancestors_i, node_j, ancestors_j)
                dist_matrix[j][i] = dist_matrix[i][j]
        core_dists = [ArticleListSpider._undefined_dist for _ in range(link_nodes_num)]
        reach_dists = [ArticleListSpider._undefined_dist for _ in range(link_nodes_num)]
        ordered_indices = ArticleListSpider.optics_ordering(link_nodes, core_dists, reach_dists, dist_matrix)
        node_clusters = ArticleListSpider.clustering(link_nodes, core_dists, reach_dists, ordered_indices)
        articles = []
        if len(node_clusters) > 0:
            article_list_nodes = ArticleListSpider.get_article_list_cluster(node_clusters)
            for node in article_list_nodes:
                article = Article()
                article.title = ArticleListSpider.get_title(node)
                article.url = self.make_article_url(node)
                articles.append(article)
        return articles

    @staticmethod
    def get_title(node):
        # node.text is not available if there is sub element in <a> tag,
        # but xpath can retrieve the text. For example, <a href="./index.html"><i>Some</i>thing</a>,
        # the "thing" inside <a> tag cannot be retrieved by node.text
        title = Selector(text=etree.tostring(node, encoding='utf-8').decode('utf-8')).xpath('//a//text()').extract()
        title = [t.strip() for t in title]
        return ''.join(title)

    @staticmethod
    def get_domain(start_url):
        url = start_url.replace("http://", '')
        url = url.replace("https://", '')
        components = url.split('/')
        domain = components[0]
        return domain

    @staticmethod
    def get_article_list_cluster(node_clusters):
        # Article links usually have the highest words density
        words_density = [0.0 for _ in range(len(node_clusters))]
        for i in range(len(node_clusters)):
            nodes = node_clusters[i]
            words = 0
            for node in nodes:
                title = ArticleListSpider.get_title(node)
                words = words + len(title)
            words_density[i] = words / len(nodes)
        i = np.argsort(words_density)[len(words_density) - 1]
        return node_clusters[i]

    @staticmethod
    def clustering(link_nodes, core_dists, reach_dists, ordered_indices):
        clusters = []
        cluster = []
        for i in range(len(ordered_indices)):
            p = ordered_indices[i]
            node = link_nodes[p]
            # (ArticleListSpider.get_title(node) + "\t" + str(reach_dists[p]))
            if reach_dists[p] <= ArticleListSpider._epsilon:
                cluster.append(node)
            elif core_dists[p] <= ArticleListSpider._epsilon:
                if len(cluster) > 0:
                    clusters.append(cluster)
                # new cluster
                cluster = [node]
            else:
                # p is noise
                pass
        # Append the last cluster
        if len(cluster) > 0:
            clusters.append(cluster)
        return clusters

    @staticmethod
    def optics_ordering(link_nodes, core_dists, reach_dists, dist_matrix):
        link_nodes_num = len(link_nodes)
        neighbors = [[] for _ in range(link_nodes_num)]
        core_indices = [False for _ in range(link_nodes_num)]
        for i in range(link_nodes_num):
            dist = dist_matrix[i]
            # Gathering all distances from current node to others, 
            # those of which are less than or equal to epsilon
            core_dist = dist[dist <= ArticleListSpider._epsilon]
            # Gathering indices of all nodes whose distance from current node is less than or equal to epsilon.
            # They are neighbors of current node
            neighbors_i = np.argwhere(dist <= ArticleListSpider._epsilon).flatten()
            # Remove index of current node
            neighbors_i = neighbors_i[neighbors_i != i]
            if len(core_dist) > ArticleListSpider._min_pts:
                core_indices[i] = True
                # Sorting core_dist, the min_pts'th position is exactly the core dist of current node
                core_dists[i] = np.sort(core_dist)[ArticleListSpider._min_pts]
                neighbors[i] = neighbors_i
        # Updating reachable distance and sorting all nodes by reachable distance in term of OPTICS algorithm
        visited = [False for _ in range(link_nodes_num)]
        order_seeds = []
        p_queue = [i for i in range(link_nodes_num)]
        np.random.shuffle(p_queue)
        p = ArticleListSpider.next(p_queue, order_seeds, visited)
        ordered_indices = []
        while p > -1:
            ordered_indices.append(p)
            visited[p] = True
            if core_indices[p]:
                neighbors_p = neighbors[p]
                for q in neighbors_p:
                    if visited[q]:
                        continue
                    else:
                        ArticleListSpider.update(p, q, order_seeds, reach_dists, core_dists, dist_matrix)
            p = ArticleListSpider.next(p_queue, order_seeds, visited)
        return ordered_indices

    @staticmethod
    def update(p, q, order_seeds, reach_dists, core_dists, dist_matrix):
        # p is core object
        if q not in order_seeds:
            order_seeds.append(q)
        q_reach_dist = reach_dists[q]
        new_reach_dist = max(core_dists[p], dist_matrix[p][q])
        if q_reach_dist == ArticleListSpider._undefined_dist:
            # If reachable distance from p to q was not set yet, then set it.
            reach_dists[q] = new_reach_dist
        else:
            # Otherwise, update q's reachable distance if the new one is less than the old one
            reach_dists[q] = min(q_reach_dist, new_reach_dist)
        order_seeds.sort(key=cmp_to_key(lambda x, y: reach_dists[x] - reach_dists[y]))

    @staticmethod
    def next(p_queue, order_seeds, visited):
        while True:
            if len(order_seeds) > 0:
                p = order_seeds[0]
                del order_seeds[0]
            elif len(p_queue) > 0:
                p = p_queue.pop()
            else:
                p = -1
            if p == -1 or not visited[p]:
                break
        return p

    @staticmethod
    def get_ancestors(node):
        ancestors = []
        node = node.getparent()
        while node is not None:
            ancestors.append(node)
            node = node.getparent()
        return ancestors

    @staticmethod
    def dist(node1, ancestors1, node2, ancestors2):
        if node1 is node2:
            return 0
        p = max(len(ancestors1), len(ancestors2))
        m = 0
        for i in range(p):
            if i < len(ancestors1) and i < len(ancestors2):
                p1 = ancestors1[i]
                p2 = ancestors2[i]
                if p1 is p2:
                    # If their ancestor nodes are the same, plus one
                    m += 1
                elif p1.tag == p2.tag and p1.get("class") == p2.get("class"):
                    # Otherwise if the tag names are equal and have save css classes, plus 0.9
                    m += 0.9
        return (p - m) / p

4. 用法

article_list_url = "https://news.163.com/domestic/"
spider = ArticleListSpider(article_list_url)
article_list = spider.download_articles()
for a in article_list:
    print(a.title + "\t" + a.url)

三、存在的问题

１.　基于密度的聚类算法，聚类结果通常对ε的取值比较敏感。尽管OPTICS算法降低了对此参数的敏感度，但网页设计千姿百态，层次深度不一。层次越深，同类元素间的距离越小，反之越大。所以ε的选取，并非所有网站皆适用。本方案ε取值0.13，是爬取上百个网站的经验值。
２.　文本密度的计算方法有些简单粗暴，等同于簇内的链接文本的平均长度。对一些边上有热门列表或者推荐列表的网页，提取出来的可能是旁边的列表。如图５所示，算法有时提取到了今日推荐，而非主要的新闻列表。

图５具有推荐列表的网页

四、应用案例

http://47.119.182.42/news/index.html

五、总结

通过基于密度的OPTICS聚类算法以及文本密度分析，可获取新闻网站的新闻列表，由此可以进一步爬取新闻正文，为文本挖掘、信息检索、机器学习提供数据支撑。

码农公寓