是否可以使用Python / BeautifulSoup从HTML EXCEPT锚点/链接中剥离所有标签?

我有一大堆HTML,我想剥离所有标签,将其保留为纯文本格式,只剩下< a href =“ url”>一些文本< a>链接.

在BeautifulSoup中这可能/简单吗?

解决方法:

尝试这个.

import BeautifulSoup

doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onclick="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''
soup = BeautifulSoup.BeautifulSoup(doc)

for tag in soup.recursiveChildGenerator():
    if isinstance(tag,BeautifulSoup.Tag) and tag.name not in ('a'):
        print(tag.string)
    elif isinstance(tag,BeautifulSoup.Tag) and tag.name in ('a'):
        print(tag)
上一篇:Maven依赖中scope的作用


下一篇:Maven学习笔记(黑马程序员)