从HTML文件中删除文本,但使用python保留javascript和结构

有很多方法可以从html文件中提取文本,但我想做相反的事情,并在结构和javascript代码保持完好无损的情况下删除文本.

例如删除所有同时保留

是否有捷径可寻?任何帮助是极大的赞赏.
干杯

解决方法:

我会选择BeautifulSoup:

from bs4 import BeautifulSoup
from bs4.element import NavigableString
from copy import copy

def strip_content(in_tag):
    tag = copy(in_tag) # remove this line if you don't care about your input
    if tag.name == 'script':
        # Do no mess with scripts
        return tag
    # strip content from all children
    children = [strip_content(child) for child in tag.children if not isinstance(child, NavigableString)]
    # remove everything from the tag
    tag.clear()
    for child in children:
        # Add back stripped children
        tag.append(child)
    return tag

def test(filename):
    soup = BeautifulSoup(open(filename))
    cleaned_soup = strip_content(soup)
    print(cleaned_soup.prettify())

if __name__ == "__main__":
    test("myfile.html")
上一篇:python-无法使用urllib.urlopen()获得网页的源代码


下一篇:Maven深入