有很多方法可以从html文件中提取文本,但我想做相反的事情,并在结构和javascript代码保持完好无损的情况下删除文本.
例如删除所有同时保留
是否有捷径可寻?任何帮助是极大的赞赏.
干杯
解决方法:
我会选择BeautifulSoup:
from bs4 import BeautifulSoup
from bs4.element import NavigableString
from copy import copy
def strip_content(in_tag):
tag = copy(in_tag) # remove this line if you don't care about your input
if tag.name == 'script':
# Do no mess with scripts
return tag
# strip content from all children
children = [strip_content(child) for child in tag.children if not isinstance(child, NavigableString)]
# remove everything from the tag
tag.clear()
for child in children:
# Add back stripped children
tag.append(child)
return tag
def test(filename):
soup = BeautifulSoup(open(filename))
cleaned_soup = strip_content(soup)
print(cleaned_soup.prettify())
if __name__ == "__main__":
test("myfile.html")