python中HTMLParser简单理解

找一个网页,例如https://www.python.org/events/python-events/,用浏览器查看源码并复制,然后尝试解析一下HTML,输出Python官网发布的会议时间、名称和地点。

 from html.parser import HTMLParser
from html.entities import name2codepoint class MyHTMLParser(HTMLParser):   in_title = False
7   in_loca = False
  in_time = False   def handle_starttag(self,tag,attrs):
    if ('class','event-title') in attrs:
      self.in_title = True
    elif ('class','event-location') in attrs:
      self.in_loca = True
    elif tag == 'time':
      self.in_time = True
      self.times = []   def handle_data(self,data):
    if self.in_title:
      print('-'*50)
      print('Title:'+data.strip())
    if self.in_loca:
      print('Location:'+data.strip())
    if self.in_time:
      self.times.append(data)
  def handle_endtag(self,tag):
    if tag == 'h3':self.in_title = False
    if tag == 'span':self.in_loca = False
    if tag == 'time':
      self.in_time = False
      print('Time:'+'-'.join(self.times))
parser = MyHTMLParser()
with open('s.html') as html:
parser.feed(html.read())

重点理解15-17和30-32行,python的HTMLParser在解析网页中的文本时,是按照一个个字符串解析的,

  <h3 class="event-title"><a href="/events/python-events/401/">PyOhio 2016</a></h3>

  <span class="event-location">The Ohio Union at The Ohio State University. 1739 N. High Street, Columbus, OH 43210, USA</span>

  <time datetime="2016-07-29T00:00:00+00:00">29 July &ndash; 01 Aug. <span class="say-no-more"> 2016</span></time>

在遇到特殊字符串时(例如&ndash;)会直接跳过,将前后作为两个字符串,15-17和30-32的配合是为了获取span中的年份2016

上一篇:APP产品设计及运营时常见的问题


下一篇:读书笔记——商广明《Nmap渗透测试指南》