测试代码结构:
演示案例并没有进行网页爬取,主要目的是演示重复import的问题。
spider目录下是各个业务spider,把任务提交到crawler。
crawler中有个任务队列汇集各个业务spider提交的任务,然后在独立的线程中对任务进行实际的爬取动作。
main启动crawler及各个业务spider
在main.py和base_spider.py中导入crawler,由于导入的方式不一样,导致重复导入,而crawler也是不同的实列,导致爬取的任务队列一致为空
代码如下:
main.py
# main.py # 这种引入方式与base_spider中引入的方式不一样, # 会重复导入crawler实例,导致这里的crawler与base_spider中的crawler不是同一个实例 # from crawler import crawler # 导入crawler实例的格式与base_spider中一样,保证只有一个crawler实例 from import_test.crawler import crawler from spider.tb_pider import TbSpider if __name__ == ‘__main__‘: crawler.run() tb_spider = TbSpider() tb_spider.run() ‘‘‘ 导入crawler方式不一致的输出结果: crawler run crawler task length: 0 TbSpider add task: 1 TbSpider add task: 2 TbSpider add task: 3 TbSpider add task: 4 crawler task length: 0 crawler task length: 0 crawler task length: 0 crawler task length: 0 ‘‘‘ ‘‘‘ 导入crawler方式一致的输出结果: crawler run crawler task length: 0 TbSpider add task: 1 TbSpider add task: 2 TbSpider add task: 3 TbSpider add task: 4 crawler task length: 4 4 crawler task length: 3 3 crawler task length: 2 2 crawler task length: 1 1 crawler task length: 0 crawler task length: 0 ‘‘‘
crawler.py
# crawler.py import threading import time class Crawler: def __init__(self): self._task_queue = [] def length(self): return len(self._task_queue) def add_task(self, task): self._task_queue.append(task) def do_task(self): while True: print(‘crawler task length: ‘, self.length()) if self.length() <= 0: time.sleep(2) continue print(self._task_queue.pop()) def run(self): print(‘crawler run‘) crawler_thread = threading.Thread(target=self.do_task, name=‘crawler_thread‘) crawler_thread.start() crawler = Crawler()
base_spider.py
# base_spider.py from import_test.crawler import crawler class BaseSpider: def __init__(self): self.crawler = crawler def run(self): raise NotImplementedError()
tb_spider.py
# tb_spider.py from spider.base_spider import BaseSpider class TbSpider(BaseSpider): def run(self): for i in range(1, 5): print(‘TbSpider add task: ‘, i) self.crawler.add_task(i)