Scrapy框架快速爬取糗事百科之数据存储及爬取多个页面【python爬虫入门进阶】（17）-CFANZ编程社区

文章目录

前言

上一篇文章简单的介绍了如何使用Scrapy框架进行爬虫。Scrapy框架快速入门，以糗事百科为例进行说明【python爬虫入门进阶】（16），但是上篇文章只介绍到了单页面的数据爬取，数据的保存以及爬取多个页面还没有介绍。所以，这篇文章就来介绍数据的存储以及爬取多个页面。

返回数据

class SpiderQsbkSpider(scrapy.Spider):
    # 标识爬虫的名字
    name = 'spider_qsbk'
    # 标识爬虫允许的域名
    allowed_domains = ['qiushibaike.com']
    # 开始的页
    start_urls = ['https://www.qiushibaike.com/text/page/1/']

    def parse(self, response):
        print(type(response))
        # SelectorList
        div_list = response.xpath('//div[@class="article block untagged mb15 typs_hot"]')
        print(type(div_list))
        for div in div_list:
            # Selector
            author = div.xpath('.//h2/text()').get().strip()
            print(author)
            content = div.xpath('.//div[@class="content"]//text()').getall()
            content = "".join(content).strip()
            duanzi = {'author': author, 'content': content}
            yield duanzi

这里主要看下面两行代码。该代码就是将作者和内容放在字典中，然后以生成器的方式返回数据。

 duanzi = {'author': author, 'content': content}
 yield duanzi

相当于

    items=[]
    items.append(item) 
    return items

数据存储

Scrapy框架存储数据的代码是放在pipelines.py中的，也就是说通过pipelines 来接收爬虫返回的item。

QsbkPipeline

QsbkPipeline类有三个方法：

open_spider(self, spider) ：当爬虫被打开时执行。
process_item(self, item, spider)：当爬虫有item传过来的时候被调用。
close_spider(self,spider): 当爬虫关闭的时候被调用。
要激活piplilne，应该在settings.py 中设置ITEM_PIPELINES。

class QsbkPipeline:
    def __init__(self):
        self.fp = open('duanzi.json', 'w', encoding='utf-8')

    # 打开文件
    def open_spider(self, spider):
        print('这是爬虫开始了.....')

    def process_item(self, item, spider):
        item_json = json.dumps(item, ensure_ascii=False)
        self.fp.write(item_json + '\n')
        return item

    def close_spider(self, spider):
        self.fp.close()
        print('这是爬虫结束了.....')

在构造方法中定义一个duanzi.json的文件，用于存储爬虫传过来的item。 json.dumps 方法可以将字典转成json字符串。self.fp.write(item_json + '\n') 每写入一个json字符串就换一行。
在这里插入图片描述

数据传输优化-Item作为数据模型

前面spider返回的item是一个字典，这种写法是Scrapy不推荐的。Scrapy推荐的方式是将数据通过scrapy.Item封装起来。本例中的Item类就是QsbkItem

import scrapy
class QsbkItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    author = scrapy.Field()
    content = scrapy.Field()

接着在爬虫的SpiderQsbkSpider中通过QsbkItem对象来接收返回数据。

  item = QsbkItem(author=author, content=content)
  # 以生成器返回
  yield item

这种方式的好处就是规范了item的数据结构。代码看起来也比较的简洁。
在QsbkPipeline中只需要在序列化数据时做一下小小的修改item_json = json.dumps(dict(item), ensure_ascii=False)。

数据存储优化-使用scrapy的导出类

使用JsonItemExporter类

首先，使用的是JsonItemExporter类。该类是以二进制的形式打开，所有在open方法指定文件的读取模式为wb模式。指定编码类型为utf-8，不进行ascii 编码。

from scrapy.exporters import JsonItemExporter
class QsbkPipeline:
    def __init__(self):
        self.fp = open('duanzi.json', 'wb')
        self.exporter = JsonItemExporter(self.fp, encoding='utf-8', ensure_ascii=False)
        self.exporter.start_exporting()

    # 打开文件
    def open_spider(self, spider):
        print('这是爬虫开始了.....')
        pass

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.fp.close()
        print('这是爬虫结束了.....')

在爬虫开始前需要调用self.exporter.start_exporting() 方法开始导入，在爬虫结束后需要调用self.exporter.finish_exporting() 方法完成导入。
运行之后的结果是：
在这里插入图片描述
该方式的运行逻辑是在调用export_item 方法时将数据item 放到一个列表中，在调用finish_exporting 方法之后在将这个列表的数据统一写到文件中。如果数据比较多的话，则会比较耗内存。这里还有另外一种比较好的方式。

使用JsonLinesItemExporter

JsonLinesItemExporter 的方式就是每次调用export_item的时候就会把这个item存储到磁盘中。不需要调用start_exporting方法和finish_exporting方法。

from scrapy.exporters import JsonLinesItemExporter


class QsbkPipeline:
    def __init__(self):
        self.fp = open('duanzi.json', 'wb')
        self.exporter = JsonLinesItemExporter(self.fp, encoding='utf-8', ensure_ascii=False)

    # 打开文件
    def open_spider(self, spider):
        print('这是爬虫开始了.....')
        pass

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        self.fp.close()
        print('这是爬虫结束了.....')

小结

保存json数据的时候，可以使用JsonItemExporter类和JsonLinesItemExporter类，让操作变得更简单。

JsonItemExporter : 这个是每次把数据添加到内存中，嘴周统一写到磁盘中。好处是存储的数据是一个满足json规则的数据，坏处是如果数据量比较大，那么比较耗内存。
JsonLinesItemExporter：这个是每次调用export_item的时候就会把这个item存储到磁盘中。坏处是每一个字典是一行，整个文件不是一个满足json格式的文件，好处是每次处理数据的时候就直接存到了磁盘中，这样不会耗内存。数据也比较安全。

当然，Scrapy框架还给我们提供了XmlItemExporter，CsvItemExporter等导出类。

爬取多个页面

单个页面的数据爬取和存储已经搞定了，接下来就是爬取多个页面了。这里首先是要获取到当前页的下一页，一直循环到最后一页。

class SpiderQsbkSpider(scrapy.Spider):
    base_domain = 'https://www.qiushibaike.com'
    # 标识爬虫的名字
    name = 'spider_qsbk'
    # 标识爬虫允许的域名
    allowed_domains = ['qiushibaike.com']
    # 开始的页
    start_urls = ['https://www.qiushibaike.com/text/page/1/']

    def parse(self, response):
        print(type(response))
        # SelectorList
        div_list = response.xpath('//div[@class="article block untagged mb15 typs_hot"]')
        print(type(div_list))
        for div in div_list:
            # Selector
            author = div.xpath('.//h2/text()').get().strip()
            print(author)
            content = div.xpath('.//div[@class="content"]//text()').getall()
            content = "".join(content).strip()
            item = QsbkItem(author=author, content=content)
            # 以生成器返回
            yield item
     # 获取当前页面的下一页
        next_url = response.xpath('//ul[@class="pagination"]/li[last()]/a/@href').get()
        if not next_url:
            return
        else:
            yield scrapy.Request(self.base_domain + next_url, callback=self.parse)

这里主要是看下面的代码。

  next_url = response.xpath('//ul[@class="pagination"]/li[last()]/a/@href').get()
  if not next_url:
        return
  else:
        yield scrapy.Request(self.base_domain + next_url, callback=self.parse)

爬虫获取到下一秒的链接是/text/page/2/ 。所以需要加上域名拼成一个完整的地址。
在这里插入图片描述
循环调用scrapy.Request(self.base_domain + next_url, callback=self.parse) 方法进行数据的请求。callback 参数用于指定会调的方法。
这里还需要更改一下settings.py中的DOWNLOAD_DELAY = 1 用于设置下载间隔为1秒。

总结

本文简单的介绍了如何通过Scrapy框架快速爬取糗事百科之数据存储以及爬取多个页面

粉丝专属福利

											👇🏻 验证码 可通过搜索下方 公众号 获取👇🏻