注

Scrapy: An open source and collaborative framework for extracting the data you need from websites.In a fast, simple, yet extensible way.

一个开源和协作框架，用于从网站中提取所需的数据。以一种快速、简单但可扩展的方式。

此篇作为学习pyton爬虫框架的笔记，用于记录常用操作及注意事项，避免后期到处翻资料填坑，新手亦可通过此笔记快速入门上手

初始化基础操作

安装基础库

shell
pip3 install scrapy

初始化爬虫项目

shell
scrapy startproject <myspider（项目名称）>

创建爬虫

shell
scrapy genspider <blogspider爬虫名字> <xxx.com目标网站>

运行爬虫

shell
# nolog表示不输出日志，错误日志也不会输出
scrapy crawl <myspider爬虫名称> --nolog

xpath相关

温馨提示：可通过浏览器控制台直接复制xpath路径

python
# //表示全局搜索
# 检索特定属性
response.xpath('//div[@class="post-card-title"]')
# 当目标包含多个样式时使用如下选择器：
response.xpath('//div[contains(@class,"post-card-title")]')

# 提取数据
node.xpath('./a/div/text()')[0].extract()
node.xpath('./a/div/text()').extract_first() # 避免空列表报错

爬取逻辑 spider/spiders/xxx.py

python
import scrapy

class CharerblogSpider(scrapy.Spider):
    name = "charerblog"
    allowed_domains = ["blog.charer.info"]
    start_urls = ["https://blog.charer.info/"]

    def parse(self, response):
        # print(response.body)
        titlelist = response.xpath('//div[contains(@class,"post-card-title")]')
        for node in titlelist:
            temp = {}
            temp['title'] = node.xpath('./a/div/text()').extract_first()
            # 数据返回给管道
            yield temp

数据处理存储

启用字典

python

#settings.py 第67行取消注释

ITEM_PIPELINES = {
   "spider.pipelines.SpiderPipeline": 300, # 后面的是执行顺序，越小越靠前
}

spider/pipelines.py 实现数据存储业务逻辑

python
import json
class SpiderPipeline:
    def __init__(self) -> None:
        self.file = open("blog.json","w")
    def process_item(self, item, spider):
        # 每次yield 都会执行这里
        item = dict(item)
        str = json.dumps(item,ensure_ascii=False)+',\n'
        self.file.write(str)
        # 数据返回给引擎
        return item

    def __del__(self)->None:
        self.file.close()

问题处理

错误

AttributeError: 'AsyncioSelectorReactor' object has no attribute '_handleSignals'

安装指定版本的 Twisted 确认当前环境使用的版本是否一致

shell
# 卸载已安装的版本
pip3 uninstall Twisted
# 安装指定版本
pip install Twisted==22.10.0

目录

初始化基础操作

xpath相关

爬取逻辑 spider/spiders/xxx.py

数据处理存储

问题处理