古诗文爬取案例

古诗文爬取

需求：爬取中国古诗文网古诗包括题目作者朝代正文

创建项目

打开Terminal
cd 文件夹
scrapy startproject gsw
scrapy genspider gs gushiwen.org

创建爬虫

设置爬虫name
设置allow_domain
设置start_urls
设置settings文件包括LOG pipeline 头部信息及items

DEFAULT_REQUEST_HEADERS = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36',
  'Accept-Language': 'en',
}

LOG_LEVEL = 'WARNING'
ITEM_PIPELINES = {
   'myscrapy.pipelines.MyscrapyPipeline': 300,
}

import scrapy


class MyscrapyItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    chaodai = scrapy.Field()
    zuozhe = scrapy.Field()
    zw = scrapy.Field()  #提前定义爬取字段，注意要把Scrapy项目选中为根目录来进行导入

实现数据提取方法
pipeline保存数据
gs文件

import scrapy
from myscrapy.items import MyscrapyItem


class GsSpider(scrapy.Spider):
    name = 'gs'
    allowed_domains = ['gushiwen.org','gushiwen.cn']
    start_urls = ['http://gushiwen.org/']

    def parse(self, response):
        res = response.xpath('//div[@class="left"]//div[@class="sons"]')
        for re in res:
            title = re.xpath('.//div[1]/p[1]/a/b/text()').extract_first()
            try:
                chaodai = re.xpath('.//div[1]/p[2]/a[1]/text()').extract_first()
                zuozhe = re.xpath('.//div[1]/p[2]/a[2]/text()').extract_first()
                zhengwen = re.xpath('.//div[1]/div[@class="contson"]/text()').extract()  #拿到列表
                zw = ''.join(zhengwen).strip()  #去除前后空格
                item = MyscrapyItem(
                    title=title,chaodai=chaodai,zuozhe=zuozhe,zw=zw
                )
                yield item
            except:
                print(title)
        #实现翻页
        next_href = response.xpath('//a[@id="amore"]/@href').extract_first() #拿到下一页url地址，打印虽然有域名，但是实际拿到的没有域名
        if next_href:
            next_url =response.urljoin(next_href)  #内部封装 自行拼接
            yield scrapy.Request(next_url)

pipeline文件

import json

class MyscrapyPipeline:
    def __init__(self):
        self.f = open('demo.json','w',encoding='utf-8')

    def open_spider(self,item):
        print('爬虫开始')


    def process_item(self, item, spider):
        print(item)
        item_json = json.dumps(dict(item),ensure_ascii=False)  #需要转成字典格式
        self.f.write(item_json+'\n')
        return item

    def close_spider(self,item):
        print('爬虫结束')

start文件

from scrapy import cmdline

# cmdline.execute('scrapy crawl gs'.split())
cmdline.execute(['scrapy','crawl','gs'])

古诗文爬取

需求：爬取中国古诗文网古诗包括题目 作者 朝代 正文

需求：爬取中国古诗文网古诗包括题目作者朝代正文