古诗文爬取

需求:爬取中国古诗文网古诗包括题目 作者 朝代 正文

  • 创建项目
1
2
3
4
打开Terminal
cd 文件夹
scrapy startproject gsw
scrapy genspider gs gushiwen.org
  • 创建爬虫

    • 设置爬虫name
    • 设置allow_domain
    • 设置start_urls
    • 设置settings文件包括LOG pipeline 头部信息及items
    1
    2
    3
    4
    5
    6
    7
    8
    9
    DEFAULT_REQUEST_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36',
    'Accept-Language': 'en',
    }

    LOG_LEVEL = 'WARNING'
    ITEM_PIPELINES = {
    'myscrapy.pipelines.MyscrapyPipeline': 300,
    }
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    import scrapy


    class MyscrapyItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    chaodai = scrapy.Field()
    zuozhe = scrapy.Field()
    zw = scrapy.Field() #提前定义爬取字段,注意要把Scrapy项目选中为根目录来进行导入
  • 实现数据提取方法

  • pipeline保存数据

  • gs文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import scrapy
from myscrapy.items import MyscrapyItem


class GsSpider(scrapy.Spider):
name = 'gs'
allowed_domains = ['gushiwen.org','gushiwen.cn']
start_urls = ['http://gushiwen.org/']

def parse(self, response):
res = response.xpath('//div[@class="left"]//div[@class="sons"]')
for re in res:
title = re.xpath('.//div[1]/p[1]/a/b/text()').extract_first()
try:
chaodai = re.xpath('.//div[1]/p[2]/a[1]/text()').extract_first()
zuozhe = re.xpath('.//div[1]/p[2]/a[2]/text()').extract_first()
zhengwen = re.xpath('.//div[1]/div[@class="contson"]/text()').extract() #拿到列表
zw = ''.join(zhengwen).strip() #去除前后空格
item = MyscrapyItem(
title=title,chaodai=chaodai,zuozhe=zuozhe,zw=zw
)
yield item
except:
print(title)
#实现翻页
next_href = response.xpath('//a[@id="amore"]/@href').extract_first() #拿到下一页url地址,打印虽然有域名,但是实际拿到的没有域名
if next_href:
next_url =response.urljoin(next_href) #内部封装 自行拼接
yield scrapy.Request(next_url)
  • pipeline文件
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import json

class MyscrapyPipeline:
def __init__(self):
self.f = open('demo.json','w',encoding='utf-8')

def open_spider(self,item):
print('爬虫开始')


def process_item(self, item, spider):
print(item)
item_json = json.dumps(dict(item),ensure_ascii=False) #需要转成字典格式
self.f.write(item_json+'\n')
return item

def close_spider(self,item):
print('爬虫结束')
  • start文件
1
2
3
4
from scrapy import cmdline

# cmdline.execute('scrapy crawl gs'.split())
cmdline.execute(['scrapy','crawl','gs'])