scrapy

scrapy工作流程

  • 首先Spider(爬虫)将需要发送请求的url经过ScrapyEngine(引擎)交给调度器(Scheduler)
  • Scheduler(调度器) 排序 入列处理后,在经过ScrapyEngine(引擎)到DownloaderMiddlewares(下载中间件)(user-agent cookie proxy)交给downloader
  • Downlaoser向互联网发起请求,并接收响应(response).将响应在经过在经过ScrapyEngine(引擎)给了SpiderMiddlewares(爬虫中间件)交给Spiders
  • Spiders处理response,提取数据并将数据经过ScrapyEngine交给itemPipeline(管道)保存数据

scrapy入门

  • pycharm Terminal控制面板

  • 进入文件夹 cd xxx

  • scrapy startproject xx(scrapy项目)

  • cd xx

  • scrapy genspider xxxx(项目名称) douban.com

  • 在settings中打开pipeline

    1
    2
    3
    ITEM_PIPELINES = {
    'myscrapy.pipelines.MyscrapyPipeline': 300,
    }
  • scrapy crawl xxxx(项目名称) 开始爬取(返回403,可以通过设置头部信息解决)

1
2
3
4
DEFAULT_REQUEST_HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36',
'Accept-Language': 'en',
} #此头部信息根据浏览器放到setting中
  • 第二种开始方式,在xx下新建start.py文件
1
2
3
4
from scrapy import cmdline

# cmdline.execute('scrapy crawl xxxx'.split())
cmdline.execute(['scrapy','crawl','xxxx'])
  • 爬虫log较多,可以进行过滤,在settings中设置
1
LOG_LEVEL = 'WARNING'
  • 导出response的类为<class ‘scrapy.http.response.html.HtmlResponse’> 需要通过自带模块进行解码
1
from scrapy.http.response.html import HtmlResponse
  • 导入模块自带xpath
1
li_list = response.xpath('//div[@class="side-links nav-anon"]/ul/li')
  • 在建立文件中
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import scrapy
from scrapy.http.response.html import HtmlResponse

class DbSpider(scrapy.Spider):
name = 'db'
allowed_domains = ['douban.com']
start_urls = ['http://douban.com/']

def parse(self, response):
li_list = response.xpath('//div[@class="side-links nav-anon"]/ul/li')
#定义字典保存数据
#提取标签内容
# get() 连同标签
# extract_first() 只有文字
item = {}
for li in li_list:
item['name'] = li.xpath('a/em/text()').extract_first()
if item['name'] == None:
item['name'] = li.xpath('a/text()').extract_first()
yield item #此处是将item送给pipeline进行处理
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import json
class MyscrapyPipeline:
def __init__(self):
self.f = open('demo.json','w',encoding='utf-8') #创建文件

def open_spider(self,item):
print('爬虫开始')


def process_item(self, item, spider):
print(item)
item_json = json.dumps(item,ensure_ascii=False) #ascii码False
self.f.write(item_json+'\n') #写入文件

return item

def close_spider(self,item):
print('爬虫结束')
  • pipeline中的执行顺序,settings中可以设置多个pipeline例如,但是数字越低优先级越高
1
2
3
4
ITEM_PIPELINES = {
'myscrapy.pipelines.MyscrapyPipeline': 300,
'myscrapy.pipelines.MyscrapyPipeline1': 301
} #数字越低优先级越高
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import json

class MyscrapyPipeline:
def __init__(self):
self.f = open('demo.json','w',encoding='utf-8')

def open_spider(self,item):
print('爬虫开始')


def process_item(self, item, spider): #spider.name可以判别数据来源
#print(spider.name) #会返回xxxx(爬虫项目名称)
item['hello'] = 'world' #运行后可以打印hello world数据说明此pipeline先进行
item_json = json.dumps(item,ensure_ascii=False)
self.f.write(item_json+'\n')

return item

def close_spider(self,item):
print('爬虫结束')

class MyscrapyPipeline1:
def process_item(self, item, spider):
print(item)
return item
  • logging模块检查程序warning

    • 在settings中配置可以到处log

      1
      LOG_FILE = './log.log'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import scrapy
import logging #导入logging模块
from scrapy.http.response.html import HtmlResponse

logger = logging.getLogger(__name__) #定义logger
class DbSpider(scrapy.Spider):
name = 'db'
allowed_domains = ['douban.com']
start_urls = ['http://douban.com/']

def parse(self, response):
li_list = response.xpath('//div[@class="side-links nav-anon"]/ul/li')
#定义字典保存数据
#提取标签内容
# get() 连同标签
# extract_first() 只有文字
logger.warning('this is warning') #记录log
item = {}
for li in li_list:
item['name'] = li.xpath('a/em/text()').extract_first()
if item['name'] == None:
item['name'] = li.xpath('a/text()').extract_first()
yield item

scrapy爬取文件流程

  • 创建项目
  • 创建爬虫
    • 设置爬虫name
    • 设置allow_domain
    • 设置start_urls
    • 设置log
  • 实现数据提取方法
  • pipeline可以保存数据