scrapy案例

需求 爬取页面工作及工作职责

1
https://careers.tencent.com/search.html

scrapy爬取文件流程

  • 创建项目
  • 创建爬虫
    • 设置爬虫name
    • 设置allow_domain
    • 设置start_urls
    • 设置log
  • 实现数据提取方法
  • pipeline保存数据
  • 通过数据接口获取页面信息 得到以下的结果
1
2
3
4
5
6
7
第一页
https://careers.tencent.com/tencentcareer/api/post/Query? timestamp=1605963271594&countryId=&cityId=&bgIds=&productId=&categoryId=&parentC ategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn
第二页
https://careers.tencent.com/tencentcareer/api/post/Query? timestamp=1605963271594&countryId=&cityId=&bgIds=&productId=&categoryId=&parentC ategoryId=&attrId=&keyword=&pageIndex=2&pageSize=10&language=zh-cn&area=cn

详情页
https://careers.tencent.com/tencentcareer/api/post/ByPostId? timestamp=1605963490669&postId=1128980808770523136&language=zh-cn #postId 更改详情页
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import scrapy
import json
from myscrapy.items import MyscrapyItem

class HrSpider(scrapy.Spider):
name = 'hr'
allowed_domains = ['tencent.com']
one_url = 'https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1606137466198&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn' #列表url
two_url = 'https://careers.tencent.com/tencentcareer/api/post/ByPostId? timestamp=1605963490669&postId={}&language=zh-cn' #详情页url
start_urls = [one_url.format(1)]

def parse(self, response):

for page in range(1,11):
url = self.one_url.format(page)

yield scrapy.Request(
url=url,callback=self.parse_one
) #获取数据传入parse_one
def parse_one(self,response):
data = json.loads(response.text)
# print(data)
item = {}
item = MyscrapyItem()
for job in data['Data']['Posts']:
# print(job)
item['arae'] = job['LocationName']
item['tpye'] = job['RecruitPostName']
post_id = job['PostId']
#拼接详情页
detail_url = self.two_url.format(post_id)

yield scrapy.Request(
url=detail_url,
callback=self.parse_two,
meta= {'item':item} #传递
) #获取数据传入parse_two
def parse_two(self,response):
#第一种接受meta传递值
# item = response.meta['item']
#第二种接受meta传递值
item = response.meta.get('item')
data = json.loads(response.text)
item['Responsibility'] = data['Data']['Responsibility']
item['Requirement'] = data['Data']['Requirement']
print(item)
yield item
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import json

class MyscrapyPipeline:
def __init__(self):
self.f = open('demo.json','w',encoding='utf-8')

def open_spider(self,item):
print('爬虫开始')


def process_item(self, item, spider):
print(item)
item_json = json.dumps(item,ensure_ascii=False)
self.f.write(item_json+'\n')
return item

def close_spider(self,item):
print('爬虫结束')