Scrapy爬取图片

方法一 直接爬取

  • carcar
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import scrapy


class CarcarSpider(scrapy.Spider):
name = 'carcar'
allowed_domains = ['car.autohome.com.cn']
start_urls = ['https://car.autohome.com.cn/photolist/series/46488/6027826.html#pvareaid=3454450']

def parse(self, response):
ul = response.xpath('//ul[@id="imgList"]/li')
for li in ul:
item = {}
item['src'] = 'https:'+li.xpath('./a/img/@src').extract_first()
# print(item)
yield item
  • pipeline
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from itemadapter import ItemAdapter
from urllib import request # 保存图片
import os
from scrapy.pipelines.images import ImagesPipeline
class CarPipeline:
def process_item(self, item, spider):
# 获取图片的url
src = item['src']
# 获取图片的名字
img_name = item['src'].split('__')[-1]
# 保存图片 E:\Project\spider\day22\pic\images
# E:\Project\spider\day22\pic\pic\pipelines
# E:\Project\spider\day22\pic\pic
# E:\Project\spider\day22\pic
# E:\Project\spider\day22\pic\images
# 一种解决问题的方式
file_path = os.path.join(os.path.dirname(os.path.dirname(__file__)),'images')
print(file_path)
request.urlretrieve(src,file_path+'/'+img_name)
return item

方法二 使用scrapy内置下载文件的方法

  • 1:避免重新下载最近已经下载过的数据

  • 2:可以方便的指定文件存储的路径

  • 3:可以将下载的图片转换成通用的格式。如:png,jpg

  • 4:可以方便的生成缩略图

  • 5:可以方便的检测图片的宽和高,确保他们满足最小限制

  • 6:异步下载,效率非常高

下载文件Files Pipeline

使用Files Pipeline下载文件,按照以下步骤完成:

  • 定义好一个Item,然后在这个item中定义两个属性,分别为file_urls以及files。files_urls是用来存储需要下载的文件的url链接,需要给一个列表

    • import scrapy
      
      
class CarItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    images = scrapy.Field()    #保存图片路径
    image_urls = scrapy.Field() #图片url
    pass
<!--hexoPostRenderEscape:<figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">- 当文件下载完成后,会把文件下载的相关信息存储到item的files属性中。如下载路径、下载的url和文件校验码等</span><br><span class="line"></span><br><span class="line">- 在配置文件settings.py中配置FILES_STORE,这个配置用来设置文件下载路径</span><br><span class="line"></span><br><span class="line">  - &#96;&#96;&#96;python</span><br><span class="line">    import os</span><br><span class="line">    IMAGES_STORE &#x3D; os.path.join(os.path.dirname(os.path.dirname(__file__)),&#39;images&#39;) #images文件夹需要提前新建</span><br></pre></td></tr></table></figure>:hexoPostRenderEscape-->
  • 启动pipeline:在ITEM_PIPELINES中设置scrapy.piplines.files.FilesPipeline:1

    • ITEM_PIPELINES = {
         # 'car.pipelines.CarPipeline': 300,
          'scrapy.pipelines.images.ImagesPipeline': 1,
      }
      <!--hexoPostRenderEscape:<figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">- carcar</span><br><span class="line"></span><br><span class="line">&#96;&#96;&#96;python</span><br><span class="line">import scrapy</span><br><span class="line">from car.items import CarItem</span><br><span class="line"></span><br><span class="line">class CarcarSpider(scrapy.Spider):</span><br><span class="line">    name &#x3D; &#39;carcar&#39;</span><br><span class="line">    allowed_domains &#x3D; [&#39;car.autohome.com.cn&#39;]</span><br><span class="line">    start_urls &#x3D; [&#39;https:&#x2F;&#x2F;car.autohome.com.cn&#x2F;photolist&#x2F;series&#x2F;46488&#x2F;6027826.html#pvareaid&#x3D;3454450&#39;]</span><br><span class="line"></span><br><span class="line">    def parse(self, response):</span><br><span class="line">        ul &#x3D; response.xpath(&#39;&#x2F;&#x2F;ul[@id&#x3D;&quot;imgList&quot;]&#x2F;li&#39;)</span><br><span class="line">        for li in ul:</span><br><span class="line">            # item &#x3D; &#123;&#125;</span><br><span class="line">            item &#x3D; CarItem()</span><br><span class="line">            #item[&#39;src&#39;] &#x3D; &#39;https:&#39;+li.xpath(&#39;.&#x2F;a&#x2F;img&#x2F;@src&#39;).extract_first() #第一种方式</span><br><span class="line">            item[&#39;image_urls&#39;] &#x3D; [&#39;https:&#39;+li.xpath(&#39;.&#x2F;a&#x2F;img&#x2F;@src&#39;).extract_first()] #注意传递的是一个列表</span><br><span class="line">            # print(item)</span><br><span class="line">            yield item</span><br></pre></td></tr></table></figure>:hexoPostRenderEscape-->