Scrapy爬取图片
方法一 直接爬取
- carcar
1 | import scrapy |
- pipeline
1 | from itemadapter import ItemAdapter |
方法二 使用scrapy内置下载文件的方法
1:避免重新下载最近已经下载过的数据
2:可以方便的指定文件存储的路径
3:可以将下载的图片转换成通用的格式。如:png,jpg
4:可以方便的生成缩略图
5:可以方便的检测图片的宽和高,确保他们满足最小限制
6:异步下载,效率非常高
下载文件Files Pipeline
使用Files Pipeline下载文件,按照以下步骤完成:
定义好一个Item,然后在这个item中定义两个属性,分别为file_urls以及files。files_urls是用来存储需要下载的文件的url链接,需要给一个列表
import scrapy
class CarItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
images = scrapy.Field() #保存图片路径
image_urls = scrapy.Field() #图片url
pass
<!--hexoPostRenderEscape:<figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">- 当文件下载完成后,会把文件下载的相关信息存储到item的files属性中。如下载路径、下载的url和文件校验码等</span><br><span class="line"></span><br><span class="line">- 在配置文件settings.py中配置FILES_STORE,这个配置用来设置文件下载路径</span><br><span class="line"></span><br><span class="line"> - ```python</span><br><span class="line"> import os</span><br><span class="line"> IMAGES_STORE = os.path.join(os.path.dirname(os.path.dirname(__file__)),'images') #images文件夹需要提前新建</span><br></pre></td></tr></table></figure>:hexoPostRenderEscape-->
启动pipeline:在ITEM_PIPELINES中设置scrapy.piplines.files.FilesPipeline:1
ITEM_PIPELINES = { # 'car.pipelines.CarPipeline': 300, 'scrapy.pipelines.images.ImagesPipeline': 1, } <!--hexoPostRenderEscape:<figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">- carcar</span><br><span class="line"></span><br><span class="line">```python</span><br><span class="line">import scrapy</span><br><span class="line">from car.items import CarItem</span><br><span class="line"></span><br><span class="line">class CarcarSpider(scrapy.Spider):</span><br><span class="line"> name = 'carcar'</span><br><span class="line"> allowed_domains = ['car.autohome.com.cn']</span><br><span class="line"> start_urls = ['https://car.autohome.com.cn/photolist/series/46488/6027826.html#pvareaid=3454450']</span><br><span class="line"></span><br><span class="line"> def parse(self, response):</span><br><span class="line"> ul = response.xpath('//ul[@id="imgList"]/li')</span><br><span class="line"> for li in ul:</span><br><span class="line"> # item = {}</span><br><span class="line"> item = CarItem()</span><br><span class="line"> #item['src'] = 'https:'+li.xpath('./a/img/@src').extract_first() #第一种方式</span><br><span class="line"> item['image_urls'] = ['https:'+li.xpath('./a/img/@src').extract_first()] #注意传递的是一个列表</span><br><span class="line"> # print(item)</span><br><span class="line"> yield item</span><br></pre></td></tr></table></figure>:hexoPostRenderEscape-->