用 scrapy + Python3 抓取博客园

实战 爬虫   后端笔记   4周前

目标网站: https://www.cnblogs.com/

使用框架: scrapy + pyhon3

输出结果: 拿到博客园文章列表的:作者,发布时间,评论数,阅读数,标题,推荐数 ,输出到指定地址的mysql数据库中

要求:爬虫可以通过配置参数可以自由爬取第n页的数据

拿到这个任务,有框架和语言的要求,所以就先装一下环境

  • 我用的 Mac 所以就在 Mac 上装 (Windows & Linux 大同小异就不演示了) 到Python 官网安装

  • 我下载的是 Python 3.7.4, 然后按照向导一直点击"下一步"直到安装完成即可

  • 更新 pip

python -m pip install --upgrade pip
pip3 install scrapy

使用 scrapy 一般就以下几个步骤

  • Creating a new Scrapy project
  • Writing a spider to crawl a site and extract data
  • Exporting the scraped data using the command line
  • Changing spider to recursively follow links
  • Using spider arguments

所以开始吧


  • 初始化爬虫
scrapy startproject jasonSpider
  • 把 items.py 改名为 JasonSpriderItem.py ,然后定义我们要抓取六个字段
import scrapy

class JasonSpiderItem(scrapy.Item):
    title = scrapy.Field()
    author = scrapy.Field()
    release_time = scrapy.Field()
    comment_count = scrapy.Field()
    view_count = scrapy.Field()
    recommended_count = scrapy.Field()
    pass
  • 然后在 spiders/ 下面创建 cnblogs.py

  • 在 start_requests 方法中,定义爬虫的名字

name = 'cnblogs'
  • 定义我们的 url
url = 'https://www.cnblogs.com/'
  • 定义我们解析网页的方法
def parse(self, response):
  • 用 xpath 来定位我们抓取的位置
xpath('//div[@class="post_item_body"]/h3/a/text()')
  • 代码如下
# -*- coding: utf-8 -*-
import scrapy
from jasonSpider.JasonSpiderItem import JasonSpiderItem

class CnblogsSpider(scrapy.Spider):
    name = 'cnblogs'
    allowed_domains = ['www.cnblogs.com/']

    def start_requests(self):
        url = 'https://www.cnblogs.com/'
        page = getattr(self, 'page', 1)
        url = url + '/sitehome/p/' + str(page)
        # print(url)
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        items = []

        for each in response.xpath('//div[@id="post_list"]'):
            item = JasonSpiderItem()
            item['title'] = each.xpath('//div[@class="post_item_body"]/h3/a/text()').extract()
            item['author'] = each.xpath('//div[@class="post_item_foot"]/a/text()').extract()
            item['recommended_count'] = each.xpath('//span[@class="diggnum"]/text()').extract()
            item['release_time'] = each.xpath('//div[@class="post_item_foot"]/text()').extract()
            item['comment_count'] = each.xpath('//span[@class="article_comment"]/a/text()').extract()
            item['view_count'] = each.xpath('//span[@class="article_view"]/a/text()').extract()

            items.append(item)

        return items
  • 然后在管道里处理抓到的数据,把数据洗干净,并存入数据库

  • 我定义两个管道

    1. CleanPipeline
    2. MySQLPipeline
# -*- coding: utf-8 -*-

import json
import codecs
import re
import pymysql.cursors

class CleanPipeline(object):

    def process_item(self, item, spider):

        # 清洗 item['comment_count']
        comment = []
        for value in item['comment_count']:
            comment_count = re.findall('[0-9]+', value)[0]
            comment.append(int(comment_count))
        item['comment_count'] = comment

        # 清洗 item['release_time']
        release = []
        for i in range(len(item['release_time'])):
            if i % 2 != 0:
                release_time = re.findall('[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}', item['release_time'][i])[0]
                release.append(release_time)
        item['release_time'] = release

        # 清洗 item['view_count']
        view = []
        for i in range(len(item['view_count'])):
            view_count = re.findall('[0-9]+', item['view_count'][i])[0]
            view.append(int(view_count))
        item['view_count'] = view

        # 清洗 item['recommended_count']
        recommended = []
        for i in range(len(item['recommended_count'])):
            recommended.append(int(item['recommended_count'][i]))
        item['recommended_count'] = recommended

        return item

class MySQLPipeline(object):
    def __init__(self):

        #链接数据库
        self.connect = pymysql.connect(
            host='127.0.0.1',
            port=3306,
            db='cnblogs',
            user='username',
            passwd='password',
            charset='utf8',
            use_unicode=True
        )

        #拿到操作数据库的游标
        self.cursor = self.connect.cursor()

    def process_item(self, item, spider):
        for k in range(len(item['title'])):
            self.cursor.execute(
                '''
                insert into posts(title,author,release_time,comment_count,view_count,recommended_count)
                VALUE (%s,%s,%s,%s,%s,%s)
                ''', (str(item['title'][k]), str(item['author'][k]), str(item['release_time'][k]), int(item['comment_count'][k]), int(item['view_count'][k]), int(item['recommended_count'][k]))
            )

            #提交sql
            self.connect.commit()
        return item
  • 在 settings.py 里面注册我们的管道
ITEM_PIPELINES = {
    'jasonSpider.pipelines.CleanPipeline': 300,
    'jasonSpider.pipelines.MySQLPipeline': 400,
}
  • 到命令行跑一下
Scrapy crawl cnblogs -a page=3
  • 看到我们的数据
2019-09-20 16:53:46 [scrapy.utils.log] INFO: Scrapy 1.7.3 started (bot: jasonSpider)
2019-09-20 16:53:46 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.7.0, Python 3.7.2 (default, Jan 14 2019, 21:25:23) - [Clang 10.0.0 (clang-1000.11.45.5)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1c  28 May 2019), cryptography 2.7, Platform Darwin-18.7.0-x86_64-i386-64bit
2019-09-20 16:53:46 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'jasonSpider', 'FEED_EXPORT_ENCODING': 'UTF8', 'NEWSPIDER_MODULE': 'jasonSpider.spiders', 'SPIDER_MODULES': ['jasonSpider.spiders']}
2019-09-20 16:53:46 [scrapy.extensions.telnet] INFO: Telnet Password: 514758ce9c75a2ed
2019-09-20 16:53:46 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2019-09-20 16:53:46 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-09-20 16:53:46 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-09-20 16:53:46 [scrapy.middleware] INFO: Enabled item pipelines:
['jasonSpider.pipelines.CleanPipeline', 'jasonSpider.pipelines.MySQLPipeline']
2019-09-20 16:53:46 [scrapy.core.engine] INFO: Spider opened
2019-09-20 16:53:46 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-09-20 16:53:46 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-09-20 16:53:46 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.cnblogs.com/sitehome/p/3> from <GET https://www.cnblogs.com//sitehome/p/3>
2019-09-20 16:53:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.cnblogs.com/sitehome/p/3> (referer: None)
2019-09-20 16:53:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/sitehome/p/3>
{'author': ['FlyLolo',
            'JMCui',
            'Rest探路者',
            '温一壶清酒',
            '鹿呦呦',
            'Miku~',
            '秃桔子',
            'Xenny',
            'smileNicky',
            'Jacian',
            '大数据江湖',
            '叙帝利',
            '平头哥的技术博文',
            'quellanan',
            '小熊餐馆',
            'baby_duoduo',
            '辉是暖阳辉',
            'stoneFang',
            '大史不说话',
            '奋进的小样'],
 'comment_count': [0, 1, 1, 0, 0, 4, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'recommended_count': [8, 3, 1, 1, 3, 6, 0, 3, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1],
 'release_time': ['2019-09-20 08:02',
                  '2019-09-20 07:46',
                  '2019-09-20 04:30',
                  '2019-09-20 00:45',
                  '2019-09-20 00:40',
                  '2019-09-20 00:27',
                  '2019-09-19 23:40',
                  '2019-09-19 23:38',
                  '2019-09-19 23:36',
                  '2019-09-19 23:35',
                  '2019-09-19 22:38',
                  '2019-09-19 22:28',
                  '2019-09-19 22:21',
                  '2019-09-19 22:13',
                  '2019-09-19 22:05',
                  '2019-09-19 22:04',
                  '2019-09-19 21:57',
                  '2019-09-19 21:56',
                  '2019-09-19 20:51',
                  '2019-09-19 20:33'],
 'title': ['ASP.NET Core 2.2 : 二十二. 多样性的配置方式',
           '多线程编程学习十一(ThreadPoolExecutor 详解).',
           'Java多线程(十四):Timer',
           'Genymotion模拟器的安装及脚本制作',
           '《即时消息技术剖析与实战》学习笔记7——IM系统的消息未读',
           '[3]尝试用Unity3d制作一个王者荣耀(持续更新)->选择英雄-(中)',
           'JVM垃圾回收?看这一篇就够了!',
           'SQL手工注入基础篇',
           'MySQL实现Oracle rank()排序',
           '为什么StringBuilder是线程不安全的?StringBuffer是线程安全的?',
           'Java 中的 syncronized 你真的用对了吗',
           '代码美化的艺术',
           '观察者模式,从公众号群发说起',
           '二、springBoot 整合 mybatis 项目实战',
           'rocketmq学习(一)  rocketmq介绍与安装',
           '深入理解Three.js中透视投影照相机PerspectiveCamera',
           'Burpsuit构造测试数据',
           '如何做一个职业的程序员-《麦肯锡方法》读书笔记',
           'Stanford公开课《编译原理》学习笔记(1~4课)',
           'Mysql优化总结(一)'],
 'view_count': [572, 180, 135, 100, 200, 403, 196, 315, 76, 319, 143, 251, 54, 159, 81, 71, 43, 160, 128, 196]}
2019-09-20 16:53:47 [scrapy.core.engine] INFO: Closing spider (finished)
2019-09-20 16:53:47 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 453,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 13096,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/301': 1,
 'elapsed_time_seconds': 0.409017,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 9, 20, 8, 53, 47, 38918),
 'item_scraped_count': 1,
 'log_count/DEBUG': 3,
 'log_count/INFO': 10,
 'memusage/max': 50388992,
 'memusage/startup': 50388992,
 'response_received_count': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2019, 9, 20, 8, 53, 46, 629901)}
2019-09-20 16:53:47 [scrapy.core.engine] INFO: Spider closed (finished)
  • 同时也存入了数据库



如果你不行动
最好的情况就只是现在
如果你行动了
最坏的情况也不过是现在
所以,你在担心什么?