通过关键词抓取京东数据

实战   后端笔记   1个月前

写了个脚本抓取了京东的数据,下面是实战的过程还有解决踩到的坑。

需求

现在有几份 2W 多条商品关键词的 Excel 文件,然后要拿关键词在京东搜索,并把该关键词的总商品数和前八个搜索结果的价格和评论数记录回 Excel 文件。

分析一下

  1. 要把关键词从 Excel 中取出,用了 PHPOffice/PhpSpreadsheet
  2. 要在京东搜索,所以需要一个 http 客户端,我比较喜欢用 guzzle
  3. 把京东的搜索页抓下来,然后要用 html-parser 来解析一下我们需要的几个数据

开始动手

经过上面的分析,需要三个包,用 composer 安装一下

php composer.phar require guzzlehttp/guzzle
php composer.phar require phpoffice/phpspreadsheet
php composer.phar require bupt1987/html-parser

然后把自动加载引进来

require 'vendor/autoload.php';

为了让脚步正常运行,修改一下 php 的配置

ini_set('memory_limit', '1024M');

首先把关键词从 Excel 里面拿到,然后 new 一个 http 客户端, 开始抓取,因为磁盘 IO 的速度相对比较慢,所以每抓一百个关键词后再写入文件。

function dosheet() {
    $inputFileName = '/Users/jasonvv/jdData/kw.xlsx';
    $reader = new Xlsx();
    $spreadsheet = $reader->load($inputFileName);
    $worksheet = $spreadsheet->getActiveSheet();
    $client = new Client();
    $count = 0;
    for ($i = 1; $i <= 23682; $i++) {
        $key = $worksheet->getCell('A'.$i)->getFormattedValue();
        echo '正在抓取第 '.($i - 1).' 个关键词'.PHP_EOL;
        $spreadsheet = gojd($key,$i,$spreadsheet,$client);
        $count++;
        if ($count == 100) {
            $writer = IOFactory::createWriter($spreadsheet, 'Xlsx');
            $writer->save('/Users/jasonvv/jdData/kw.xlsx');
            unset($writer);
            echo "正在写入数据100条,已写数据总共".($i - 1).'条'.PHP_EOL;
            $count = 0;
        }
    }
    echo "写入完成";
}

然后来写一下抓取的方法,抓取关键词的总商品数和前八个搜索结果的价格都没问题,但是商品的评论数拿不到,什么原因呢,打开 chrome 的控制台观察了一下,发现 Network 里面有一个特殊的 ajax 请求,拿到它的Request URL:

https://club.jd.com/comment/productCommentSummaries.action?referenceIds=36766083425&callback=jQuery3479769&_=1567248103508

打开发现是一个 json 对象:

{
    "CommentsCount": [
        {
            "SkuId": 36766083425,
            "ProductId": 36766083425,
            "ShowCount": 70,
            "ShowCountStr": "70+",
            "CommentCountStr": "1.3万+",
            "CommentCount": 13000,
            "AverageScore": 5,
            "DefaultGoodCountStr": "1.1万+",
            "DefaultGoodCount": 11000,
            "GoodCountStr": "6000+",
            "GoodCount": 6000,
            "AfterCount": 10,
            "OneYear": 0,
            "AfterCountStr": "10+",
            "VideoCount": 10,
            "VideoCountStr": "10",
            "GoodRate": 0.99,
            "GoodRateShow": 99,
            "GoodRateStyle": 148,
            "GeneralCountStr": "20+",
            "GeneralCount": 20,
            "GeneralRate": 0.007,
            "GeneralRateShow": 1,
            "GeneralRateStyle": 2,
            "PoorCountStr": "20+",
            "PoorCount": 20,
            "SensitiveBook": 0,
            "PoorRate": 0.003,
            "PoorRateShow": 0,
            "PoorRateStyle": 0
        }
    ]
}

找到这个 api 就很好办了,先拿到商品的 data-sku ,然后把它赋给 referenceIds,拿 http 客户端请求一下,拿到这个 json 然后取我们要的 CommentCount ,数据基本都拿到了,贴一下代码:

function gojd($key, $row, $spreadsheet, $client) {
    $worksheet = $spreadsheet->getActiveSheet();
    $data = urlencode($key);
    $base_uri = 'https://search.jd.com/Search?keyword='.$data.'&enc=utf-8';

    $response = $client->request('GET', $base_uri, [
        'headers' => [
            'authority' => 'search.jd.com',
            'upgrade-insecure-requests' => '1',
            'user-agent' => 'balabalabala...',
            'referer' => 'https://search.jd.com/Search',
            'accept-encoding' => 'gzip, deflate, br',
            'accept-language' => 'zh-CN,zh;q=0.9,en;q=0.8',
            'cookie' => 'balabalabala...'
        ]
    ]);
    $res = $response->getBody()->getContents();

    $html_dom = new ParserDom($res);

    try{
        $resCount = $html_dom->find('#J_resCount',0)->getPlainText();
    } catch (Error $e) {
        return $spreadsheet;
    }

    $price = [];
    $Comment = [];
    $id = [];

    for ($i = 0; $i < 8; $i++) {
        try{
            $html_dom->find('#J_goodsList', 0)->find('li.gl-item', $i)->find('div.p-price', 0)->find('i', 0);
        } catch (Error $e) {
            break;
        }
        $price[] = $html_dom->find('#J_goodsList', 0)->find('li.gl-item', $i)->find('div.p-price', 0)->find('i', 0)->getPlainText();
        $id[] = $html_dom->find('#J_goodsList', 0)->find('li.gl-item', $i)->getAttr('data-sku');
    }

    $res = [];
    $requests = function ($total) use ($id){
        $uri = 'https://club.jd.com/comment/productCommentSummaries.action?referenceIds=';
        for ($i = 0; $i < $total; $i++) {
            yield new Request('GET', $uri.$id[$i]);
        }
    };

    $pool = new Pool($client, $requests(count($id)), [
        'concurrency' => count($id),
        'fulfilled' => function ($response, $index) use(&$res,&$Comment) {
            $res[$index] = $response->getBody()->getContents();
        },
        'rejected' => function ($reason, $index) {
            echo '!!!'.$index.':'.$reason;
        },
    ]);

    $promise = $pool->promise();
    $promise->wait();

    ksort($res);
    array_map(function ($item) use (&$Comment){
        $Comment[] = json_decode(utf8_encode($item))->CommentsCount[0]->CommentCount;
    },$res);

    unset($html_dom);

    $worksheet->getCell(strtoupper(chr(66)).$row)->setValue($resCount);
    echo 'resCount: '.$resCount.PHP_EOL;
    for ($i = 0, $col=71; $i < count($id); $i++) {
        echo 'price:'.$price[$i].' Comment:'.$Comment[$i].PHP_EOL;
        $worksheet->getCell(strtoupper(chr($col++)).$row)->setValue($price[$i]);
        $worksheet->getCell(strtoupper(chr($col++)).$row)->setValue($Comment[$i]);
    }
    echo PHP_EOL;
    return $spreadsheet;
}

跑了脚本后发现,由于网络原因或者其他因素的影响,有一些关键词没抓到,这个时候就想到要再写一个脚本,修复一下这些数据,贴一下代码:

function docheck() {
    $inputFileName = '/Users/jasonvv/jdData/kw.xlsx';
    $reader = new Xlsx();
    $spreadsheet = $reader->load($inputFileName);
    $worksheet = $spreadsheet->getActiveSheet();
    $client = new Client();
    $checks = [];
    $count = 0;
    for ($i = 0; $i <= 23682; $i++) {
        $point = $worksheet->getCell('G'.$i)->getFormattedValue();
        $key = $worksheet->getCell('A'.$i)->getFormattedValue();
        if ($point==null) {
            $checks[] = $i;
            $checks[] = $key;
        }
    }
    $isWrite = 0;

    for ($j = 0; $j < count($checks); $j=$j+2) {
        $row = $checks[$j];
        $key = $checks[$j+1];
        echo '正在抓取第 '.($row - 1).' 个关键词'.PHP_EOL;
        $spreadsheet = gojd($key,$row,$spreadsheet,$client);
        $isWrite = 0;
        $count++;
        if ($count == 100) {
            $writer = IOFactory::createWriter($spreadsheet, 'Xlsx');
            $writer->save('/Users/jasonvv/jdData/kw.xlsx');
            unset($writer);
            echo "正在写入数据100条".PHP_EOL;
            $isWrite = 1;
            $count = 0;
        }
    }
    if ($isWrite==0) {
        $writer = IOFactory::createWriter($spreadsheet, 'Xlsx');
        $writer->save('/Users/jasonvv/jdData/kw.xlsx');
    }
    echo "修复完成";
}

跑了这个修复的脚本,数据就完整了。




如果你不行动
最好的情况就只是现在
如果你行动了
最坏的情况也不过是现在
所以,你在担心什么?