Python 安装phantomjs以及操作MySQL-CFANZ编程社区

Phantomjs下载地址:http://phantomjs.org/download.html

我们这里选择phantomjs-2.1.1-windows.zip:如下图所示:

Python 安装phantomjs以及操作MySQL_ide

下载完成是一个压缩包,我们解压到桌面,以方便我们操作,如下图:

Python 安装phantomjs以及操作MySQL_ide_02

解压完成后,我们进到bin里面讲路径复制下来;如下图:

Python 安装phantomjs以及操作MySQL_html_03

Python 安装phantomjs以及操作MySQL_.net_04

复制路径到环境变量中,如下图:

Python 安装phantomjs以及操作MySQL_.net_05

注意:在复制路径之前如果有 \ 就在 \ 后面加一个分号,如果没有 \ ,那我们就添加上 \ 和分号,不然后面写代码会报错

安装按成后,我们输入phantomjs测试一下,如下图:

Python 安装phantomjs以及操作MySQL_.net_06

接下来我们爬取一个网站,使用,MySQL进行存储:
爬取的网站:http://www.ygdy8.net/html/gndy/index.html

然后我们使用终端命令行创建框架:

1.进入相应的文件夹

2.输入scrapy startproject tiantangdianying(天堂电影)

3.创建框架完成后我们进入到框架中:cd tiantangdianying

4.进入到框架中我们创建爬虫一个项目:scrapy genspider tiantang ygdy8.net

项目我们已经创建完成,这里我们就不废话了,直接进入到tiantang.py里面进行爬取自己想要获取的:

我们先说说这次爬取的目的:

1.获取天堂电影里面的经典影片

2.获取"更多"链接,进入到更多里面,获取所有电影的详情链接

3.进入详情页面,我们获取电影名称和下载电影链接

4.将获取的信息,保存到MySQL数据库中

我们进入tiantang.py:如图所示:

Python 安装phantomjs以及操作MySQL_.net_07

进去之后我们更改一下网址,如图所示:

Python 安装phantomjs以及操作MySQL_html_08

获取"更多"链接:

# -*- coding: utf-8 -*-
import scrapy
from .. items import TiantangdianyingItem
class DianyingSpider(scrapy.Spider):
    name = 'dianying'
    allowed_domains = ['ygdy8.net']
    start_urls = ['http://www.ygdy8.net/html/gndy/index.html']
    def parse(self, response):
        div_list = response.xpath('//div[@class="title_all"]/p/em/a/@href').extract()
        for div in div_list:
            div_url = 'http://www.ygdy8.net' + div
            print(div_url)

输出的结果:

Python 安装phantomjs以及操作MySQL_ide_09

获取每一个电影的详细链接:

def get_detail_page_url(self,response):
        common_url =  response.xpath('//div[@class="co_area2"]//ul//td//a[2]/@href').extract()
        for common in common_url:
            url = 'http://www.ygdy8.net' + common
            print(url)

输出结果:

Python 安装phantomjs以及操作MySQL_html_10

获取下一页链接:

next_list = response.xpath('//div[@class="x"]//a[text()="下一页"]/@href').extract_first('')
        url = 'http://www.ygdy8.net/html/gndy/china/' + next_list
        print(url)

输出结果:

Python 安装phantomjs以及操作MySQL_ide_11

接下来我们获取电影名字和下载地址:

def get_title_and_href(self,response):
        title = response.xpath('//div[@class="title_all"]//font/text()').extract_first()
        href = response.xpath('//td[@style="WORD-WRAP: break-word"]/a/@href').extract_first()
        print(title)
        print(href)

输出结果:

Python 安装phantomjs以及操作MySQL_ide_12

我们开始已经引入items,现在我们来使用它:

item = TiantangdianyingItem()
        item['title'] = title
        item['href'] = href
        yield item

配置items.py:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class TiantangdianyingItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    href = scrapy.Field()
    pass

设置setting.py:

Python 安装phantomjs以及操作MySQL_html_13

Python 安装phantomjs以及操作MySQL_html_14

然后我们打开MySQL创建数据库和表:

Python 安装phantomjs以及操作MySQL_ide_15

在表中创建两个字段name和href:

Python 安装phantomjs以及操作MySQL_.net_16

配置管道piplines.py:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql
class TiantangdianyingPipeline(object):
    def __init__(self):
        self.connect = pymysql.connect(host='localhost',user='root',password='123456',db='dianying',port=3306)
        self.cursor = self.connect.cursor()
    def process_item(self, item, spider):
        self.cursor.execute('insert into new_table(name,href)VALUES ("{}","{}")'.format(item['title'],item['href']))
        self.connect.commit()
        return item
    def close_spider(self,spider):
        self.cursor.close()
        self.connect.close()

最后我们在终端命令行执行scrapy crawl tiantang

就完成了MySQL格式保存数据:如下图所示:

Python 安装phantomjs以及操作MySQL_.net_17

下面是tiantang.py完整代码:

# -*- coding: utf-8 -*-
import scrapy
from .. items import TiantangdianyingItem
class DianyingSpider(scrapy.Spider):
    name = 'dianying'
    allowed_domains = ['ygdy8.net']
    start_urls = ['http://www.ygdy8.net/html/gndy/index.html']
    def parse(self, response):
     
        div_list = response.xpath('//div[@class="title_all"]/p/em/a/@href').extract()
        for div in div_list:
            div_url = 'http://www.ygdy8.net' + div
            # print(div_url)
            yield scrapy.Request(url=div_url,callback=self.get_detail_page_url)
    def get_detail_page_url(self,response):
        common_url =  response.xpath('//div[@class="co_area2"]//ul//td//a[2]/@href').extract()
        for common in common_url:
            url = 'http://www.ygdy8.net' + common
            # print(url)
            yield scrapy.Request(url=url,callback=self.get_title_and_href)
        next_list = response.xpath('//div[@class="x"]//a[text()="下一页"]/@href').extract_first('')
        url = 'http://www.ygdy8.net/html/gndy/china/' + next_list
        # print(url)
        yield scrapy.Request(url = url,callback=self.get_detail_page_url)
    def get_title_and_href(self,response):
        pass
        title = response.xpath('//div[@class="title_all"]//font/text()').extract_first()
        href = response.xpath('//td[@style="WORD-WRAP: break-word"]/a/@href').extract_first()
        print(title)
        print(href)
        item = TiantangdianyingItem()
        item['title'] = title
        item['href'] = href
        yield item