0
点赞
收藏
分享

微信扫一扫

Linux 安装python爬虫框架 scrapy

624c95384278 2023-01-10 阅读 116


Linux 安装python爬虫框架 scrapy

​​http://scrapy.org/​​

Scrapy是python最好用的一个爬虫框架.要求: python2.7.x.

1. Ubuntu14.04

1.1 测试是否已经安装pip

# pip --version

如果没有pip,安装:

# sudo apt-get install python-pip

1.2 然后安装scrapy

Import the GPG key used to sign Scrapy packages into APT keyring:

$ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 627220E7


Create /etc/apt/sources.list.d/scrapy.list file using the following command:



$ echo 'deb http://archive.scrapy.org/ubuntu scrapy main' | sudo tee /etc/apt/sources.list.d/scrapy.list


Update package lists and install the scrapy package:



$ sudo apt-get update && sudo apt-get install scrapy
$ pip install service_identity --timeout 10000


Install pyasn1-0.1.8:

$ wget https://pypi.python.org/packages/source/p/pyasn1/pyasn1-0.1.8.tar.gz#md5=7f6526f968986a789b1e5e372f0b7065
$ tar -zxvf pyasn1-0.1.8.tar.gz
$ cd pyasn1-0.1.8
$ sudo python setup.py install


2. RHEL6.4

2.1 安装pip

# wget "https://pypi.python.org/packages/source/p/pip/pip-1.5.4.tar.gz#md5=834b2904f92d46aaa333267fb1c922bb" --no-check-certificate
# tar -xzvf pip-1.5.4.tar.gz
# cd pip-1.5.4
# python2.7 setup.py install

2.2 然后安装scrapy

# yum instal python-devel (?需要么?) 
# yum instal libxslt-devel (lxml 安装需要依赖这个)
# pip install scrapy --timeout 10000

说明:


    scrapy在Linux需要下面的依赖(Ubuntu自动解决了这些依赖,RHEL6需要手动安装):

        lxml=>libxslt-devel

        cryptography=>libffi-devel


2.3 安装libffi与cryptography

编辑: /etc/yum.repos.d/rpmforge.repo, 如下:

# http://rpmforge.net/
[rpmforge]
name=Red Hat Enterprise $releasever - RPMforge.net - dag
mirrorlist=http://apt.sw.be/redhat/el5/en/mirrors-rpmforge
enabled=1
gpgcheck=0


然后:


# yum install libffi-devel
# pip install cryptography


3. 实验例子

3.1 创建一个爬虫程序stackoverflow.py

#!/usr/bin/python2.7
#-*- coding: UTF-8 -*-
# stackoverflow.py
#
import scrapy

class StackOverflowSpider(scrapy.Spider):
name = 'stackoverflow'
start_urls = ['http://stackoverflow.com/questions?sort=votes']

def parse(self, response):
for href in response.css('.question-summary h3 a::attr(href)'):
full_url = response.urljoin(href.extract())
yield scrapy.Request(full_url, callback=self.parse_question)

def parse_question(self, response):
yield {
'title': response.css('h1 a::text').extract()[0],
'votes': response.css('.question .vote-count-post::text').extract()[0],
'body': response.css('.question .post-text').extract()[0],
'tags': response.css('.question .post-tag::text').extract(),
'link': response.url,
}


3.2 运行爬虫程序

$ scrapy runspider stackoverflow.py -o top-ques.json

3.3 把 top-ques.json 文件的内容放到

​​http://www.json.cn/​​

看看爬虫得到了什么!

enjoy it !


举报

相关推荐

0 条评论