爬虫初步-CFANZ编程社区

爬虫是什么？

一段自动抓取互联网信息的程序

爬虫的价值：

获取想要的互联网数据

简单爬虫架构：

爬虫调度端（监视爬虫的运行情况）
URL管理器 (管理没有爬取的URl, 已经爬取的URl)
网页下载器（如urllib2)
网页解析器（如BeautifulSoup，提取出有价值的信息)

URL管理器

有内存、关系数据库、缓存数据库三种途径实现管理

网页下载器

urllib2（官方基础模块）、requests（第三方包）

网页解析器的工作

从网页中提取有价值的数据 HTML网页字符串经过网页解析器的处理，获取有价值的信息和新的URL。
网页解析器的类型：正则表达式，html.parser，BeautifulSoup(第三方插件)，lxml(第三方插件) 。其中，正则表达式——模糊匹配后三者是结构化解析

爬虫初步_爬虫

数据分析流程

爬虫初步_ico_02

关于网页解析器beautifulSoup

官网：
http://www.crummy.com/software/BeautifulSoup/
说明文档：
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
中文文档：
https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
百度安装教程：
http://jingyan.baidu.com/article/afd8f4de6197c834e386e96b.html

简单应用

打开一个URL:

import urllib2
response=urllib2.urlopen('http://www.baidu.com')

#获取状态码，200则成功
print response.getcode()

cont=response.read()
print cont

#output:
200
<!DOCTYPE html><!--STATUS OK--><html><head><meta http-equiv="content-type" content="text/html;charset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=Edge"><meta content="always" name="referrer"><meta name="theme-color" content="#2932e1"><link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" /><link rel="search" type="application/opensearchdescription+xml" href="/content-search.xml" title="百度搜索"

网页下载：

#coding: utf8
#!/usr/bin/python    
#三种下载方法

import urllib2,cookielib

url='http://www.baidu.com'
print 'No.1: '
response1=urllib2.urlopen(url)
print response1.getcode()
print len(response1.read())

print 'No.2: '
request=urllib2.Request(url)
#浏览器方式下载
request.add_header("User-agent","Mozilla/5.0")
response2=urllib2.urlopen(request)
print response2.getcode()
print len(response2.read())

print 'No.3:'
# 创建一个cookie容器
cj=cookielib.CookieJar()
opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
# 给urllib2安装创建的opener
urllib2.install_opener(opener)
response3=urllib2.urlopen(url)
print response3.getcode()
print response3.read()

No.1: 
200
99236
No.2: 
200
99236
No.3:
200
<!DOCTYPE html><!--STATUS OK--><html><head><meta http-equiv="content-type" content="text/html;charset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=Edge"><meta content="always" name="referrer"><meta name="theme-color" content="#2932e1"><link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" /><link rel="search" type="application/opensearchdescription+xml" href="/content-search.xml" title="百度搜索"