目标网站
https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=10&type=T
观察发现,url中start=10
一页中有10本书的数据,所以这里是10,第二页则是20
我们一页一页访问获取数据
这里是100,就是爬取5页
然后获取到源代码之后,再用re正则表达式来提取书名
然后没问题,保存
下面是完整代码
import requests
import re
import csv
# 打开csv文件
f = open('豆瓣小说.csv',mode='a',newline='',encoding='utf-8-sig')
# 请求头信息
headers = {'Referer':'https://read.douban.com/',
'sec-ch-ua': '" Not;A Brand";v="99", "Microsoft Edge";v="103", "Chromium";v="103"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36 Edg/103.0.1264.37'}
p=0
# p = re.findall('<a href="/tag/小说\?start=\d+&type=T">\d*</a>')
# for循环,这里是100,也就是到第5页
for pg in range(0,100,20):
p = p + 1
url = f'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start={pg}&type=T'
Requests = requests.get(url=url, headers=headers)
html_data = Requests.text
title = re.findall('<a href="https://book.douban.com/subject/\d*/" title="(.*?)"',html_data)
print(title)
# 保存
writer = csv.writer(f)
writer.writerow([f'第{p}页'])
writer.writerow(title)
看效果
感谢观看