HTML数据提取
这一片文章, 我们一起学习如何利用HTML标签来提取数据。
url = 'https://cs.lianjia.com/ershoufang/rs/'
headers = {
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36'
}
import requests
headers = {
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36',
"referer": "https://hip.lianjia.com/",
"cookie": "select_city=430100; lianjia_ssid=18717f43-a946-4e0f-a774-7965c8aa73ed; lianjia_uuid=0741e41c-75be-4e7b-9bd0-ee4002203371; sajssdk_2015_cross_new_user=1; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%22192752ebbdd1116-0ed7d9ff3e4d3e-26001051-1474560-192752ebbded87%22%2C%22%24device_id%22%3A%22192752ebbdd1116-0ed7d9ff3e4d3e-26001051-1474560-192752ebbded87%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%7D%7D; _jzqckmp=1; _ga=GA1.2.1019355060.1728542865; _gid=GA1.2.344285568.1728542865; hip=vxXEKT3nj90U2y57pSYlx074hLYy_KITMLesLlcpGMh9Slf8YMXZvf6QAv4ZcLnkesWaYhh-nRlvZVlTmOjIT-Y-WHy-FZE7aLT3xJpXyb0QTDqy9NYovzxBro7rK3M3NlC9A0EE8fu1MXkSBUrVvLakvFPxaOW6Z1j6eke9A-aD546zbHzuKX3T4Q%3D%3D; Hm_lvt_46bf127ac9b856df503ec2dbf942b67e=1728542848,1728543853; Hm_lpvt_46bf127ac9b856df503ec2dbf942b67e=1728543853; HMACCOUNT=7AB3E94A75916BE3; _qzja=1.151913534.1728542860751.1728542860751.1728543853745.1728542860751.1728543853745.0.0.0.2.2; _qzjc=1; _qzjto=2.2.0; _jzqa=1.489454228011104260.1728542851.1728542851.1728543854.2; _jzqc=1; _jzqx=1.1728543854.1728543854.1.jzqsr=hip%2Elianjia%2Ecom|jzqct=/.-; srcid=eyJ0Ijoie1wiZGF0YVwiOlwiYmIwMmYyYzU1ODZjMjNhZGJjOGVmZTZmYmEyYzVlODQ0OGRkNGYxYjJjZjliZGY4MDZjMmExOTgwOThkYjVkMzFkOTEwOTc5NTliOGNlMzBmMWNhZGJiN2NhYTY3ZTE0OTQ3NDc2YTg4N2JmYTBiOTRhODJlMTZiYjdmY2UxMjdhMDljY2UxYTY0M2RhMTlhMzQyN2ZlYTc5MTFkZTdkMmY5NzQyZmRjMTRmYTRmNjk0NGNmYmM4ZjYzOTBlMjE4YThhYWQ2ZGUyZTRkZmE5ZjU2OGIxZmJmNzBiZGQzY2E5ZWEyYzEzZmY2ZTMyOTlkOGFkMDUzNDQ0NmNiZTVhZFwiLFwia2V5X2lkXCI6XCIxXCIsXCJzaWduXCI6XCI5YzhiMDQ2OVwifSIsInIiOiJodHRwczovL2NzLmxpYW5qaWEuY29tL2Vyc2hvdWZhbmcvcnMvIiwib3MiOiJ3ZWIiLCJ2IjoiMC4xIn0=; _jzqb=1.1.10.1728543854.1; _qzjb=1.1728542860751.2.0.0.0; _ga_4JBJY7Y7MX=GS1.2.1728542865.1.1.1728543871.0.0.0"
}
url = 'https://cs.lianjia.com/ershoufang/rs/'
res = requests.get(url,headers=headers)
print(res.text)
with open('链家.html','w',encoding='utf-8')as f:
f.write(res.text)
打开文件:
安装命令:
pip install lxml
from lxml import etree
# 避免多次请求,提前将要解析的数据保存在了本地html文件中
# 提取数据,需要把文件中的字符串给读取出来
with open('链家.html','r',encoding='utf-8') as f:
html_code = f.read()
tree = etree.HTML(html_code)
xpath用法:
一个杠(/):
tree.xpath('/html')
print(tree.xpath('/html/head/title'))
两个杠(//):
print(tree.xpath('//title'))
print(tree.xpath('//div'))
获取所有的div标签
通过text()在我们获取到的标签里面的值(就是文本数据):
print(tree.xpath('//span/text()'))
如何准确的定位到自己想要的标签
第一钟方法:
print(tree.xpath('//h2/span/text()'))
第二种方法:
语法: 标签名[@属性名=“属性值”]
我们可以这么写:
print(tree.xpath('//a[@href="https://cs.lianjia.com/ershoufang/104113837527.html"]/text()'))
我们再举个例子, 用第二种写法解决。
代码:
print(tree.xpath('//div[@data-price="16495"]/span/text()'))
获取标签里面的属性值
语法: 标签/@属性名
比如, 我们要获取一个div的类名为info clear(class = info clear), div里面有个a标签, 获取a标签里面的href属性的值。
print(tree.xpath('//div[@class="info clear"]//a/@href'))
我们已经把xpath的内容讲完了, 那我们接下来就讲一个案例:
这个案例, 就是我们刚才打开过的二手房网站, 我们的这个案例就是利用xpath来爬取网站里面的数据, 要求爬取到标题和总价。
代码:
from lxml import etree
# 避免多次请求,提前将要解析的数据保存在了本地html文件中
# 提取数据,需要把文件中的字符串给读取出来
with open('链家.html','r',encoding='utf-8')as f:
html_code = f.read()
tree = etree.HTML(html_code)
lis = tree.xpath('//ul[@class="sellListContent"]/li') # 30个房屋信息的整体
for li in lis:
# 第一次循环 li=第一个房子信息的整体对象
# 第一次循环,li.xpath 通过编写的xpath语法 从当前第一个li标签中去匹配内容
# 配合.进行使用:代表当前标签
title = li.xpath('.//div[@class="title"]/a/text()')[0]
price = li.xpath('.//div[@class="totalPrice totalPrice2"]//text()')
# [' ', '220', '万'] ---》 220万
price = ''.join(price)
# 获取标题和总价
print(title, price)
结果:
实战:
先自己尝试做一做哦, 不要马上看答案。
参考答案:
from lxml import etree
with open('链家.html', 'r', encoding='utf-8') as f:
html_code = f.read()
tree = etree.HTML(html_code)
info_ershoufang = tree.xpath('//div[@class="info clear"]')
print(info_ershoufang)
for i in info_ershoufang:
title = i.xpath('.//div[@class="title"]//text()')
print("标题:", ''.join(title))
address = i.xpath('.//div[@class="positionInfo"]//text()')
print("大致地址:", ''.join(address).replace(" ", ""))
house_info = i.xpath('.//div[@class="houseInfo"]//text()')
print("房子信息:", ''.join(house_info).replace("|", "、").replace(" ", ""))
unit_price = i.xpath('.//div[@class="unitPrice"]//text()')
print("房子单价:", ''.join(unit_price))
totalPrice = i.xpath('.//div[@class="totalPrice totalPrice2"]//text()')
print("房子总价:", ''.join(totalPrice).replace(" ", ""))
print()