Python开发实例（三）网络爬虫：编写一个爬虫来提取网站上的数据-CFANZ编程社区

Python开发实例（三）网络爬虫：编写一个爬虫来提取网站上的数据

网络爬虫是一种用于从网页中提取数据的程序。请注意，使用网络爬虫时，要遵守网站的使用条款，并尊重网站的robots.txt文件。在这个实例中，我们将使用Python的requests库来获取网页内容，并使用BeautifulSoup库来解析和提取数据。

首先，请确保你已经安装了requests和beautifulsoup4库。如果没有安装，可以通过以下命令来安装：

pip install requests
pip install beautifulsoup4

接下来，我们将编写一个简单的网络爬虫来从一个网页上提取数据。这里以提取热门新闻标题为例：

import requests
from bs4 import BeautifulSoup

def get_top_news():
    url = "https://news.ycombinator.com/"  # 这里使用Hacker News作为例子，你也可以换成其他网站

    try:
        response = requests.get(url)
        response.raise_for_status()  # 如果请求失败，抛出异常
    except requests.exceptions.RequestException as e:
        print(f"网络请求错误: {e}")
        return None

    soup = BeautifulSoup(response.text, "html.parser")

    news_titles = []
    for item in soup.select(".storylink"):
        news_titles.append(item.get_text())

    return news_titles

if __name__ == "__main__":
    top_news = get_top_news()
    if top_news:
        print("热门新闻标题：")
        for index, title in enumerate(top_news, 1):
            print(f"{index}. {title}")
    else:
        print("未能获取热门新闻标题。")

在这个例子中，我们使用了Hacker News网站（https://news.ycombinator.com/）作为目标网站，并提取了其热门新闻标题。你可以将`url`变量换成你想要爬取数据的网站。

上述代码会输出网站的热门新闻标题列表。请确保合法使用这个爬虫，不要对目标网站造成过度负担，并遵守网站的使用条款和政策。

0 条评论