如何使用Python的requests和BeautifulSoup库自动化筛选二手房信息

频道：租售信息日期：2025-02-11 00:12:37 浏览：33

大家好！我是鱼子酱，一名热爱数据分析的Python开发者！今天要跟大家分享如何使用Python的requests和BeautifulSoup库来自动化筛选二手房信息。通过简单的爬虫程序，我们可以轻松获取海量房源数据并按需筛选，是不是很神奇？让我们一起来探索吧！1. 安装准备

首先需要安装必要的Python库：

pip install requests
pip install beautifulsoup4
pip install pandas

确保Python版本 >= 3.6，建议使用虚拟环境进行开发。

2. 基础概念

我们主要用到的是两个强大的库：

这就像是我们派出的智能助手，自动浏览网页并记录信息！

3. 代码示例

让我们从简单的示例开始：

import requests
from bs4 import BeautifulSoup
import pandas as pd

def fetch_house_info(url):
    # 设置请求头，模拟浏览器访问
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }
    
    # 发送请求获取页面内容
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # 提取房源信息
    houses = []
    for item in soup.find_all('div', class_='house-item'):
        house = {
            'title': item.find('h3').text.strip(),
            'price': item.find('span', class_='price').text.strip(),
            'area': item.find('span', class_='area').text.strip(),
            'location': item.find('div', class_='location').text.strip()
        }
        houses.append(house)
    
    return pd.DataFrame(houses)

# 保存数据
def save_to_excel(df, filename):
    df.to_excel(filename, index=False)

4. 使用技巧

在使用过程中要注意以下几点：

请求间隔：添加适当的时间间隔，避免被封IP

异常处理：捕获可能的网络错误和解析异常

数据清洗：处理特殊字符和异常值

查询效率：使用多线程提高爬取速度

5. 实战应用

下面是一个完整的示例，包含筛选条件和数据分析：

def filter_houses(df, max_price, min_area, locations):
    """筛选符合条件的房源"""
    filtered = df[
        (df['price'].astype(float) <= max_price) &
        (df['area'].astype(float) >= min_area) &
        (df['location'].isin(locations))
    ]
    return filtered.sort_values('price')

# 使用示例
if __name__ == '__main__':
    url = 'https://example.com/houses'
    df = fetch_house_info(url)
    
    # 设置筛选条件
    filtered_houses = filter_houses(
        df,
        max_price=300,  # 最高价格30万
        min_area=90,    # 最小面积90平
        locations=['朝阳', '海淀']  # 期望区域
    )
    
    # 保存结果
    save_to_excel(filtered_houses, '筛选结果.xlsx')