Python实现微信公众号爬虫

微信公众号作为中国最大的内容平台之一，拥有海量的优质文章和数据。本文将介绍如何使用Python构建一个微信公众号爬虫，帮助您获取和分析公众号内容。

一、微信公众号爬虫的难点

微信公众号爬虫相比普通网页爬虫有几个显著难点：

微信的反爬机制较为严格
公众号文章没有公开的列表页
需要处理动态加载的内容
需要模拟登录或使用接口

二、实现微信公众号爬虫的几种方法

1. 通过搜狗微信搜索

import requests
from bs4 import BeautifulSoup

def get_article_links(public_name):
    url = f"https://weixin.sogou.com/weixin?type=1&query={public_name}"
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'
    }
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    links = [a['href'] for a in soup.select('div.txt-box h3 a')]
    return links

2. 使用微信公众号API（需授权）

import requests

def get_official_account_articles(token, fakeid):
    url = f"https://api.weixin.qq.com/cgi-bin/appmsg?token={token}&lang=zh_CN&f=json&ajax=1"
    data = {
        "action": "list_ex",
        "begin": "0",
        "count": "5",
        "fakeid": fakeid,
        "type": "9",
        "query": ""
    }
    response = requests.post(url, data=data)
    return response.json()

3. 使用第三方库WechatSogou

from wechatsogou import WechatSogouAPI

ws_api = WechatSogouAPI()
articles = ws_api.get_gzh_article_by_history('公众号名称')
for article in articles:
    print(article['title'], article['content_url'])

三、完整爬虫实现示例

下面是一个相对完整的微信公众号爬虫实现：

import requests
from bs4 import BeautifulSoup
import time
import random

class WeChatSpider:
    def __init__(self):
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36',
            'Cookie': '你的Cookie'
        }
        self.session = requests.Session()

    def search_public_account(self, name):
        """搜索公众号"""
        url = "https://weixin.sogou.com/weixin"
        params = {
            'type': '1',
            'query': name,
            'ie': 'utf8'
        }
        response = self.session.get(url, params=params, headers=self.headers)
        soup = BeautifulSoup(response.text, 'html.parser')
        accounts = []
        for item in soup.select('.news-list2 li'):
            account = {
                'name': item.select_one('.txt-box h3').text,
                'wechat_id': item.select_one('.txt-box .info label').text.replace('微信号：', ''),
                'link': item.select_one('a')['href']
            }
            accounts.append(account)
        return accounts

    def get_article_list(self, account_link):
        """获取公众号文章列表"""
        response = self.session.get(account_link, headers=self.headers)
        soup = BeautifulSoup(response.text, 'html.parser')
        articles = []
        for item in soup.select('.news-list li'):
            article = {
                'title': item.select_one('.txt-box h3 a').text,
                'link': item.select_one('.txt-box h3 a')['href'],
                'date': item.select_one('.txt-box .s-p').get('t')
            }
            articles.append(article)
        return articles

    def get_article_content(self, article_link):
        """获取文章内容"""
        # 处理微信文章链接，将http改为https
        if article_link.startswith('http://mp.weixin.qq.com'):
            article_link = article_link.replace('http://', 'https://')

        response = self.session.get(article_link, headers=self.headers)
        soup = BeautifulSoup(response.text, 'html.parser')

        content = {
            'title': soup.select_one('#activity-name').text.strip(),
            'author': soup.select_one('#meta_content > span.rich_media_meta.rich_media_meta_text').text,
            'publish_time': soup.select_one('#publish_time').text,
            'content': soup.select_one('#js_content').text.strip()
        }
        return content

    def run(self, public_name):
        """运行爬虫"""
        # 搜索公众号
        accounts = self.search_public_account(public_name)
        if not accounts:
            print(f"未找到公众号: {public_name}")
            return

        # 获取第一个匹配的公众号
        account = accounts[0]
        print(f"找到公众号: {account['name']}({account['wechat_id']})")

        # 获取文章列表
        articles = self.get_article_list(account['link'])
        print(f"获取到{len(articles)}篇文章")

        # 获取每篇文章内容
        for article in articles:
            try:
                print(f"正在获取: {article['title']}")
                content = self.get_article_content(article['link'])
                # 这里可以添加保存内容的代码
                time.sleep(random.uniform(1, 3))  # 随机延迟，避免被封
            except Exception as e:
                print(f"获取文章失败: {e}")
                continue

if __name__ == '__main__':
    spider = WeChatSpider()
    spider.run('人民日报')

四、注意事项

遵守法律法规：爬取公众号内容需遵守《网络安全法》和相关规定，不得用于非法用途
尊重版权：获取的内容应尊重原作者版权，不得用于商业用途
控制频率：设置合理的爬取间隔，避免给服务器造成过大压力
处理反爬：微信有较强的反爬机制，可能需要处理验证码、IP封禁等问题
数据存储：爬取的数据应合理存储，建议使用数据库如MySQL或MongoDB

五、进阶功能

定时爬取：使用APScheduler等库实现定时爬取
多线程爬取：使用多线程提高爬取效率
数据清洗：对爬取的内容进行清洗和分析
情感分析：对文章内容进行情感分析
可视化展示：使用PyEcharts等库展示数据分析结果

六、总结

微信公众号爬虫的实现需要克服多个技术难点，本文介绍了多种实现方法和一个相对完整的示例。在实际应用中，还需要根据具体需求进行调整和优化。切记在使用爬虫时要遵守相关法律法规和平台规则。

希望本文能帮助您构建自己的微信公众号爬虫，获取有价值的数据！

聆途笔记

Python实现微信公众号爬虫

一、微信公众号爬虫的难点

二、实现微信公众号爬虫的几种方法

1. 通过搜狗微信搜索

2. 使用微信公众号API（需授权）

3. 使用第三方库WechatSogou

三、完整爬虫实现示例

四、注意事项

五、进阶功能

六、总结

发表回复取消回复

Python实现微信公众号爬虫

一、微信公众号爬虫的难点

二、实现微信公众号爬虫的几种方法

1. 通过搜狗微信搜索

2. 使用微信公众号API（需授权）

3. 使用第三方库WechatSogou

三、完整爬虫实现示例

四、注意事项

五、进阶功能

六、总结

发表回复 取消回复

发表回复取消回复