爬取地址: https://tieba.baidu.com/p/4959928798 在 chrome 上查看源代码,有着一段
<a class="pb_nameplate j_nameplate j_self_no_nameplate" href="/tbmall/propslist?category=112&ps=24" data-field='{"props_id":"1120050972","end_time":"1512731564","title":"\u6d77\u8d3c\u738b\u7684\u53f3\u624b","optional_word":["\u7684","\u4e4b","\u306e"],"pattern":["1","1","1","2","3","3"]}' target="_blank">海贼王的右手</a>
依据: class="pb_nameplate j_nameplate j_self_no_nameplate
写了一个正则:(?<=pb_nameplate\sj_nameplate\sj_self_nameplate)[\s\S]*?(?=)
运行后发现死活匹配不了,所以
# -*- coding: utf-8 -*-
__author__ = 'duohappy'
import requests
def get_info_from(url):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"
}
web_data = requests.get(url, headers=headers)
web_data.encoding = 'utf-8'
content = web_data.text
with open('./test.txt', 'w') as f:
f.write(content)
if __name__ == '__main__':
url = 'http://tieba.baidu.com/p/4959928798'
get_info_from(url)
才发现
<a class="pb_nameplate j_nameplate j_self_nameplate" href="/tbmall/propslist?category=112&ps=24" data-field='{"props_id":"1120050972","end_time":"1512731564","title":"\u6d77\u8d3c\u738b\u7684\u53f3\u624b","optional_word":["\u7684","\u4e4b","\u306e"],"pattern":["1","1","1","2","3","3"]}' target="_blank">海贼王的右手</a>
class="pb_nameplate j_nameplate j_self_no_nameplate 变成了 pb_nameplate j_nameplate j_self_nameplate
这是什么技术,还是我的姿势有问题?