网页内容是用 json 动态生成的，需要怎么爬取？

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› virtualenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› Pyflakes

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

This topic created in 4290 days ago, the information mentioned may be changed or developed.

用的是scrapy。根据url爬取的是这个网站的整体框架，然后内容是根据json动态生成的。如果直接请求json的url，会报非法请求的错误。这种情况下应该怎么爬取？

JSON

URL

生成

10 replies • 2014-10-31 09:00:07 +08:00

cdxem713

Oct 30, 2014

json的url是不是有post的数据？

ljcarsenal

Oct 30, 2014

@cdxem713 现在知道原因了。。请求json时，http的header referer 要和那个框架的url相同。。现在问题来了，怎么动态设置header。。

xunyu

Oct 30, 2014

呵呵，scrapy+ghost.py

cdxem713

Oct 30, 2014

@ljcarsenal 没用过scrapy不知道呢，一般都可以设置header的吧，形式一般是key-value pair那样的，设置成这个网站的首页地址试试。

fxbird

Oct 30, 2014

用phantomjs试试，它可以获得页面源代码，并且可以用js来操作dom，我也是刚学，用它把post改成get提交一个表单，总是跳到另外一个页

Oct 30, 2014

json多好啊，都不需要自己结构化数据了……

pynix

Oct 30, 2014

urllib都可以设置的嘛。。

R4rvZ6agNVWr56V0

Oct 31, 2014

"会报非法请求的错误" 多半是因为你没加该加的header啊

konakona

Oct 31, 2014

把網頁show出來大夥一看不就明白了

zhyu

Oct 31, 2014

用phantomjs