求推荐好用的网页减噪的 Python 库 - V2EX

Home Sign Up Sign In

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

This topic created in 4082 days ago, the information mentioned may be changed or developed.

需要抓取很多新闻网站，但这些页面很多是极其不规范的使用 html ，那么如何自动化提取这些网页中的正文部分呢？
试用了几个，感觉还是有一些问题。。。。求推荐

Supplement 1 · May 27, 2015

好像大家多没有做过类似的，或者没有理解我的意思。
我是说提取网页正文，类似 pocket 那种。
已用过的库：
* [Goose](https://github.com/grangier/python-goose)
* [python-readability](https://github.com/buriy/python-readability)

看过的论文：
* [《基于行块分布函数的通用网页正文抽取算法》](http://cx-extractor.googlecode.com/files/%E5%9F%BA%E4%BA%8E%E8%A1%8C%E5%9D%97%E5%88%86%E5%B8%83%E5%87%BD%E6%95%B0%E7%9A%84%E9%80%9A%E7%94%A8%E7%BD%91%E9%A1%B5%E6%AD%A3%E6%96%87%E6%8A%BD%E5%8F%96%E7%AE%97%E6%B3%95.pdf)

大家还有用过/看过的其他的吗？

20 replies • 2015-05-27 22:17:12 +08:00

1

shierji

May 27, 2015 via Android

额 XPath选取还行啊

我遇到的主要是改版问题不过我感觉逻辑上多处理一下也行
我遇到的问题是很多新闻网站旧链接是孤岛没法从当前时间递归抓取不知道楼主有这个问题没有？

2

Valyrian

May 27, 2015

每个网站单独处理。。我上个实习就是干这个的，没有什么好办法

3

binux

May 27, 2015

现在正常一点的 html 库都能做到兼容不规范的 html
要不你试试 lxml

4

fy

May 27, 2015

@shierji 文不对题啊，楼主说的是那种自动分析网页，猜测正文大概位置的库。并不是说xpath选取不准确。

5

fy

May 27, 2015

= = 好像也并不是来着，如果是这样的话lxml的xpath确实已经够用了。

6

alexapollo

May 27, 2015

web extractor

7

binux

May 27, 2015

@fy 还真是「正文大概位置的库」，这种涉及策略的东西，想要好，就自己写一个吧。

8

TuxcraFt

May 27, 2015

你需要人工智能黑科技…… （逃

9

zts1993

May 27, 2015 via Android

招点实习生吧

10

simo

May 27, 2015

看下qq收藏网页助手，插件应该能反编吧

11

nbndco

May 27, 2015

libextract

12

hewigovens

May 27, 2015

Diffbot?

13

xixijun

May 27, 2015 via iPhone

不知道楼主说的不规则具体指的是什么。
bootstrap可以自动补全

14

zog

May 27, 2015

pip install html2text

15

zhicheng

May 27, 2015

https://github.com/rodricios/eatiht

16

13k

May 27, 2015

https://github.com/codelucas/newspaper

17

zztt168

May 27, 2015 via Android

在学习爬虫，感谢楼主和楼上的分享！

18

bigbook

May 27, 2015

https://github.com/buriy/python-readability
这个算是最好用的了

具体遇到什么问题了呢？

19

pango

May 27, 2015

楼主遇到点什么小问题？请具体说说。
一直在用python-goose爬youtube，从来没有出过什么问题。

20

shiznet

May 27, 2015 via iPhone

印象笔记在chrome的插件可以实现类似的功能

About · Help · Advertise · Blog · API · FAQ · Solana · 959 Online Highest 6679 ·

Select Language

创意工作者们的社区

World is powered by solitude

VERSION: 3.9.8.5 · 52ms · UTC 19:24 · PVG 03:24 · LAX 12:24 · JFK 15:24
♥ Do have faith in what you're doing.