推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
xiongshengyao
V2EX  ›  Python

新手尝试用 aiohttp 写了个爬虫,但是目前因为 task 过多(超过 1000 个),报错 Too many open files,请问如何解决呢?

  •  
  •   xiongshengyao · Mar 19, 2018 · 9357 views
    This topic created in 3002 days ago, the information mentioned may be changed or developed.

    完整代码

    import time
    import asyncio
    
    import aiohttp
    from bs4 import BeautifulSoup as bs
    
    BASE_URL = "http://www.biqudu.com"
    TITLE2URL = dict()
    CONTENT = list()
    
    
    async def fetch(url, callback=None, **kwarags):
        headers = {'User-Agent':'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
        sem = asyncio.Semaphore(5)  
        with (await sem):
            async with aiohttp.ClientSession() as session: 
                async with session.get(url, headers=headers) as res:
                    page = await res.text()
                    if callback:
                        callback(page, **kwarags)
                    else:
                        return page
    
    
    def parse_url(page):
        soup = bs(page, "lxml")
        dd_a_doc = soup.select("dd > a")
        for a_doc in dd_a_doc:
            article_page_url = a_doc['href']
            article_title = a_doc.get_text()
            if article_page_url:
                TITLE2URL[article_title] = article_page_url
    
    
    def parse_body(page, **kwarags):
        title = kwarags.get('title', '')
        print("{}".format(title))
        soup = bs(page, "lxml")
        content_doc = soup.find("div", id="content")
        content_text = content_doc.get_text().replace('readx();', '').replace('    ', "\r\n")
        content = "%s\n%s\n\n" % (title, content_text)
        CONTENT.append(content)
    
    
    def main():
        t0 = time.time()
        loop = asyncio.get_event_loop()
        loop.run_until_complete(fetch(BASE_URL+"/43_43074/", callback=parse_url))
        tasks = [fetch(BASE_URL + page_url, callback=parse_body, title=title) for title, page_url in TITLE2URL.items()]
        loop.run_until_complete(asyncio.gather(*tasks[:500]))
        loop.close()
        elapsed = time.time() - t0
        print("cost {}".format(elapsed))
    
    
    if __name__ == "__main__":
        main()
    

    错误信息

    Traceback (most recent call last):
      File "/usr/local/lib/python3.5/dist-packages/aiohttp/connector.py", line 797, in _wrap_create_connection
        return (yield from self._loop.create_connection(*args, **kwargs))
      File "/usr/lib/python3.5/asyncio/base_events.py", line 695, in create_connection
        raise exceptions[0]
      File "/usr/lib/python3.5/asyncio/base_events.py", line 662, in create_connection
        sock = socket.socket(family=family, type=type, proto=proto)
      File "/usr/lib/python3.5/socket.py", line 134, in __init__
        _socket.socket.__init__(self, family, type, proto, fileno)
    OSError: [Errno 24] Too many open files
    
    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "/home/xsy/Workspace/Self/aiotest/aiotest.py", line 58, in <module>
        main()
      File "/home/xsy/Workspace/Self/aiotest/aiotest.py", line 52, in main
        loop.run_until_complete(asyncio.gather(*tasks[:500]))
      File "/usr/lib/python3.5/asyncio/base_events.py", line 387, in run_until_complete
        return future.result()
      File "/usr/lib/python3.5/asyncio/futures.py", line 274, in result
        raise self._exception
      File "/usr/lib/python3.5/asyncio/tasks.py", line 239, in _step
        result = coro.send(None)
      File "/home/xsy/Workspace/Self/aiotest/aiotest.py", line 18, in fetch
        async with session.get(url, headers=headers) as res:
      File "/usr/local/lib/python3.5/dist-packages/aiohttp/client.py", line 690, in __aenter__
        self._resp = yield from self._coro
      File "/usr/local/lib/python3.5/dist-packages/aiohttp/client.py", line 267, in _request
        conn = yield from self._connector.connect(req)
      File "/usr/local/lib/python3.5/dist-packages/aiohttp/connector.py", line 402, in connect
        proto = yield from self._create_connection(req)
      File "/usr/local/lib/python3.5/dist-packages/aiohttp/connector.py", line 749, in _create_connection
        _, proto = yield from self._create_direct_connection(req)
      File "/usr/local/lib/python3.5/dist-packages/aiohttp/connector.py", line 860, in _create_direct_connection
        raise last_exc
      File "/usr/local/lib/python3.5/dist-packages/aiohttp/connector.py", line 832, in _create_direct_connection
        req=req, client_error=client_error)
      File "/usr/local/lib/python3.5/dist-packages/aiohttp/connector.py", line 804, in _wrap_create_connection
        raise client_error(req.connection_key, exc) from exc
    aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host www.biqudu.com:443 ssl:True [Too many open files]
    

    目前我能想到的方法是

    • 修改 linux 的最大文件打开数量限制
    • 对 task 切片多次运行

    但是这样解决都感觉太蠢了,请问有什么更好的方式吗?

    15 replies    2018-09-21 10:46:49 +08:00
    WuMingyu
        1
    WuMingyu  
       Mar 19, 2018 via iPhone
    考虑下多个请求共同一个 client session,一个 clientsession 至少会占用一个链接的
    WuMingyu
        2
    WuMingyu  
       Mar 19, 2018 via iPhone
    xiongshengyao
        3
    xiongshengyao  
    OP
       Mar 19, 2018
    @WuMingyu 好哒,我试试
    janxin
        4
    janxin  
       Mar 19, 2018
    linux 打开文件句柄上限了解一下?
    zhengwenk
        5
    zhengwenk  
       Mar 19, 2018
    Too many open files 的话不是应该将最大文件数修改大一些么
    xiongshengyao
        6
    xiongshengyao  
    OP
       Mar 19, 2018
    @janxin 了解了,确实可以,但是觉得这种解决方式不优雅…
    ipwx
        7
    ipwx  
       Mar 19, 2018
    其实我对你这两句话表示疑惑:

    sem = asyncio.Semaphore(5)
    with (await sem):

    请问你是要靠 Semaphore 控制并发嘛?可是每个 fetch 用一个独立的 Semaphore 你靠什么去控制并发呢?
    xiongshengyao
        8
    xiongshengyao  
    OP
       Mar 19, 2018
    @zhengwenk 治标不治本呢…现在这个是链接是 1000 多…下次爬的假如是 10000 岂不是又要改…我按一楼的改好了…
    ipwx
        9
    ipwx  
       Mar 19, 2018
    另外 @WuMingyu 说的那一点也是,你为什么每一个 fetch 都用一个独立的 ClientSession 呢?

    事实上 Semaphore 或者 ClientSession 两者之中任何一个都能控制并发。Semaphore 可以控制同时运行的 task,而 ClientSession 可以控制最大连接数(当然你得加参数)。当然你必须用同一个对象才行。
    lfzyx
        10
    lfzyx  
       Mar 19, 2018
    这有什么优雅不优雅的,每个发行版的初始 open files 限制都不一样,而在云上的话,早就被云供应商改成 65535 甚至更高了
    xiongshengyao
        11
    xiongshengyao  
    OP
       Mar 19, 2018
    @ipwx
    @lfzyx
    @zhengwenk
    @janxin
    @WuMingyu
    感谢各位,已经有个解决的思路了,谢谢大家~~~
    CSM
        12
    CSM  
       Mar 19, 2018 via Android
    楼上说得对,aiohttp 文档中说一个 app 只需要一个 ClientSession 就够了。可以把 session 作为 fetch 的一个参数。
    bestehen
        13
    bestehen  
       Jun 17, 2018
    @xiongshengyao 我看你这里 task 是 500 个啊,怎么是 1000 个 *tasks[:500]))
    handan
        14
    handan  
       Sep 20, 2018
    可以问一下,你之前你这个问题有想到什么好的 解决方案么??
    xiongshengyao
        15
    xiongshengyao  
    OP
       Sep 21, 2018   ❤️ 1
    @handan   1 楼
    About   ·   Help   ·   Advertise   ·   Blog   ·   API   ·   FAQ   ·   Solana   ·   2743 Online   Highest 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 50ms · UTC 02:12 · PVG 10:12 · LAX 19:12 · JFK 22:12
    ♥ Do have faith in what you're doing.