亚洲国产日韩欧美一区二区三区,精品亚洲国产成人av在线,国产99视频精品免视看7,99国产精品久久久久久久成人热,欧美日韩亚洲国产综合乱

^{<blockquote id="hnero"></blockquote>}

Community

Learn

Tools Library

AI Tools

Leisure

English

Web crawler - How to crawl the pictures in the Blog Park blog using python?

某草草 2017-05-18 10:45:39

1114

I wrote a small piece of code to crawl the pictures in the Blog Park blog. This code is effective for some links, but some links report errors as soon as they are crawled. What is the reason?

#coding=utf-8

import urllib
import re
from lxml import etree

#解析地址
def getHtml(url):
    page = urllib.urlopen(url)
    html = page.read()
    return html

#獲取地址并建樹
url = "http://www.cnblogs.com/fnng/archive/2013/05/20/3089816.html"
html = getHtml(url)
html = html.decode("utf-8")
tree = etree.HTML(html)

#保存圖片至本地
reg = r'src="(.*?)" alt'
imgre = re.compile(reg)
imglist = re.findall(imgre, html)
x = 0
for imgurl in imglist:
    urllib.urlretrieve(imgurl, '%s.jpg' % x)
    x += 1

As shown in the figure, the image can be crawled correctly

If you change the url to

url = "http://www.cnblogs.com/baronzhang/p/6861258.html"

then report an error immediately

Please solve it, thank you!

某草草

reply all(1)

我想大聲告訴你2017-05-18 10:47:39 1 floor

The error message is already very obvious. If you look at the source code of the web page, the first image matched is in GIF format, and it is still a relative path, so you cannot download it, so it prompts IOerror, even if you have downloaded it. , because you specified the format as JPG, you cannot open it. So all you need to do is judge and filter

for imgurl in imglist:
    if "gif" not in imgurl:
        urllib.urlretrieve(imgurl, '%s.jpg' % x)
        x += 1

Look at what I added. Of course, this is just the simplest judgment, but it can ensure that your second program will not report an error, and it also gives you an idea!

Like +0

Add Reply