亚洲国产日韩欧美一区二区三区,精品亚洲国产成人av在线,国产99视频精品免视看7,99国产精品久久久久久久成人热,欧美日韩亚洲国产综合乱

Community

Learn

Tools Library

AI Tools

Leisure

English

html - regular expression python crawler

怪我咯 2017-06-22 11:51:19

809

import urllib.request

req = urllib.request.urlopen('http://search.jd.com/Search?k...')

req
Out[3]: <http.client.HTTPResponse at 0x52bf6d8>

buf = req.read()

buf = buf.decode('utf-8')

urllist = re.findall(r'//img. .png',buf)
This will normally display the image URL ending in .png
urllist = re.findall(r'//img. .jpg ',buf)
Also basically normal
urllist = re.findall(r'//img. .(png|jpg)',buf)
This can only display the format of a series of pictures, like this ：
'.jpg',
'.jpg',
'.png',
'.jpg',
'.jpg',
'.jpg',
'.jpg',
'.jpg',
'.jpg',
Why is this?

怪我咯

走同樣的路，發(fā)現(xiàn)不同的人生

reply all(2)

阿神2017-06-22 11:53:19 2 floor

Mainly because, when you do not add (), re.findall will print out all the matches, but if you add (), it will print the matching, which is () Captured results, so you see a bunch of jpg/png. Because of this, we need to use () to capture all the matching links so that they can be printed. At the same time, we need to use (?:jpg |png), because what this place needs is to match jpg or png, so we need to use non-capturing grouping syntax.

# 代碼修改
urllist = re.findall(r'(//img.+?.(?:png|jpg))',buf)

For more about capture grouping/non-capturing grouping, you can refer to: Link description

Like +0

Add Reply

代言2017-06-22 11:53:19 1 floor

[png|jpg]

(png|jpg) will be grouped

import re
import requests

r = requests.get('http://search.jd.com/Search?keyword=%E6%96%87%E8%83%B8&enc=utf-8&wq=%E6%96%87%E8%83%B8&pvid=4anf50si.fbrh68')
print re.findall('//img.+.[png|jpg]', r.text)

Like +0

Add Reply