URL:http://www.baidu.com/s?wd=site:www.cnblogs.com
代碼:
def get_html(url)
uri = URI(url)
p resp = Net::HTTP.get(uri)
end
而獲取到的結(jié)果是百度首頁(yè)的源碼,并不是搜索site:www.cnblogs.com
的結(jié)果
不知道,Ruby中有關(guān)于網(wǎng)絡(luò)編程方面的好書(shū)籍沒(méi)?
剛接觸ruby,很多東西不知道從何找(目前都是到官網(wǎng)看文檔)。
使用PHP簡(jiǎn)單實(shí)現(xiàn)了下:
<?php
set_time_limit(0);
function _rand()
{
$length = 26;
$chars = "0123456789abcdefghijklmnopqrstuvwxyz";
$max = strlen($chars) - 1;
mt_srand((double)microtime() * 1000000);
$string = '';
for ($i = 0; $i < $length; $i++) {
$string.= $chars[mt_rand(0, $max) ];
}
return $string;
}
$HTTP_SESSION = _rand();
$HTTP_SESSION;
$HTTP_Server = "www.baidu.com";
$HTTP_URL = "/s?wd=site:www.cnblogs.com";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://" . $HTTP_Server . $HTTP_URL);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)");
$res = curl_exec($ch);
curl_close($ch);
print_r($res);
認(rèn)證0級(jí)講師
No matter what language code you use to crawl, you cannot capture Baidu content so easily.
Baidu is not the same Baidu it used to be. Without various cookie authentications, you can't even catch it. You'd better do some research to see if there is an API. Baidu's front-end code is full of twists and turns, just to prevent you from being caught.
http://www.baidu.com/s?wd=www.cnblogs.com&rsv_bp=0&ch=&tn=19045005_5_pg&bar=&rsv_spt=3&ie=utf-8&rsv_n=2&rsv_sug3=1&rsv_sug4=57&rsv_sug2=0&inputT=635
Postmaster, you can only get back by throwing out such a large amount, right?