When I crawled a web page, I noticed that its page turning was implemented by such a function. After turning the page, the page URL did not change:
<input class="buttonJump" name="goto2" onclick="dirGroupMblogToPage(document.getElementById('dirGroupMblogcp2').value)" type="button" value="Go"/>
</input>
function dirGroupMblogToPage(currentPage){
jQuery.post("dirGroupMblog.action", {"page.currentPage":currentPage,gid:MI.TalkBox.gid}, function(data){$("#talkMain").html(data);
window.scrollTo(0, $css.getY(MI.talkList._body)-65);
});
}
Written a function like this to try to achieve page turning:
def login_page(login_url, content_url, usr_name="******@126.com", passwd="******"):
# 實(shí)現(xiàn)登錄, 返回Session對(duì)象和獲得的頁面
post_data = {'r': 'on', 'u': usr_name, 'p': passwd}
s = requests.Session()
s.post(login_url, post_data)
r = s.get(content_url)
return s, r
def turn_page(s, next_page, content_url):
post_url = "http://sns.icourses.cn/dirGroupMblog.action"
post_headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
"X-Requested-With":"XMLHttpRequest"}
post_data = {"page.currentPage": next_page, "gid": 2632}
s.post(post_url, data=post_data, headers = post_headers)
res = s.get(content_url)
return res
But page turning failed after calling turn_page(). How should we solve this problem? Also, what kind of knowledge do we need to learn by ourselves to solve this kind of problem? Thank you!
Following the voice in heart.
Recommended to use selenium
For example, if you need to click the next page button on the interface, or you need to enter the up, down, left, and right keys, the page can be turned, selenium webdriver can do it, and give a reference (I used to crawl the novels on Qidian Chinese website )
Selenium can interact with the page, click, double-click, enter, wait for the page to load (implicit wait, and explicit wait). . . .
from selenium import webdriver
# from selenium.webdriver.common.keys import Keys
#driver = webdriver.PhantomJS(executable_path="D:\phantomjs-2.1.1-windows\bin\phantomjs")
# 我的windows 已配置環(huán)境變量,不需指定 executable_path,使用 Chrome需要對(duì)應(yīng)的瀏覽器以及驅(qū)動(dòng)程序
driver = webdriver.Chrome()
# url 為你需要加載的頁面url
url = 'http://sns.icourses.cn/*****'
# 打開頁面
driver.get(url)
# 在你的例子中,是需要點(diǎn)擊 button ,通過class 屬性獲取到button,然后執(zhí)行單擊 .click()
# 如果需要準(zhǔn)確定位,可以自行搜索其他的 find_
driver.find_element_by_class_name("buttonJump").click()
# selenium webdriver 還有很多其它高級(jí)的用法,自行谷歌,你這個(gè)問題,搜索應(yīng)該是能得到答案的,
There are several situations,
1. The page can be turned by sliding or clicking through the js effect;
2. The page can be turned by clicking on the hyperlink;
You can use the network analysis in Chrome's developer tools to get the result, whether it is an html page or a feedback json rendering.
json is easier to handle, just get the result directly. Ordinary html pages need to use regular matching to page breaks. Then put the link into the pool to be crawled.
/a/11...