python爬蟲總計+案例+常用工具

首頁>技術>呆萌胖頭魚2019-12-11 14:08

python爬蟲總計+案例+常用工具

用Python開發爬蟲是一件很輕鬆愉悅的事情，因為其相關庫較多，而且使用方便，短短十幾行程式碼就可以完成一個爬蟲的開發；但是，在應對具有反爬措施的網站，使用js動態載入的網站，App採集的時候就得動動腦子了；並且在開發分散式爬蟲，高效能爬蟲的時候更得用心設計。

Python開發爬蟲常用的工具總結reqeusts：Python HTTP網路請求庫；pyquery： Python HTML DOM結構解析庫，採用類似JQuery的語法；BeautifulSoup：python HTML以及XML結構解析；selenium：Python自動化測試框架，可以用於爬蟲；phantomjs：無頭瀏覽器，可以配合selenium獲取js動態載入的內容；re：python內建正則表示式模組；fiddler：抓包工具，原理就是是一個代理伺服器，可以抓取手機包；anyproxy：代理伺服器，可以自己撰寫rule擷取request或者response，通常用於客戶端採集；celery：Python分散式計算框架，可用於開發分散式爬蟲；gevent：Python基於協程的網路庫，可用於開發高效能爬蟲grequests：非同步requestsaiohttp:非同步http client/server框架asyncio：python內建非同步io，事件迴圈庫uvloop：一個非常快速的事件迴圈庫，配合asyncio效率極高concurrent：Python內建用於併發任務執行的擴充套件scrapy：python 爬蟲框架；Splash：一個JavaScript渲染服務，相當於一個輕量級的瀏覽器，配合lua指令碼通過他的http API 解析頁面；Splinter：開源自動化Python web測試工具pyspider：Python爬蟲系統網頁抓取思路資料是否可以直接從HTML中獲取？資料直接巢狀在頁面的HTML結構中；資料是否使用JS動態渲染到頁面中的？資料巢狀在js程式碼中，然後採用js載入到頁面或者採用ajax渲染；獲取的頁面使用是否需要認證？需要登入後頁面才可以訪問；資料是否直接可以通過API得到？有些資料是可以直接通過api獲取到，省去解析HTML的麻煩，大多數API都是以JSON格式返回資料；來自客戶端的資料如何採集？例如：微信APP和微信客戶端如何應對反爬不要太過分，控制爬蟲的速率，別把人家整垮了，那就兩敗俱傷了；使用代理隱藏真實IP，並且實現反爬；讓爬蟲看起來像人類使用者，選擇性滴設定以下HTTP頭部：Host：https://www.baidu.comConnection：keep-aliveAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8UserAgent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.104 Safari/537.36Referer: /file/2019/12/11/20191211140532_20268.jpg gzip, deflateAccept-Language: zh-CN,zh;q=0.8檢視網站的cookie，在某些情況下，請求需要新增cookie用於通過服務端的一些校驗；案例說明靜態頁面解析（獲取微信公眾號文章）

import pyqueryimport redef weixin_article_html_parser(html): &#34;&#34;&#34; 解析微信文章，返回包含文章主體的字典資訊 :param html: 文章HTML原始碼 :return: &#34;&#34;&#34; pq = pyquery.PyQuery(html) article = { &#34;weixin_id&#34;: pq.find(&#34;#js_profile_qrcode &#34; &#34;.profile_inner .profile_meta&#34;).eq(0).find(&#34;span&#34;).text().strip(), &#34;weixin_name&#34;: pq.find(&#34;#js_profile_qrcode .profile_inner strong&#34;).text().strip(), &#34;account_desc&#34;: pq.find(&#34;#js_profile_qrcode .profile_inner &#34; &#34;.profile_meta&#34;).eq(1).find(&#34;span&#34;).text().strip(), &#34;article_title&#34;: pq.find(&#34;title&#34;).text().strip(), &#34;article_content&#34;: pq(&#34;#js_content&#34;).remove(&#39;script&#39;).text().replace(r&#34;\\r\\n&#34;, &#34;&#34;), &#34;is_orig&#34;: 1 if pq(&#34;#copyright_logo&#34;).length &gt; 0 else 0, &#34;article_source_url&#34;: pq(&#34;#js_sg_bar .meta_primary&#34;).attr(&#39;href&#39;) if pq( &#34;#js_sg_bar .meta_primary&#34;).length &gt; 0 else &#39;&#39;, } # 使用正則表示式匹配頁面中js指令碼中的內容 match = { &#34;msg_cdn_url&#34;: {&#34;regexp&#34;: &#34;(?&lt;=\\&#34;).*(?=\\&#34;)&#34;, &#34;value&#34;: &#34;&#34;}, # 匹配文章封面圖 &#34;var ct&#34;: {&#34;regexp&#34;: &#34;(?&lt;=\\&#34;)\\d{10}(?=\\&#34;)&#34;, &#34;value&#34;: &#34;&#34;}, # 匹配文章釋出時間 &#34;publish_time&#34;: {&#34;regexp&#34;: &#34;(?&lt;=\\&#34;)\\d{4}-\\d{2}-\\d{2}(?=\\&#34;)&#34;, &#34;value&#34;: &#34;&#34;}, # 匹配文章釋出日期 &#34;msg_desc&#34;: {&#34;regexp&#34;: &#34;(?&lt;=\\&#34;).*(?=\\&#34;)&#34;, &#34;value&#34;: &#34;&#34;}, # 匹配文章簡介 &#34;msg_link&#34;: {&#34;regexp&#34;: &#34;(?&lt;=\\&#34;).*(?=\\&#34;)&#34;, &#34;value&#34;: &#34;&#34;}, # 匹配文章連結 &#34;msg_source_url&#34;: {&#34;regexp&#34;: &#34;(?&lt;=&#39;).*(?=&#39;)&#34;, &#34;value&#34;: &#34;&#34;}, # 獲取原文連結 &#34;var biz&#34;: {&#34;regexp&#34;: &#34;(?&lt;=\\&#34;)\\w{1}.+?(?=\\&#34;)&#34;, &#34;value&#34;: &#34;&#34;}, &#34;var idx&#34;: {&#34;regexp&#34;: &#34;(?&lt;=\\&#34;)\\d{1}(?=\\&#34;)&#34;, &#34;value&#34;: &#34;&#34;}, &#34;var mid&#34;: {&#34;regexp&#34;: &#34;(?&lt;=\\&#34;)\\d{10,}(?=\\&#34;)&#34;, &#34;value&#34;: &#34;&#34;}, &#34;var sn&#34;: {&#34;regexp&#34;: &#34;(?&lt;=\\&#34;)\\w{1}.+?(?=\\&#34;)&#34;, &#34;value&#34;: &#34;&#34;}, } count = 0 for line in html.split(&#34;\\n&#34;): for item, value in match.items(): if item in line: m = re.search(value[&#34;regexp&#34;], line) if m is not None: count += 1 match[item][&#34;value&#34;] = m.group(0) break if count &gt;= len(match): break article[&#34;article_short_desc&#34;] = match[&#34;msg_desc&#34;][&#34;value&#34;] article[&#34;article_pos&#34;] = int(match[&#34;var idx&#34;][&#34;value&#34;]) article[&#34;article_post_time&#34;] = int(match[&#34;var ct&#34;][&#34;value&#34;]) article[&#34;article_post_date&#34;] = match[&#34;publish_time&#34;][&#34;value&#34;] article[&#34;article_cover_img&#34;] = match[&#34;msg_cdn_url&#34;][&#34;value&#34;] article[&#34;article_source_url&#34;] = match[&#34;msg_source_url&#34;][&#34;value&#34;] article[&#34;article_url&#34;] = &#34;/file/2019/12/11/20191211140532_20270.jpg biz=match[&#34;var biz&#34;][&#34;value&#34;], mid=match[&#34;var mid&#34;][&#34;value&#34;], idx=match[&#34;var idx&#34;][&#34;value&#34;], sn=match[&#34;var sn&#34;][&#34;value&#34;], ) return articleif __name__ == &#39;__main__&#39;: from pprint import pprint import requests url = (&#34;/file/2019/12/11/20191211140532_20271.jpg &#34;&amp;sn=39419542de39a821bb5d1570ac50a313&amp;scene=0#wechat_redirect&#34;) pprint(weixin_article_html_parser(requests.get(url).text))# {&#39;account_desc&#39;: &#39;夜聽，讓更多的家庭越來越幸福。&#39;,# &#39;article_content&#39;: &#39;文字：安夢 \\\\xa0 \\\\xa0 聲音：劉筱 得到了什麼？又失去了什麼？&#39;,# &#39;article_cover_img&#39;: &#39;/file/2019/12/11/20191211140532_20272.jpg &#39;article_pos&#39;: 1,# &#39;article_post_date&#39;: &#39;2017-07-02&#39;,# &#39;article_post_time&#39;: 1499002202,# &#39;article_short_desc&#39;: &#39;週日 來自劉筱的晚安問候。&#39;,# &#39;article_source_url&#39;: &#39;&#39;,# &#39;article_title&#39;: &#39;【夜聽】走到這裡&#39;,# &#39;article_url&#39;: &#39;https://mp.weixin.qq.com/s?__biz=MzI1NjA0MDg2Mw==&amp;mid=2650682990&amp;idx=1&amp;sn=39419542de39a821bb5d1570ac50a313&#39;,# &#39;is_orig&#39;: 0,# &#39;weixin_id&#39;: &#39;yetingfm&#39;,# &#39;weixin_name&#39;: &#39;夜聽&#39;}

使用phantomjs解析js渲染的頁面–微博搜尋

有些頁面採用複雜的js邏輯處理，包含各種Ajax請求，請求之間還包含一些加密操作，通過分析js邏輯重新渲染頁面拿到想要的資料可謂比登天還難，沒有堅實的js基礎，不熟悉各種js框架，搞明白這種頁面就別想了；採取類似瀏覽器的方式渲染頁面，直接獲取頁面HTML方便多了。

例如：/file/2019/12/11/20191211140533_20273.jpg 搜尋出來的結果是使用js動態渲染的，直接獲取HTML並不會得到搜尋的結果，所以我們要執行頁面中的js，將頁面渲染成功以後，再獲取它的HTML進行解析；

使用Python模擬登陸獲取cookie

有些網站比較蛋疼，通常需要登入之後才可以獲取資料，下面展示一個簡單的例子：用於登入網站嗎，獲取cookie，然後可以用於其他請求

但是，這裡僅僅在沒有驗證碼的情況下，如果要有簡訊驗證，圖片驗證，郵箱驗證那就要另行設計了；

目標網站：http://www.newrank.cn，日期：2017-07-03，如果網站結構更改，就需要修改代以下碼了；

#!/usr/bin/env python3# encoding: utf-8import timefrom urllib import parsefrom selenium import webdriverfrom selenium.common.exceptions import TimeoutException, WebDriverExceptionfrom selenium.webdriver.common.action_chains import ActionChainsfrom selenium.webdriver.common.desired_capabilities import DesiredCapabilitiesfrom pyquery import PyQuerydef weibo_user_search(url: str): &#34;&#34;&#34;通過phantomjs獲取搜尋的頁面html&#34;&#34;&#34; desired_capabilities = DesiredCapabilities.CHROME.copy() desired_capabilities[&#34;phantomjs.page.settings.userAgent&#34;] = (&#34;Mozilla/5.0 (Windows NT 10.0; Win64; x64) &#34; &#34;AppleWebKit/537.36 (KHTML, like Gecko) &#34; &#34;Chrome/59.0.3071.104 Safari/537.36&#34;) desired_capabilities[&#34;phantomjs.page.settings.loadImages&#34;] = True # 自定義頭部 desired_capabilities[&#34;phantomjs.page.customHeaders.Upgrade-Insecure-Requests&#34;] = 1 desired_capabilities[&#34;phantomjs.page.customHeaders.Cache-Control&#34;] = &#34;max-age=0&#34; desired_capabilities[&#34;phantomjs.page.customHeaders.Connection&#34;] = &#34;keep-alive&#34; driver = webdriver.PhantomJS(executable_path=&#34;/usr/bin/phantomjs&#34;, # 設定phantomjs路徑 desired_capabilities=desired_capabilities, service_log_path=&#34;ghostdriver.log&#34;,) # 設定物件的超時時間 driver.implicitly_wait(1) # 設定頁面完全載入的超時時間，包括頁面全部渲染，非同步同步指令碼都執行完成 driver.set_page_load_timeout(60) # 設定非同步指令碼的超時時間 driver.set_script_timeout(60) driver.maximize_window() try: driver.get(url=url) time.sleep(1) try: # 開啟頁面之後做一些操作 company = driver.find_element_by_css_selector(&#34;p.company&#34;) ActionChains(driver).move_to_element(company) except WebDriverException: pass html = driver.page_source pq = PyQuery(html) person_lists = pq.find(&#34;div.list_person&#34;) if person_lists.length &gt; 0: for index in range(person_lists.length): person_ele = person_lists.eq(index) print(person_ele.find(&#34;.person_name &gt; a.W_texta&#34;).attr(&#34;title&#34;)) return html except (TimeoutException, Exception) as e: print(e) finally: driver.quit()if __name__ == &#39;__main__&#39;: weibo_user_search(url=&#34;/file/2019/12/11/20191211140533_20273.jpguser/%s&#34; % parse.quote(&#34;新聞&#34;))# 央視新聞# 新浪新聞# 新聞# 新浪新聞客戶端# 中國新聞週刊# 中國新聞網# 每日經濟新聞# 澎湃新聞# 網易新聞客戶端# 鳳凰新聞客戶端# 皇馬新聞# 網路新聞聯播# CCTV5體育新聞# 曼聯新聞# 搜狐新聞客戶端# 巴薩新聞# 新聞日日睇# 新垣結衣新聞社# 看看新聞KNEWS# 央視新聞評論

使用Python模擬登陸獲取cookie

有些網站比較蛋疼，通常需要登入之後才可以獲取資料，下面展示一個簡單的例子：用於登入網站嗎，獲取cookie，然後可以用於其他請求

但是，這裡僅僅在沒有驗證碼的情況下，如果要有簡訊驗證，圖片驗證，郵箱驗證那就要另行設計了；

目標網站：http://www.newrank.cn，日期：2017-07-03，如果網站結構更改，就需要修改代以下碼了；

#!/usr/bin/env python3# encoding: utf-8from time import sleepfrom pprint import pprintfrom selenium.common.exceptions import TimeoutException, WebDriverExceptionfrom selenium.webdriver.common.desired_capabilities import DesiredCapabilitiesfrom selenium import webdriverdef login_newrank(): &#34;&#34;&#34;登入新榜，獲取他的cookie資訊&#34;&#34;&#34; desired_capabilities = DesiredCapabilities.CHROME.copy() desired_capabilities[&#34;phantomjs.page.settings.userAgent&#34;] = (&#34;Mozilla/5.0 (Windows NT 10.0; Win64; x64) &#34; &#34;AppleWebKit/537.36 (KHTML, like Gecko) &#34; &#34;Chrome/59.0.3071.104 Safari/537.36&#34;) desired_capabilities[&#34;phantomjs.page.settings.loadImages&#34;] = True # 自定義頭部 desired_capabilities[&#34;phantomjs.page.customHeaders.Upgrade-Insecure-Requests&#34;] = 1 desired_capabilities[&#34;phantomjs.page.customHeaders.Cache-Control&#34;] = &#34;max-age=0&#34; desired_capabilities[&#34;phantomjs.page.customHeaders.Connection&#34;] = &#34;keep-alive&#34; # 填寫自己的賬戶進行測試 user = { &#34;mobile&#34;: &#34;user&#34;, &#34;password&#34;: &#34;password&#34; } print(&#34;login account: %s&#34; % user[&#34;mobile&#34;]) driver = webdriver.PhantomJS(executable_path=&#34;/usr/bin/phantomjs&#34;, desired_capabilities=desired_capabilities, service_log_path=&#34;ghostdriver.log&#34;, ) # 設定物件的超時時間 driver.implicitly_wait(1) # 設定頁面完全載入的超時時間，包括頁面全部渲染，非同步同步指令碼都執行完成 driver.set_page_load_timeout(60) # 設定非同步指令碼的超時時間 driver.set_script_timeout(60) driver.maximize_window() try: driver.get(url=&#34;/file/2019/12/11/20191211140533_20275.jpg.html driver.find_element_by_css_selector(&#34;.login-normal-tap:nth-of-type(2)&#34;).click() sleep(0.2) driver.find_element_by_id(&#34;account_input&#34;).send_keys(user[&#34;mobile&#34;]) sleep(0.5) driver.find_element_by_id(&#34;password_input&#34;).send_keys(user[&#34;password&#34;]) sleep(0.5) driver.find_element_by_id(&#34;pwd_confirm&#34;).click() sleep(3) cookies = {user[&#34;name&#34;]: user[&#34;value&#34;] for user in driver.get_cookies()} pprint(cookies) except TimeoutException as exc: print(exc) except WebDriverException as exc: print(exc) finally: driver.quit()if __name__ == &#39;__main__&#39;: login_newrank()# login account: 15395100590# {&#39;CNZZDATA1253878005&#39;: &#39;1487200824-1499071649-%7C1499071649&#39;,# &#39;Hm_lpvt_a19fd7224d30e3c8a6558dcb38c4beed&#39;: &#39;1499074715&#39;,# &#39;Hm_lvt_a19fd7224d30e3c8a6558dcb38c4beed&#39;: &#39;1499074685,1499074713&#39;,# &#39;UM_distinctid&#39;: &#39;15d07d0d4dd82b-054b56417-9383666-c0000-15d07d0d4deace&#39;,# &#39;name&#39;: &#39;15395100590&#39;,# &#39;rmbuser&#39;: &#39;true&#39;,# &#39;token&#39;: &#39;A7437A03346B47A9F768730BAC81C514&#39;,# &#39;useLoginAccount&#39;: &#39;true&#39;}

在獲取cookie之後就可以將獲得的cookie新增到後續的請求中了，但是因為cookie是具有有效期的，因此需要定時更新；可以通過設計一個cookie池來實現，動態定時登入一批賬號，獲取cookie之後存放在資料庫中（redis，MySQL等等），請求的時候從資料庫中獲取一條可用cookie，並且新增在請求中訪問；

使用pyqt5爬個數據試試（PyQt 5.9.2）

import sysimport csvimport pyqueryfrom PyQt5.QtCore import QUrlfrom PyQt5.QtWidgets import QApplicationfrom PyQt5.QtWebEngineWidgets import QWebEngineViewclass Browser(QWebEngineView): def __init__(self): super(Browser, self).__init__() self.__results = [] self.loadFinished.connect(self.__result_available) @property def results(self): return self.__results def __result_available(self): self.page().toHtml(self.__parse_html) def __parse_html(self, html): pq = pyquery.PyQuery(html) for rows in [pq.find(&#34;#table_list tr&#34;), pq.find(&#34;#more_list tr&#34;)]: for row in rows.items(): columns = row.find(&#34;td&#34;) d = { &#34;avatar&#34;: columns.eq(1).find(&#34;img&#34;).attr(&#34;src&#34;), &#34;url&#34;: columns.eq(1).find(&#34;a&#34;).attr(&#34;href&#34;), &#34;name&#34;: columns.eq(1).find(&#34;a&#34;).attr(&#34;title&#34;), &#34;fans_number&#34;: columns.eq(2).text(), &#34;view_num&#34;: columns.eq(3).text(), &#34;comment_num&#34;: columns.eq(4).text(), &#34;post_count&#34;: columns.eq(5).text(), &#34;newrank_index&#34;: columns.eq(6).text(), } self.__results.append(d) with open(&#34;results.csv&#34;, &#34;a+&#34;, encoding=&#34;utf-8&#34;) as f: writer = csv.DictWriter(f, fieldnames=[&#34;name&#34;, &#34;fans_number&#34;, &#34;view_num&#34;, &#34;comment_num&#34;, &#34;post_count&#34;, &#34;newrank_index&#34;, &#34;url&#34;, &#34;avatar&#34;]) writer.writerows(self.results) def open(self, url: str): self.load(QUrl(url))if __name__ == &#39;__main__&#39;: app = QApplication(sys.argv) browser = Browser() browser.open(&#34;/file/2019/12/11/20191211140533_20276.jpg.html browser.show() app.exec_()

持續更新中：

5. 使用Fiddler抓包分析

瀏覽器抓包fiddler手機抓包

6. 使用anyproxy抓取客戶端資料–客戶端資料採集

7. 關於開發高效能爬蟲的總結

原文連結：/file/2019/12/11/20191211140534_20277.jpg.html

150

∨ 聊聊 Vue3.0 響應式資料

劇多

python爬蟲總計+案例+常用工具

Python

網路爬蟲

HTML

指令碼語言

Chrome