爬取需驗證碼登入的電影票房資料庫，別篡改人家的資料啊

首頁>技術>地表嘴強程式設計師2020-12-23 13:24

爬取需驗證碼登入的電影票房資料庫，別篡改人家的資料啊

簡單需求分析

由於練手所需，我們需要電影票房資料。

第一次做也沒什麼經驗，就瞄準了電影票房資料庫。

上去之後才知道，人家要登入，登入帶了數學驗證碼。

於是我們開始了。

獲取這個訊息之後呢，團隊裡的成員就開始議論了。

最終，一邊覺得可以把驗證碼取下來填上去獲取cookies，另一邊覺得可以先登入再取cookies，當然他們都成功了。唯獨我用selenium去登入取cookies的爬下來是亂碼。

程式碼實現

哎，我知道，大家點進來也不是為了看我嗶嗶嗶的，基本都想說：趕緊放碼過來！！！

以下程式碼出自團隊成員TopTab

import requestsimport refrom lxml import etreeimport randomfrom concurrent.futures import ThreadPoolExecutorimport timeuser_agent=[# 請自己放上十幾個頭]#下面的cookie自己加，建議加多個cookie=[]list_urls=[]def geturl(page):    headers={        'Cookie':random.choice(cookie),        'User-Agent':random.choice(user_agent)    }    time.sleep(1)    page = requests.get("http://58921.com/alltime?page={}".format(int(page)),headers=headers)    html = page.content.decode(encoding='utf-8')    with open("test.html",'wb') as f:        f.write(html.encode())    xpath_data=etree.HTML(page.content)    list_urls_raw=xpath_data.xpath('//*[@id="content"]/div[3]/table/tbody/tr/td[3]/a/@href')    # print(list_urls_raw)    for url in list_urls_raw:        list_urls.append(url)    return list_urlsdef get_number(url_half):    headers={        'User-Agent':random.choice(user_agent)            }    Html=requests.get("http://58921.com"+url_half+"/boxoffice",headers).content.decode("utf-8")    # print(Html)    pattern_number = re.compile(r'\(最新票房 (.+?)\)')    pattern_name=re.compile(r'<h3 class="panel-title">(.*)票房統計\(.*\)</h3>')    # print(pattern)    number=pattern_number.findall(Html)[0]    name=pattern_name.findall(Html)[0]    print(number,name)    return number,namewith ThreadPoolExecutor(max_workers=2) as executor_first:    for i in range(1,30): # 要幾頁自己調        executor_first.submit(geturl,i)print(list_urls)print(len(list_urls))with ThreadPoolExecutor(max_workers=2) as executor_second:    executor_second.map(get_number,list_urls)

然後這個cookies哪裡取呢？由於我一直取不到正確的cookies，所以導致效果一直無法復現，這裡幫你們把這個問題解決了。

曉得咯？

∨ Python專案實戰練習：製作小型圖書管理系統

熱門排行

劇多

爬取需驗證碼登入的電影票房資料庫，別篡改人家的資料啊