超詳細，手把手教你用20行Python程式碼製作飛花令小程式

首頁>技術>CDA資料分析師2021-02-24 15:24

超詳細，手把手教你用20行Python程式碼製作飛花令小程式

飛花令是古時候人們經常玩一種“行酒令”的遊戲，是中國古代酒令之一，屬雅令。“飛花”一詞則出自唐代詩人韓翃《寒食》中春城無處不飛花一句。行飛花令時選用詩和詞，也可用曲，但選擇的句子一般不超過7個字。

在《中國詩詞大會》中改良了“飛花令”，不再僅用花字，而是增加了雲、春、月、夜等詩詞中的高頻字，輪流背誦含有關鍵字的詩句，直至決出勝負。

今天，我們就利用 Python 定製一款“飛花令”小程式：給定一個關鍵字或者關鍵詞，就能夠返回許多含有這個關鍵字的詩句，跟朋友玩再也不怕輸了！

網頁分析

要利用爬蟲完成這項工作需要先選擇一個合適的網站，這裡我們選擇了 古詩文網

在右上角的方框中輸入關鍵詞，如酒，就能夠返回相應的結果：

我們注意到，返回的結果是一整首詩或詞，關鍵字所在的句子僅為其中一句。後面我們爬取資訊時也需要做到過濾。

往下翻頁後會發現只能獲取前 2 頁內容，到第 3 頁會出現以下提示：

也就是說要完整獲取全部詩文需要下載 App，本文簡化問題只爬取前 2 頁的內容，後續有機會再分享 App 相關爬蟲推文。在翻頁的過程中我們注意一下 URL 的改變：

“

第 1 頁：https://so.gushiwen.cn/search.aspx?value=酒

第 2 頁：https://so.gushiwen.cn/search.aspx?type=title&page=2&value=酒

”

其中經過測試 type=title 可以去除，而page=2 顯然是頁碼，那麼 page=1 能否獲取到第 1 頁呢？

答案是可以的，因此不需要用 requests 的 post 請求，直接 get 下面的 URL 就可到達指定頁面：https://so.gushiwen.cn/search.aspx?page=頁碼&value=關鍵字

大致分析完就可以寫程式碼了

程式碼實現

首先匯入庫，設定請求頭

import requestsfrom lxml import htmlheaders = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}

以關鍵字酒為例，嘗試獲取第一頁全部內容：

import requestsfrom lxml import htmlheaders = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}html_data = requests.get('https://so.gushiwen.cn/search.aspx?page=1&value=酒', headers=headers).textprint(html_data)

返回的文字中有我們需要的內容，說明組合而成的請求是沒有問題的。接下來就可以解析文字獲取具體內容了，本文采用 Xpath：

selector = html.fromstring(html_data)poets = selector.xpath("/html/body/div[2]/div[1]/div[@class='sons']")for poet in poets:    title = ''.join(poet.xpath("div[1]/p[1]/a/b//text()")).strip()    print(title)

詩人和朝代被分隔至兩行，說明之間存在換行符及空格，可以用包含.strip()的列表推導式去除：

for poet in poets:    title = ''.join(poet.xpath("div[1]/p[1]/a/b//text()")).strip()    source = ''.join(poet.xpath('div[1]/p[2]//text()'))    source = ''.join([i.strip() for i in source])    print(title, source)

最後是對詩句的解析。為了獲取關鍵字真正在的句子，我們要透過句號或者問號將整首詩斷開成多個完整句：

for poet in poets:    title = ''.join(poet.xpath("div[1]/p[1]/a/b//text()")).strip()    source = ''.join(poet.xpath('div[1]/p[2]//text()'))    source = ''.join([i.strip() for i in source])    contents = ''.join(poet.xpath('div[1]/div[@class="contson"]//text()')).strip().replace('\n', '。').replace('？', '。').split('。')    print(title, source, contents)

對每一首詩逐漸判斷是否包含關鍵字：

for poet in poets:    title = ''.join(poet.xpath("div[1]/p[1]/a/b//text()")).strip()    source = ''.join(poet.xpath('div[1]/p[2]//text()'))    source = ''.join([i.strip() for i in source])    contents = ''.join(poet.xpath('div[1]/div[@class="contson"]//text()')).strip().replace('\n', '。').replace('？', '。').split('。')    content_lst = []    for i in contents:        if '酒' in i:            content = i.strip() + '。'            content_lst.append(content)            # 有的詩可能有兩句都包含關鍵字，這兩句詩就都是需求    if not content_lst: # 有可能只有題目中含有關鍵詞，這種詩就跳過        continue    for j in list(set(content_lst)): # 有可能有的詩雖然有兩句都包含關鍵字，但這兩句是一樣的，需要去重        print(j, title, source)

大部分需求已經滿足，最後只需要利用迴圈結構組裝 URL 達到範圍多頁的目的，同時關鍵字可以修改為 input 互動輸入，程式碼如下：

import requestsfrom lxml import htmlheaders = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}def poet_content(keyword,num,url):    html_data = requests.get(url, headers=headers).text    selector = html.fromstring(html_data)    poets = selector.xpath("/html/body/div[2]/div[1]/div[@class='sons']")    for poet in poets:        title = ''.join(poet.xpath("div[1]/p[1]/a/b//text()")).strip()        source = ''.join(poet.xpath('div[1]/p[2]//text()'))        source = ''.join([i.strip() for i in source])        contents = ''.join(poet.xpath('div[1]/div[@class="contson"]//text()')).strip().replace('\n', '。').replace('？','。').split('。')        content_lst = []        for i in contents:            if keyword in i:                content = i.strip() + '。'                content_lst.append(content)        if not content_lst:            continue        for j in list(set(content_lst)):            print(num, j)            print(f'<{title}>', source)            print('')            num += 1    return numif __name__ == '__main__':    keyword = input('> 請輸入關鍵詞: ')    print('')    num = 1    for i in range(1, 3):        url = f'https://so.gushiwen.org/search.aspx?page={i}&value={keyword}'        num = poet_content(keyword, num, url)

至此，我們就透過 Python 爬蟲就成功製作了一款“飛花令”小工具，感興趣的讀者可以自己嘗試一下！

∨ 六種開發APP的技術分析

熱門排行

劇多

超詳細，手把手教你用20行Python程式碼製作飛花令小程式