python對lxml的操作

首頁>技術>pandastar2020-12-22 22:25

一、簡介

在pyathon爬蟲中，常用BeatifulSoup進行html解析，但容易記憶體溢位。這裡介紹另一種工具lxml在html元素提取中的使用，同時與BeatifulSoup方式進行比較。

二、使用

這裡直接上程式碼，具體請檢視程式碼註釋。

#! /usr/bin/env python

# -*- coding:utf8 -*-

import requests

from bs4 import BeautifulSoup

from lxml import etree, html

from lxml.html import soupparser

def get_html():

res = requests.get('http://www.ifeng.com/')

html_str = res.content

return html_str

def main():

# 方式一採用html方式解析html,使用etree作為parser

page = html.fromstring(get_html())

eles = page.cssselect('#headLineSichuan > ul:nth-child(1) > li:nth-child(1)')

content = eles[0].text_content()

print(content)

# 方式二採用html方式解析html,使用beautifulsoup作為parser,對編碼有良好支援

page = soupparser.fromstring(get_html())

eles = page.cssselect('#headLineSichuan > ul:nth-child(1) > li:nth-child(1)')

content = eles[0].text_content()

print(content)

# 方式三採用xml方式解析html

page = etree.HTML(get_html())

eles = page.cssselect('#headLineSichuan > ul:nth-child(1) > li:nth-child(1)')

content = eles[0].xpath('string(.)')

print(content)

# 方式四採用beautifulsoup方式解析html,注意,此時後代結點的寫法不同

page = BeautifulSoup(get_html())

eles = page.select('#headLineSichuan > ul:nth-of-type(1) > li:nth-of-type(1)')

content = eles[0].text

print(content)

if __name__ == '__main__':

main()

熱門排行