程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Python crawler data parsing (XPath)

編輯:Python

xpath yes python The most commonly used data parsing method for crawlers , I think it's also the simplest , It is also very versatile , Why is it the simplest . There are two main steps .

1、 Instantiate a etree object , And the parsed page source code data needs to be loaded into the object .

2、 call etree Object xpath Method , combination xpath Expressions locate labels and crawl content text or attributes .

How to instantiate a etree What about objects? ? First download lxml Library and import etree package , And then it will be the local HTML The document source code data is loaded into etree In the object , Or load the real-time web page source code data into etree in .

from lxml import etree
# Will local html You can only get the text content directly below the tag
tree = etree.parse('./douban.html')
print(tree.xpath('/html/head/title'))
>> [<Element li at 0x1458ddbbc80>]
from lxml import etree
import requests
# Load the web page source code data into the object
page_text = requests.get(url=url,headers=headers).text
tree = etree.HTML(page_text)
li_list = tree.xpath('//div[@class="slist"]/ul/li')
print(li_list)
>> [<Element li at 0x1458ddbbc80>]

                                The local is etree.parse, Page is etree.HTML 

The return here is not local or web html Content in the document , It is a Element Object of type , This object stores title Corresponding text content , If there are multiple contents , In the form of a list , Return multiple Element.

xpath Rules for expressions :

/: The representation is a hierarchy , Locate from the root node .

//: It represents multiple levels , You can start from any node .

Attribute positioning ://div[@class="title"] Add... Before the attribute @.

The index position ://div[@class="title"]/a[1] The subscript is from 1 Not at first 0 Start .

/text(): Get the immediate text content in the tag .

//text(): You can get all the text content of a label .

@attrName: Add... Before the attribute @, You can get the attribute content .

  Next, let's tell you , How to write one quickly xpath route .

We can open developer tools on the web page where we want to crawl data ( Right click to check or press fn and f12 Open the developer tool ), Then in the element (Element) Find the data you want to crawl, right-click and select Copy , Choose to copy again XPath, That's all right. , Is it convenient .

Above is xpath More commonly used methods , Of course xpath There are many other ways , If you are interested, you can check the relevant documents .


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved