您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python爬蟲之續Urllib ＆＆ Jsonpath庫的使用

編輯：Python

活動地址：CSDN21天學習挑戰賽

以下是關於Python~Jsonpath庫的使用
🥧 續Urllib之Jsonpath庫的使用點擊跳轉到Urllib文章
🥧快，跟我一起爬起來

爬蟲越爬越爽

Jsonpath

簡介
- 爬蟲步驟
- JsonPath與Xpath語法對比
- 續Urllib的相關使用跳轉到Urllib文章
- - Python之Urllib爬取前後端分離Json格式的後端數據~Ajax~get(以為例（其他類似）)
  - Python之Urllib爬取前後端分離Json格式的後端數據~Ajax~get動態爬取多少頁(以為例（其他類似）)
  - Python之Urllib爬取餐廳的信息~Ajax~post動態爬取多少頁
- Python之Jsonpath簡單使用
- Python之Jsonpath爬取淘數據然後使用Jsonpath獲取想要的數據
總結

簡介

JSONPath是一種信息抽取類庫，是從JSON文檔中抽取指定信息的工具，提供多種語言實現版本，包括：Javascript, Python，PHP 和 Java，JsonPath 對於 JSON 來說，相當於 XPath 對於 XML。

爬蟲步驟

想要爬什麼？數據類型找接口爬取數據

JsonPath與Xpath語法對比

Json結構清晰，比 XML 簡潔得多，可讀性高，復雜度低，非常容易匹配，可以很直觀地了解存的是什麼內容，如圖所示。

XPathJSONPath描述/$根對象/元素.@當前對象/元素/. or []孩子操作符…n/a父親操作符//…遞歸下降。JSONPath從E4X借用了這個語法。**通配符。所有對象/元素，不管它們的名稱。@n/a屬性的訪問。JSON結構沒有屬性。[][]下標操作符。XPath使用它來遍歷元素集合和謂詞。在Javascript和JSON中，它是原生數組操作符。I[,]XPath中的聯合運算符會生成節點集的組合。JSONPath允許替換名稱或數組索引集。n/a[startstep]從ES4借用的數組切片操作符。[]?()應用篩選器(腳本)表達式。n/a()腳本表達式，使用底層腳本引擎。()n/a分組在Xpath

官網：https://goessner.net/articles/JsonPath/

續Urllib的相關使用跳轉到Urllib文章

Python之Urllib爬取前後端分離Json格式的後端數據_Ajaxget(以為例（其他類似）)

注意：open方法默認情況下載的是gbk的編碼，如果我們要下載保存汗字，那麼需要在open方法中指定編碼格式

Ⅰ爬取json數據格式化數據ctrl + alt +L

Ⅱ下載數據到本地的兩種方法：

方法1、 fs=open(保存的文件名，’類型‘，’等‘）
fs.write(要寫入或要讀取數據)

方法2、 with open（保存的文件名，’類型‘，‘等’) as fs:
fs.write(要寫入或要讀取數據)

步驟

查看數據是不是我們想要的

復制接口

3. 找到接口就可以爬了

代碼演示：

import urllib.request
//地址
url = 'https://movie.douban.com/j/new_search_subjects?sort=U&range=0,10&tags=&start=0&genres=%E5%8A%A8%E4%BD%9C'
headers = {
'User-Agent': ' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
}
//1、請求頭的定制
request = urllib.request.Request(url=url, headers=headers)
//2、獲取響應數據
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
//3、將數據下載到本地
with open('6、python之urllib_ajax_get請求_爬電影/doBan.json', 'w', encoding='utf-8') as fs:
fs.write(content)

如下圖（爬取成功）：

快跟我爬起來吧

Python之Urllib爬取前後端分離Json格式的後端數據_Ajaxget動態爬取多少頁(以為例（其他類似）)

步驟：找頁碼規律

當我往下滑的時候會發現不斷更新數據（利用Axios技術）

同時我們獲取刷新數據的接口如

https://movie.douban.com/j/new_search_subjects?sort=U&range=0,10&tags=&start=0&genres=%E5%8A%A8%E4%BD%9C
https://movie.douban.com/j/new_search_subjects?sort=U&range=0,10&tags=&start=20&genres=%E5%8A%A8%E4%BD%9C
https://movie.douban.com/j/new_search_subjects?sort=U&range=0,10&tags=&start=40&genres=%E5%8A%A8%E4%BD%9C

我們可以發現如（所以從這裡入手）

start=0
start=20
start=40

代碼演示：

import urllib.request
import urllib.parse
# 第一、請求對象的定制
def create_request(page):
url = 'https://movie.douban.com/j/new_search_subjects?sort=U&range=0,10&tags=&'
url1 = '&genres=%E5%8A%A8%E4%BD%9C'
index = 20 //一頁又二十部電影
data = {
'start': (page - 1)*index
}
data = urllib.parse.urlencode(data) //這裡不是post請求 所以不用encode（‘utf-8’）
url = url + data + url1 //拼接路徑
headers = {
'User-Agent': ' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
}
request = urllib.request.Request(url=url, headers=headers)
return request
# 第二、獲取響應數據
def get_content(request):
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
return content
# 第三、下載數據
def download(c, content):
with open('6、python之urllib_ajax_get請求_爬電影/doBanS' + str(page) + '.json', 'w', encoding='utf-8') as fs:
fs.write(content)
# 程序入口
if __name__ == '__main__': // 簡單理解為了防止其他程序調用該模塊時，觸發其他的動作
start_page = int(input("請輸入起始頁"))
end_page = int(input("請輸人結束頁"))
//左閉右開+1
for page in range(start_page, end_page + 1):
請求定制返回數據
request = create_request(page)
獲取響應數據返回數據
content = get_content(request)
下載數據
download(page, content)

如下圖（爬取成功）：

快跟我爬起來吧

Python之Urllib爬取餐廳的信息_Ajaxpost動態爬取多少頁

步驟：找頁碼規律

當我點擊下一頁的時候會發現隨之數據頁發生變化

同時我們獲取刷新數據的接口如

http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname
cname: 汕頭
pid:
pageIndex: 1
pageSize: 10
http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname
cname: 汕頭
pid:
pageIndex: 2
pageSize: 10

我們可以發現如（所以從這裡入手）

pageIndex: 1
pageIndex: 2

代碼演示：

import urllib.request
import urllib.parse
import json
# 第一、請求對象的定制
def create_request(page):
url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname'
data = {
'cname': '廣州',
'pid': ' ',
'pageIndex': page,
'pageSize': '10',
}
data = urllib.parse.urlencode(data).encode('utf-8')
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/102.0.0.0 Safari/537.36',
}
request = urllib.request.Request(url=url, headers=headers, data=data)
return request
# 第二、獲取響應數據
def get_content(request):
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
print(content)
# 反序列化
result = str(json.loads(content)['Table1']).replace("'", '"')
return result
# 第三、下載數據
def download(page, content):
with open('7、python之urllib_ajax_get請求_爬地址_前十頁/kfc' + str(page) + '.json', 'w', encoding='utf-8') as fs:
fs.write(content)
# 程序入口
if __name__ == '__main__':
start_page = int(input("請輸入起始頁"))
end_page = int(input("請輸入起始頁"))
for page in range(start_page, end_page + 1):
request = create_request(page)
content = get_content(request)
download(page, content)

如下圖（爬取成功）：

Python之Jsonpath簡單使用

安裝：pip intsall jsonpath（由於庫很小可以不使用鏡像）

推薦一篇不錯的文章：點擊跳轉

JsonPath要爬的數據

{
"firstName": "John",
"lastName": "doe",
"age": 26,
"address": {
"streetAddress": "naist street",
"city": "Nara",
"postalCode": "630-0192"
},
"phoneNumbers": [
{
"type": "iPhone",
"number": "0123-4567-8888"
},
{
"type": "home",
"number": "0123-4567-8910"
},
{
"type": "home"
}
]
}

代碼演示：

import json
import jsonpath
obj = json.load(open('jsonPath.json', 'r', encoding='utf-8'))
request = jsonpath.jsonpath(obj, '$.address.')
request1 = jsonpath.jsonpath(obj, '$.phoneNumbers[*]')
request2 = jsonpath.jsonpath(obj, '$.phoneNumbers[*]..type')
request3 = jsonpath.jsonpath(obj, '$.phoneNumbers[(@.length-3)]')
request4 = jsonpath.jsonpath(obj, '$.phoneNumbers[0,2]')
request5 = jsonpath.jsonpath(obj, '$.phoneNumbers[0:]')
request6 = jsonpath.jsonpath(obj, '$.phoneNumbers[?(@.type)]')
# 爬取地址
print(request)
# 爬取所有的電話信息
print(request1)
# 爬取所有的名字
print(request2)
# 爬取倒數第@.length-3個
print(request3)
# 爬取倒數第一和第二給
print(request4)
# 獲取從0開始後的數據
print(request5);
# 條件過濾
# 爬取具有type的數據
print(request6)

如下圖（獲取想要的數據）：

Python之Jsonpath爬取淘數據然後使用Jsonpath獲取想要的數據

代碼演示：

import urllib.request
import jsonpath
import json
url = "https://dianying.taobao.com/cityAction.json?activityId&_ksTS=1659517543500_108&jsoncallback=jsonp109&action" \
"=cityAction&n_s=new&event_submit_doGetAllRegion=true "
headers = {
'accept': 'text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, */*; q=0.01',
# 'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh,zh-CN;q=0.9,en;q=0.8',
'bx-v': '2.2.2',
'cookie': 't=b4c2e30684a007ae1b99fcc29f106fbc; cookie2=1861202b7779067257f60da8045ea2bc; v=0; _tb_token_=e88733b173ee1; cna=LyhxGzY8sCMCAbcHsQ0Ms5ot; xlly_s=1; tfstk=cQFdBAZwwGjnRgkjFyBiUYLTdC7cZiLKqeiJw7owg7LVt2ARiYN0MVhCd4oqvNC..; l=eBjmQEz7L7nUI0y8BOfwlurza77tcIRAguPzaNbMiOCP_LCH7F9cW6xwE18MCnGNh62JR3Wrj_IwBeYBqC2sjqj2nAHOrKHmn; isg=BM7Oll5OINPphJT3EqpdZWGYH6SQT5JJT641pPgXXVGUW261YN46WUYVk483w4ph',
'referer': 'https://dianying.taobao.com/index.htm?n_s=new',
'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
'x-requested-with': 'XMLHttpRequest',
}
request = urllib.request.Request(url=url, headers=headers)
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
content = content.split('(')[1].split(')')[0]
with open('套票飄飄.json', 'w', encoding='utf-8') as fs:
fs.write(content)
results = json.load(open('套票飄飄.json', 'r', encoding='utf-8'))
receive =str(jsonpath.jsonpath(results, '$..regionName'))
print(receive)
with open('淘票票地區.text', 'w', encoding='utf-8') as fs:
fs.write(receive)

如下圖（獲取數據成功）：