程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

[pit] Jinshan document batch download skill record (python+js)

編輯:Python

List of articles

  • preface
  • requests Request problems
  • selenium The code analysis
  • JS Download code parsing
  • summary


preface

As of this publication , In the web version of Jinshan document , If you need to download at the same time 2 More than files , You must open a member . It's easy to think of writing crawlers to download files one by one to achieve the purpose of batch downloading .

You can find that this is not complicated by capturing packets , Select any file and click download , monitor XHR Packet under , The following key packets can be captured :

chart 1 Single file loading packet response results

In the response result on the right of the above figure url The field is the file download address , Note that the download address does not need to be logged in , You can access and download in any state ( But there is a validity period ).

Then this thing is very simple , Let's take a look at the request field of this packet :

chart 2 Single file load packet request field

As marked in the figure , Above request URL There are only two variable fields in :groupid And fileid,groupid It's easy to get directly from the current page URL You can read .fileid There are many ways to read , In fact, one of the source code of the current page <script> All files are stored under the tag fileid( Stored in variables window.__API_CACHED__ in , Call directly in the console window.__API_CACHED__ It's fine too ):

chart 3 Page source code window.API_CACHED

This window.__API_CACHED__ For a long , It's time to <script> Take out the complete contents under the label :

chart 4 window.__API_CACHED__ Complete data structure
As shown in the figure above , Information of all documents ( Include fileid) Are saved in the field value where the red box is located ( Has been put away ).

You may feel like reading fileid too troublesome , Therefore, you can also call the one given in the red box API Interface ( You can also grab bags ):

https://drive.kdocs.cn/api/v5/groups/{
group_id}/files?include=acl,pic_thumbnail&with_link=true&offset=0&count={
count}

among {group_id} As mentioned above ,{count} That is, the number of files to be obtained , Generally, you can take as many files as there are in this folder , Of course, the login status is required to access this interface , The data returned by the interface is shown in the following figure ( File name blocked , The three red boxes show groupid, file name ,fileid):

chart 5 All file information returned by the interface

Of all documents groupid and fileid All available , It seems that the problem has been solved , But since it can be written as a blog , It's not that simple .


requests Request problems

As the preface says , Key requests such as chart 2 Shown :

chart 2 Single file load packet request field (copy)

Based on past experience , Although the request of this packet must be in login status , But in theory, just attach all the request headers in the lower right box , You should get something like chart 1 Response result of ( That is, the download address of the file ).

Although I resend this request in the browser, I can still get chart 1 Result , It proves that the request is not burn after reading . But if you simply use requests The library makes a request , You can't get the same response :

def cookie_to_string(cookies: list) -> str:
string = ''
for cookie in cookies:
string += '{}={}; '.format(cookie['name'], cookie['value'])
return string.strip()
# Turn the request header in string form into dictionary form headers
def headers_to_dict(headers: str) -> dict:
lines = headers.splitlines()
headers_dict = {
}
for line in lines:
key, value = line.strip().split(':', 1)
headers_dict[key.strip()] = value.strip()
return headers_dict
url = f'https://drive.kdocs.cn/api/v5/groups/{
group_id}/files/{
file_id}/download?isblocks=false&support_checksums=md5,sha1,sha224,sha256,sha384,sha512'
cookies = driver.get_cookies()
headers_string = f"""Host: drive.kdocs.cn User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8 Accept-Language: zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2 Accept-Encoding: gzip, deflate, br Connection: keep-alive Cookie: {
cookie_to_string(cookies=cookies)} Upgrade-Insecure-Requests: 1 Sec-Fetch-Dest: document Sec-Fetch-Mode: navigate Sec-Fetch-Site: none Sec-Fetch-User: ?1"""
r = requests.get(url, headers=headers_to_dict(headers=headers_string)) # Unable to get response results 

Then this thing seems very strange , I tried for a long time , Also switch to requests.Session test , Still not ok . This shows that Kingsoft document anti crawl is indeed in place , According to the results of successful crawler implementation ,Cookie It does contain all login information , Then I guess Jinshan documents should have made some middleware restrictions on the access process , Or restrict cross domain requests , This is really very unfriendly to reptiles .

But it always needs to be solved , Then we have to turn to omnipotent selenium 了 .


selenium The code analysis

This part will be parsed in combination with the code , Because there are so many pits , But it is also an improvement of reptile skills :

# -*- coding: utf-8 -*-
# @author: caoyang
# @email: [email protected]
import re
import json
import time
import requests
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
def get_download_urls(group_id=1841861380, count=50):
# firefox_profile = webdriver.FirefoxProfile(r'C:\Users\caoyang\AppData\Roaming\Mozilla\Firefox\Profiles\sfwjk6ps.default-release')
# driver = webdriver.Firefox(firefox_profile=firefox_profile)
driver = webdriver.Firefox()
driver.get('https://account.wps.cn/') # The login page 
WebDriverWait(driver, 30).until(lambda driver: driver.find_element_by_xpath('//*[contains(text(), "VIU")]').is_displayed())
driver.get('https://www.kdocs.cn/latest')
WebDriverWait(driver, 30).until(lambda driver: driver.find_element_by_xpath('//span[contains(text(), " share ")]').is_displayed())
def cookie_to_string(cookies: list) -> str:
string = ''
for cookie in cookies:
string += '{}={}; '.format(cookie['name'], cookie['value'])
return string.strip()
def headers_to_dict(headers: str) -> dict:
lines = headers.splitlines()
headers_dict = {
}
for line in lines:
key, value = line.strip().split(':', 1)
headers_dict[key.strip()] = value.strip()
return headers_dict
# driver.get(f'https://drive.kdocs.cn/api/v5/groups/{group_id}/files?include=acl,pic_thumbnail&with_link=true&offset=0&count={count}')
# time.sleep(3)
# html = driver.page_source
# windows = driver.window_handles
# print(html)
# print(len(windows))
# print(driver.current_url)
# https://drive.kdocs.cn/api/v5/groups/1841861380/files?include=acl,pic_thumbnail&with_link=true&offset=0&count=50
cookies = driver.get_cookies()
headers_string = f"""Host: drive.kdocs.cn User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8 Accept-Language: zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2 Accept-Encoding: gzip, deflate, br Connection: keep-alive Cookie: {
cookie_to_string(cookies=cookies)} Upgrade-Insecure-Requests: 1 Sec-Fetch-Dest: document Sec-Fetch-Mode: navigate Sec-Fetch-Site: none Sec-Fetch-User: ?1"""
r = requests.get(f'https://drive.kdocs.cn/api/v5/groups/{
group_id}/files?include=acl,pic_thumbnail&with_link=true&offset=0&count={
count}', headers=headers_to_dict(headers=headers_string))
html = r.text
json_response = json.loads(html)
files = json_response['files']
print(f' total {
len(files)} File ')
download_urls = []
filenames = []
for file_ in files:
group_id = file_['groupid']
file_id = file_['id']
filename = file_['fname']
print(filename, group_id, file_id)
url = f'https://drive.kdocs.cn/api/v5/groups/{
group_id}/files/{
file_id}/download?isblocks=false&support_checksums=md5,sha1,sha224,sha256,sha384,sha512'
# driver.get(url)
# time.sleep(3)
# html = driver.page_source
cookies = driver.get_cookies()
headers_string = f"""Host: drive.kdocs.cn User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8 Accept-Language: zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2 Accept-Encoding: gzip, deflate, br Connection: keep-alive Cookie: {
cookie_to_string(cookies=cookies)} Upgrade-Insecure-Requests: 1 Sec-Fetch-Dest: document Sec-Fetch-Mode: navigate Sec-Fetch-Site: none Sec-Fetch-User: ?1"""
r = requests.get(url, headers=headers_to_dict(headers=headers_string))
html = r.text
# print(html)
json_response = json.loads(html)
download_url = json_response['url']
print(download_url)
download_urls.append(download_url)
filenames.append(filename)
with open('d:/download_urls.txt', 'w') as f:
for download_url, filename in zip(download_urls, filenames):
f.write(filename + '\t' + download_url + '\n')
driver.quit()
get_download_urls()

It is recommended to copy the above code first , The following description will be expanded by the number of lines of code ( Do not delete commented out lines , Those are the pits ).

  1. First of all to see 14-18 That's ok :

    At first, I thought about whether to import user data only ( The application of user data in crawlers can be viewed My blog ) You can skip Kingsoft document login . It should be noted that I did not test Chrome Browser situation , however Firefox It really doesn't work , Even if I first open a window to log in to Jinshan documents ( At this point, no matter how many windows I open to access Jinshan documents , Are in login status ), Then start importing user data selenium, Still stuck on the login page . Therefore, we can only comment out 14-15 That's ok .

    then 17 Line to access the login page ,18 Line is waiting for login to succeed ( During this period, you can click Wechat login , Then scan the code to confirm ).

    Actually used selenium We should all know , If you time.sleep Too long ,selenium It will collapse , If you manually operate on the page ( For example, click on , slide , Input text and so on ),selenium It's going to break down . I always thought selenium After startup, the browser cannot be operated at all , Now I find that I only need to write WebDriverWait( among xpath The user name is searched ,VIU Is my user name ), Then you can click scan code to login , This is very convenient .

  2. Next is 20-21 That's ok :

    This special pit , If you log in , Direct access 37 Line interface ( namely chart 5), It does not show chart 5 Result , I'll just tell you The user is not logged in , Therefore, you can only visit the homepage of Jinshan document first . In fact, I think it is possible to restrict the process to reverse crawl .

  3. 23-35 Line is two tool functions ,cookie_to_string Yes, it will driver.get_cookies() Back to Cookie Format ( Form like [{'name': name, 'value': 'value'}, ...]) To string Cookie Add to request header ,headers_to_dict Is to rewrite the string request header copied from the browser into a dictionary ( be used for requests.get Of headers Parameters )

  4. 37-61 That's ok :

    Here comes the pit , At this time, I use 37 Row access chart 5 The interface of can be seen chart 5 The data of , however 39 Yes driver.page_source Return is indeed 20 Go to the homepage of Jinshan document HTML, this TM It hurts . You can see 40 That's ok -43 OK, I did some tests , I have proved that there is really only one window at present (len(windows) by 1), also driver.current_url Display the current page URL Indeed, it is not the homepage of Jinshan document .

    It's been a problem for a long time , I checked drivers The function of does not get the page JSON Method of data . I haven't noticed this response before. The result is JSON Your page can't pass driver.page_source Get page data , So I was forced to use requests Library to rewrite this logic (47-61 That's ok ).

    Someone may have said , According to the preface , This request response is not and cannot be used requests Get it ? Such is the case , If it is only used alone 47-61 Line to access chart 5 Result , Indeed, what is still returning is The user is not logged in , But here I use drivers.get_cookies() Back to Cookie Information to replace... In the request header copied directly from the browser Cookie Information , It turned out to be miraculously , To be honest, I don't particularly understand the principle here , I don't know how the back-end code of Jinshan document determines whether it is a request made by a crawler .

  5. 63-66 That's ok : Get chart 5 File information data .

  6. 68-99 That's ok :

    What happened here is similar to 37-61 The line is exactly the same , We want to get chart 1 Response result of ( Naturally also JSON Formatted data ), If you still use driver To access the interface , Got driver.page_source It is still on the homepage of Jinshan document HTML, So the same method is used here (requests rewrite ), You can get 96 Line file download address .

    similarly , According to the situation in the preface , If you use it directly requests visit chart 2 It's not feasible , But here it is selenium After completing the login operation, it is indeed feasible .

All file download addresses are stored in d:/download_urls.txt in , Because this download address can be used even if it is not logged in , The finishing work is very simple .

with open('d:/download_urls.txt', 'r') as f:
lines = f.read().splitlines()
for line in lines:
filename, url = line.split('\t')
r = requests.get(url)
with open(f'd:/{
filename}', 'wb') as f:
f.write(r.content)

JS Download code parsing

Here is an equivalent JS Code , Theoretically, you can download it by running it directly in the console , But the problem is that cross domain request errors will occur , So it seems that it's still not very good , I wonder if any friends can solve this problem .

let groups = "1842648021";
let count = 54;
let res = await fetch(`https://drive.kdocs.cn/api/v5/groups/${
groups}/files?include=acl,pic_thumbnail&with_link=true&offset=0&count=${
count}&orderby=fname&order=ASC&filter=folder`);
let files = await res.json();
files = files.files;
let urls = []
let fid, info, url;
for (let f of files) {

fid = f.id;
res = await fetch(`https://drive.kdocs.cn/api/v5/groups/${
groups}/files/${
fid}/download?isblocks=false&support_checksums=md5,sha1,sha224,sha256,sha384,sha512`, {

"method": "GET",
"mode": "cors",
"credentials": "include"
});
info = await res.json();
url = info.url;
urls.push(url);
}
console.log(" Number of files to be downloaded :", urls.length);
for (let i = 0; i < urls.length; i++) {

let url = urls[i];
let fname = files[i].fname
fetch(url).then(res => res.blob().then(blob => {

let a = document.createElement('a');
let url = window.URL.createObjectURL(blob);
let filename = fname;
a.href = url;
a.download = filename;
a.click();
window.URL.revokeObjectURL(url);
}))
}

summary

At present, the urgent problem to be solved is selenium visit JSON How should the page read data , I think of a very bullshit method is to use from selenium.webdriver.common.keys import Keys In the direct Ctrl+A,Ctrl+C Copy the page data to get the string , Although this is stupid , But it seems feasible .

Another question is whether there is any way to use requests Without relying on selenium Complete the batch download of Jinshan documents , And why there are problems in the preface , This is really disturbing .

All in all , A mixture of requests and selenium It's really not very beautiful , I think someone should be able to come up with a more beautiful solution .


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved