程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Python crawler from 0 to 1 -urllib_ Customization of request object (anti crawl strategy)

編輯:Python

Request object customization

Let's study together urllib Customization of request object in .

1.UA Introduce

UA(user agent) The Chinese name is user agent , It's a special string header , Enables the server to identify the operating system and version used by the client ,cpu type , Browsers and versions , Browser kernel , Browser rendering engine , Browser language , Browser plug-ins, etc .

# About how to find... In the browser UA. As shown in the figure below !

2.urllib.request.Request

urlopen() Method can realize the most basic request initiation , But if you want to join Headers Etc , You can use it Request Class to construct the request .
The grammar is as follows :

urllib.request.Request(url, data=None, headers={})

Be careful : Because of the problem of parameter order , Don't write directly url,headers, In the middle data, If written directly , be headers The value of is passed to by default data, So we need to use keywords to pass parameters !

Let's take an example ( Read Baidu home page source code ):

import urllib.request
url = 'https://www.baidu.com'
# url The composition of eg:https://www.baidu.com/s?wd= Jackson Yi 
# 1. agreement (http/https) 2. host (www.baidu.com) 3. Port number (80/443) 4. route (s) 5. Parameters (wd= Jackson Yi ) 6. Anchor point 
# Common port numbers 
# http(80) https(443) mysql(3306) oracle(1521) redis(6379) mongodb(27017)
headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'
}
# Because of the problem of parameter order , Don't write directly url,headers, In the middle data, If written directly , be headers The value of is passed to by default data, So we need to use keywords to pass parameters !
request = urllib.request.Request(url = url,headers = headers)
response = urllib.request.urlopen(request)
content = response.read().decode('utf8')
print(content) # If not ua identification , Then only part of the returned data . Because there is an anti climbing strategy 

Running results :

That's all python Customization of request object in crawler , That is, the anti climbing strategy ! Pay attention to me , In the next issue, we will update some questions about encoding and decoding !!


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved