程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

100 days of proficiency in python (crawler) - day 44: summary of requests Library

編輯:Python

List of articles

  • Each preface
  • One 、request Module summary
    • 1. Download and install
    • 2. Common properties or methods
    • 3. response.text and response.content The difference between :
    • 4. Send tape headers Parameter request
    • 5. Send request with parameters
    • 6. stay headers Parameters carry cookie
    • 7. Timeout parameters timeout Use
    • 8. proxies Use of proxy parameters
    • 9. send out post request

Each preface

  • The authors introduce :Python Quality creators in the field 、 Huawei cloud sharing expert 、 Alibaba cloud expert Blogger 、2021 year CSDN Blog star Top6

  • This article has been included in Python Full stack series column :《100 Sky master Python From entry to employment 》
  • ​​ This column is dedicated to Python A complete set of teaching prepared by zero foundation Xiaobai , from 0 To 100 Continuous advanced and in-depth learning , All knowledge points are linked
  • Subscribe to the column and read later Python From entry to employment 100 An article ; You can also chat with 200 people in private Python Full stack communication group ( Teaching by hand , Problem solving ); Join the group to receive 80GPython Full stack tutorial video + 300 This computer book : Basics 、Web、 Reptiles 、 Data analysis 、 visualization 、 machine learning 、 Deep learning 、 Artificial intelligence 、 Algorithm 、 Interview questions, etc .
  • Join me to learn and make progress , One can walk very fast , A group of people can go further !


One 、request Module summary

This article mainly studies requests This http modular , This module is mainly used to send request and get response , This module has many alternative modules , for instance urlib modular , But the most used in work is requests modular ,requests The code is simple Understandability , Compared with bloated urlib modular , Use requests Less crawler code will be written , And realize some - The function will be simple . Therefore, it is recommended that you master the use of this module

1. Download and install

1. window The computer clicks win key + R, Input :cmd

2. install requests, Enter the corresponding pip command pip install requests, I have already installed the existing version, and the installation is successful

2. Common properties or methods

Method / attribute explain response = requests.get(url) Send the response object obtained by the request ( The most commonly used )response = requests.post(url) Send the response object obtained by the request response.url Responsive url; Sometimes the response is ur1 And requested urI Don't agree with each other response.status_ code Response status code , Such as :200,404response.request.headers Respond to the corresponding request header response. headers Response head response.request.cookies Respond to the corresponding request cookie; return cookieJar type response.cookies Responsive cookie ( After set- cookie action ; return cookieJar type )response.json() Automatically put json The response content of string type is converted to python object (dict or list)response.text Returns the content of the response ,str type response.content Returns the content of the response , bytes type

Simple code implementation : adopt requests Send a request to Baidu home page , Get the source code of the page

import requests
# Target website 
url = "http://www.baidu.com/"
# Send request to get response 
response = requests.get(url)
# View the type of response object 
print(type(response))
# Check the response status code 
print(response.status_code)
# View the type of response content 
print(type(response.text))
# see cookies
print(response.cookies)
# View the contents of the response 
print(response.text)

Output results :

<class 'requests.models.Response'>
200
<class 'str'>
<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>ç¾åº¦ä¸€ä¸‹ï¼Œä½ 就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=ç¾åº¦ä¸€ä¸‹ class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>æ–°é—»</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>ç»å½•</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">ç»å½•</a>');</script> <a href=https://www.baidu.com/more/ name=tj_briicon class=bri >更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>å
³äºŽç¾åº¦</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用ç¾åº¦å‰å¿
读</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a>&nbsp;京ICP证030173号&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>
b'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b\xef\xbc\x8c\xe4\xbd\xa0\xe5\xb0\xb1\xe7\x9f\xa5\xe9\x81\x93</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>\xe6\x96\xb0\xe9\x97\xbb</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>\xe5\x9c\xb0\xe5\x9b\xbe</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>\xe8\xa7\x86\xe9\xa2\x91</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>\xe8\xb4\xb4\xe5\x90\xa7</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>\xe7\x99\xbb\xe5\xbd\x95</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">\xe7\x99\xbb\xe5\xbd\x95</a>\');</script> <a href=https://www.baidu.com/more/ name=tj_briicon class=bri >\xe6\x9b\xb4\xe5\xa4\x9a\xe4\xba\xa7\xe5\x93\x81</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>\xe5\x85\xb3\xe4\xba\x8e\xe7\x99\xbe\xe5\xba\xa6</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>\xe4\xbd\xbf\xe7\x94\xa8\xe7\x99\xbe\xe5\xba\xa6\xe5\x89\x8d\xe5\xbf\x85\xe8\xaf\xbb</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>\xe6\x84\x8f\xe8\xa7\x81\xe5\x8f\x8d\xe9\xa6\x88</a>&nbsp;\xe4\xba\xacICP\xe8\xaf\x81030173\xe5\x8f\xb7&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'

3. response.text and response.content The difference between :

response.text

  • type :str
  • Decoding type : requests The module automatically according to HTTP The header makes a reasoned guess about the encoding of the response , Speculative text encoding

response.content

  • type :bytes
  • Decoding type : Is not specified , Executive selection

Through to response.content Conduct decode, To solve Chinese garbled code

  • response.content.decode(): Default utf-8
  • response.content.decode('GBK')

Common coded character sets

  • utf-8
  • gbk
  • gb2312
  • asci ( pronunciation : Aske code )
  • iso-8859-1

Code demonstration :

import requests
# Target website 
url = "http://www.baidu.com/"
# Send request to get response 
response = requests.get(url)
# Set the encoding format manually 
response.encoding = 'utf8'
# Print source code str Type data 
print(response.text)
# response.content It's stored bytes Type of response data , Conduct decode operation 
print(response.content.decode('utf-8'))

Running results :

<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title> use Baidu Search , You will know </title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value= use Baidu Search class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav> Journalism </a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav> Map </a> <a href=http://v.baidu.com name=tj_trvideo class=mnav> video </a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav> tieba </a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb> Sign in </a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb"> Sign in </a>');</script> <a href=https://www.baidu.com/more/ name=tj_briicon class=bri > More products </a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com> About Baidu </a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/> Read Before Using Baidu </a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback> Feedback </a>&nbsp; Beijing ICP Prove 030173 Number &nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title> use Baidu Search , You will know </title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value= use Baidu Search class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav> Journalism </a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav> Map </a> <a href=http://v.baidu.com name=tj_trvideo class=mnav> video </a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav> tieba </a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb> Sign in </a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb"> Sign in </a>');</script> <a href=https://www.baidu.com/more/ name=tj_briicon class=bri > More products </a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com> About Baidu </a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/> Read Before Using Baidu </a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback> Feedback </a>&nbsp; Beijing ICP Prove 030173 Number &nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

4. Send tape headers Parameter request

1) View browser request header

  • 1. Open Google browser 》 Right click to check 》 Click on the top left corner to refresh the page
  • 2. Click on Network 》 Find the corresponding web address 》 Turn down and find it User-Agent And copy

2) Code instructions

requests.get(ur1, headers=headers)
  • headers ginseng Receive the request header in dictionary form
  • Request header field name as key, The value corresponding to the field is used as value

3) Code implementation

import requests
# Target website 
url = "http://www.baidu.com/"
# Build request header Dictionary , The most important thing is User-Agent
# If you need other request headers , It's just headers Add... To the dictionary 
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
# Send request to get response 
response = requests.get(url,headers=headers)
print(response.text)

Running results : The entire web source code :

5. Send request with parameters

How to delete redundant parameters in a web page address ?

  • Baidu search :python, You can see url The address is particularly complex
  • Delete parameters one by one and refresh , The resulting

The first method : The URL contains parameters

import requests
# Target website 
url = "https://www.baidu.com/s?wd=python"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
# Send request to get response 
response = requests.get(url,headers=headers)
print(response.text)

The second way : adopt params Construct parameter Dictionary

import requests
# Target website 
url = "https://www.baidu.com/s?"
headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
# The request parameter is a dictionary 
kw = {
'wd': 'python'}
# Set the parameter dictionary when sending the request , Get a response 
response = requests.get(url, headers=headers, params=kw)
print(response.text)

6. stay headers Parameters carry cookie

Websites often take advantage of... In the request header Cookie Field to maintain the user access state , So we can do that headers Add... To the parameter Cookie, Simulate the request of ordinary users .Cookie It has timeliness and needs to be replaced after a period of time

  • 1. Open Google browser 》 Right click to check 》 Click on the top left corner to refresh the page

  • 2. Click on Network 》 Find the corresponding web address 》 Turn down and find it Cookie And copy

  • 3. stay headers Add... To the dictionary cookie Parameters

    headers = {
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
    'Cookie': 'BAIDUID=157D064FDE25DE5DD0E68AF62CBC3627:FG=1; BAIDUID_BFESS=157D064FDE25DE5DD0E68AF62CBC3627:FG=1; BIDUPSID=157D064FDE25DE5DD0E68AF62CBC3627; PSTM=1655611179; BD_UPN=12314753; ZFY=Cs:BflL5Del98YBOjx2EyRPzQE3QCyolFKzgVTguBEHI:C; BD_HOME=1; H_PS_PSSID=36548_36626_36673_36454_31254_36452_36690_36165_36693_36696_36569_36657_26350_36469; BA_HECTOR=85850gag05ak0l040h1hbg5st14; delPer=0; BD_CK_SAM=1; PSINO=7; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; H_PS_645EC=0e08fXgvc5rDJVK1jRjlqmZ7pLp5r%2Fmn9jlENTs3CQ4%2FbhzUL09Y%2F%2FYtCGA; baikeVisitId=e10d7983-547d-4f34-a8d8-ec98dbcba8e4; COOKIE_SESSION=115_0_2_2_1_2_1_0_2_1_0_0_0_0_0_0_1655611189_0_1656233437%7C3%230_0_1656233437%7C1'
    }
    

7. Timeout parameters timeout Use

At ordinary times . In the process of surfing , We often encounter network fluctuations , This is the time , A request that has been waiting for a long time may still have no result . In reptiles , A request has been fruitless for a long time , It will make the efficiency of the whole project very low , At this time, we need to enforce the request , Let him have to return the result within a specific time , Otherwise, it will be wrong .

1. Timeout parameters timeout How to use

response = requests.get(ur1, timeout=3)

2. timeout=3 Express : After sending the request ,3 Response returned in seconds , Otherwise throw an exception

3. Combat code :

import requests
# Target website 
url = "https://www.baidu.com/"
headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}
try:
response = requests.get(url, headers=headers, timeout=10) # The timeout is set to 10 second 
except:
for i in range(4): # Loop to request the website 
response = requests.get(url, headers=headers, timeout=20)
if response.status_code == 200:
break
html_str = response.text

8. proxies Use of proxy parameters

In order to make the server think that the same client is not requesting ; In order to prevent frequent requests to a domain name from being blocked ip, So we need to use agents ip

grammar :

response = requests.get(url, proxies=proxies)

proxies In the form of : Dictionaries

proxies = {

"http": "http://12.34.5679:9527",
"https": "https://12.34.5679:9527",
}

Be careful : If proxies The dictionary contains multiple key value pairs , The request will be sent in accordance with ur Address protocol to choose to use the corresponding proxy ip

9. send out post request

requests Module to send post Request other parameters of the function and send get The requested parameter is exactly one Cause

Grammar format

response = requests.post(url, data) # data Parameter receives a dictionary 

How to find data Forms ?

  • Take Baidu translation as an example : Find the corresponding request , Click on Payload, an Form data Forms

  • Construct in code data Dictionaries

    import requests
    url = "https://fanyi.baidu.com/"
    data = {
    
    'query': ' Love '
    }
    response = requests.post(url)
    print(response.text)
    
  • Return to the full web page


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved