程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Python crawler -- requests Library

編輯:Python

List of articles

  • The reptile library Requests
    • 1. install
    • 2. Send a request
      • GET request
      • POST request
      • Complex request mode
    • 3. Get a response

The reptile library Requests


Requests yes Python A very practical HTTP Client library , Fully meet the needs of today's web crawlers . And Urllib contrast ,Requests Not only has Urllib All the functions of ; On development and use , The grammar is easy to understand , It's completely in line Python grace 、 Concise features ; On compatibility , Fully compatible with Python 2 and Python3, It has strong applicability , The operation is more humanized .

1. install

​ Requests Ku as python Third party library , Can pass pip install , As shown below :

  • Windows:pip install requests.
  • Linux: sudo pip install requests.

​ Besides using pip Besides the installation , It can also be downloaded whl Files installed , But the steps are complicated , There is no more introduction here .

It is worth mentioning that Requests It's an open source library , The source code is located in GitHub:https://github.com/kennethreitz/requests, If you want to download the latest version , You can go straight to GitHub Upload and download Requests Source code , The download link is :https://github.com/kennethreitz/requests/releases. Decompress the source code package , Then go to the unzipped folder , function setup.py File can .


2. Send a request

HTTP A common request for is GET and POST, Requests There are two different ways to request this .

import requests
url = 'https://baidu.com/'
# GET request 
r = requests.get(url, params=params, headers=headers, proxies=proxies, verify=True, cookies=cookies)
# POST request 
r = requests.post(url, data=data, files=files,headers=headers, proxies=proxies, verify=True, cookies=cookies)

GET request

​ GET There are two forms of request , They are without parameters and with parameters , for example :

# With no arguments 
https://www.baidu.com/
# With parameters wd
https://www.baidu.com/s?wd=python

Judge URL With or without parameters , It can be done to == Symbol “?”== Judge . The end of the general web address ( domain name ) with “?”, It means that URL With request parameters , Otherwise, there is no parameter .GET The parameters are described as follows :

  1. wd Is the parameter name , The parameter name is determined by the web site ( The server ) Regulations .
  2. python Is the parameter value , It can be set by the user .
  3. If one URL There are multiple parameters , Use... Between parameters “&” Connect .

​ Requests Realization GET request , For... With parameters URL There are two ways to request :

import requests
# The first way 
r = requests.get('https://www.baidu.com/s?wd=python')
# The second way 
url = 'https://www.baidu.com/s'
params = {
'wd':'python'}
r = requests.get(url, params=params)
# Output generated URL
print(r.url)

​ Both ways are possible , The second method is to pass in parameters and parameter values in the form of dictionaries , The effect is equal to , The first method is recommended in actual development , Because the code is concise , If the parameter changes dynamically , Then you can use string formatting to URL Dynamic setting , for example :'https://www.baidu.com/s?wd=%s' %('python').

POST request

​ POST A request is what we often call a submission form , The data content of the form is POST Request parameters for .**Requests Realization POST Request parameters should be set for the request data, The data format can be Dictionaries 、 Tuples 、 List and JSON Format ,** Different data formats have different advantages .

# Dictionary type 
data = {
'key1':'value1', 'key2':'value2'}
# Tuples or lists 
(('key1', 'value1'), ('key2', 'value2'))
# JSON
import json
data = {
'key1':'value1', 'key2':'value2'}
# Convert dictionary to JSON
data = json.dumps(data)
# send out POST request 
import requests
r = requests.post("https://www.baidu.com/", data=data)
print(r.text)

Complex request mode

​ Complex requests are often made in Request header 、 agent IP、 Certificate verification and Cookies And so on .Requests This series of complex requests is simplified , Pass these functions in the form of parameters in the sending request and act on the request .

​ (1) Add request header The request header is generated in the form of a dictionary , Then send the... Set in the request headers Parameters , Point to the defined request header .

headers = {

'User-Agent':'......',
'...':'...',
......
}
requests.get("https://www.baidu.com/", headers=headers)

Add request header headers Is the solution requests One of the ways to request reverse crawling , It is equivalent to the server itself that we enter this web page , Pretend you're crawling data . It solves the problem of requesting web page crawling , Output text Sorry will appear in the message , Unable to access and so on .

​ (2) Using agents IP agent IP The usage of is consistent with that of the request header , Set up proxies Parameters can be .

import requests
proxies = {

"http":"http://10.10.1.10:3128",
"https":"http://10.10.1.10:1080",
}
requests.get("https://www.baidu.com/", proxies=proxies)

​ Because we use python A crawler crawls through a web site , One fixed IP The frequency of visits will be very high , This does not meet the standard of human operation , Because people can't operate in a few ms Inside , Make such frequent visits . So some websites will set up a IP Threshold of access frequency , If one IP Access frequency exceeds this threshold , It means that this person is not visiting , It's a crawler . So our IP May be sealed , And their own IP It can also be found , In this case, we can use some high-level agents IP To meet our greater needs .

​ (3) Certificate validation Generally, you can set to turn off verification . Set parameters on request verify=False The certificate verification can be turned off , By default True. If you need to set the certificate file , Then you can set parameters verify The value is the certificate path .

​ General web pages have authentication certificates , So general requests Do not add this parameter when requesting , But it is still possible to visit a web page without an authentication certificate , You can then turn off certificate validation . However, closing the validation run program will be popular , But it still works , No impact

​ (4) timeout After sending the request , Because of the Internet 、 Server and other factors , There is a time difference between the request and the response . If you don't want the program to wait too long or extend the waiting time , You can set timeout Wait seconds for , Stop waiting for a response after this time . If the server is in timeout No response in seconds , An exception will be thrown .

requests.get('https://www.baidu.com', timeout = 5)
requests.post('https://www.baidu.com', timeout = 5)

​ (5) Set up Cookies Use... In the request process Cookies Just set the parameters Cookies that will do .Cookies Is used to identify users , stay Requests Use a dictionary or RequestsCookieJar Object as parameter . The acquisition method is mainly generated by reading from the browser and running the program .

import requests
test_cookies = 'JSESSIONID=2C30FCABDBF3B92E358E3D4FCB69C95B; Max-Age=172800;'
cookies = {
}
# Division , Convert string to dictionary format 
for i in test_cookies.split(';'):
value = i.split('=')
cookies[value[0]] = value[1]
r = requests.get(url, cookies=cookies)
print(r.text)

​ When the program sends a request ( No parameters cookies), It will automatically generate a RequestsCookieJar object , This object is used to store Cookies Information .

import requests
url = 'https://www.baidu.com/'
r = requests.get(url)
# r.cookies yes RequestsCookieJar object 
print(r.cookies)
thecookies = r.cookies
# RequestsCookieJar Convert to dictionary 
cookie_dict = requests.utils.dict_from_cookiejar(thecookies)
print(cookie_dict)
# Dictionary conversion RequestsCookieJar
cookie_jar = requests.cookiejar_from_dict(cookie_dict, cookiejar=None, overwrite=True)
print(cookie_jar)
# stay RequestsCookieJar Add to object Cookies Dictionaries 
print(requests.utils.add_dict_to_cookiejar(thecookies, cookie_dict))

3. Get a response

When to the website ( The server ) When sending a request , The website will return the corresponding response (response) object , Contains information about the server response .Requests Provide the following methods to get the response content .

  • r.status_code: Response status code .
  • r.content: Byte response body , It needs to be decoded .
  • r.text: String style response body , It will automatically decode according to the character encoding of the response header .
  • r.raw: Original responder , Use r.raw.read() Read .
  • r.encoding: Get the encoding format .
  • r.headers: Store the server response header as a dictionary object , But this dictionary is special , Dictionary keys are not case sensitive , If the key doesn't exist , Then return to None.
  • r.json():Requests The built-in JSON decoder .
  • r.raise_for_status(): request was aborted ( Not 200 Respond to ), Throw an exception .
  • r.cookies: Get the... After the request cookies.
  • r.url: Get request link .
  • r.history: Will request parameters in allow_redirects Set to True, Allow redirection , Can pass r.history Field to view history information , That is, all request jump information before successful access .

Be careful : When getting the response content, you can use r.text, But sometimes there are decoding errors , Get the condition of garbled code , This is because the obtained encoding format is incorrect , Can pass r.encoding To view the , It's fine too r.encoding = ‘…’ To specify the correct encoding format , The code of a general web page is utf-8( It could be gbk).

But this manual approach is a bit clumsy , Here's a simpler way :chardet, This is a good string / File code detection module .

​ Install the module first :pip install chardet

​ After installation , Use chardet.detect() Return dictionary , among confidence It's detection accuracy ,encoding It's the coding form .

import requests
r = requests.get('http://www.baidu.com')
print(chardet.detect(r.content))
# take chardet The detected code is assigned to r.encoding Realize decoding 
r.encoding = chardet.detect(r.content)['encoding']
print(r.text)


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved