Python3 Web crawler development practice .urllib The library contains the following four basic modules :
request: The most basic HTTP Request module , Simulate the sending of the request .
error: Exception handling module .
parse: Tool module . Yes URL Provide split 、 analysis 、 Merge and other functions .
robotparser: Mainly used to identify the website robots.txt file , The crawler permissions are set in this file , That is, which crawlers the server allows can crawl which web pages .
It's recorded here request Module some basic API Use of functions .
urllib.request.urlopen() Send web page request API standard :
urllib.request.urlopen(url, data=None, [timeout,]*, cafile=None, capath=None, cadefault=False, context=None).Parameter interpretation :
url: Request URL
data: The request is sent to the designated url The data of , When this parameter is given , The request mode changes toPOST, If not given, it isGET. You need to use... When adding this parameterbytesMethod converts the parameter to the content of byte stream encoding format , The following is an example .
timeout: Set the timeout . If no response is received within the set time , Throw an exception .
cafile, capathRespectivelyCACertificate and its path ,cadefault, contextNo introduction .Examples of use :
import urllib.request response = urllib.request.urlopen('https://www.baidu.com') print(type(response)) # Print the data type of the obtained response object print(response.read().decode('utf-8')) # Print the obtained web page HTML Source codeUse
urlopenAfter the function , The objects returned by the server are stored inresponsein , PrintresponseThe data type of the object , byhttp.client.HTTPResponse.
- If you want to add data to the request , You can use
dataParameters .Examples of use :
import urllib.request import urllib.parse dic = { 'name': 'Tom' } data = bytes(urllib.parse.urlencode(dic), encoding='utf-8') response = urllib.request.urlopen('https://www.httpbin.org/post', data=data)adopt
dataThe dictionary data passed by the parameter , You need to use it firsturllib.parse.urlencode()Convert to string , And then throughbytes()Method is transcoded to byte type .
timeout: Specify the timeout period . In seconds .
response = urllib.request.urlopen('https://www.baidu.com', timeout=0.01)stay
0.01If no response is received from the server within seconds , And throw an exception .
urlopen Too few parameters for , It also means that , There are too few request headers that we can set .Construct more complete requests : Use
urllib.request.Requestobject , This object is the encapsulation of the request header , By usingRequestobject , We can separate the request headers , In order to set , Not like the previous method , Just passingURL.
- Request Construction method of :
class urllib.request.Request(url, data=None, headers={ }, origin_req_host=None, unverifiable=False, method=None)Examples of use :
from urllib import request, parse url = 'https://www.httpbin.org/post' headers = { 'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)', 'Host': 'www.httpbin.org' } dict = { 'name': 'Tom'} data = bytes(urllib.parse.urlencode(dict), encoding='utf-8') request = urllib.request.Request(url=url, data=data, headers=headers, method='POST') response = urllib.request.urlopen(request) print(response.read().decode('utf-8'))A is constructed in advance
Requestobject , It is then passed as a parameter to theurlopen()Method .