您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python crawler learning notes crawler process and requests module

編輯：Python

Python Crawler learning

The first day

HTTP/HTTPS agreement

HTTP agreement

HTTP Protocol is a form of data interaction between server and client .

Common request header information ：

User-Agent: The identity of the request carrier . such as , When sending a request to the server with Google browser , The identity of the carrier includes Google browser , Current operating system and other information .
Connection： After the request , Disconnect or stay connected keep on line perhaps close

Common response header information

Content-Type： The data type of the server response back to the client

HTTPS agreement

Safe HTTP（ Hypertext transfer ） agreement . Here security involves data encryption .

encryption ：

Symmetric key encryption
Asymmetric key encryption

There is no guarantee that the client gets the secret key sent by the server .

Certificate key encryption （HTTPS）

The server submits the public key to the certificate authority first , After passing the audit, the public key is digitally signed , Encapsulate the public key into the certificate , Then send the certificate to the client . After the client gets the certificate, the public key must be provided by the server .

requests modular

There are two modules of network request, including ,urllib（ It's old , More trouble ） and requests（ Very concise , Very efficient ）.

requests characteristic ：

Very powerful
Simple and convenient
Very efficient

requests effect ：

Simulate browser to send request .

requests Module coding process ：

Strictly follow the process of the browser sending the request .

Appoint URL（uniform resource locator）
j be based on requests The module initiates the request
Get response data
Persistent storage

Environmental installation ：pip install requests

Actual code ：

Crawl Sogou to search the information on the home page

import requests
if __name__ == "__main__":
#step1 Appoint URL
url = "https://www.sogou.com/"
#step2 Initiate request 
#get Method will return a response object 
response = requests.get(url=url)
#step3 Get response data , What is returned is the response data in the form of string 
page_text = response.text
print(page_text)
#step4 Persistent storage 
with open('./sougou.html','w',encoding='utf-8') as fp:
fp.write(page_text)
print(' End of crawling data ！！！')