您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python crawler Prelude [Chengxin notes]

編輯：Python

Send a... In the browser http Process of request :

1. When the user enters a URL And press enter , The browser will go to HTTP Server send HTTP request .HTTP Requests are mainly divided into "Get" and "Post" The two methods .

2. When we type in the browser URL htp://www baidu .com When , Browser send - individual Request Request to get htp:/www baidu com Of html file , Server handle Response The file object is sent back to the browser .

3. Browser analysis Response Medium HTML, Many other files are referenced , such as Images file ,CSS file , JS file . Browser will automatically resend Request To get pictures ,CSS file , perhaps JS file .

4. When all the files are downloaded successfully , The web page will be based on HTML Grammatical structure , The complete display shows .

#url Detailed explanation :

URL yes Uniform Resource Locator Abbreviation , Uniform resource locator .

One URL It consists of the following parts :

scheme://host:port/path/>query-string=xxx#ancho

●scheme: It represents the protocol of access ,- General http perhaps https as well as ftp etc. .

●host: Host name , domain name , such as www.baidu.com

●port: Port number . When you visit - When it comes to a website , Browser defaults to 80 port .

●path: Find the path . such as : www.jianshu.com/trending/now , hinder trending/now Namely path

●query-string: Query string , such as : www.baidu.com/s?wd=python , hinder wd=python Is the search string .

●anchor: Anchor point , You don't have to worry about the backstage , The front end is used for page positioning .

Request in the browser - - individual ufl , The browser will look at this url Make a code . Except English letters , In addition to numbers and some symbols , All the others use the percent sign + Encode the hexadecimal code value .

# Common parameters of request header :

stay http Agreement , Send a request to the server , The data is divided into three parts , The first - One is to put the data in url in , The second is to put the data in body in ( stay post In request ) , The third is to put the data in head in . Here are some request header parameters that are often used in web crawlers :

User-Agent : Browser name . This is often used in web crawlers . When requesting a web page , Through this parameter, the server can know which browser sent the request . If we send a request through a crawler , So our User-Agent Namely Python , This is for websites with anti crawler mechanism , It is easy to judge that your request is a crawler . Therefore, we should always set this value to the value of some browsers , To disguise our reptiles .

Referer : Indicates which... The current request is from url Over here . This can also be used for anti crawler technology . If not from the specified page , Then do not make relevant response .

Cookie : http Protocol is stateless . That is, the same - One person sent two requests , The server doesn't have the ability to know if the two requests are from the same person . So use... At this time cookie To make a sign . Generally, if you want to make a website that can be accessed only after logging in , Then you need to send cookie Information. .

Common request methods :

stay Http Agreement , Eight request methods are defined . Here are two common request methods , Namely get Request and post request .

get request : In general , Just get the data from the server , Will not have any impact on the server resources will use get request .
post request : Send data to the server ( Sign in )、 Upload files, etc , It will be used when it has an impact on server resources post request .

The above two methods are commonly used in website development . And generally will follow the principle of use . But some websites and servers are trying to do anti crawler mechanism , Also often will not play according to the common sense , It's possible that one should use get Method request must be changed to post request , It depends on the situation .

Common response status codes :

200 : Request OK , The server's normal loopback data .

301 : Permanent redirection . Like visiting www.jingdong.com Will redirect to www.jd.com.

302 : Temporary redirection . For example, when visiting a page that needs to log in , And there is no login at this time , Then it will be redirected to the login page .

400 : Requested url Could not find... On the server . In other words, request ur1 error .

403 : Server access denied , Not enough permissions .

500 : Server internal error . It may be the server bug 了 .