程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Python3 crawler overview

編輯:Python

Reptile base

List of articles

  • Reptile base
    • Reptile Overview
    • Session and Cookie sketch
      • 1. Session
      • 2. Cookie
      • 3. About Session
  • Reference material

Reptile Overview

Simply speaking , A crawler is an automated program that extracts information from a web page and saves it .

  • The work of the crawler :

    • Access to web pages : The crawler needs to get the web information first , That is, follow-up analysis of the web page source code . adopt Python Of urllib,requests And so on .

    • Analyze the web , Extract target information : After obtaining the web page source code , The crawler will parse the web page , And then extract the target information .

    • Save the data : Save the extracted target information , For later use .

Session and Cookie sketch

When we go to a website , You may need to enter your login name and password . When we close the website we have logged in to , When you enter this page again , You don't need to enter your login information again ( Login name and password, etc ), This is it. Session and Cookie The result of cooperation .

Let's first introduce some pre concepts :

  1. Static web pages and dynamic web pages :
  • What is a static web page ? In website design , Use pure HTML A web page written in a format is often referred to as “ Static web page ”. There is another definition : Static web pages are relative to Dynamic web pages for , It means that there is no Background database 、 Excluding procedures and It's not interactive The web page of .

  • Advantages and disadvantages of static web pages :

advantage : Fast loading speed , Write simple .

shortcoming : Poor maintainability , Can't be based on URL Flexible transformation of displayed content .

  1. What is a dynamic web page ? It refers to a web page programming technology opposite to static web pages . Its main difference from static web pages is : Allow data interaction between users and service background .
  • Advantages and disadvantages of dynamic web pages :

advantage : More flexibility , More features .( It can be dynamically parsed URL Changes in parameters in , And then present different contents .)

shortcoming :① Not dominant in access speed .② Not dominant in terms of search engine collection .

Be careful : Distinguishing a web page is “ dynamic ” still “ static state ”, It is not based on whether the content presented is dynamic ( Shuffling figure , Scrolling subtitles, etc ), But according to whether the web page can interact with the background database for data transmission .

No state HTTP:

HTTP The statelessness of means :HTTP Agreements have no memory for processing things , In other words, the server does not know the status of the client .

such as : We log on to a website , Then our login status is “ Logon ”. Due to stateless HTTP Characteristics of , When we request the website again , The server doesn't know if we are logged in , So we should also include our login related information in the request information , This will cause some messages to be sent repeatedly .

therefore , For holding HTTP The technology of connection state appears , When we parted Session and Cookie.

1. Session

Session, It is called conversation in Chinese . Its original meaning refers to a series of actions from beginning to end 、 news . for example : On the phone , The process from picking up the phone and dialing to hanging up the phone can be a Session.

Session Object is used to store users Session The required properties and configuration information of . amount to Session Object saves the state of the current session .

Session Stored on the server . When a user sends a request to the server , If there is no corresponding Session object , The server will create a new one Session object .

2. Cookie

Cookie, Sometimes in the plural Cookies. The type is “ Small text files ”, It's some websites to identify users , Conduct Session Tracking data stored on the user's local terminal ( Usually encrypted ), By the user client Information temporarily or permanently stored by a computer .

  • User status is maintained

When the user first requests the server , The server will return a response with Set-Cookie Field response to client , This field is used to mark the user . The client will put Cookie Save up , The next time you send a request to this server , Will be preserved Cookie Put it in the request header and send it to the server .

The first time the server responds to a client request , Created a response Session. Client's Cookie The corresponding... Is saved in Session Of ID. The server parses the data sent by the client Cookie You can locate the corresponding Session, To get the client status .

  • Cookie Property structure of

With Google Chrome Browser as an example , Enter a web page ( such as : You know ). Press down F12 Enter developer mode . left Storage Item in Cookies The subitem contains Cookie Details of .

  • Name:Cookie The name of . Cannot be changed after creation .

  • Value:Cookie Value .

  • Domain: Specify the domain name that can be accessed . Such as : Set to .zhihu.com, All in the form of zhihu.com The domain name at the end can be accessed .

  • PathCookie The use path of .

  • Max-Age:Cookie Expiration time , The unit is in seconds . If it's negative , It means that the browser will be invalid after it is closed .

  • Size:Cookie Size

  • HTTP: A little

  • Secure: A little

Cookie Effective time of , With in field Max-Age/Expires decision .

3. About Session

When the user closes the browser , The corresponding on the server Session Will not disappear immediately , Only when the server is set Session After the effective time runs out ,Session Will be deleted by the server , To save storage space .

Reference material

  • https://baike.baidu.com/item/%E5%8A%A8%E6%80%81%E7%BD%91%E9%A1%B5/6327050#6
  • https://baike.baidu.com/item/%E9%9D%99%E6%80%81%E7%BD%91%E9%A1%B5/6327183
  • https://baike.baidu.com/item/cookie/1119
  • Python3 Web crawler development practice ( Cui Qingcai Writing )

  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved