程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

100 days of mastering python (crawler) - day 43: introduction to Crawlers

編輯:Python

List of articles

  • Each preface
  • One 、 Reptile Overview
    • 1. Why study reptiles ?
    • 2. Reptiles and Python
    • 3. Are reptiles legal ?
    • 4. Reptilian spear and shield
    • 5. Schematic diagram of reptile and flow chart
  • Two 、 Related technology introduction
    • 1. HTML And CSS
    • 2. URL Website explanation
    • 3. HTTP And HTTРS
      • (1) Common request methods
      • (2) Common request headers
      • (3) Common request status codes
    • 4. Chrome Browser analytics site
    • 5. Session And Cookie
    • 6. Ajax request
  • 3、 ... and 、 How to make learning more efficient ?

Each preface

  • The authors introduce :Python Quality creators in the field 、 Huawei cloud sharing expert 、 Alibaba cloud expert Blogger 、2021 year CSDN Blog star Top6

  • This article has been included in Python Full stack series column :《100 Sky master Python From entry to employment 》
  • ​​ This column is dedicated to Python A complete set of teaching prepared by zero foundation Xiaobai , from 0 To 100 Continuous advanced and in-depth learning , All knowledge points are linked
  • Subscribe to the column and read later Python From entry to employment 100 An article ; You can also chat with 200 people in private Python Full stack communication group ( Teaching by hand , Problem solving ); Join the group to receive 80GPython Full stack tutorial video + 300 This computer book : Basics 、Web、 Reptiles 、 Data analysis 、 visualization 、 machine learning 、 Deep learning 、 Artificial intelligence 、 Algorithm 、 Interview questions, etc .
  • Join me to learn and make progress , One can walk very fast , A group of people can go further !


One 、 Reptile Overview

1. Why study reptiles ?

For individuals

  • When I see some excellent pictures that make people's blood spurt . Always want to save it for future wallpaper on the desktop

  • When browsing some important data ( All walks of life ), I hope to keep it and add luster to my sales activities in the future . When browsing some strange and powerful videos , Hope to save it in the hard disk for future appreciation

  • When browsing some excellent singing tracks , I hope it can be preserved for us to add a wonderful part to our boring life

For reptile Engineers

  • Company data requirements

  • Data analysis

  • Smart product exercise data

2. Reptiles and Python

Reptiles must use Python Well ? no ~ use Java It's OK ,C It's fine too . Please remember , Programming languages are just tools . Capturing data is your goal. You can use any tool to achieve your goal . Just like eating , You can use a fork or chopsticks , The end result is that you can eat . Then why do most people like to use Python Well ? answer : because Python Writing about reptiles is simple . incomprehension ? ask : Why not eat rice with a knife and fork ? Use chopsticks ? Because it's easy to use !

and Python Is one of many programming languages , Xiaobai is the quickest , The grammar is the simplest , what's more , There are a lot of third-party support libraries that reptiles can use. To be frank, you eat with chopsticks , I also give you a servant to help you eat ! Is it better to eat like this . It's easier ~

3. Are reptiles legal ?

First , Reptiles are not prohibited by law . In other words, the law allows reptiles to exist, but , Reptiles are also at risk of breaking the law, just like kitchen knives , The law allows the existence of kitchen knives . But if you use it to cut people , I'm sorry , No one is used to you, just like Wang Xin said , Technology is innocent . It mainly depends on what you do with it. For example, some people use reptiles + Some hacking techniques are aimed at every second bb It's definitely not allowed to roll over 18000 times .

Reptiles are divided into friendly reptiles and malicious reptiles

  • Well meaning reptiles : Do not destroy the resources of the crawled website ( Normal visit , Generally, the frequency is not high , Do not steal user privacy )
  • Malicious reptile : Affect the normal operation of the website ( Grab tickets , Second kill madness solo Website resources cause website downtime )

Sum up , To avoid oranges We still have to keep to our old ways and constantly optimize our crawler programs to avoid interfering with the normal operation of the website , Well and when using the climbed data , When discovering sensitive content involving user privacy and trade secrets , We must stop crawling and spreading in time .

4. Reptilian spear and shield

Anti climbing mechanism : Portal site , It can be achieved by formulating corresponding strategies or technical means , Prevent crawlers from crawling the website data .

Anti-crawl strategy : Crawler programs can be developed through relevant strategies or technical means , Crack the anti crawling mechanism in portal website , Thus, relevant data in the user website can be obtained . Personal advice : Don't force yourself to climb backwards , Malicious crawlers may have been involved =

robots.txt agreement : A gentleman's agreement , It specifies which data in the website can be crawled and which data can not be crawled .

5. Schematic diagram of reptile and flow chart

Schematic diagram of reptile :

Reptile flow chart :

Two 、 Related technology introduction

1. HTML And CSS

(1)HTML Full name “ Hypertext markup language ", It is different from programming language , No logical structure , Use markup to build web pages , Use <> Enclose the mark


HTML The label is as follows :

Tag name explain <p> Paragraph marks <a> Hyperlinks href Hyperlink address <img> picture src Picture storage path <span> In line labels <li> List item <div> Divide HTML block <table> Table Tags <tr> Line marker <td> Column markers h1~h6 title

(2)CSS Basics

  • Cascading style sheets
  • control HTML Page style and layout
  • Use {} Enclose style definitions

CSS Selectors :

  • Element selector ( understand ): The element selector selects... Based on the element name HTML Elements
  • id Selectors ( important ): Use HTML Elemental id Attribute to choose The only specific element
  • Class selectors ( It's very important ) Selection has specific class Attribute HTML Elements
  • Case study ( Select the title ): First find div class='p12', Take down a Under the label The text of

2. URL Website explanation

Case website https://baike.baidu.com/item/%E8%99%8E/865?fromtitle=%E8%80%81%E8%99%8E&fromid=65781

URL( website ) yes Uriform Resource Locator Abbreviation , Uniform resource locator .URL It consists of the following parts :

  • 1、 The type of agreement . Such as :https
  • 2、 Host name / domain name . Such as :baike.baidu.com
  • 3、 Port number
  • 4、 Find the path . Such as :865?
  • 5、 Query parameters ( by All the contents after ). Such as :fromtitle=%E8%80%81%E8%99%8E&fromid=65781, In the form of key value pairs , Multiple key value pairs & separate
  • 6、 Anchor point , The front end is used for surface positioning . Now some front and rear end separation items , Also use anchors for navigation
    • Front end positioning :https://baike. baidu.com/item/*E5488%98%E8%8BKA5%E8%88%B1#2
    • Anchor navigation : What moves is # What follows , Request data according to the wrong point https://music.163.com/#/friend

3. HTTP And HTTРS

HTTP agreement : The full name is HyperText Transfer Protocal , Hypertext transfer protocol , It's one Kind of publish and receive HTML (HyperText Markuup Language) Page method . The server port number is :80

HTTPS agreement : Full name : Hyper Text Transfer Protocol over SecureSocket Layer, yes HTTP Encrypted version of the protocol , stay HTTP Next to join the SSL layer , The server port number is :443

Reference for more knowledge points : Graphic network protocol

(1) Common request methods

http The protocol stipulates that the browser must choose an interactive mode in the process of data interaction with the server . stay HTTP Agreement , Eight request modes are defined , Common are get Ask for something to do with post request .

GET request : In general , Just get the data from the server , Will not have any impact on the server resources will use get request

POST request : Send data to the server ( Sign in )、、 Upload files, etc , It will be used when it has an impact on server resources Post request . The request parameters are in Form Data in

(2) Common request headers

http Agreement , Send to server - A request , The data is divided into three parts

  • The first is to put the data in urI in
  • The second is to put the data in body in (post When asked )
  • The third is to put the data in head in

Common request header parameters :

  • user-agent: Browser name
  • referer: Indicates which... The current request is from ur| Over here
  • cookie:http Protocol is stateless . That is, the same person sent two requests . The server doesn't have the ability to know if the two requests are from the same person . And bring cookie It can be identified as a logged in user or the same request twice

(3) Common request status codes

  • 200: Request OK , The server returns data normally
  • 301: Permanent redirection . For example, visit http://www.360buy.com ( JD's previous website ) Will redirect to :https://www.jd.com/
  • 404: Requested url On the server I can't find , In other words, the requested url error .
  • 418: Sending request encountered server-side anti crawler , The server rejected the response data
  • 500: Server internal error , Maybe there's a server bug

4. Chrome Browser analytics site

Open Google browser : Right click - 》 Check

  • Elements: It can help us analyze the structure of web pages , Get the data we want
  • Console Console : Print out a part of the website - Some information , For example, the recruitment information on the website
  • Sources: Equivalent to a folder , Load all the source files needed for this page , except Elements Outside the source code of , Some more CSS file 、JS Documents, etc. .
  • Network: View all network requests sent by the whole web page . Generally, we want to check some - Information requested , You can go here and have a look

5. Session And Cookie

Session And Cookie It's used to keep HTTP Long-term Connection state technology

Session

  • Session Represents a session between the server and the browser .
  • Session It's one A server-side mechanism ,Session Object is used to store information needed for a specific user session .
  • Session Generated by the server , Memory saved in the server 、 cache 、 Hard disk or database .


Cookie: It is generated by the server and sent to the client ( Usually Browse ), Cookie total It's insurance
There are clients

Cookie The basic principle of

  • 1、 establish Cookie
  • 2、 Set up storage Cookie
  • 3、 send out Cookie
  • 4、 Read Cookie

6. Ajax request

  • Ajax In the browser with Web Asynchronous data transfer between servers . This allows the web page to request a small amount of information from the server , Not the whole page .

  • Ajax Technology independent of browser and platform .

  • Ajax The general return is JSON, Direct pair Ajax address Post or get, Just go back to JSON Data.

  • Judge whether it is Ajax Generate the data , See if there is a refresh when scrolling the page , If the page is not refreshed, the description data is automatically generated , Namely Ajax Rendered to the interface

3、 ... and 、 How to make learning more efficient ?

Niuke, a little partner who dislikes the blogger's slow update, brushes the questions by himself

1. Programming white contestant

Many novice programmers have learned basic grammar , But I don't know the purpose of grammar , I don't know how to deepen the image , I don't know how to improve myself , This is the time It's very important to brush one question independently every day ( Refining into a God ), You can go to the introductory training of programming beginners on Niuke online . This topic is at the entry level of programming , Suitable for Xiaobai who has just learned grammar , The topic involves basic grammar of programming , Basic structure, etc , Each question has practice mode and examination mode , The test mode can be restored for simulation , You can also practice through practice mode .

Link address : Cattle from | Beginner programming training

2. Advanced programming player

When you have gradually mastered the key points of knowledge after basic practice , Go to this time Learn data structures in special exercises 、 Algorithm basis 、 Fundamentals of computer etc. . Start with the simple , If you feel up, do it in medium difficulty , And more difficult topics . These three are the knowledge points that must be tested in the interview , We can only insist on practicing more every day by ourselves , Refuse to lie flat and continue to brush questions , Continuously improve yourself to impact a satisfactory company .

Link address : Cattle from | Special exercises

Speed up , Let's attack the big factory together , If you have questions, leave a message in the comment area to answer !!!


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved