程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Thoroughly understand what a python crawler is?

編輯:Python

Some time ago, my mother suddenly asked me : son , What is a reptile ? I was surprised and embarrassed , It's amazing why my mother is curious about reptiles ? What's embarrassing is how can I explain to her ?

One 、 Introduction to reptiles

1. What is a reptile

Web crawler (web crawler It's called reptile for short ) Namely A program that grabs information from the Internet according to certain rules , Since it's a program, what's the difference between it and a normal user visiting the page ? The difference between a crawler and a user's normal access to information is : Users are slow 、 A small amount of information , And reptiles are A lot of Access to information .\

What we need to pay attention to here is : Reptiles are not Python The patent of language ,Java、Js、C、PHP、Shell、Ruby And so on the language can realize , What then? Python Reptiles can be so hot ? I think it's better to be a reptile than other languages Python Maybe it's all kinds of Library improvement points 、 It's easy to use , The community is naturally active , And community activism contributes to Python The reptiles are maturing , Maturity also urges more users to use , Such a virtuous circle , therefore Python Reptiles are more popular than other languages .

Here is a passage hello world Grade Python Reptiles , It is equivalent to your search keywords in Baidu :python.

2. Reptile cases

Since the crawler is a large number of web pages , Is that not good for reptiles ? The answer, of course, is not , It can be said that we can't get online without reptiles , Why do you say that? ? Now I'm going to take stock of a few daily applications of reptiles for you :

  1. Search engine : Such as Google、 Baidu 、 Yahoo 、 sogou 、 Bing and many other search engines are essentially one ( Maybe more than one ) Giant reptiles , These search engines work by : Included on the page -> Page analysis -> Page order -> Respond to keyword queries , That is to say, it will first save many pages on the Internet to the server , Then analyze the content of the web page to build a keyword index , Finally, when the user enters the keyword, he will query the content , And then sort by relevance ( Baidu's bid ranking is irrelevant ), First step Included on the page It's reptiles , Baidu to see how many pages of a website are included , Baidu input :site: The website you want to check , Such as :site:blog.csdn.net.
![](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/2551817dd260447f96e065a62569df7d~tplv-k3u1fbpfcp-zoom-1.image)
  1. Ticket grabbing software : Many people make complaints about it 12306 card , But you don't know 12306 Almost every day is equivalent to Taobao double 11 Of traffic , Who can stand it . Why is the traffic so high every day ? The answer, of course, is reptile , Why can the ticket grabbing software grab tickets ? Because it's constantly refreshing and monitoring whether there are tickets left , So many tickets, big and small app, You can imagine the high number of visitors . Before, many companies have offered ticket grabbing plug-ins , Such as : Baidu 、360、 jinshan 、 Sogou and so on , Later, they were all invited by the Ministry of Railways to go offline , And now it's popular to grab tickets app, Why rob tickets app Sure , Plug in is not allowed ? Maybe it's because of management and controllability .
  2. Huihui shopping assistant : This is a website that can compare prices of multiple websites and know the lowest price , It also works by crawling through a large number of reptiles to get the price of goods and store them , In this way, we can make a price chart , To help you understand the lowest price of goods .

Two 、 The value of reptiles

From the examples above , The value of crawlers to the whole Internet is really incalculable , For the ego , What value can reptiles bring to us ?

1. Invisible wings

If you ask me when I'm done Python What skills should I learn after foundation ? I will not hesitate to say that reptiles , Why reptiles ?

  1. Reptiles are relatively easy to learn , And the effect is immediately visible , There will be a certain sense of achievement
  2. Reptiles can be said to be the cornerstone of other skills , Because he's the source of the data , In this era, whoever has data can be king , So being able to crawl will definitely add to your strength
  3. At home , A lot of companies want you to know everything , So when applying , Reptiles will be a good bonus

2. Invisible business war

Workplace conversation :

Boss : Xiao Ming gives you an important task .\ Xiao Ming : Even if the 996 I'm not going to leave ( The first time I received a direct demand from my boss )!\ Boss : Can you get the price of the competitor's goods ?\ Xiao Ming : That's all right. ( The cow forces to blow out first ), trifle !\ Boss : It's not a small thing , As long as you can keep getting the price of competitive products , We can put the price a little lower than them , Continue so that you know our price is definitely lower than them , In this way, you can come to us directly to buy goods , You will be the most meritorious person at the celebration banquet ( Draw a cake first ).\ Xiao Ming : The boss is a bull , The boss is wise !

3. You can start a business if you can crawl

Many students will make use of their spare time after work , To make your own things or projects , Don't look at the beginning as a little brawler , Gradually enrich it and it may become a mature product in the future .

And reptiles can make it easy for you to implement your own products , If you do well, you can start a business . Here Charlie gives you a few simple entrepreneurial projects , Of course as a thought guide .

If you want to make a good product , You need to think from the needs of users , Make products or services that solve the existing problems , Maybe your product is the next headline .

3、 ... and 、 catch me now

Since reptiles are so powerful and excellent , Is it possible for reptiles to do whatever they want ?

Extend the topic : Actually, I have a question in my heart : Why do Internet companies prefer to use animals and plants to name or act as logo? Such as : The ant gold dress 、 Tmall 、 rookie 、 Jindonggou 、 Tencent Penguin , Baidu's bear paw 、 sogou 、 Way cattle 、 Meituan's kangaroo ... It's really too much , Is it just because it is easy to remember ? I think good memory is one reason , The root cause is the impact of the programming industry , Think about how many animals and plants there are in the programming industry :Java( coffee )、Python( Python )、Go( The gopher )、PHP( Elephant )、Linux( penguin )、Perl( camel )、Mysql( The dolphins ) wait , Charlie doesn't know exactly why the programming industry likes to use animals and plants , Please leave a message to let us know !

What I want to express is , All things in the world , reinforce each other , Balance without disaster ! And so do reptiles , Here are a few points for you to restrict reptiles .

1.robots agreement

Students who have done the website may know , We need to put a file in the root directory of the website when we build the website :robots.txt, What's the purpose of this document ?

Robots agreement , Also known as the crawler protocol 、 Robot protocol, etc , The whole is called “ Exclusion criteria for web crawlers (Robots Exclusion Protocol)”. Website through Robots The protocol tells search engines which pages to grab , Which pages can't be crawled .

You must place... In the root directory of each website robots.txt file , Otherwise, the search engine will not include any pages of the website .

Let's take Baidu as an example , Take a look at Baidu's robots.txt file :\

We are in Baidu robots At the bottom of the agreement , There is such a :

User-agent: *
Disallow: /

This means that other than those defined above, no other crawler is allowed to climb Baidu anything !

2. law

We all know that when we make a request User-agent It can be customized , That is to say, we can bypass robots agreement And with User-agent To define anti - crawler technology , therefore robots agreement Maybe it's more like a gentleman's agreement , Is there any law in our country ? Let's get to know Crime of illegally invading computer information system

Article 285 Crime of illegally invading computer information system : Violation of state regulations , invasion Computer information systems other than those specified in the preceding paragraph or other technical means , Access to the computer information system storage 、 Data processed or transmitted , Or illegally control the computer information system , serious , To be sentenced to fixed-term imprisonment of not more than three years or criminal detention , Penalty concurrently or only ; Especially serious , To be sentenced to fixed-term imprisonment of not less than three years but not more than seven years , Impose a fine .

We can see the key information : It's illegal to invade a computer to get data , That is to say, reptile technology itself is innocent , Because it's access to public information , It's not trespassing on computers . But if you use crawled data to do business operations , That may constitute a crime !

I find that there are two points in the generality of these cases :1、 The nature of the company .2、 competitors .3、 Someone found a clue .

Finally, I would like to remind you : Technologist , Keep your bottom line , We must not do anything in violation of national laws and regulations !

3. Anti crawler Engineer

Originally, I wanted to interview a Ctrip anti reptile engineer , But he said it was inconvenient to be interviewed because of the confidentiality of work , So we have to respect his decision , I'd like to say sorry to you !

Four 、 Reptile status

Charlie said before that there are more than 50% The traffic comes from reptiles , Let's talk about the current situation of reptiles !

1. technology

Anti reptiles are born almost at the same time as reptiles , They are technologies that love and kill each other , If there are no reptiles, there will be no pickpockets , In turn, anti pickpocketing technology can promote the development of reptile technology .

  1. Interaction problem : All kinds of abnormal verification codes are full of , In especial 12306, I want to be rude every minute , In the future, it will become more and more abnormal ...
  2. js encryption : A popular anti pickpocketing technique recently , You have to learn to be a reptile js, Then the anti crawler engineer is in js There are various poisons in it , Kill and kill ...
  3. IP Limit : Limit individual ip Number of visits in a certain period of time

Charlie only introduced some anti pickpocketing techniques , Of course, there are ready-made technical solutions , But the most important thing as a crawler is not to use tools or frameworks to deal with anti pickpockets , But through their own thinking and exploration to crack anti pickpockets , Because anti pickpocketing technology is fast and diversified .\

2. employment

Employment mainly comes from recruitment 、 Salary 、 An analysis of the employment situation in three aspects , I looked up some information from the Internet , Then organize to provide you with reference in the form of pictures .\

Data sources : Working friends :https://dwz.cn/6PeU46QY

3. prospects

Now many people are not optimistic about the future of reptiles , If we only engage in reptiles, the technology will only stay at the current level , Stop learning new things , No more progress , Then there is no future , One day it will be eliminated by the times , In fact, other positions are just like this .

Every profession has a horizontal and vertical development , That is to say, the meaning of breadth and depth . First of all 、 If you study hard enough , Your reptiles are very powerful , A high performance , Good scalability and so on , So it's still very promising . second 、 Reptiles as a source of data , There are many directions to develop in the future , For example, we can analyze big data 、 Data presentation 、 Machine learning and other aspects of development , The future is limitless , Now as the era of big data , You occupy the entrance of data , I'm afraid I can't find the development direction ? So reptiles may just be a starting point and a springboard , It's a cornerstone of your life , One day you will marry Bai Fumei !

5、 ... and 、 summary

This issue will show you what reptiles are 、 The value of reptiles 、 The legitimacy of reptiles and the status quo of reptiles . Last , in any case , Thank you very much for reading my article !


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved