程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

What is a crawler? What is the principle of Python crawler

編輯:Python

Preface

--

In short, the Internet is a large network composed of sites and network devices , We visit the site through a browser , The site HTML、JS、CSS Code back to browser , The code is parsed by the browser 、 Rendering , Present us with colorful web pages ;

One 、 What is a reptile ?


If we compare the Internet to a big spider web , Data is stored in each node of the spider web , And a reptile is a little spider ,

Grab your prey along the Internet ( data ) A reptile means : Make a request to the website , A program that analyzes and extracts useful data after obtaining resources ;

Technically speaking, it's Through the program simulation browser request site behavior , Return the site to HTML Code /JSON data / binary data ( picture 、 video ) Climb to the local , Then extract the data you need , Store and use ;


Two 、 Basic process of reptile :


How users access network data :

The way 1: Browser submit request ---> Download Web code ---> Parse to page

The way 2: Impersonate a browser to send a request ( Get web code )-> Extract useful data -> Stored in a database or file

What a reptile has to do is the way 2;

1、 Initiate request

Use http The library makes a request to the target site , Send a Request

Request contain : Request header 、 Requester, etc

Request Module defect : Cannot perform JS and CSS Code

2、 Get response content

If the server can respond normally , You get a Response

Response contain :html,json, picture , Video etc.

3、 Parsing content

analysis html data : Regular expressions (RE modular ), The third-party parsing library is as follows Beautifulsoup,pyquery etc.

analysis json data :json modular

Parse binary data : With wb Write to a file

4、 Save the data

database (MySQL,Mongdb、Redis)

file

3、 ... and 、http agreement Request and response


Request: Users will be their own information through the browser (socket client) Send to the server (socket server)

Response: Server receives request , Analyze the request information sent by users , Then return the data ( The returned data may contain other links , Such as : picture ,js,css etc. )

ps: The browser is receiving Response after , The content will be parsed to show the user , And the crawler sends the request in the simulation browser and then receives Response after , To extract useful data .

Four 、 request


1、 Request mode :

Common request methods :GET / POST

2、 Requested URL

url Global resource locator , Used to define a unique resource on the Internet for example : A picture 、 A file 、 A video can be used url Sole determination

url code

https://www.baidu.com/s?wd= picture

The picture will be encoded ( Look at the sample code )

The loading process of the web page is :

Load a web page , Usually load first document file ,

In parsing document When documenting , Encounter Links , Then the request to download the picture is initiated for the hyperlink

3、 Request header

User-agent: If there is no user-agent Client configuration , The server may regard you as an illegal user host;

cookies:cookie Used to save login information

Be careful : Generally, a reptile will add a request header

Parameters that the request header needs to be aware of :

(1)Referrer: Where is the source of the visit ( Some large websites , Will pass Referrer Do anti-theft chain strategy ; All reptiles should also pay attention to simulation )

(2)User-Agent: Visit the browser ( Add it or it will be treated as a crawler )

(3)cookie: Ask the head to take care of

4、 Request body

 Request body
If it is get The way , The request body has no content (get The request body of the request is placed in url In the following parameters , Can see directly )
If it is post The way , The body of the request is format data
ps:
1、 Login window , File upload, etc , Information is attached to the request body
2、 Sign in , Enter the wrong username and password , And then submit , You can see that post, When you log in correctly, the page will usually jump , Can't capture post

5、 ... and 、 Respond to Response


1、 Response status code

200: On behalf of success

301: For jump

404: file does not exist

403: No access to

502: Server error

2、respone header

Parameters to be noted in response header :

(1)Set-Cookie:BDSVRTM=0; path=/: There may be more than one , To tell the browser , hold cookie preserved

(2)Content-Location: The server response header contains Location After returning to the browser , The browser will revisit another page

3、preview It's web source code

JSO data

Such as web page html, picture

Binary data, etc

6、 ... and 、 summary


1、 Summarize the reptile process :

Crawling ---> analysis ---> Storage

2、 Tools for reptiles :

Request Library :requests,selenium( It can drive the browser to parse and render CSS and JS, But there are performance disadvantages ( Useful and useless pages will be loaded );)

Parsing library : Regular ,beautifulsoup,pyquery

The repository : file ,MySQL,Mongodb,Redis

3、 Climb the school flower net

Finally, I'll give you some welfare

Basic Edition :

import re
import requests
respose\=requests.get('http://www.xiaohuar.com/v/')
# print(respose.status\_code)# The status code of the response
# print(respose.content) # Return byte information
# print(respose.text) # Return text content
urls=re.findall(r'class="items".\*?href="(.\*?)"',respose.text,re.S) #re.S Convert text information into 1 Line matching
url=urls\[5\]
result\=requests.get(url)
mp4\_url\=re.findall(r'id="media".\*?src="(.\*?)"',result.text,re.S)\[0\]
video\=requests.get(mp4\_url)
with open('D:\\\\a.mp4','wb') as f:
f.write(video.content)

View Code

Function encapsulation

import re
import requests
import hashlib
import time
# respose=requests.get('http://www.xiaohuar.com/v/')
# # print(respose.status\_code)# The status code of the response
# # print(respose.content) # Return byte information
# # print(respose.text) # Return text content
# urls=re.findall(r'class="items".\*?href="(.\*?)"',respose.text,re.S) #re.S Convert text information into 1 Line matching
# url=urls\[5\]
# result=requests.get(url)
# mp4\_url=re.findall(r'id="media".\*?src="(.\*?)"',result.text,re.S)\[0\]
#
# video=requests.get(mp4\_url)
#
# with open('D:\\\\a.mp4','wb') as f:
# f.write(video.content)
#
def get\_index(url):
respose \= requests.get(url)
if respose.status\_code==200:
return respose.text
def parse\_index(res):
urls \= re.findall(r'class="items".\*?href="(.\*?)"', res,re.S) # re.S Convert text information into 1 Line matching
return urls
def get\_detail(urls):
for url in urls:
if not url.startswith('http'):
url\='http://www.xiaohuar.com%s' %url
result \= requests.get(url)
if result.status\_code==200 :
mp4\_url\_list \= re.findall(r'id="media".\*?src="(.\*?)"', result.text, re.S)
if mp4\_url\_list:
mp4\_url\=mp4\_url\_list\[0\]
print(mp4\_url)
# save(mp4\_url)
def save(url):
video \= requests.get(url)
if video.status\_code==200:
m\=hashlib.md5()
m.updata(url.encode('utf-8'))
m.updata(str(time.time()).encode('utf-8'))
filename\=r'%s.mp4'% m.hexdigest()
filepath\=r'D:\\\\%s'%filename
with open(filepath, 'wb') as f:
f.write(video.content)
def main():
for i in range(5):
res1 \= get\_index('http://www.xiaohuar.com/list-3-%s.html'% i )
res2 \= parse\_index(res1)
get\_detail(res2)
if \_\_name\_\_ == '\_\_main\_\_':
main()

View Code

Concurrent Version ( If there is a total need to climb 30 A video , open 30 A thread to do , The time it takes is The slowest part of it takes time )

import re
import requests
import hashlib
import time
from concurrent.futures import ThreadPoolExecutor
p\=ThreadPoolExecutor(30) # establish 1 In the process pool , The number of threads to be accommodated is 30 individual ;
def get\_index(url):
respose \= requests.get(url)
if respose.status\_code==200:
return respose.text
def parse\_index(res):
res\=res.result() # After the process is executed , obtain 1 Objects
urls = re.findall(r'class="items".\*?href="(.\*?)"', res,re.S) # re.S Convert text information into 1 Line matching
for url in urls:
p.submit(get\_detail(url)) # For details page Commit to thread pool
def get\_detail(url): # Only download 1 A video
if not url.startswith('http'):
url\='http://www.xiaohuar.com%s' %url
result \= requests.get(url)
if result.status\_code==200 :
mp4\_url\_list \= re.findall(r'id="media".\*?src="(.\*?)"', result.text, re.S)
if mp4\_url\_list:
mp4\_url\=mp4\_url\_list\[0\]
print(mp4\_url)
# save(mp4\_url)
def save(url):
video \= requests.get(url)
if video.status\_code==200:
m\=hashlib.md5()
m.updata(url.encode('utf-8'))
m.updata(str(time.time()).encode('utf-8'))
filename\=r'%s.mp4'% m.hexdigest()
filepath\=r'D:\\\\%s'%filename
with open(filepath, 'wb') as f:
f.write(video.content)
def main():
for i in range(5):
p.submit(get\_index,'http://www.xiaohuar.com/list-3-%s.html'% i ).add\_done\_callback(parse\_index)
#1、 Let's start with the task of climbing the home page (get\_index) Asynchronously commit to the thread pool
#2、get\_index After the mission is completed , Through the callback letter add\_done\_callback() Number notifies the main thread , Task to complete ;
#2、 hold get\_index Execution results ( Note that the thread execution result is an object , call res=res.result() Method , To get the real execution results ), As a parameter to parse\_index
#3、parse\_index After the task is completed ,
#4、 Through the loop , Put the get details page again get\_detail() The task is submitted to the thread pool for execution
if \_\_name\_\_ == '\_\_main\_\_':
main()

View Code

Related to knowledge : Multithreading, multiprocessing

Compute intensive tasks : Using multiple processes , Because it can Python Yes GIL, Multiple processes can take advantage of CPU Multi core advantage ;

IO Intensive task : Using multithreading , do IO Switching saves task execution time ( Concurrent )

Thread pool


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved