程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Python Data Mining - Crawler Foundation

編輯:Python

Python data mining - crawler foundation

    • Anti-climbing methods
      • 1.User‐Agent
      • 2. Proxy IP
      • 3. CAPTCHA ACCESS
      • 4. Dynamically load web pages
      • 5. Data encryption
    • urllib library
    • Request object customization
    • Regular Expressions

Anti-climbing methods

1.User‐Agent

User Agent is called User Agent in Chinese, or UA for short. It is a special string header that enables the server to identify the operating system and version, CPU type, browser and version, browser rendering engine, and browser language used by the client., browser plug-ins, etc.

2.Proxy IP

Western proxy
Fast proxy
What is high anonymous, anonymous and transparent proxy?What's the difference between them?
1. Using a transparent proxy, the other server can know that you are using a proxy, and also know your real IP.
2. Using an anonymous proxy, the other server can know that you use a proxy, but not your real IP.
3.
4. Use a highly anonymous proxy, the other server does not know that you are using a proxy, let alone your real IP.

3. Verification code access

coding platform
cloud coding platform
super

4. Dynamically load web pages

The website returns js data, not the real data of the webpage
Selenium drives the real browser to send requests

5. Data encryption

Analyze js code

urllib library

urllib.request.urlopen() Simulate a browser to send a request to the server
response The data returned by the server
responseThe data type is HttpResponse
byte‐‐>string
decode
string‐‐>bytes
encode encode
read() read binary extension in byte form: rede(5) returns the first few bytes
readline() read a line
readlines() read line by line until the end
getcode() get the status code
geturl() get url
getheaders() get headers
urllib.request.urlretrieve()
request web page
request image
request video

Request object customization

Syntax: request = urllib.request.Request()

Regular Expression


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved