程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

python3.6。 Introduction notes of reptile learning

編輯:Python

Reptiles

  • Prerequisite knowledge :
    • URL
    • HTTP agreement
    • web front end ,html css js
    • ajax
    • re,Xpath
    • XML

Definition of reptile

  • Detailed introduction on Baidu
  • Three steps :
    • Download information
    • Extract the right information
    • According to certain rules, you can jump to another web page to execute the two-step content
  • Reptile classification
    • Universal crawler
    • Dedicated crawler
  • pyhon Introduction to network package
    • 2.X ----
    • 3.x----urllib,urllib3,httplib2,requests

urllib

  • Contains modules
    • urllib.request: Open and read the... Of the module urls
    • urllib.error: contain urllib.request Common errors that occur , Use try capture
    • urllib.parse: Including means url Methods
    • urllib.robotparse: analysis robots.txt file
    • Case study V1
  • Web page coding problem solving
    • chardet It can automatically detect the encoding format of web pages , But there may be a mistake
    • Need to install conda install chardet
    • Case list V2
  • urlopen Return item for
    • geturl: Return the request object
    • info: Request to return the... Of the object meta object
    • getcode: Return the request status code of the object
  • request.code
    • Two ways to access the network
      • get: In fact, parameters are used to pass information to the server , Parameter use dict, And then use parse code
      • post: Generally, the server passes parameters
        • post Is to automatically encrypt information
        • If you use post Information needs to be used data Parameters
        • Use post signify Http The request header of can need to be changed :
          • Content-Type:applocation/x-www.from-urlencode
          • Content-Length: Data length
          • in other words , Once the request method is changed , Note that other request header information is appropriate
        • urllib.parse,urlencode You can change the upper string to the network protocol
        • Case study V4
      • Case study V4
      • To set up our request information more , Simple use urlopen It is not very easy to use
      • Need to use request.Request() class
      • Case study V6
    • urllib.error:
      • The reasons causing :
        • No net
        • Server link failed
        • Do not know the specified server
        • yes osError Subclasses of
        • Case study V7
    • HTTPError: yes URLError A subclass of
    • Case study V8
    • UserAgent
      • UserAgent: User agent is abbreviated as UA, Belong to Headers Part of , Server pass UA To determine the identity of visitors
    • Set up UA have access to
    • heads
    • add_heads
    • Case study V9
  • ProxyHandler proxy server
    • Using agents IP, Common means of reptiles
    • Get the address of the proxy server :
      • www.xicidaili.com
      • www.goubanjian.com
    • Proxy is used to hide the real access summary , The agent also does not allow frequent access to a fixed website , So there must be many agents
    • Basic settings for using proxy :
      • Set proxy address
      • establish ProxyHandler
      • establish Opener
      • install Opener
      • Case study V10

cookie & session

  • because http The memoryless of the agreement , People make up for this , A supplementary agreement adopted
  • cookie Is half the message sent to the user ,session Is the information stored in the other half of the server , To record information
  • cookie and session The difference between :
    • Different storage locations
    • cookie unsafe
    • session Will be on the server for a while , Will be out of date
    • Single cookie Save no more than 4K, Many browsers limit a site to a maximum of 20 individual
  • session Storage location
    • Store on the server
    • General situation ,session Is stored in the database
  • cookie land
  • Simulated Login to Renren
  • V11
  • Use Cookie land
    • Put... Directly cookie Copy down , Then put the request header in manually
    • V12
    • http The module contains the cookie Module , Automatic use cookie
      • CookieJar
        • Manage storage Cookie, Outgoing to http Request add Cookie
        • cookie Stored in memory ,CookieJar After instance recycling cookie Will disappear
      • FileCookieJar
        • Use file management cookie
        • filename Is save Cookie The file of
      • MozillaCookieJar
        • establish Mozilla browser Cookie.txt Compatible FileCookieJar example
      • LwqCookieJar
        • Founded in libwww-perla Standard compatible Set-Cookie3 Format FileCookieJar
      • Their relationship is :Cookie Jar–>FileCookieJar–>MozillaCookieJar&LwqCookieJar
      • utilize Cooke Ja Visit people's network
      • Case study 13
        • Automatic use Cookie land
        • After opening the login interface, you will automatically log in through the account password
        • Automatically extract feedback Cookie
        • Use extracted Cookie Log in to the privacy page
        • handler yes Headler Example
          • Commonly used
          • establish cookie example
        • cookie = cookiejar.CookieJar()
        • Generate cookie The manager of
        • cookie_handler = request.HTTPCookieProcessor(cookie)
      • establish http Request manager
        • http_handler = request.HTTPHandler()
      • Generate http Manager
        • https_handler = request.HTTPSHandler()
      • Create request manager
        • opener = request.build_opener(http_handler,https_handler,cookie_handler)
      • establish handler after , Use opener open , After opening, the corresponding handler To use
      • cookie Print as a variable
      • Case study V14
      • cookie attribute
        • name : name
        • value: value
        • domain : You can visit here cookie Domain name of
        • path: Look at the accessible cookie Page path
        • expirse: Expired information
        • size: size
        • http Field
      • cookie The preservation of the —FileCookieJar
        • Case study 15
      • cookie The read
        • Case study 16

SSL

  • SSL Certificate means comply with SSL Secure the server digital certificate of the socket layer protocol
  • CA(CertifacateAuthority) It's the digital certification center
  • Meet someone you don't trust SSL Certificate processing method
  • Case study V17

JS encryption

http://tool.oschina.net

  • Some anti - Crawler strategies use js Encrypting the transmitted data is usually md5 value
  • Encrypted is ciphertext, but , The encryption function or process must be completed in the browser , Also is to JS The code is exposed to users
  • By reading the encryption algorithm , You can simulate the encryption process , So as to crack
  • Case study V18
  • Use V18 and V19 Contrast
  • remember JS Must be saved locally , Then find the encryption algorithm

AIAX

  • The essence is a paragraph js Code , It is our web page that makes asynchronous requests
  • There will be url, Request method
  • Use general json Format
  • Case study 20
  • commonly GET Method is sent in the form of parameters
  • post It uses form Methods , It is also convenient for encryption

Requests Module Xiange human module

  • Inherited urlllib All the ways
  • The bottom layer uses urllib3
  • Open source
  • Have a Chinese address
  • install pip install request
  • get request :
    • request.get(url)
    • request.request(‘get’,url)
    • Can carry headers and parmas Parameters
    • Case study 21
  • get The return content of
    • Case study 22
  • post
    • rsp = resquest.post(url,data)
    • Case study 23
    • data,headers The requirement is dict type
  • proxy agent
    • proxy = {
      “http”:“ Address ”
      “HTTPs”:‘ Address ’
      }
      rsp = requests.request(“get”,“http::…”,proxies=proxy)
  • User authentication
    • Proxy verification
      • Possible use HTTP basic Auth It can be like this
      • The format is the user name : password @ Agency address : Port number
      • proxy = {“http”:“china:[email protected]:8888”}
      • res = request.get(“http://www.baidu.com”,proxies=proxy)
  • web Client authentication
    • If you need to verify, you can add auth=( user name , password )
    • autu=(“ user name ”,‘’ password "’)
    • res = request.get(“http://www.baidu.com”,auth=autu)
  • cookie
    • request It's automatic cookie Information
      • rsp = requests.get(url)
      • If the other server sends it cookie Information , You can consider the feedback cookie Attribute , Return to one cookie Example
      • cookieJar = rsp.cookies
      • Can be cookie Turn it into a dictionary
      • cookiedict = requests.utils.dict_from_cookiejar(cookieJar)
  • session
    • And... On the server session It's different
    • Simulate a session , Start linking servers from the client Explorer , Disconnect to client
    • Let me keep some parameters across requests , For example, in the same session Between some requests issued by the instance cookie
      • establish session When you're with someone , It can be saved cookie value
      • ss = requests.session()
      • headers = {“User-Agent”:“XXXXXXx”}
      • data = {“name”:“XXXXXXx”}
      • At this time, there are created session Manage requests , Responsible for making requests
      • ss.post(“http://www.baidu.com”, data=data,headers=headers)
      • rsp = ss.get(“XXXXXX”)
  • https verification SSL certificate
    • Parameters verify Responsible for indicating whether it is necessary to SSL certificate , By default TRUE
    • If you don't need to SSL Certificate validation , be false
      • rsp = requests.get(“https:”,verify=false)

Processing of crawler data

  • Structural data : Prior structure , Let's talk about data
    • json file
      • json Path
      • Convert to the corresponding Python Type operation (json class )
    • XML
      • convert to python The type of (xmtodict)
      • Xpath
      • CSS Selectors
      • Regular
  • Unstructured data : First there's data , Let's talk about structure
    • Text
    • Phone number
    • Email address
    • Regular expressions are often used to process this data
    • Html file
      • Regular
      • Xpath
      • CSS Selectors

Regular expressions

  • A set of shares , You can search and replace in string text, and so on
  • Case study 24, Basic rules for regular use
  • Case study match Basic use of
  • Common methods :
    • match : Find... From the start , It only matches once
    • search: Search from anywhere , One match
    • findall: Find all , Returns a list of
    • finditer: All match , Back to iteration
    • spilt: Split characters , Returns a list of
    • sub: Replace
  • Matching Chinese
    • matching Unicode The scope is mainly 【u4e00-u9fa5]
    • Case study V27
  • Greed is more than non greed
    • Greedy mode : If the whole expression or match succeeds , As many matches as possible
    • Non greedy model : As few matches as possible

XML

  • XML(ExtensibilityleMarkLanguage)
  • http://www.w3cschool
  • Case study V28
  • Concept ; Parent node , Child node , Predecessor node , Brother node , Next generation nodes

Xpath

  • Xpath(XML Path language)
  • w3school
  • Common path expressions

lxml library

  • Case study 29
  • analysis html
  • File read html
  • etree and xpath In combination with
  • Case study V31

CSS Selectors beatifulsoup4

Comparison of several tools

- Regular : fast , Not easy to use, no need to install
- beatifulsoup Slow and easy to use
- lxml: Faster
  • Use beatifulsoup The case of
  • Case study V32

beautifulSoup

  • Four objects
    • Tag
    • NavigableString
    • Beautifulsoup
    • Comment
  • Tag
    • Corresponding HTML label
    • adopt soup,tag_name()
    • tag Two important
      • name
      • attribute
    • Case study V33
  • NavigableString
    • Corresponding content value
  • Beautifulsoup
    • Represents the contents of a document
  • comment
    • special NavigableString object
    • For its output , The content does not include annotation symbols
  • Traversing objects
    • contents: tag The child node list of
    • children: The child node returns in the form of iteration
    • descendants: All grandchildren 、
    • string
    • Case study 34
  • Search for document objects
    • find_all(name,arrts,recursive,text,** kwaargs**)
    • name: Search by string , Content that can be included
      • character string
      • Regular expressions
      • list
    • keywortd Parameters , Represents the property
    • text : Corresponding tag Text value
  • CSS Selectors

  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved