程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Python crawler_ Scrape (I)

編輯:Python

Python Reptiles _Scrapy

    • One 、Scrapy sketch
    • Two 、58 Case of local project
    • 3、 ... and 、 Car home case
    • Four 、scrapy shell

One 、Scrapy sketch

It's for crawling website data , Application framework for extracting structural data , It can be applied in data mining 、 In a series of programs such as information processing or storing historical data

( One ) install

pip install scrapy -i https://pypi.douban.com/simple

Report errors :

WARNING: You are using pip version 21.3.1; however, version 22.1.2 is available.
You should consider upgrading via the 'D:\PythonCode\venv\Scripts\python.exe -m pip install --upgrade pip' command.

terms of settlement : function python -m pip install --upgrade pip
( Two ) Basic use

  1. Create a crawler project :scrapy startproject scrapy_baidu_01
    Be careful :
    (1) It is necessary to enter the installation site scrapy.exe Folder to run ;
    (2) The name of the crawler item created cannot start with a number , Nor can it contain Chinese
    (3) Be sure to configure scrapy.exe Environment variables of , At the same time, restart the computer to run D:\P ythonCode\venv\Scripts
  2. Create crawler file stay spider Folder to create D:\PythonCode\venv\Scripts\scrapy_baidu_01\scrapy_baidu_01\spiders>
    Create instructions :scrapy genspider Crawler file name Page to crawl
  3. Run the crawler code scrapy crawl The name of the reptile The name is [name = ‘baidu’]
 import scrapy
class BaiduSpider(scrapy.Spider):
# The name of the reptile , Value used 
name = 'baidu'
# Allowed access to the domain name 
allowed_domains = ['www.baidu.com']
# Initial url Address The domain name visited for the first time 
# start_urls really allowed_domains Add a http://, Added after /
start_urls = ['http://www.baidu.com/']
# Yes start_urls The method of execution after , Methods response Is the returned object , amount to 
# response = urllib.request.urlopen()
# response = requests.get()
def parse(self, response):
print('ssssss')

Two 、58 Case of local project

1. scrapy Project structure

2. response Properties and methods of
response.text Get the response string
response.body Get binary data
response.xpath You can use it directly xpath Method to parse response The content in
response.extract() For extraction seletor Object data Property value
response.extract_first() extract seletor The first data in the list

3、 ... and 、 Car home case

import scrapy
class CarSpider(scrapy.Spider):
name = 'car'
allowed_domains = ['car.autohome.com.cn/price/brand-15.html']
start_urls = ['https://car.autohome.com.cn/price/brand-15.html']
def parse(self, response):
name_list = response.xpath('//div[@class="main-title"]/a/text()')
price_list = response.xpath('//div[@class="main-lever"]//span/span/text()')
for i in range(len(name_list)):
name = name_list[i].extract()
price = price_list[i].extract()
price(name,price)

scrapy working principle 【 Ash often matters !!】

  1. Engine direction spiders request url;
  2. The engine will ask for url Pass it to the scheduler
  3. The scheduler will url The generated request object is placed in the specified queue
  4. Dequeue a request from the queue
  5. The engine passes the request to the downloader for processing
  6. The downloader sends a request to get internet data
  7. The downloader returns the data to the engine
  8. The engine hands over the data to spiders
  9. spiders adopt xpath Parse the data , Get data or url
  10. spiders Give the parsing result to the engine
  11. If the parsing result is data , Then hand it over to the pipeline ; If the parsing result is url, Then it is handed over to the scheduler to enter the next cycle .

Four 、scrapy shell


Direct input instructions :scrapy shell www.baidu.com


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved