程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Python create crawler project

編輯:Python
  1. install Scrapy:pip install scrapy -i https://mirrors.aliyun.com/pypi/simple/ ( Followed by -i https://mirrors.aliyun.com/pypi/simple/ Domestic resources will improve the download speed )
  2. open Cmd / PyCharm–Terminal
  3. Enter the path where you want to create the crawler project , Input :scrapy startproject Project name ( Create a crawler project )
  4. Entry project , Input :scrapy genspider Reptile name “host Address ” ( Create crawler file )
  5. Set up settings, stay pycharm Set in
Serial number step (1) Set up ROBOTSTXT_OBEY = Falserobots Explanation of the agreement :https://blog.csdn.net/wz947324/article/details/80633668( Some websites don't allow crawlers to visit , If the robot agreement is observed , Cannot crawl )(2) Turn on DOWNLOAD_DELAY = 3 Download delay :DOWNLOAD_DELAY = 3, Access to the server has passed 3s Ask for more data , Used to simulate user access (3) Turn on :DEFAULT_REQUEST_HEADERS = { ‘Accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8’,‘Accept-Language’: ‘en’,} You can set the default request header here , Delete the original content Set up :User-Agent:------ Set up :Cookie:------(4) Turn on :DOWNLOADER_MIDDLEWARES = { ‘zhaobiao( Project name ).middlewares.ZhaobiaoDownloaderMiddleware’: 543,} Download Middleware : Configure agent IP(5) Turn on :ITEM_PIPELINES = { ‘zhaobiao( Project name ).pipelines.ZhaobiaoPipeline’: 300,} Pipeline files : Point to pipelines.py file (6)scrapy Operation of the project Method 1: Create a start file :from scrapy import cmdline cmdline.execute('scrapy crawl bilian( Crawler file name ).split() Method 2:Terminal:cmdline.execute(‘scrapy crawl bilian( Crawler file name )’.split())
  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved