程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Simple and fast Python crawler tool: smartsharper

編輯:Python

Hello everyone .

Today I will introduce a simple 、 Automatic and quick Python Reptile tools SmartScraper.SmartScraper Make it easy to grab page data , No longer need to learn things like pyquery、beautifulsoup Equal positioning package , We just need to provide url And data to ta Just learn the rules of web page positioning .

One 、 install

pip install smartscraper

Two 、 Quick start

2.1 Get similar results

for example We want to get from Douban studies - A novel Page access 20 The title and publication information of this book

  • P1  https://book.douban.com/tag/ A novel ?start=0&type=T
  • P2  https://book.douban.com/tag/ A novel ?start=20&type=T

We use P1 Link training Title 、 Publish information These two fields

from smartscraper import SmartScraper
#  Links to web pages to be trained
url = 'https://book.douban.com/tag/ A novel ?start=0&type=T'
# Definition   Desired field
wanted_dict = {"title":[" Alive "],
               "pub": [" Yuhua  /  Writers press  / 2012-8-1 / 20.00 element "]
              }
#  Training / stay url Search the corresponding page wanted_dict law
scraper = SmartScraper()
results = scraper.build(url, wanted_dict=wanted_dict)
print(results)

Run code , Collected results as follows

{'title': [' Alive ', 
           ' Fang Siqi's first love paradise ', 
           ' White night line ', 
           ' Solaris ', 
           ' despise ',
           ...], 
 'pub': [' Yuhua  /  Writers press  / 2012-8-1 / 20.00 element ', 
         ' Lin Yihan  /  Beijing joint publishing company  / 2018-2 / 45.00 element ', 
         '[ Japan ]  Guiwu Dongye  /  Liuzijun  /  Nanhai publishing company  / 2013-1-1 / CNY 39.50', 
         '[ wave ]  Stanislaw · Lyme  /  Jingzhenzhong  /  Yilin Translation Publishing House  / 2021-8 / 49.00 element ', 
         '[ It means ]  Alberto · Moravia  /  Shensepmei 、 Liuxirong  /  Jiangsu Phoenix literature and art press  / 2021-7 / 62.00',
          ...]
}

Use the one you just trained scraper Try from P2 link Get the title and Publication Information

scraper.get_result_similar('https://book.douban.com/tag/ A novel ?start=20&type=T')

2.2 Save the model

Trained smartscraper Models can be saved , Subsequent direct calls

scraper.save('douban_Book.pkl')

Model import code

scraper.load('douban_Book.pkl')

  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved