Hello everyone .
Today I will introduce a simple 、 Automatic and quick Python Reptile tools SmartScraper.SmartScraper Make it easy to grab page data , No longer need to learn things like pyquery、beautifulsoup Equal positioning package , We just need to provide url And data to ta Just learn the rules of web page positioning .
pip install smartscraper
for example We want to get from Douban studies - A novel Page access 20 The title and publication information of this book
We use P1 Link training Title 、 Publish information These two fields
from smartscraper import SmartScraper
# Links to web pages to be trained
url = 'https://book.douban.com/tag/ A novel ?start=0&type=T'
# Definition Desired field
wanted_dict = {"title":[" Alive "],
"pub": [" Yuhua / Writers press / 2012-8-1 / 20.00 element "]
}
# Training / stay url Search the corresponding page wanted_dict law
scraper = SmartScraper()
results = scraper.build(url, wanted_dict=wanted_dict)
print(results)Run code , Collected results as follows
{'title': [' Alive ',
' Fang Siqi's first love paradise ',
' White night line ',
' Solaris ',
' despise ',
...],
'pub': [' Yuhua / Writers press / 2012-8-1 / 20.00 element ',
' Lin Yihan / Beijing joint publishing company / 2018-2 / 45.00 element ',
'[ Japan ] Guiwu Dongye / Liuzijun / Nanhai publishing company / 2013-1-1 / CNY 39.50',
'[ wave ] Stanislaw · Lyme / Jingzhenzhong / Yilin Translation Publishing House / 2021-8 / 49.00 element ',
'[ It means ] Alberto · Moravia / Shensepmei 、 Liuxirong / Jiangsu Phoenix literature and art press / 2021-7 / 62.00',
...]
}Use the one you just trained scraper Try from P2 link Get the title and Publication Information
scraper.get_result_similar('https://book.douban.com/tag/ A novel ?start=20&type=T')Trained smartscraper Models can be saved , Subsequent direct calls
scraper.save('douban_Book.pkl')Model import code
scraper.load('douban_Book.pkl')