程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Python dynamic web crawler - crawl Jingdong Mall

編輯:Python

1. Static web pages and dynamic web pages

A static web page is a web page that forms a static page in the server html or htm Document and send it to the web service of the client .

Dynamic web pages need to rely on client-side script and server-side script to render to form the final display document .

Client script :

Mainly JavaScript Script , It allows the client to respond to server-side events .

Server script :

There are many scripting languages on the server side , Include PHP,ASP,ASP.NET,JSP,ColdFusion and Perl Allow response to web page submission events .

2. Dynamic web crawler tool —Selenium and PhantomJS

2.1 Selenium brief introduction

Selenium It's a Web Automated test tool , It can be used to operate some browser drivers , And use some headless( No graphical user interface ) Browser , such as PhantomJS.

install Selenium:

pip install selenium

Selenium You also need a browser driver to run , Download driver , I download Chrome drive :

Chrome:https://sites.google.com/chromium.org/driver/Edge:https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/Firefox:https://github.com/mozilla/geckodriver/releasesSafari:https://webkit.org/blog/6900/webdriver-support-in-safari-10/

Be careful ,chromedriver The version of must be the same as that installed on this computer Chrome The browser version is consistent .

Then put it in the system variable Path in .

2.2 PhantomJS

PhantomJS It's a way to use JavaScript Scripted headless browser .

download PhantomJS:https://phantomjs.org/download.html

After downloading, you only need to bin In the catalog .exe Files in Windows/System32 Under the table of contents :

3. Preparation before coding

3.1 Web analytics

Web address :http://quotes.toscrape.com/js/

This is a neat looking web page , My goal is to grab the first few slogans .

Next, look at its source code :

​
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Quotes to Scrape</title>
<link rel="stylesheet" href="/static/bootstrap.min.css">
<link rel="stylesheet" href="/static/main.css">
</head>
<body>
<div class="container">
<div class="row header-box">
<div class="col-md-8">
<h1>
<a href="/" >Quotes to Scrape</a>
</h1>
</div>
<div class="col-md-4">
<p>
<a href="/login">Login</a>
</p>
</div>
</div>
<script src="/static/jquery.js"></script>
<script>
var data = [
{
"tags": [
"change",
"deep-thoughts",
"thinking",
"world"
],
"author": {
"name": "Albert Einstein",
"goodreads_link": "/author/show/9810.Albert_Einstein",
"slug": "Albert-Einstein"
},
"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d"
},
{
"tags": [
"abilities",
"choices"
],
"author": {
"name": "J.K. Rowling",
"goodreads_link": "/author/show/1077326.J_K_Rowling",
"slug": "J-K-Rowling"
},
"text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d"
},
{
"tags": [
"inspirational",
"life",
"live",
"miracle",
"miracles"
],
"author": {
"name": "Albert Einstein",
"goodreads_link": "/author/show/9810.Albert_Einstein",
"slug": "Albert-Einstein"
},
"text": "\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d"
},
{
"tags": [
"aliteracy",
"books",
"classic",
"humor"
],
"author": {
"name": "Jane Austen",
"goodreads_link": "/author/show/1265.Jane_Austen",
"slug": "Jane-Austen"
},
"text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"
},
{
"tags": [
"be-yourself",
"inspirational"
],
"author": {
"name": "Marilyn Monroe",
"goodreads_link": "/author/show/82952.Marilyn_Monroe",
"slug": "Marilyn-Monroe"
},
"text": "\u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d"
},
{
"tags": [
"adulthood",
"success",
"value"
],
"author": {
"name": "Albert Einstein",
"goodreads_link": "/author/show/9810.Albert_Einstein",
"slug": "Albert-Einstein"
},
"text": "\u201cTry not to become a man of success. Rather become a man of value.\u201d"
},
{
"tags": [
"life",
"love"
],
"author": {
"name": "Andr\u00e9 Gide",
"goodreads_link": "/author/show/7617.Andr_Gide",
"slug": "Andre-Gide"
},
"text": "\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d"
},
{
"tags": [
"edison",
"failure",
"inspirational",
"paraphrased"
],
"author": {
"name": "Thomas A. Edison",
"goodreads_link": "/author/show/3091287.Thomas_A_Edison",
"slug": "Thomas-A-Edison"
},
"text": "\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d"
},
{
"tags": [
"misattributed-eleanor-roosevelt"
],
"author": {
"name": "Eleanor Roosevelt",
"goodreads_link": "/author/show/44566.Eleanor_Roosevelt",
"slug": "Eleanor-Roosevelt"
},
"text": "\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d"
},
{
"tags": [
"humor",
"obvious",
"simile"
],
"author": {
"name": "Steve Martin",
"goodreads_link": "/author/show/7103.Steve_Martin",
"slug": "Steve-Martin"
},
"text": "\u201cA day without sunshine is like, you know, night.\u201d"
}
];
for (var i in data) {
var d = data[i];
var tags = $.map(d['tags'], function(t) {
return "<a class='tag'>" + t + "</a>";
}).join(" ");
document.write("<div class='quote'><span class='text'>" + d['text'] + "</span><span>by <small class='author'>" + d['author']['name'] + "</small></span><div class='tags'>Tags: " + tags + "</div></div>");
}
</script>
<nav>
<ul class="pager">
<li class="next">
<a href="/js/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
</li>
</ul>
</nav>
​
</div>
<footer class="footer">
<div class="container">
<p class="text-muted">
Quotes by: <a href="https://www.goodreads.com/quotes">GoodReads.com</a>
</p>
<p class="copyright">
Made with <span class='sh-red'></span> by <a href="https://scrapinghub.com">Scrapinghub</a>
</p>
</div>
</footer>
</body>
</html>

The slogan of this web page depends on the front end JavaScript Script rendering , The data of slogans only exists in the front end html On the file .

stay html The code uses a javascript Script load banner :

for (var i in data) {
var d = data[i];
var tags = $.map(d['tags'], function(t) {
return "<a class='tag'>" + t + "</a>";
}).join(" ");
document.write("<div class='quote'><span class='text'>" + d['text'] + "</span><span>by <small class='author'>" + d['author']['name'] + "</small></span><div class='tags'>Tags: " + tags + "</div></div>");
}

The code on the next page is :

<nav>
<ul class="pager">
<li class="next">
<a href="/js/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
</li>
</ul>
</nav>

3.2 Program code

# Introduce the required modules
import selenium.webdriver
from bs4 import BeautifulSoup as bs
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument('--headless') #This line should be uncommented if you're using Docker
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
# call Chrome perhaps PhantomJS
driver = webdriver.webdriver.Chrome()
#driver = webdriver.webdriver.PhantomJS()

Get web source code :

driver.get('http://quotes.toscrape.com/js/')
content=driver.page_source

Page turning code :

host='http://quotes.toscrape.com'
biaoyus=[]
next='http://quotes.toscrape.com/js/'
for i in range(4):
# Use driver Access to web pages
driver.get(next)
content=driver.page_source
# Use soup Look for the element
eles=soup(content,'html.parser')
biaoyus.append(eles.find_all("div",{"class":"quote"}))
print(len(biaoyus))
# The next page
next=host+eles.find('li',{'class':'next'}).find('a')['href']
print(next)

Complete code :

# Introduce the required modules
from selenium import webdriver
from bs4 import BeautifulSoup as soup
# call Chrome perhaps PhantomJS
driver = webdriver.Chrome()
#driver = webdriver.webdriver.PhantomJS()
# host
host='http://quotes.toscrape.com'
biaoyus=[]
next='http://quotes.toscrape.com/js/'
for i in range(4):
# Use driver Access to web pages
driver.get(next)
content=driver.page_source
# Use soup Look for the element
eles=soup(content,'html.parser')
biaoyus.append(eles.find_all("div",{"class":"quote"}))
print(len(biaoyus))
next=host+eles.find('li',{'class':'next'}).find('a')['href']
print(next)
#input()
​
for biaoyu in biaoyus:
for quote in biaoyu:
print(quote.find(class_='text').getText())
print(quote.find(class_='author').getText())
print(quote.find(class_='tags').getText())
print('\n')

4. Crawl to Jingdong store

I'm going to crawl to JD. Com to “python” Before keyword search 200 A book .

Web address :https://search.jd.com/Search?keyword=python&enc=utf-8&wq=python&pvid=3e6f853b03a64d86b17638dc2de70fdf

Web page :

View page source code :

The structure of a book , Books are listed li Is displayed on the web page :

This page uses a sliding fill method to display books . Start by showing only a few books , Only when the user swipes the browser , Will show the rest of the books , Sliding code :

<span class="clr"></span>
<div id="J_scroll_loading" class="notice-loading-more"><span> Loading , Please later ~~</span></div>
<div class="page clearfix"><div id="J_bottomPage" class="p-wrap"></div></div>

4.1 Use selenium location “ The next page ” Elements , And simulate clicking

To climb 200 Information about several books , Can't read in one page , To use selenium Provide simulation click function , Jump to multi page crawling information .

# Use class class Locate the next page
next=driver.find_element_by_class_name('pn-next')
# Click on the simulation
next.click()

4.2 Complete code

# Introduce the required modules
from selenium import webdriver
from bs4 import BeautifulSoup as soup
import time
import json
# call Chrome perhaps PhantomJS
driver = webdriver.Chrome()
#driver = webdriver.webdriver.PhantomJS()
# host
next='https://search.jd.com/Search?keyword=python'
# Use driver Access to web pages
driver.get(next)
booksstore=[]
# Save the data
fi=open("books.txt","a",encoding='utf-8')
for j in range(4):
#driver Control the roller to slide
for i in range(2):
driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
# Wait for the page to load
time.sleep(4)
content=driver.page_source
# Use soup Look for the element
eles=soup(content,'html.parser')
books=eles.find_all('li',{'class':'gl-item'})
print(len(books))
for book in books:
name=book.find('div',{'class':'p-name'}).find('a').find('em').getText()
price=book.find('div',{'class':'p-price'}).find('i').getText()
commit='https:'+book.find('div',{'class':'p-commit'}).find('a')['href']
shop=book.find('div',{'class':'p-shopnum'}).find_all('a')
print(name)
print(price)
print(commit)
book={' Book name ':name,' Book price ':price,' Purchase address ':commit}
if(len(shop)!=0):
shopaddress=shop[0]['href']
shopname=shop[0]['title']
print("http:"+shopaddress)
print(shopname)
book[' Store address ']="http:"+shopaddress
book[' Shop name ']=shopname
booksstore.append(book)
#booksstore.append('\n')
fi.write(json.dumps(book,ensure_ascii=False))
fi.write("\n")
# The next page
next=driver.find_element_by_class_name('pn-next')
print(next.text)
next.click()
time.sleep(4)
​
print(len(booksstore))
print(booksstore)
fi.write
fi.close()

Crawling effect :

5. Reference resources

[1] What is dynamic scripting

[2] Python Reptiles , Use Python Crawling through dynamic web pages - Tencent animation (Selenium)

[3] selenium Control the roller to slide

[4] selenium Element positioning and simulation click events


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved