程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Python crawls the position data of dragnet

編輯:Python

Hello, everyone , I'm Ning Yi , Today we are going to talk about Python Reptiles , use Python To crawl through the dragnet data , Dragnet's anti - crawler technology is very powerful , Through ordinary header Requests always return frequently requested information

So we mainly use selenium This plug-in to crawl data , This plug-in is to simulate the operation of our real people , Automatically click on the page , Read page content , Efficiency is better than using header Go ahead request The request is much lower

Let's see how I operate , Just look at the code

List of articles

          • 1、 Introducing plug-ins
          • 2、 download chromedriver plug-in unit
            • (1) see chrome Browser version
            • (2) download chromedriver plug-in unit
          • 3、 Define and request the URL we want to crawl
          • 4、 Crawl the data on this page , And loop through the next button on each page
          • 5、 To write getData Method , Get the position name 、 Salary and other important data
          • 6、 Complete code

1、 Introducing plug-ins

First, we will reference four plug-ins , Respectively used to simulate real people to grab web data 、 Parse web data 、 Time module 、 File module

from selenium import webdriver # Simulate a real person to operate a web page 
import pyquery as pq # Parse web pages 
import time # Time module 
import os # File module 
2、 download chromedriver plug-in unit
(1) see chrome Browser version

Helping –> About Google Chrome(E) View browser version information in , My version is 81.0.4044.138

(2) download chromedriver plug-in unit

Open the url http://npm.taobao.org/mirrors/chromedriver/, Select the corresponding chrome Browser version plug-ins , Download and unzip the file , Put it in line with what we are editing now Python File in the same directory

And then introduce... Into the code chromedriver plug-in unit , This plug-in will help us open chrome Browser simulation window

path = os.getcwd()
driver = webdriver.Chrome(executable_path=(path+"/chromedriver"))
# Wait for the entire page to load before proceeding to the next step 
driver.implicitly_wait(10)
3、 Define and request the URL we want to crawl

This URL opens yes Python The page of the new student's position

# website 
lagou_http = "https://www.lagou.com/jobs/list_Python/p-city_2?px=default&gx=%E5%85%A8%E8%81%8C&gj=&isSchoolJob=1#filterBox"
# Define an empty list to save the found data 
data = []
driver.get(lagou_http)
4、 Crawl the data on this page , And loop through the next button on each page

Find the button on the next page and click the button automatically , And get the data of the current web page , use getData Method to parse web page data , And save the returned data to data In the list ,getData Methods we will write later

Print data when the next page button is not found , Out of the loop

while True:
# Find the next button 
next_html = driver.find_element_by_css_selector(".pager_next").get_attribute('class')
if next_html == 'pager_next ':
# Get the web page data 
items = pq.PyQuery(driver.page_source).find(".con_list_item")
# print(items)
data += getData(items)
time.sleep(2)
# Click the next button 
driver.find_element_by_xpath("//span[@action='next']").click()
else:
print(' End of data crawling ')
print(data)
break

Print the obtained web page data ( Only a part of it ), Pass these data as parameters to getData Method into our final visual data

<div>
<div class="p_bot">
<div class="li_b_l">
<span class="money">10k-20k</span>
<!--<i></i>--> Experience fresh graduates / Undergraduate
</div>
</div>
</div>
<div class="company">
<div class="industry">
Enterprise service , Data services / C round / 150-500 people
</div>
</div>
<div class="list_item_bot">
<div class="li_b_l">
<span> Server side </span>
<span>Linux/Unix</span>
<span>Hadoop</span>
<span>Scala</span>
</div>
<div class="li_b_r">“ rapid growth , Artificial intelligence , big data , Good treatment ”</div>
</div>
5、 To write getData Method , Get the position name 、 Salary and other important data
def getData(items):
datalist=[]
for item in items.items():
temp = dict()
temp[' Job title ']= item.attr('data-positionname')
temp[' Salary ']= item.attr('data-salary')
temp[' Company name ']= item.attr('data-company')
temp[' Company description ']=pq.PyQuery(item).find(".industry").text()
temp[' Work experience ']=pq.PyQuery(item).find(".p_bot>.li_b_l").remove(".money").text()
datalist.append(temp)
return datalist
6、 Complete code
# coding=utf-8
from selenium import webdriver # Simulate a real person to operate a web page 
import pyquery as pq # Parse web pages 
import time # Time plug in 
import os # File module 
path = os.getcwd()
driver = webdriver.Chrome(executable_path=(path+"/chromedriver"))
# Wait for the entire page to load before proceeding to the next step 
driver.implicitly_wait(5)
print(" Start crawling data ")
# website 
lagou_http = "https://www.lagou.com/jobs/list_Ruby/p-city_0?px=default&gx=%E5%85%A8%E8%81%8C&gj=&isSchoolJob=1#filterBox"
# Define an empty list to save the found data 
data = []
driver.get(lagou_http)
def getData(items):
datalist=[]
for item in items.items():
temp = dict()
temp[' Job title ']= item.attr('data-positionname')
temp[' salary range ']= item.attr('data-salary')
temp[' Company name ']= item.attr('data-company')
temp[' Company description ']=pq.PyQuery(item).find(".industry").text()
temp[' Work experience ']=pq.PyQuery(item).find(".p_bot>.li_b_l").remove(".money").text()
datalist.append(temp)
return datalist
while True:
# Find the next button 
next_html = driver.find_element_by_css_selector(".pager_next").get_attribute('class')
if next_html == 'pager_next ':
# Get the web page data 
items = pq.PyQuery(driver.page_source).find(".con_list_item")
# print(items)
data += getData(items)
time.sleep(2)
# Click the next button 
driver.find_element_by_xpath("//span[@action='next']").click()
else:
print(' End of data crawling ')
print(data)
break
# Finally, save the obtained data to a.txt In file 
file = open(path+"/a.txt","w")
file.write(str(data))
print(' Write file successful ')

  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved