程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

[Python actual combat] collect expression packs in batches. You are absolutely the leader in the group~

編輯:Python

The purpose of this time :

python Crawling through massive expression packs

Bright spot :

  1. System analysis target page
  2. html Label data analysis method
  3. Massive image data can be saved with one click

Introduction to the environment :

  • python 3.8
  • pycharm

Module USES :

  • requests >>> pip install requests
  • parsel >>> pip install parsel

Corresponding installation package / Installation tutorial / Activation code / Use the tutorial / Learning materials / Tool plugins You can click collect

Module installation problem :

If installed python Third-party module :

  1. win + R Input cmd Click ok , Enter the installation command pip install Module name (pip install requests) enter
  2. stay pycharm Click on the Terminal( terminal ) Enter the installation command

Installation failure reason :

Failure one :

pip Not an internal command

resolvent :

Set the environment variable

Failure two :

There are a lot of red reports (read time out)

resolvent :

Because the network link timed out , You need to switch the mirror source
tsinghua :https://pypi.tuna.tsinghua.edu.cn/simple
Alibaba cloud :http://mirrors.aliyun.com/pypi/simple/
University of science and technology of China https://pypi.mirrors.ustc.edu.cn/simple/
Huazhong University of technology :http://pypi.hustunique.com/
Shandong University of technology :http://pypi.sdutlinux.org/
douban :http://pypi.douban.com/simple/
for example :pip3 install -i https://pypi.doubanio.com/simple/ Module name

Failure three :

cmd It shows that it has been installed , Or the installation is successful , But in pycharm It still can't be imported

resolvent :

Multiple... May be installed python edition (anaconda perhaps python Just install one ) Just uninstall one
Or you pycharm Inside python The interpreter is not set

How to configure pycharm Inside python Interpreter ?

  1. choice file( file ) >>> setting( Set up ) >>> Project( project ) >>> python interpreter(python Interpreter )
  2. Click on the gear , choice add
  3. add to python The installation path

pycharm How to install plug-ins ?

  1. choice file( file ) >>> setting( Set up ) >>> Plugins( plug-in unit )
  2. Click on Marketplace Enter the name of the plug-in you want to install such as : Translation plug-ins Input translation
  3. Select the corresponding plug-in and click install( install ) that will do
  4. After successful installation Yes, it will pop up restart pycharm The option to Click ok , Restart to take effect

Basic idea and process of reptile :

One . Data source analysis

  1. Determine what to get ( Determining demand )
    Send the emoticon package picture of emoticon website
  2. Through the developer tool packet capture analysis data ( picture url Address and picture title ) source
    You can know that this website is a static web page , The data content we want comes from his web page source code

Two . Code implementation steps :

  1. Send a request , Send a request to the emoticon package list page
  2. get data , Get the data content returned by the server (response Response body data )
  3. Parsing data , Extract the data we want ( picture url Address as well as Picture title )
  4. Save the data , Save the obtained contents in the local folder
  5. Multi page crawling , On request url The law of address change

Code

The import module

import requests # Data request module Third-party module pip install requests The module is not used after installation Gray shows 
import parsel # Data analysis module Third-party module pip install parsel
import re # Regular expressions Built-in module No installation required 
# 1. Send a request
# You need to pay attention to : Confirm the request url Address Request method Request header parameters ( Some websites need to add cookie perhaps Anti theft chain )
for page in range(12, 21):
print(f' Climbing to the top {page} Page data content ')
url = f'https://fabiaoqing.com/biaoqing/lists/page/{page}.html'
# headers  Functional camouflage python Code A crawler simulates a browser sending a request to a server 
headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36'
}
response = requests.get(url=url, headers=headers)
# 2. get data , Get the data content returned by the server (response Response body data ) response.text Get the text data of the response body
# 3. Parsing data Extract content , It is based on the data returned by the server , Instead of looking at the element panel
# print(response.text) # Got response.text This html String data Directly extract the string data content Need to use re Regular expressions 
selector = parsel.Selector(response.text) # Put the obtained string data content convert to selector object
# css Selectors It is to extract data according to the tag attribute content 
divs = selector.css('div.ui.segment.imghover div') # Get all div label Returns the object
for index in divs:
# a::attr(title) obtain a Inside the label title attribute data get() Get the first tag data content 
title = index.css('a::attr(title)').get()
title = re.sub(r'[\/:*?"<>|\n]', '', title)
img_url = index.css('img::attr(data-original)').get()
# split  String segmentation method list [-1] Take the last element The first element on the right 
img_name = img_url.split('.')[-1]
# response.content Get binary data content , Save the picture / video / Audio / File content in a specific format Are stored in binary data 
img_content = requests.get(url=img_url, headers=headers).content
with open('img\\' + title + '.' + img_name, mode='wb') as f:
f.write(img_content)
print(title, ' Saved successfully ')

Multithreaded code

import re
import time
import requests
import parsel
import concurrent.futures
def change_title(title):
mode = re.compile(r'[\\\/\:\*\?\"\<\>\|\n]')
new_title = re.sub(mode, '_', title)
return new_title
def get_response(html_url):
headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
}
response = requests.get(url=html_url, headers=headers)
return response
def save(name, title, img_url):
img_content = get_response(img_url).content
with open('img\\' + title + '.' + name, mode='wb') as f:
f.write(img_content)
print(' Saving :', title)
def main(html_url):
html_data = get_response(html_url).text
selector = parsel.Selector(html_data)
divs = selector.css('#container div.tagbqppdiv')
for div in divs:
title = div.css('img::attr(title)').get()
img_url = div.css('img::attr(data-original)').get()
name = img_url.split('.')[-1]
new_title = change_title(title)
if len(new_title) > 255:
new_title = new_title[:10]
save(name, new_title, img_url)
else:
save(name, new_title, img_url)
if __name__ == '__main__':
start_time = time.time()
exe = concurrent.futures.ThreadPoolExecutor(max_workers=7)
for page in range(1, 201):
url = f'https://www.fabiaoqing.com/biaoqing/lists/page/{page}.html'
exe.submit(main, url)
exe.shutdown()
use_time = int(time.time()) - int(start_time)
print(f' Total time taken :{use_time} second ')

  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved