程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

How Python obtains the specified data information of web pages

編輯:Python

1、 Generalization

         This article mainly uses python Programming , Use requests Library to get web page content , utilize BeautifulSoup Realization html Text parsing to find the data we want , Use pandas Finally, the obtained data will be persisted and stored in txt In the text file .

3、 preview

         Before starting the case , Let's take a look at the main api Interface :

# url: website

# headers: Request header data dictionary

# return: Return Web Information

requests.get(url=url, headers=headers)

# dl: Tag name

# attrs: Attribute and attribute value Dictionary of tag

# return: Return all the data nodes that meet the requirements

bs1.find_all('dl',attrs={'class':'roleIntrodcution-descritpion'})

Be careful :find_all Before use, it is necessary to use BeautifulSoup conversion requests Text content of

4、 analysis

         We need to check the website html Analyze ,F12 View the role name 、 Actor name 、 Introduce 、 Node tag name of the picture and node class Attribute value or id value  , So as to obtain the location data . adopt F12 When you view the element, you can see that except for the image URL, other information is inside its node dl label ,class by roleIntrodcution-descritpion Under the parent node of . In the same way, let's look at the role names 、 Actor name 、 Introduce the corresponding nodes of the information .

# The role of

find_all('div',attrs={'class':'role-name'})

# Actor name

find_all('div',attrs={'class':'role-actor'})

# describe

find_all('dd',attrs={'class':'role-description'})

# Image information

find_all('a',attrs={'class':'roleIntrodcution-picture'})

5、 Case realization

# Guide pack
import requests
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}
url = 'https://baike.baidu.com/item/%E5%8A%9F%E5%8B%8B/24265662?fr=aladdin#8'
response = requests.get(url=url, headers=headers)
#step_3: Get response data : By calling the... Of the response object text attribute
page_text = response.text
from bs4 import BeautifulSoup
bs1 = BeautifulSoup(page_text, 'html.parser')
# name
name = []
# Actor name
real_name = []
# Introduce
instruct=[]
for i in bs1.find_all('dl',attrs={'class':'roleIntrodcution-descritpion'}):
for j in i.find_all('div',attrs={'class':'role-name'}):
name.append(j.text.replace('\n',''))
for j in i.find_all('div',attrs={'class':'role-actor'}):
real_name.append(j.text.replace('\n','').replace(' actor ',''))
for i in bs1.find_all('dd',attrs={'class':'role-description'}):
instruct.append(i.text.replace('\n',''))
# Photo download path
pic = []
for i in bs1.find_all('a',attrs={'class':'roleIntrodcution-picture'}):
pic.append(i.find('img').get('src'))
data={' Role name ':name,' actor ':real_name,' Role profile ':instruct,' Still photos of the characters ':pic}
import pandas as pd
df = pd.DataFrame(data,columns=[' Role name ', ' actor ',' Role profile ',' Still photos of the characters '])
df.to_csv('data.txt', sep=' ',index=False)

6、 result

# data.txt file name

# sep Column lattice

# index Show index or not

df.to_csv('data.txt', sep=' ',index=False)


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved