Hello everyone , I'm Chen Chen
A few days ago, a certain Ya was fined for tax evasion 13.41 One hundred million yuan , The message came out , But it has aroused thousands of waves on the Internet , Netizens directly fried the pot . All of them are filled with emotion , I don't know if there is more than a fraction of the fine paid by others .
So I crawled the data under this microblog , Conducted a simple public opinion analysis !
Because it is more convenient to crawl microblogs from the mobile terminal , So this time we choose to crawl the microblog from the mobile terminal .
We usually input keywords in this place , To search the microblog content .
I observed this page in developer mode and found that , Every time it makes a request for a keyword , I'm going to return one XHR Respond to .
We have now found the page where the data actually exists , Then you can perform the normal operation of the crawler .
Above we have found the real web page of data storage , Now we just need to make a request for the page , Then extract the data .
By observing the request header , It is not difficult to construct the request code .
The construction code is as follows :
key = input(" Please enter the crawling keyword :")
for page in range(1,10):
params = (
('containerid', f'100103type=1&q={key}'),
('page_type', 'searchall'),
('page', str(page)),
)
response = requests.get('https://m.weibo.cn/api/container/getIndex', headers=headers, params=params)From the above observation, we found that this data can be transformed into a dictionary for crawling , But after my actual test, I found that , It is the most simple and convenient to extract with regularization , So here is the way of regular extraction , Interested readers can try to extract data by dictionary . The code is as follows :
r = response.text
title = re.findall('"page_title":"(.*?)"',r)
comments_count = re.findall('"comments_count":(.*?),',r)
attitudes_count = re.findall('"attitudes_count":(.*?),',r)
for i in range(len(title)):
print(eval(f"'{title[i]}'"),comments_count[i],attitudes_count[i])The data has been parsed , We can store it directly , Here I store data in csv In file , The code is as follows :
for i in range(len(title)):
try:
with open(f'{key}.csv', 'a', newline='') as f:
writer = csv.writer(f)
writer.writerow([eval(f"'{title[i]}'"),comments_count[i],attitudes_count[i],reposts_count[i],created_at[i].split()[-1],created_at[i].split()[1],created_at[i].split()[2],created_at[i].split()[0],created_at[i].split()[3]])
except:
passAfter data collection , It needs to be cleaned , Make it meet the analysis requirements before visual analysis .
use pandas Read the crawled data and preview .
import pandas as pd
df = pd.read_csv(' Weiya .csv',encoding='gbk')
print(df.head(10))We found that , The data in the month is abbreviated in English , We need to turn it into numbers , The code is as follows :
c = []
for i in list(df[' month ']):
if i == 'Nov':
c.append(11)
elif i == 'Dec':
c.append(12)
elif i == 'Apr':
c.append(4)
elif i == 'Jan':
c.append(1)
elif i == 'Oct':
c.append(10)
else:
c.append(7)
df[' month '] = c
df.to_csv(' Weiya .csv',encoding='gbk',index=False)View field types and missing values , Meet the needs of analysis , No additional treatment is required .
df.info()
Let's visually analyze these data .
Here we only climb closer 100 Pages of data , It could be the cause of 20 and 21 The reason why there is less microblog data on .
The code is as follows :
from pyecharts.charts import Bar
from pyecharts import options as opts
from collections import Counter # Count the frequency of words
c=[]
d={}
a = 0
for i in list(df[' month ']):
if i == 12:
if list(df[' Japan '])[a] notin c:
c.append(list(df[' Japan '])[a])
a+=1
a = 0
for i in c:
d[i]=0
for i in list(df[' month ']):
if i == 12:
d[list(df[' Japan '])[a]]+=1
a += 1
columns = []
data = []
for k,v in d.items():
columns.append(k)
data.append(v)
bar = (
Bar()
.add_xaxis(columns)
.add_yaxis(" Number of pieces ", data)
.set_global_opts(title_opts=opts.TitleOpts(title=" Daily number of microblogs "))
)
bar.render(" Word frequency .html")We found that doutujun starfish has the most comments and likes , Yes 7.5w+, Let's take a look at its comments , It makes users like it so much .
Maybe it's early to like so many , The position is relatively front , Another reason may be that the content conforms to everyone's wishes .
Analyze the release time of all comments , We found that 21 Point has the most comments , At that time, it was almost the same time when it came to the hot search list , It seems that not being on the hot search list still has a great impact on Weibo .
The code is as follows :
import pandas as pd
df = pd.read_csv('weiya.csv',encoding='gbk')
c=[]
d={}
a = 0
for i in list(df[' when ']):
if i notin c:
c.append(i)
a = 0
for i in c:
d[i]=0
for i in list(df[' when ']):
d[i]+=1
print(d)
from collections import Counter # Count the frequency of words
from pyecharts.charts import Bar
from pyecharts import options as opts
columns = []
data = []
for k,v in d.items():
columns.append(k)
data.append(v)
bar = (
Bar()
.add_xaxis(columns)
.add_yaxis(" Time ", data)
.set_global_opts(title_opts=opts.TitleOpts(title=" Time distribution "))
)
bar.render(" Word frequency .html")We can see from the picture of words , There are a lot of tax evasion , In line with the theme , The second is cancellation 、 Blockade and imprisonment , It seems that people still hate illegal acts .
The code is as follows :
from imageio import imread
import jieba
from wordcloud import WordCloud, STOPWORDS
with open("weiya.txt",encoding='utf-8') as f:
job_title_1 = f.read()
with open(' Stop Thesaurus .txt','r',encoding='utf-8') as f:
stop_word = f.read()
word = jieba.cut(job_title_1)
words = []
for i in list(word):
if i notin stop_word:
words.append(i)
contents_list_job_title = " ".join(words)
wc = WordCloud(stopwords=STOPWORDS.add(" One "), collocations=False,
background_color="white",
font_path=r"K:\ Su Xin's poems are written in regular script .ttf",
width=400, height=300, random_state=42,
mask=imread('xin.jpg', pilmode="RGB")
)
wc.generate(contents_list_job_title)
wc.to_file(" Recommended language .png")As public figures, netizens and celebrities should set an example , You can't enjoy fame and wealth while still doing illegal acts .