程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Teach you to capture and analyze the bullet screen of yearning for life in Python

編輯:Python

《 Desired life 》 It is a very warm life reality show variety show on Hunan Satellite TV , At present, the third season is being updated , Permanent guests joined Zhang Zifeng , Deeply loved by the audience . And the Douban score of the program also reached 7.9. This variety show takes star artists to experience life in the village as the main line , Integrated with delicious food , labour , Humorous elements , It makes people feel immersive while watching , It seems that they have really entered “ Desired life ”.

Yearning for life, Douban score

While watching the program these days , Seeing the lively discussion on the barrage , On a whim, can you climb down all the bullet screens for analysis . On the one hand, explore whether there is anything special about barrage data capture , On the other hand, through the bullet screen to find out the reputation of the program . Next, we'll update the page just last Friday 5 Period as an example , Capture barrage data . The code mainly uses requests library , The grab results are stored in csv In file .

Web analytics

In mango TV The web version opens page 5 Episode , Wait for the ad to load , Open at the same time chrome Developer Tools network tab . Because there are many requests , And over time , More and more . So I took the way of emptying first and then waiting . I found that most of the images loaded in front are pictures , Naturally, this is not our goal . After a while , Found a suspicious request , See figure below , Click to see , There really is a barrage of content .interval yes 60, Guess may mean an interval , Every time 60s There will be a new request . So using filter Filtered to “rdb” Initial request , It was found that these were bullet screens , and next All are 60000 Multiple , Guess means 60000 millisecond , That is to say 60 second .

Find the barrage request link

Filter barrage requests

Next, we need to confirm the flip logic of the barrage , That is, the unified law of these barrage Links . Here we recommend a good web request analysis tool postman. It can not only be used to analyze the parameters of web pages , It can also provide request codes in different languages , With a little modification, you can use . Post the link we just found to postman in . As shown in the figure , You can see the parameters of the request , Click on send After the button, you can see the result of the request . Due to many parameters , Consider removing some useless parameters . Finally found , Just keep vid,cid,time Three parameters are sufficient . guess vid Show id,cid Show video id,time It should be the moment of request , It's a relative value . And in the request result , And the time of each barrage , It's better than time Big numbers . Combined with the above analysis logic , It can be concluded that the result of each request is the request time 60s The barrage inside . If we want to get all the barrages , You can change time To achieve . The smallest time The value should be 0, The biggest one should be the one closest to the video duration 60000 Multiple milliseconds . The length of the program here is 89:49. After verification , Right enough , Next, we can implement it in code .

Use postman Test request parameters

Use postman test time Request parameters

Code implementation

Use requests Construct network request , And use a loop to control page turning , Climb all the barrages . Parse the returned json Data and use pandas Store in Excel in . The detailed code is as follows , altogether 45 That's ok .

import requests
import pandas as pd
import time
import datetime
from fake_useragent import UserAgent
ua = UserAgent()
url = "https://galaxy.bz.mgtv.com/rdbarrage"
rdb_content = {'id': [], 'type': [], 'uid': [], 'content': [], 'add_time': [], 'ups': []}
count = 0
print(" Crawl start time : {}".format(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')))
for i in range(0, 91):
querystring = {"version": "2.0.0", "vid": "5683459", "cid": "328724", "time": i*60000}
headers = {
'User-Agent': ua.random
}
try:
response = requests.request("GET", url, headers=headers, params=querystring).json()
items = response['data']['items']
if items is None:
print(" Crawling over ! Number of barrages {}".format(count))
break
else:
for item in items:
rdb_content['id'].append(item.get('id')) # bullet chat id
rdb_content['type'].append(item.get('type')) # Barrage type
rdb_content['uid'].append(item.get('uid')) # user id
rdb_content['content'].append(item.get('content')) # The contents of the barrage
rdb_content['add_time'].append(item.get('time')) # Barrage time
rdb_content['ups'].append(item.get('up', 0)) #d Barrage likes
count = count + 1
print(" Crawling {} Minutes of barrage ..., Current number of barrages {}".format(i + 1, count))
time.sleep(5)
except:
print(" The first {} Minute barrage crawl failed ! Current number of barrages {}".format(i + 1, count))
continue
rdb_df = pd.DataFrame(rdb_content)
rdb_df.to_csv('rdb.csv', index=None)

Screenshot of operation effect :

Running effect

It can be seen that , During this climb , The number of barrages is close to 3w strip , At this time, the program update is not yet 2 God , To a certain extent, it can reflect the popularity of the program . Next, let's do some in-depth analysis of the barrage data , From the perspective of data, this program .

Data visualization

The data crawled above , Some fields are missing , But the proportion is very small , Therefore, delete is adopted to deal with , Final surplus 28602 Valid data .

Data preprocessing - Delete duplicate values

01 Distribution of barrage number in different time periods

The duration of the program is about 90 minute , We are divided by 1 Minutes and 10 In minutes , Look at the number of barrages . It can be seen that , Though over time , The number of barrages fluctuates , But on the whole , At all times , The barrage does not fluctuate violently , It also reflects that the program can continue to maintain a high popularity , a “ Every minute is wonderful ”.

Number of barrages per minute column chart .png

Number of barrages per ten minutes column chart .png

02 The number distribution of barrages with different lengths

Column diagram of different barrage lengths .png

It can be seen that , The length of most barrages is concentrated in 10 Up and down , Tend to be colloquial . It's also in line with our perception ,10 Words or so are enough to express the user's mood and point of view . Of course, there are users who are not too troublesome , The number of barrages has reached 30 Words above , There are also a very small number of barrages with a length of 50 above . out of curiosity , We can see that the length exceeds 50 What did the barrage say , See figure below , How much can you feel that the audience is enjoying the program very carefully .

The length exceeds 50 bullet chat .png

03 The distribution of the likes of the barrage

The number range of likes .png

It can be seen that nearly a quarter of the barrage did not get praise . near 6 The amount of praise for the bullet screen is 20 following , Like 20 The above Barrage is less than 20%. We can also see that the praise is greater than 300 What did the barrage say , But from the bullet screen, we can feel the overall happy atmosphere of the program .

Like more than 300 bullet chat .png

04 Number of barrages released by users , Number of likes , Comparison of the total number of words in the bullet screen

There are... In our data 17268 Users posted 28602 Shrapnel , In descending order of likes, take the top 10, Observe the number of barrages , Number of likes , The total number of words in the barrage . It can be seen that , Users with high likes , The number of barrages released is also large , The number of words is also a lot .

Comparison of bullet screen situation of each user .png

05 Barrage use emoji Facial expression

bullet chat emoji Expression usage .png

06 Clouds of words

Through the word segmentation of the barrage , Draw the following word cloud .

The cloud picture of barrage words

Look at the cloud picture of this word , Instantly feel the joy of overflowing the screen , It seems that the ears can hear the music intermittently “ Ha ha ha ha ” The sound , The eyes of the masses are bright , A program that makes people so happy , It's not surprising that the fire rises .

thus , We've basically finished 《 Desired life 》 The first 5 The capture of bullet screen and simple visual analysis of this program . More interesting points can be analyzed and found by yourself . Originally, I also called Baidu's Emotional Analysis API, I want to analyze the emotional tendency of the barrage , But the effect doesn't seem to be very good , As a result, it didn't post .


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved