您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

[Bi design teaching] implementation of Crawler Based on Python

編輯：Python

brief introduction

Crawlers are often used in the data collection stage of graduation design , Many students' requirements and reactions , Let students grow up with an article on reptiles .
This article will describe and analyze how crawlers are used , And give an example .

The so-called crawler is to write code to crawl the data you want from the web page , The quality of the code determines whether you can accurately crawl the data you want , Whether the data can be analyzed intuitively and correctly .

Python Undoubtedly, it is the most suitable language for reptiles .Python It's very simple , But to really use it well, you need to learn a lot of third-party library plug-ins . such as matplotlib library , Is an imitation matalab A powerful drawing library , With it, you can draw a pie chart of the data you climb down 、 Broken line diagram 、 Scatter plot and so on , Even 3D Figure to show intuitively .

Python Third party libraries can be installed manually , But it is easier to enter a line of code directly on the command line to automatically search resources and install . And very intelligent , You can identify the type of your computer and find the most suitable version .

Pip install + The third-party library you need

Or is it easy install + The third-party library you need

Here we suggest you use pip install , because pip You can install or uninstall , The other method can only install . If you want to use a new version of the third-party library , Use pip The advantages of will appear .

🧿 Topic selection guidance , Project sharing ：

https://gitee.com/dancheng-senior/project-sharing-1/blob/master/%E6%AF%95%E8%AE%BE%E6%8C%87%E5%AF%BC/README.md

Interactive interface

def web():
root = Tk()
Label(root,text=' Please enter the web address ').grid(row=0,column=0) # Yes Label The contents are arranged in tabular format 
Label(root,text=' Please enter User-Agent :').grid(row=1,column=0)
v1=StringVar() # Set a variable 
v2=StringVar()
e1 = Entry(root,textvariable=v1) # Used to store Input content 
e2 = Entry(root,textvariable=v2)
e1.grid(row=0,column=1,padx=10,pady=5) # Perform tabular layout 
e2.grid (row=1,column=1,padx=10,pady=5)
url = e1.get() # Assign the URL obtained from the input box to url
head = e2.get()

The reptile part

Use crawlers to crawl any blog , And stutter and segment all his articles . So as to extract keywords , Analyze the frequency of the blogger using the currently popular Internet related vocabulary .

First write a function download（） obtain url, Then write a function parse_descrtion（） Parsing from
url From html, Final stutter participle .

def download(url): # By giving url Climb out of the data 
if url is None:
return None
try:
response = requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36', })
if (response.status_code == 200):
return response.content
return None
except:
return None
def parse_descrtion(html):
if html is None:
return None
soup = BeautifulSoup(html, "html.parser") #html String creation BeautifulSoup
links = soup.find_all('a', href=re.compile(r'/forezp/article/details'))
for link in links:
titles.add(link.get_text())
def jiebaSet():
strs=''
if titles.__len__()==0:
return
for item in titles:
strs=strs+item;
tags = jieba.analyse.extract_tags(strs, topK=100, withWeight=True)
for item in tags:
print(item[0] + '\t' + str(int(item[1] * 1000)))

The first function has nothing to say .

The second function uses beautifulsoup, Through the analysis of the web , So as to find all the conditions that meet
href=re.compile(r’/forezp/article/details’) Of a What's in the label .

The third function is stutter segmentation . Next, I will make a brief introduction to stuttering participle .

Three word segmentation modes are supported .

Accurate model ： Try to cut the sentence as precisely as possible , Suitable for text analysis .

All model ： Scan the sentences for all the words that can be made into words , Very fast , But it doesn't solve the ambiguity .

Search engine model ： On the basis of exact patterns , Again shred long words , Increase recall rate , Suitable for search engine segmentation .

for instance , Stuttering participle “ I came to tsinghua university in Beijing ” this sentence .

【 All model 】: I / Came to / Beijing / tsinghua / Tsinghua University / Bgi, / university

【 Accurate model 】: I / Came to / Beijing / Tsinghua University

data storage

I'm going to use mongoDB database , Students can choose the database they are familiar with or meet the requirements
client = pymongo.MongoClient(“localhost”, 27017)

This sentence is to use a given host location and port .pymongo Of Connection() This method is not recommended , Officials recommend new methods MongoClient().

db = client[‘local’]

This sentence will create a good mongoDB One of the two databases that exist by default “local” Assign to db, such
db In the later program, it represents the database local.

posts = db.pymongo_test
post_id = posts.insert(data)
take local A default set in “pymongo_test” Assign a value to posts, And use insert Method single insert data . Finally, return to a loop program in stuttering participle , Insert data in sequence .

The above is the core code about connecting to the database , Next, how to start mongoDB database .（ I couldn't connect to it at the beginning of programming , Later, it was found that the database itself did not start , alas , There are too many stupid things happening in programming .）

Microsoft logo +R, Input cmd, look for “mongodb” The path of , And then run mongod Open command , Simultaneous use –dbpath The designated data storage location is “db” Folder .

start-up mongoDB

I put it here E disc , You can set it yourself according to your needs . Finally, we need to see if it is successful , From the information in the picture ,mongodb use 27017 port , Then we will lose in the browser http://localhost:27017, After opening mongodb Tell us in 27017 On Add 1000 It can be used http Mode view mongodb Management information for .

🧿 Topic selection guidance , Project sharing ：

https://gitee.com/dancheng-senior/project-sharing-1/blob/master/%E6%AF%95%E8%AE%BE%E6%8C%87%E5%AF%BC/README.md