您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python crawler practice -- quora website text automatic crawling and regular matching filtering information

編輯：Python

Web crawler and regular matching

Realization principle

utilize requests Visit the website for html, use re Regular expressions match and process characters

Code

# -*- coding: utf-8 -*-
# The above line tells the compiler the encoding format to use . In this way, even if there is Chinese, there will be no problem 
import re
import requests
response = requests.get('https://www.quora.com/Is-online-education-overrated') # Page to crawl 
f = open("words.txt", "a") # Create in read-write mode / Open file 
data = response.text # Express the website source code in words , The coding format can be changed 
title = ' '.join(re.findall('<title>(.*?)</title>',data)) # Webpage title 
result_list = re.findall('"text": "(.*?)."',data)+re.findall(r'''\\\\\\"text\\\\\\": \\\\\\"(.*?).\\\\\\",''',data)
# The regular expression here is more complex , Mainly looking for “text” Content of element . According to the of the web page html The rules are different , What to look for tag Also different 
f.write('\n') # Write one line for another 
print(title) # Output title 
for result in result_list: # Do some follow-up 
result = result.replace(r'\u2019', r"'") # Manually switch to special characters 
result = result.replace('\\\\\\', "") # Remove the double backslash 
result = result.replace(r'/', "") # Remove single slash 
result = result.replace(r'\n', "") # Remove the newline 
result = result.replace('\\', "") # Remove the single backslash 
check = result.split() # Change the format to list, Each element is an answer 
for ele in check: # Traversal list elements , Delete other irrelevant characters 
if '"modifiers": {"image": ' in ele or len(ele) >= 13:
check.remove(ele)
result = ' '.join(check) # Turn it into str
f.write(result+", "+title+"\n") # Write it in a file 
print(result) # Output the main contents written 
f.close() # Good habit of saving files

Functional expansion

To demonstrate regular matching and crawlers , The code adds a lot of irrelevant code , And the match is not perfect . Most of them are quite accurate , such as

Culture of Qualit
"On quality Terry Anderson emphasized that ""learning- knowledge- assessment- and educational experiences will result in high levels of learning by all He also believes that the ""integration of the new tools and affordances of the educational Semantic Web and emerging social software solutions will further enhance and make more accessible and affordable quality online learning experiences"
Since I have titled this observation as
ODeL Xperitu
[from Latin experitu = experienced tested proven] let me say that learning must progress to maturity; to function well as social innovators promoting excellence through Capacity Building and Development. Yes this is the Quality Assurance (QA) principle that defines and determ
"Michael Moore even says that this is a fact of distance education wherein ""teaching is hardly ever an individual act but a process joining together the expertise of a number of specialists."

However , There are still some strange characters that have not been deleted

#NAME?
#NAME?
#NAME?
#NAME?
#NAME?
For any Query or Enquiry Please Call u2013 Hiren Harwani - 9712186969 (you can join us in Whatsapp also

See if the partners can try to improve this program by themselves ！
in addition , If you are crawling Chinese web pages , Pay attention to changing the coding format to utf-8 Oh
In fact, there are many other third-party libraries online , such as beautiful soap 4, It is also worth exploring .