程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Be sure to know Python crawler skills. Search the name to download full-text novels?

編輯:Python

Preface

          hello, I haven't seen you for a long time , I don't know if I want to UP ah . Recently, due to personal reasons, I have no time to share new case tutorials with you , So sorry, everyone , Now the problem has been solved , It will be updated every day , I hope you like it .

        So what is the case to share with you today , That is a case of a crawler that can download text by inputting the name of a novel in just 20 lines of code , Relatively speaking, it is very simple , Can let everyone get started well , And then it follows up Let's work together to realize such a function

Project analysis

Xiaobian used to love reading novels , I haven't read a novel in the past few years , I don't know which websites can be crawled , So I just found one .、 We first come to the novel of the website top Check the web page source code

 

You can still see clearly top All the novels in the list are <li> In the tag, this brings great convenience to our crawlers , Just get each li The content in the label can be completed .

Plus, let's find out where to download the file , We continue to click to break through the sky to the following page , Why choose to break this novel , That's because this is up The first finished novel in my life , It is also the most impressive novel . I don't know if you feel the same , I remember it was hard to wait for updates at that time .

  We continue to click in and see the download link of the file , In order to make the crawler code simpler, let's take a look at this link and the previous li What is the difference between the novel links in the tag

 

 

You can see that they have the same serial number , In this way, we only need to get the number of each novel to get the download links of all novels , Let's improve our code .

Code writing

The library we need for this project is requests library ,BeautifulSoup library

After sorting out the above ideas, we first establish our crawler framework :

 

I drew a flow chart to let you better understand our structure  

 

 

OK, let's see what the code for each part is :

The web page request code is almost universal. Every time a crawler can copy and paste it intact ( This code cannot solve the problem of anti - crawling )

def getHTMLtext(url):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36"
}
try:
r = requests.get(url=url, headers=headers)
r.raise_for_status()
r.encoding=r.apparent_encoding
return r.text
except:
print(' The visit to fail ')

The following is to get the information of each novel on the page and construct a download link

def parseHTMLtext(html,down_list):
soup=BeautifulSoup(html,'html.parser')
li_list=soup.find('div',class_='chu').find_all('li')
for li in li_list:
li_url=li.find('a').attrs['href']
li_url='https://www.555x.org/home/down/txt/id/'+li_url[-10:-5:1]
name=li.a.string
down_list.append([li_url,name])
pass

The following is the file saving function

def savetxt(down_list):
root='D:\\ A novel \\'
for down in down_list:
path=root+down[1]+'.txt'
txt=requests.get(down[0]).content
print(down[0])
with open(path,'wb') as f:
f.write(txt)
f.close()
pass

Next is the main function

def main():
url='https://www.555x.org/top.html'
html=getHTMLtext(url)
down_list = []
parseHTMLtext(html,down_list)
savetxt(down_list)
print(' Download successful ')
pass

OK, this is our code framework. Let's see how the code crawler works :

 

All the children who have read these novels should have soy sauce

 

We can see that the effect is good !!

Well, this is the whole content of this issue. I hope you can try it by typing the code yourself , I am also python A little white in the reptile , The code may not be perfect, but I still hope to get your support , In the future, I will also launch more articles to study together !!

In the next issue, we will share how to crawl web images , You can pay attention to it !!!

Finally, I'll give you a complete code. If you want to read a novel immediately, you can copy and paste it directly :

import requests
from bs4 import BeautifulSoup
# Web request
def getHTMLtext(url):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36"
}
try:
r = requests.get(url=url, headers=headers)
r.raise_for_status()
r.encoding=r.apparent_encoding
return r.text
except:
print(' The visit to fail ')
# Web content analysis
def parseHTMLtext(html,down_list):
soup=BeautifulSoup(html,'html.parser')
li_list=soup.find('div',class_='chu').find_all('li')
for li in li_list:
li_url=li.find('a').attrs['href']
li_url='https://www.555x.org/home/down/txt/id/'+li_url[-10:-5:1]
name=li.a.string
down_list.append([li_url,name])
pass
# file save
def savetxt(down_list):
root='D:\\ A novel \\'
for down in down_list:
path=root+down[1]+'.txt'
txt=requests.get(down[0]).content
print(down[0])
with open(path,'wb') as f:
f.write(txt)
f.close()
pass
# The main function
def main():
url='https://www.555x.org/top.html'
html=getHTMLtext(url)
down_list = []
parseHTMLtext(html,down_list)
savetxt(down_list)
print(' Download successful ')
pass
main()

  The source code is posted , It can also be further optimized , But considering that everyone's level is different , So let's stop here today , By the way, you can pay attention to the official account below . If you have any questions you don't understand, please leave a message . I'll see you next

 

 


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved