程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Use Pythons requests and beautiful soup to analyze web pages

編輯:Python
author : translator :  

| 2022-06-28 13:29      

Learn this Python course , Easily extract information about web pages .

Browsing the web may take up most of your day . However , You always need to browse manually , This is dislike. , isn't it? ? You must open the browser , Then visit a website , Click the button , Move the mouse …… It's quite time-consuming . If you can interact with the Internet through code , Wouldn't it be better ?

stay Python Of  requests  With the help of the module , You can use Python Get data from the Internet :

import requests
DATA = "https://opensource.com/article/22/5/document-source-code-doxygen-linux"
PAGE = requests.get(DATA)
print(PAGE.text)

In the above code example , You first imported  requests  modular . next , You created two variables : One of them is called  DATA, It is used to save what you want to download URL. In later code , You will be able to provide different... Each time you run the application URL. however , For now , The easiest way is “ Hard encoding ” A test URL, For demonstration purposes .

Another variable is  PAGE. The code reads the data stored in  DATA  Medium URL, Then pass it as a parameter  requests.get  function , Finally, use variables  PAGE  To receive the return value of the function .requests  Modules and  .get  The function :“ Read ” An Internet address ( One URL)、 Visit the Internet , And download anything at that address .

Of course , There are many steps involved . Fortunately, , You don't have to figure it out for yourself , This is what Python Why modules exist . Last , You tell me Python Print  requests.get  Stored in  PAGE  Variable  .text  Everything in the field .

Beautiful Soup

If you run the above example code , You will get examples URL All of , also , They will be output to your terminal indiscriminately . This is because in the code , You are right about  requests  The only thing the collected data does , Just print it . However , Parsing text is more interesting .

Python You can use its most basic functions to “ Read ” Text , But parsing text allows you to search for patterns 、 Specific words 、HTML Labels etc. . You can interpret it yourself  requests  Returned text , however , It's much easier to use specialized modules . in the light of HTML and XML Text , We have    library .

The following code does the same thing , It's just , It has been used. Beautiful Soup To parse the downloaded text . because Beautiful Soup Can identify HTML Elements , So you can use some of its built-in functions , Make the output more eye-friendly .

for example , At the end of the program , You can use Beautiful Soup Of  .prettify  Function to process text ( Make it more beautiful ), Instead of printing the original text directly :

from bs4 import BeautifulSoup
import requests
PAGE = requests.get("https://opensource.com/article/22/5/document-source-code-doxygen-linux")
SOUP = BeautifulSoup(PAGE.text, 'html.parser')
# Press the green button in the gutter to run the script.
if __name__ == '__main__':
    # do a thing here
    print(SOUP.prettify())

Through the above code , We made sure that every open HTML Labels are output on a single line , With appropriate indentation , To help explain the inheritance relationship of labels . actually ,Beautiful Soup Be able to understand in more ways HTML label , Instead of just printing it out .

You can choose to print a specific label , Instead of printing the entire page . for example , Try to change the selector for printing from  print(SOUP.prettify())  Change to :

print(SOUP.p)

This will only print one  <p>  label . say concretely , It only prints the first one it encounters  <p>  label . To print all  <p>  label , You need to use a loop .

loop

Use Beautiful Soup Of  find_all  function , You can create a  for  loop , To traverse the  SOUP  Entire page contained in variable . except  <p>  Beyond labels , You may also be interested in other labels , So it's best to build it as a custom function , from Python Medium  def  keyword ( intend  “ Definition ”define) Appoint .

def loopit():
    for TAG in SOUP.find_all('p'):
        print(TAG)

You can change temporary variables at will  TAG  Name , for example  ITEM  or  i  Or whatever you like . Every time the cycle runs ,TAG  It will include  find_all  Function search results . In this code , It searches for  <p>  label .

Functions don't execute automatically , Unless you explicitly call it . You can call this function at the end of the code :

# Press the green button in the gutter to run the script.
if __name__ == '__main__':
    # do a thing here
    loopit()

Run the code to see all  <p>  Labels and their contents .

Get only content

You can specify by just “ character string string”( It is “ word words” Programming terms for ) To exclude printing labels .

def loopit():
    for TAG in SOUP.find_all('p'):
        print(TAG.string)

Of course , Once you have the text of the page , You can use the standard Python The string library parses it further . for example , You can use  len  and  split  Function to get the number of words :

def loopit():
    for TAG in SOUP.find_all('p'):
        if TAG.string is not None:
            print(len(TAG.string.split()))

This will print the number of strings in each paragraph element , Omit paragraphs that do not have any strings . To get the total number of strings , You need to use variables and some basic math :

def loopit():
    NUM = 0
    for TAG in SOUP.find_all('p'):
        if TAG.string is not None:
            NUM = NUM + len(TAG.string.split())
    print("Grand total is ", NUM)

Python Homework

You can use Beautiful Soup and Python Extract more information . Here are some ideas on how to improve your application :

  • , So you can start the application , Specify the... To download and analyze URL.
  • Statistics page pictures (<img>  label ) The number of .
  • Count the pictures in another tag (<img>  label ) The number of ( for example , Only in  <main> div Picture in , Or just in  </p>  Picture after tag ).

via: 

author :  Topic selection :  translator :  proofreading :

This paper is written by    Original compilation ,  Honor roll out



  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved