您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python knowledge: removing tag class symbols from HTML

編輯：Python

Sometimes , When we try to store strings in the database , It will HTML Store with tags . however , Some web sites need to render strings in their original format , Without any... In the database HTML Mark . therefore , In this tutorial , We will learn how to be in Python Delete from the string HTML Different ways of marking .

1 stay Python Use regular expressions in to remove... From a string HTML Mark

1.1 Sample code

Regular expressions are combinations of characters that represent search patterns . stay python Regular expression module of , We used sub() function , It replaces the string that matches the specified pattern with another string . The use of regular expressions to remove... From a string is mentioned below HTML String code .

import re
regex = re.compile(r'<[^>]+>')
def remove_html(string):
return regex.sub('', string)
text=input("Enter String:")
new_text=remove_html(text)
print(f"Text without html tags: {new_text}")

Output 1:

Enter String:<div class="header"> Welcome to my website </div>
Text without html tags: Welcome to my website

Output 2:

Enter String:<h1> Hello </h1>
Text without html tags: Hello

1.2 How the above code works ？

first , We're in the middle of something called “re” Of python Import regular expression module in
And then we use regex Modular re.compile() function . Where? . compile() Method will create a regular expression pattern object from the regular expression pattern string provided as input . This pattern object will use regular expression functions to search for matching strings in different target strings . The argument to the function is the pattern to match the input string . ‘<>’, Match the start and end tags in the string .
‘.*’ Represents zero or more characters . Regular expressions are a greedy approach , It tries to match as many repetitions as possible . If it doesn't work , Then the whole process will be traced back to . To convert greedy to non greedy methods , We use... In regular expression strings “？” character . It basically tries to match only a few repetitions , Then if it doesn't work, go back .
And then we use re.sub() Function to replace the matching pattern with an empty string .
Last , We call functions remove_html Remove... From the input string HTML label .

Two 、 Remove from string without using built-in functions HTML Mark

Here's how to remove from a string without using a built-in function HTML String code .

def remove_html(string):
tags = False
quote = False
output = ""
for ch in string:
if ch == '<' and not quote:
tag = True
elif ch == '>' and not quote:
tag = False
elif (ch == '"' or ch == "'") and tag:
quote = not quote
elif not tag:
output = output + ch
return output
text=input("Enter String:")
new_text=remove_html(text)
print(f"Text without html tags: {new_text}")

Output:

Enter String:<div class="header"> Welcome to my website </div>
Text without html tags: Welcome to my website

How the above code works ？
In the code above , We keep two counters , be called tag and quote. tag Variable tracking label , and quote Variables track single and double quotation marks in the input string . We use for Loop through each character of the string . If the character is a start or end marker , be Tag The variable is set to False. If the character is a single or double quotation mark , The quotation mark variable is set to False. otherwise , This character will be appended to the output string . therefore , In the output of the above code , Deleted div label , Only the original string is left .

3、 ... and 、 Use Python Medium XML Module removes from the string HTML Mark

It is mentioned below that XML Module, delete from the string HTML String code . XML It's a markup language , Used to store and transmit large amounts of data or information . Python There are some built-in modules that can help us parse XML file .XML Documents have separate units , It's called the element , Mark at the beginning and end (<>) define . Anything between the start tag and the end tag is the content of the element . An element can consist of multiple child elements called child elements . Use Python Medium ElementTree modular , We can operate these easily XML file .

import xml.etree.ElementTree
def remove_html(string):
return ''.join(xml.etree.ElementTree.fromstring(string).itertext())
text=input("Enter String:")
new_text=remove_html(text)
print(f"Text without html tags: {new_text}")

Output:

Enter String:<p class="intro"> I love Coding </p>
Text without html tags: I love Coding

How the above code works ？

first , We are Python Import xml.etree.ElementTree modular
We use formstring() Method to convert or parse a string to XML Elements . In order to traverse the formstring() Function returns each XML Elements , We used itertext() function . It basically iterates through each XML Element and returns the inner text within that element .
We use join The function concatenates the inner text with an empty string , And return the final output string .
Last , We call remove_html Function to delete... From the input string HTML label .
therefore , About how to be in Python Delete from the string HTML This concludes the tutorial on tagging . You can use the following links to learn about Python More information about regular expressions in .