程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Python regular expressions

編輯:Python


List of articles

  • sketch
  • Ordinary character
  • Metacharacters
    • \ The backslash
    • `[]` square brackets
    • ^
    • $ dollar
    • . spot
    • | or
    • ? question mark
    • * The stars
    • + plus
    • {m, n} Curly braces
    • (< Regular expressions >) – Group
  • Special sequence
    • Special sequence list
  • Python Regular expression module in
    • re.findall()
    • re.compile()
    • re.split()
    • re.sub()
    • re.subn()
    • re.escape()
    • re.search()
    • A match object
      • Get string and regular expression
      • Get the index of the matching object
    • Get the matching substring
  • Reference material :

sketch

Regular expressions (RegEx) It's a special sequence of characters , It uses a search pattern to find a string or group of strings . It can detect the existence of text by matching the text with a specific pattern , You can also split a schema into one or more sub schemas .Python Provides a re modular , Support in Python Using regular expressions . Its main function is to provide search , It uses regular expressions and strings . ad locum , It either returns the first match , Either no match is returned .

Example :

import re
s = 'GeeksforGeeks: A computer science portal for geeks'
match = re.search(r'portal', s)
print('Start Index position :', match.start())
print('End Index position :', match.end())

Output

The above code gives the start index and end index of the string portal .

Be careful : there r character (r’portal’) Represents the original , Not regular expressions . The original string is slightly different from the regular string , It will not \ Characters are interpreted as escape characters . This is because the regular expression engine uses \ Character to achieve its own escape purpose .

At the beginning of use Python Before the regular expression module , Let's see how to actually write regular expressions using metacharacters or special sequences .

Ordinary character

Regular expressions are made up of ordinary characters ( Such as character a To z) And special characters ( be called " Metacharacters ") The composition of the text pattern . Pattern describes one or more strings to match when searching for text . Regular expression as a template , Match a character pattern to the string being searched .

Normal characters include all printable and nonprintable characters that are not explicitly specified as metacharacters . This includes all uppercase and lowercase letters 、 All figures 、 All punctuation and some other symbols .

Metacharacters

In order to understand RE Method , Metacharacters are very useful , Important , And will be used in modules re The function of . The following is a list of metacharacters .

Metacharacters describe \ Used to remove the special meaning after a character [] Represents a character class ^ Match the beginning $ Match the end . Match any character except newline | Express OR( Matches any character separated by it .? Match zero or one time * Any number of occurrences ( Include 0 time )+ One or more instances {} Indicates the number of occurrences of the preceding regular expression to match .() Enclose a set of regular expressions

Let's discuss each of these metacharacters in detail

\ The backslash

The backslash \ Ensure that characters are not handled in a special way . This can be thought of as a way to escape metacharacters . for example , If you want to search in a string dot(.), Then you will find dot(.) Will be treated as a special character , Just like one of the metacharacters ( As shown in the table above ). therefore , In this case , We will be in dot(.) Use backslash before (\), So that it will lose its character . See the following example for a better understanding of .

example :

import re
s = 'geeks.forgeeks'
# without using \
match = re.search(r'.', s)
print(match)
# using \
match = re.search(r'.', s)
print(match)

Output :

[] square brackets

square brackets [] Represents a character class consisting of a set of characters that we want to match . for example , Character class [abc] Will match any single a、b or c.

We can also use the... In square brackets – Specify the character range . for example

  • [0, 3] As a sample
  • [a-c] And [abc] identical

We can also use caret ^ Reverse character class . for example

  • [^0-3] Express Division 0、1、2 or 3 Any number other than
  • [^a-c] Express Division a、b or c Any character other than

^

Insertion mark ^ The symbol matches the beginning of the string , That is, check whether the string starts with the given character . for example ——

  • ^g The string will be checked to see if it starts with g start , for example geeks, globe, girl, getc etc. .
  • ^ge The string will be checked to see if it starts with ge start , for example geeks、geeksforgeeks etc. .

$ dollar

Dollar$ The symbol matches the end of the string , That is, check whether the string ends with the given character . for example ——

  • s$ Will check to geeks、ends、s And so on .
  • ks$ Will check to ks a null-terminated string , for example geeks、geeksforgeeks、ks etc. .

. spot

spot . The symbol matches only the division newline character \n A single character other than . for example ——

  • a.b A string containing any characters at the point will be checked , for example acb,acbd,abbb etc.
  • .. Will check if the string contains at least 2 Characters

| or

| Symbol as or Operator , This means that it checks the string for the presence of or Patterns before or after symbols . for example ——

  • a|b Will match any containing a or b String , Such as acd、bcd、abcd etc. .

? question mark

question mark Check whether the string before the question mark in the regular expression appears at least once or does not appear at all . for example ——

  • ab?c Will match the string ac、acb、dabc, But it doesn't match abbc, Because there are two b. Again , It will not be with abdc matching , because b There is no c.

* The stars

asterisk * Symbols match * Zero or more matches of the regular expression preceding the symbol . for example ——

  • ab*c Will match the string ac、abc、abbbc、dabc etc. , But it doesn't match abdc, because b There is no c.

+ plus

plus + matching + One or more matches of the regular expression preceding the symbol . for example ——

  • ab+c Will match the string abc、abbc、dabc, But it doesn't match ac,abdc because ac There is no b, and abdc There is no c.

{m, n} Curly braces

{} Match any repetition before regular expression , from m To n( Include m and n). for example ——

  • a{2, 4} Will match the string aaab、baaaac、gaad, But it doesn't match abc、bc Etc , Because in both cases there is only one a Or not a.

(< Regular expressions >) – Group

Group symbols are used to group sub patterns . for example ——

  • (a|b)cd Will match acd、abcd、gacd Etc .

Special sequence

The special sequence does not match the actual characters in the string , Instead, it tells the search string where a match must occur . It makes it easier to write common patterns .

Special sequence list

Special sequence describe Example \A If the string starts with a given character , The match \A forfor geeks\b If a word begins or ends with a given character , The match .\b( character string ) The beginning of the word will be checked ,( character string )\b The end of the word will be checked \b gegeeks\B It is associated with \b contrary , That is, the string should not start or end with the given regular expression \B geforge\d With any decimal number , This is equivalent to setting the class [0-9]\d123\D Match any non numeric characters , This is equivalent to the collection class [^0-9]\Dgeeks\s Match any space character \sgee ks\S Match any non whitespace characters \Sa bd\w Match any alphanumeric character , This is equivalent to class [a-zA-Z0-9_]\w123\W Match any non alphanumeric characters .\W>$\Z If the string ends with a given regular expression , The match ab\Zabcdab

Python Regular expression module in

Python There is one named re Module , be used for Python Regular expressions in . We can use import Statement to import this module .

example : stay Python Import re modular

importre

Let's look at the various functions provided by this module , In order to be in Python Using regular expressions .

re.findall()

Returns all non overlapping matches of patterns in a string as a string list . Scan string from left to right , And return the matching items in the order they are found .

example : Find all matches of the pattern

# findall()
import re
# Sample text string , Where regular expressions
# Searched .
string = """Hello my Number is 123456789 and my friend's number is 987654321"""
# Example regular expressions for finding numbers .
regex = '\d+'
match = re.findall(regex, string)
print(match)

Output

re.compile()

Regular expressions are compiled into pattern objects , These objects have methods for various operations , For example, search for pattern matching or perform string replacement .

Example 1:


import re
# compile() Create regular expression
# Character class 【a-e】,
# This is equivalent to [abcde].
# class [abcde] Will match the string
# 'a', 'b', 'c', 'd', 'e'.
p = re.compile('[a-e]')
# findall() Search for regular expressions
# And return to the list after finding
print(p.findall("Aye, said Mr. Gibenson Stark"))

Output :

Understand the output :

  • For the first time “Aye” Medium “e”, instead of “A”, Because it is case sensitive .
  • The next occurrence is “said” Medium “a”, And then there was “said” Medium “d”, And then there was “Gibenson” Medium “b” and “e”, the last one “a” And “Stark” matching .
  • Metacharacter backslash ’‘ Has a very important role , Because it sends out various sequences of signals . If you want to use a backslash instead of its special meaning as a metacharacter , Please use ’\’

re.split()

Splits a string by the number of occurrences of a character or pattern , When the pattern is found , The remaining characters in the string will be returned as part of the result list .

grammar :

re.split(pattern, string, maxsplit=0, flags=0)

The first parameter ,pattern Represents a regular expression ,string Is the given string , Where the pattern is searched and split , If not provided maxsplit, Think maxsplit zero “0”, If any non-zero value is provided , Many splits occur at most . If maxsplit = 1, Then the string will be split only once , The resulting length is 2 A list of . The logo is very useful , Can help shorten code , They are not required parameters , for example :flags = re.IGNORECASE, In this split , Case write , That is, lowercase or uppercase will be ignored .

Example 1:

from re import split
# '\W+' Represents non alphanumeric characters
# Or the character group when searching ','
# Or blank ' ', split(), Split the string starting at this point
print(split('\W+', 'Words, words , Words'))
print(split('\W+', "Word's words Words"))
# here ':', ' ' ,',' So it's not alphanumeric ,
# The point where the split occurs
print(split('\W+', 'On 12th Jan 2016, at 11:02 AM'))
# '\d+' Represents a numeric character or group of
# Character splitting occurs at “12”, '2016',
# '11', '02' only
print(split('\d+', 'On 12th Jan 2016, at 11:02 AM'))

Output :

Example 2:

import re
# Splitting will only occur in “12” Occurs once at , The length of the returned list is 2
#
print(re.split('\d+', 'On 12th Jan 2016, at 11:02 AM', 1))
# 'Boy' and 'boy' When flags = re.IGNORECASE when , Will be considered the same . Ignore the case
print(re.split('[a-f]+', 'Aey, Boy oh boy, come here', flags=re.IGNORECASE))
print(re.split('[a-f]+', 'Aey, Boy oh boy, come here'))

Output :

re.sub()

Function “sub” representative SubString, In a given string ( The first 3 Parameters ) Search for a regular expression pattern in , And the substring pattern is found to be repl( The first 2 Parameters ) After replacement , Count and check the number of times this happens .

grammar :

 re.sub(pattern, repl, string, count=0, flags=0)

Example 1:

import re
# Regular expression patterns “ub” matching “Subject” and “Uber” String at . As the case may be
# Has been ignored , Use the tag “ub” Should match the string twice when matching ,
# “Subject” Medium “ub” Replace with “~*”, and “Uber” Medium “ub” Replace with “ub”.
print(re.sub('ub', '~*', 'Subject has Uber booked already', flags=re.IGNORECASE))
# Consider case sensitivity ,“Uber” Medium “Ub” Will not be replaced .
print(re.sub('ub', '~*', 'Subject has Uber booked already'))
# because count The value of is 1, Therefore, the maximum number of replacements is 1
print(re.sub('ub', '~*', 'Subject has Uber booked already',
count=1, flags=re.IGNORECASE))
# Before the mode “r” Express RE,\s Represents the beginning and end of a string .
print(re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam',
flags=re.IGNORECASE))

Output

re.subn()

subn() In all respects with sub() be similar , In addition to the way the output is provided . It returns a tuple , It contains the total number of replacement and new strings , Not just strings .

grammar :

 re.subn(pattern, repl, string, count=0, flags=0)

example :

import re
print(re.subn('ub', '~*', 'Subject has Uber booked already'))
t = re.subn('ub', '~*', 'Subject has Uber booked already',
flags=re.IGNORECASE)
print(t)
print(len(t))
# This will provide with sub() Same output
print(t[0])

Output

re.escape()

Returns a string of all non alphanumeric backslashes , If you want to match any text string that may contain regular expression metacharacters , This will be very useful .

grammar :

re.escape(string)

example :

import re
# escape() Returns a backslash before each non alphanumeric character “\” String
# In the first case only “”, In the second case, it is non alphanumeric “”, Insert symbol “^”、“-”、“[]”、“\”
# It's not alphanumeric
print(re.escape("This is Awesome even 1 AM"))
print(re.escape("I Asked what is this [a-9], he said \t ^WoW"))

Output

re.search()

This method returns None( If the patterns don't match ), Or return re. The matching object contains information about the matching part of the string . This method stops after the first match , So this is best for testing regular expressions rather than extracting data .

example : The number of occurrences of the search pattern

import re
# Let's use regular expressions to match date strings , The format is month name followed by days
regex = r"([a-zA-Z]+) (\d+)"
match = re.search(regex, "I was born on June 24")
if match != None:
# When the expression “([a-zA-Z]+)(\d+) When matching the date string , We got here .
# This will print [14,21], Because it's indexing 14 There's a match , stay 21 At the end of .
print("Match at index %s, %s" % (match.start(), match.end()))
# We use group() Method to get all matches
# Captured groups . These groups contain matching values .
# especially :
# matching .group(0) Always return an exact matching string
# matching .group(1) matching .group(2)... Return capture
# Group in the input string from left to right
# matching .group() Equivalent to match. Group (0)
# So this will print “6 month 24 Japan ”
print("Full match: %s" % (match.group(0)))
# So this will print “June”
print("Month: %s" % (match.group(1)))
# So this will print "24"
print("Day: %s" % (match.group(2)))
else:
print("The regex pattern does not match.")

Output

A match object

Match Object contains all the information about the search and results , If no match is found , Will return to None. Let's take a look match Some common methods and properties of objects .

Get string and regular expression

match.re Property returns the passed regular expression ,match.string Property returns the passed string .

example : Get the string and regular expression of the matching object

import re
s = "Welcome to GeeksForGeeks"
# here x It's a match
res = re.search(r"\bG", s)
print(res.re)
print(res.string)

Output

Get the index of the matching object

  • start() Method returns the starting index of the matching substring
  • end() Method returns the end index of the matching substring
  • span() Method returns a tuple , It contains the start index and end index of the matching substring

example : Get the index of the matching object

import re
s = "Welcome to GeeksForGeeks"
# here x It's a match
res = re.search(r"\bGee", s)
print(res.start())
print(res.end())
print(res.span())

Output

Get the matching substring

group() Method returns the string portion of the pattern match . See the following example for a better understanding of .

example : Get the matching substring

import re
s = "Welcome to GeeksForGeeks"
# here x It's a match
res = re.search(r"\D{2} t", s)
print(res.group())

Output

Reference material :

https://docs.python.org/2/library/re.html


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved