
author | Dongge takes off
source | Python Data Science
This time, let's introduce the common methods of text processing .
The two main types of text are string and object. Unless otherwise specified, the type is string, The text type is generally object.
The operation of text is mainly through accessor str To achieve , Very powerful , However, the following points should be paid attention to before use .
Accessors can only access Series Data structure use . In addition to regular column variables df.col outside , You can also change the index type df.Index and df.columns Use
Ensure that the object type accessed is a string str type . If not, first astype(str) Conversion type , Otherwise, an error will be reported
Accessors can be used with multiple connections . Such as df.col.str.lower().str.upper(), This and Dataframe One line operation in is a principle
The following formally introduces the various operations of the text , It can basically cover daily 95% Data cleaning needs , altogether 8 A scenario .

The following operations are based on the following data :
import pandas as pd
import numpy as np
df = pd.DataFrame({'name':['jordon', 'MIKE', 'Kelvin', 'xiaoLi', 'qiqi','Amei'],
'Age':[18, 30, 45, 23, 45, 62],
'level':['high','Low','M','L','middle',np.nan],
'Email':['[email protected]','[email protected]','[email protected]','[email protected]',np.nan,'[email protected]']})
--------------------------------------------
name Age level Email
0 jordon 18 high [email protected]
1 MIKE 30 Low [email protected]
2 Kelvin 45 M [email protected]
3 xiaoLi 23 L [email protected]
4 qiqi 45 middle NaN
5 Amei 62 NaN [email protected]Case change
# All characters become lowercase
s.str.lower()
# Characters are all uppercase
s.str.upper()
# Capitalize the first letter of each word
s.str.title()
# The first letter of the string is capitalized
s.str.capitalize()
# Upper and lower case conversion
s.str.swapcase() The above usage is relatively simple , Don't give examples one by one , Here's an example of columns Examples of lowercase .
df.columns.str.lower()
--------------------------------------------------------
Index(['name', 'age', 'level', 'email'], dtype='object')Format judgment
The following are judgment operations , Therefore, the Boolean value is returned .
s.str.isalpha # Is it a letter
s.str.isnumeric # Is it a number 0-9
s.str.isalnum # Whether it consists of letters and numbers
s.str.isupper # Whether it is capitalized
s.str.islower # Whether it is lowercase
s.str.isdigit # Is it a number alignment
# Align center , Width is 8, Others ’*’ fill
s.str.center(, fillchar='*')
# Align left , Width is 8, Others ’*’ fill
s.str.ljust(8, fillchar='*')
# Right alignment , Width is 8, Others ’*’ fill
s.str.rjust(8, fillchar='*')
# Custom alignment , Parameter adjustable width 、 Align the direction 、 Fill character
s.str.pad(width=8, side='both',fillchar='*')# give an example
df.name.str.center(8, fillchar='*')
-------------
0 *jordon*
1 **MIKE**
2 *Kelvin*
3 *xiaoLi*
4 **qiqi**
5 **Amei**Counting and coding
s.str.count('b') # A string that contains a specified number of letters
s.str.len() # String length
s.str.encode('utf-8') # Character encoding
s.str.decode('utf-8') # Character decoding By using split Method can split text by using a specified character as a split point . among ,expand Parameter allows the split content to expand , Form a separate column ,n Parameter can specify the split position to control the formation of several columns .
Next email The variable follows @ To break up .
# Usage method
s.str.split('x', expand=True, n=1)
# give an example
df.Email.str.split('@')
----------------------------
0 [jordon, sohu.com]
1 [Mike, 126.cn]
2 [KelvinChai, gmail.com]
3 [xiaoli, 163.com]
4 NaN
5 [amei, qq.com]# expand You can expand the split content into a single column
df.Email.str.split('@' ,expand=True)
----------------------------
0 1
0 jordon sohu.com
1 Mike 126.cn
2 KelvinChai gmail.com
3 xiaoli 163.com
4 NaN NaN
5 amei qq.com More complex splitting can be done with the help of regular expressions , For example, I want to pass @ and . To break up , Then it can be realized in this way .
df.Email.str.split('\@|\.',expand=True)
----------------------------
0 1 2
0 jordon sohu com
1 Mike 126 cn
2 KelvinChai gmail com
3 xiaoli 163 com
4 NaN NaN NaN
5 amei qq com There are several ways to replace text :replace,slice_replace,repeat
replace Replace
replace Method is the most commonly used alternative , The parameters are as follows :
pal: Is the replaced content string , It can also be a regular expression
repl: String for new content , It can also be a called function
regex: Used to set whether regular is supported , The default is True
# take email Species com Replace with cn
df.Email.str.replace('com','cn')
------------------------
0 [email protected]
1 [email protected]
2 [email protected]
3 [email protected]
4 NaN
5 [email protected]More complicated , For example, write old content as Regular expressions .
# take @ Replace the previous names with xxx
df.Email.str.replace('(.*?)@','[email protected]')
------------------
0 [email protected]
1 [email protected]
2 [email protected]
3 [email protected]
4 NaN
5 [email protected]Or write new content as Called function .
df.Email.str.replace('(.*?)@', lambda x:x.group().upper())
-------------------------
0 [email protected]
1 [email protected]
2 [email protected]
3 [email protected]
4 NaN
5 [email protected]Slice replacement
slice_replace The replacement is realized by slicing , The specified characters can be retained or deleted by slicing , The parameters are as follows .
start: The starting position
stop: End position
repl: New content to replace
Yes start After slice position and stop Replace before slice position , If not set stop, that start Then replace them all , Similarly, if it is not set start, that stop Replace all before .
df.Email.str.slice_replace(start=1,stop=2,repl='XX')
-------------------------
0 [email protected]
1 [email protected]
2 [email protected]
3 [email protected]
4 NaN
5 [email protected]Repeat replacement
repeat It can realize the function of repeated replacement , Parameters repeats Set the number of repetitions .
df.name.str.repeat(repeats=2)
-------------------------
0 jordonjordon
1 MIKEMIKE
2 KelvinKelvin
3 xiaoLixiaoLi
4 qiqiqiqi
5 AmeiAmei The text is spliced through cat Method realization , Parameters :
others: Sequences that need to be spliced , If None Not set up , It will automatically splice the current sequence into a string
sep: Separator for splicing
na_rep: Null values are not processed by default , The replacement character of null value is set here .
join: Direction of splicing , Include left, right, outer, inner, The default is left
There are mainly the following splicing methods .
1. Splice a single sequence into a complete string
As mentioned above , When there is no setting ohters When parameters are , This method combines the current sequence into a new string .
df.name.str.cat()
-------------------------------
'jordonMIKEKelvinxiaoLiqiqiAmei'
# Set up sep The separator is `-`
df.name.str.cat(sep='-')
-------------------------------
'jordon-MIKE-Kelvin-xiaoLi-qiqi-Amei'
# Assign the missing value to `*`
df.level.str.cat(sep='-',na_rep='*')
-----------------------
'high-Low-M-L-middle-*'2. Splicing sequence and other class list objects are new sequences
Let's start with name Column sum * Column splicing , then level Column splicing , Form a new sequence .
# str.cat Multi level connection realizes multi column splicing
df.name.str.cat(['*']*6).str.cat(df.level)
----------------
0 jordon*high
1 MIKE*Low
2 Kelvin*M
3 xiaoLi*L
4 qiqi*middle
5 NaN
# You can also directly splice multiple columns
df.name.str.cat([df.level,df.Email],na_rep='*')
--------------------------------
0 [email protected]
1 [email protected]
2 [email protected]
3 [email protected]
4 qiqimiddle*
5 Amei*[email protected]Splice a sequence with multiple objects into a new sequence
Text extraction is mainly through extract To achieve .
extract Parameters :
pat: Through regular expressions to achieve an extraction of pattern
flags: Regular library re Logo in , such as re.IGNORECASE
expand: When regular extracts only one content , If expand=True The exhibition will return to a DataFrame, Otherwise, return one Series
# extract email Two contents in
df.Email.str.extract(pat='(.*?)@(.*).com')
--------------------
0 1
0 jordon sohu
1 vMike NaN
2 KelvinChai gmail
3 xiaoli 163
4 NaN NaN
5 amei qq adopt find and findall Two ways to achieve .
find The parameters are simple , Directly enter the string to query , Returns the position in the original string , If no query result is found, return -1.
df['@position'] = df.Email.str.find('@')
df[['Email','@position']]
-------------------------------------
Email @position
0 [email protected] 6.0
1 [email protected] 4.0
2 [email protected] 10.0
3 [email protected] 6.0
4 NaN NaN
5 [email protected] 4.0 The above example returns @ stay email Position in variable .
Another way to find it is findall
findall Parameters :
pat: What to look for , regular expression
flag: Regular library re Logo in , such as re.IGNORECASE
findall and find The difference is that regular expressions are supported , And return the details . This method is a little similar to extract, It can also be used to extract , But not as good as extract convenient .
df.Email.str.findall('(.*?)@(.*).com')
--------------------------
0 [(jordon, sohu)]
1 []
2 [(KelvinChai, gmail)]
3 [(xiaoli, 163)]
4 NaN
5 [(amei, qq)]The above example returns two parts of a regular lookup , And appear in the form of tuple list .
The text contains through contains Method realization , Returns a Boolean value , In general, and loc The query function is used in conjunction with , Parameters :
pat: Match string , regular expression
case: Is it case sensitive ,True Express difference
flags: Regular library re Logo in , such as re.IGNORECASE
na: Fill in missing values
regex: Whether to support regular , Default True Support
df.Email.str.contains('jordon|com',na='*')
----------
0 True
1 False
2 True
3 True
4 *
5 True
#
df.loc[df.Email.str.contains('jordon|com', na=False)]
------------------------------------------
name Age level Email @position
0 jordon 18 high [email protected] 6.0
2 Kelvin 45 M [email protected] 10.0
3 xiaoLi 23 L [email protected] 6.0
5 Amei 62 NaN [email protected] 4.0 There's a little bit of caution here , If and loc In combination with , Note that there must be no missing values , Otherwise, an error will be reported . Can be set by na=False Ignore missing values and complete the query .
get_dummies A column variable can be automatically generated into a dummy variable ( Dummy variable ), This method is often used in feature derivation .
df.name.str.get_dummies()
-------------------------------
Amei Kelvin MIKE jordon qiqi xiaoLi
0 0 0 0 1 0 0
1 0 0 1 0 0 0
2 0 1 0 0 0 0
3 0 0 0 0 0 1
4 0 0 0 0 1 0
5 1 0 0 0 0 0That's what we're sharing .

Looking back
It's too voluminous !AI High accuracy of math exam 81%
This Python Artifact can let you touch fish for a long time !
2D Transformation 3D, Look at NVIDIA's AI“ new ” magic !
How to use Python Realize the security system of the scenic spot ?
Share
Point collection
A little bit of praise
Click to see