程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Explain the data cleaning tool flashtext in Python

編輯:Python

Catalog

1、 Get ready flashtext Environmental Science

2、 Add keywords

3、 Extract key words

4、 Replace keywords

5、 Get all keywords

6、 Add keywords in batch

7、 Delete keywords in batch

8、 Comparison of execution efficiency

In some ordinary small-scale data filtering 、 Regular expressions are the most commonly used in the cleaning process , But as the data scale increases , Regular expressions seem to have some spare energy .

Regular expressions in a 10k In the thesaurus of 15k The time of a keyword is almost 0.165 second . But for Flashtext It just needs 0.002 second . therefore , On this issue Flashtext Is about faster than regular expressions 82 times .

From the performance comparison of the above example diagram , You can see that as we need to process more and more characters , The processing speed of regular expressions almost increases linearly . However ,Flashtext Almost a constant .

1、 Get ready flashtext Environmental Science

adopt pip To install flashtext, Or other ways are also possible , The mirror station of Tsinghua University is used by default .

pip install flashtext -i https://pypi.tuna.tsinghua.edu.cn/simple

Getting ready for flashtext After environment , Take a look at flashtext Important use process , Help us to better complete the data cleaning operation .

2、 Add keywords

When adding a keyword here, it is added to the keyword thesaurus through a single keyword , Use add_keyword Function to add . The first parameter indicates the keyword to be added , The second parameter is the alias of the first keyword , If the keyword is found, it is displayed as an alias , If the second parameter is not used as an alias, the original name will still be displayed .

from flashtext import KeywordProcessor# Initialize the key vocabulary processor processor = KeywordProcessor()# Add keywords in the normal way processor.add_keyword('Python')# Add keywords by alias processor.add_keyword('Scala', 'Java')

In this way, the required keywords have been added to the thesaurus processor in two ways .

3、 Extract key words

Add keywords through the previous step , Now the keyword information already exists in the thesaurus processor , Reuse extract_keywords Just extract the keywords .

# Extract keyword information from a string found = processor.extract_keywords('I like Python and Scala.')# result print(found)# ['Python', 'Java']

And here it is , As we expected , and Scala Also shown as Java.

4、 Replace keywords

Replace the keywords with replace_keywords function , The premise is that the words with aliases in the thesaurus can be replaced , Just like up here Scala Displayed as Java equally .

Replace... In a string Scala key word , because Scala The corresponding alias is Java, So... In a string Scala It should be replaced by Java.

replaced = processor.replace_keywords('I like Scala.')# result print(replaced)# I like Java.# Scala If so, it will be replaced by Java.5、 Get all keywords

Sometimes , stay KeywordProcessor You may not remember what keywords have been added to the thesaurus processor , It can be used at this time get_all_keywords Function to get all the current keywords .

all_keywords = processor.get_all_keywords()# result print(all_keywords)# {'python': 'Python', 'scala': 'Java'}6、 Add keywords in batch

When the key vocabulary needs more keywords , You can add them in batches through lists or dictionaries . The corresponding functions are add_keywords_from_list、add_keywords_from_dict function .

# Initialize a dictionary for batch addition dict_ = { 'java': ['java_ee', 'java_se', 'java_me'], 'python': ['pandas', 'all']}# Add keywords in batches through dictionaries processor.add_keywords_from_dict(dict_)# Match keywords from batch added keywords result = processor.extract_keywords('looking for java_ee and pandas.')# result print(result)# ['java', 'python']# Batch add keywords by listing processor.add_keywords_from_list(['scala', 'python', 'scala', 'go'])# adopt get_all_keywords Take a look at all the keywords all_keywords = processor.get_all_keywords()# result print(all_keywords)# {'python': 'python', 'pandas': 'python', 'scala': 'scala', 'java_ee': 'java', 'java_se': 'java', 'java_me': 'java', 'all': 'python', 'go': 'go'}

Find that all the keywords have been added to the thesaurus processor , And repeated will not be added again .

7、 Delete keywords in batch

There are also two ways to batch delete keywords in the thesaurus processor , One is the list 、 The other is a dictionary . The corresponding functions are remove_keywords_from_list、remove_keywords_from_dict function .

# Remove keywords from the list in batch processor.remove_keywords_from_list(['python','java_ee','java_me'])# Remove keywords from the dictionary in batch processor.remove_keywords_from_dict({'python': ['pandas','all']})# adopt get_all_keywords Take a look at all the keywords all_keywords = processor.get_all_keywords()# result print(all_keywords)# {'scala': 'scala', 'java_se': 'java', 'go': 'go'}

It is found that all the keywords that need to be removed have been removed .

8、 Comparison of execution efficiency

For a more impressive display effect , I found two flashtext The efficiency comparison chart in the process of searching and replacing keywords can be seen at a glance .

flashtext、 Regular expression search efficiency comparison

flashtext、 Regular expression search Replacement comparison

The above is the detailed explanation Python Data cleaning tools in flashtext Details of , More about Python For data cleaning information, please pay attention to other relevant articles on the software development network !



  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved