程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Using Python for fine Chinese sentence segmentation (based on regular expression), harvesttext: a text mining and preprocessing tool

編輯:Python

    1. use python Make fine Chinese clauses ( Based on regular expressions )

Chinese clause , At first glance, it seems to be a very simple job , Generally, we only need to find one 【.!?】 Is it OK to break this kind of typical punctuation symbol .
       about Simple text This approach is already feasible ( For example, I see a concise implementation method in this article

Natural language processing learning 3: Chinese clause re.split(),jieba Word segmentation and word frequency statistics FreqDist_zhuzuwei The blog of -CSDN Blog _jieba Clause

NLTK Use notes ,NLTK It is commonly used. Python Natural language processing libraries

However, when I deal with novel texts , Found a loophole in this idea :

  • For sentences with double quotation marks , The result of the clause should be postponed to the end of the double quotation marks , such as :

This morning, , I went to “ Secret base ” 了 .

  • Ellipsis is also a common sentence separator , However, it exceeds one character , use re.split() The method is slightly inconvenient .

therefore , Here I offer a more refined solution , Can solve the above problems :

# Version is python3, If python2 You need to precede the string with u
import re
def cut_sent(para):
para = re.sub('([.!?\?])([^”’])', r"\1\n\2", para) # Single character sentence breaker
para = re.sub('(\.{6})([^”’])', r"\1\n\2", para) # English ellipsis
para = re.sub('(\…{2})([^”’])', r"\1\n\2", para) # Chinese Ellipsis
para = re.sub('([.!?\?][”’])([^,.!?\?])', r'\1\n\2', para)
# If there is a terminator before the double quotation mark , Then double quotation marks are the end of the sentence , Break the sentence \n Put it behind double quotation marks , Note that the previous sentences carefully retain double quotation marks
para = para.rstrip() # If there is extra at the end of the paragraph \n Just get rid of it
# Semicolons are considered in many rules ;, But here I ignore it , Dashes 、 English double quotation marks are also ignored , If necessary, just make some simple adjustments .
return para.split("\n")

Test the effect

2. HarvestText: Text mining and preprocessing tools

HarvestText Is a focus without ( weak ) Supervision methods , Be able to integrate domain knowledge ( Such as type , Alias ) A library for simple and efficient processing and analysis of texts in specific fields . It is suitable for many text preprocessing and preliminary exploratory analysis tasks , In novel analysis , Network text , Professional literature and other fields have potential application value .

   While processing data , In addition to clauses, you may have to clean up special data formats first ,

Such as micro-blog ,HTML Code ,URL,Email etc. ,

Some big guy ! A number of commonly used data preprocessing and cleaning operations are integrated into the developed HarvestText library

github(https://github.com/blmoistawinde/HarvestText)

Code cloud :https://gitee.com/dingding962285595/HarvestText

Using document :Welcome to HarvestText’s documentation! — HarvestText 0.8.1.7 documentation

2.1 Text cleaning example :

print(" Various cleaning texts ")
ht0 = HarvestText()
# The default setting can be used to clean Weibo text
text1 = " reply @ Qian Xuming QXM:[ Hee hee ][ Hee hee ] //@ Qian Xuming QXM: Brother Yang [good][good]"
print(" Clean Weibo 【@ And emoticons 】")
print(" primary :", text1)
print(" After cleaning :", ht0.clean_text(text1))
# URL Clean-up
text1 = "【# Zhao Wei #: Preparing for the next movie But it's not a youth movie ....http://t.cn/8FLopdQ"
print(" Cleaning website URL")
print(" primary :", text1)
print(" After cleaning :", ht0.clean_text(text1, remove_url=True))
# Clean the mailbox
text1 = " My email is [email protected], Welcome to contact "
print(" Clean the mailbox ")
print(" primary :", text1)
print(" After cleaning :", ht0.clean_text(text1, email=True))
# Handle URL Escape character
text1 = "www.%E4%B8%AD%E6%96%87%20and%20space.com"
print("URL Turn to normal characters ")
print(" primary :", text1)
print(" After cleaning :", ht0.clean_text(text1, norm_url=True, remove_url=False))
text1 = "www. chinese and space.com"
print(" Normal character to URL[ With Chinese and spaces request We need to pay attention to ]")
print(" primary :", text1)
print(" After cleaning :", ht0.clean_text(text1, to_url=True, remove_url=False))
# Handle HTML Escape character
text1 = "<a c> ''"
print("HTML Turn to normal characters ")
print(" primary :", text1)
print(" After cleaning :", ht0.clean_text(text1, norm_html=True))
# From traditional Chinese to simplified Chinese
text1 = " Who pays for heartbreak "
print(" From traditional Chinese to simplified Chinese ")
print(" primary :", text1)
print(" After cleaning :", ht0.clean_text(text1, t2s=True))

result

 Various cleaning texts
Clean Weibo 【@ And emoticons 】
primary : reply @ Qian Xuming QXM:[ Hee hee ][ Hee hee ] //@ Qian Xuming QXM: Brother Yang [good][good]
After cleaning : Brother Yang
Cleaning website URL
primary : 【# Zhao Wei #: Preparing for the next movie But it's not a youth movie ....http://t.cn/8FLopdQ
After cleaning : 【# Zhao Wei #: Preparing for the next movie But it's not a youth movie ....
Clean the mailbox
primary : My email is [email protected], Welcome to contact
After cleaning : My email is , Welcome to contact
URL Turn to normal characters
primary : www.%E4%B8%AD%E6%96%87%20and%20space.com
After cleaning : www. chinese and space.com
Normal character to URL[ With Chinese and spaces request We need to pay attention to ]
primary : www. chinese and space.com
After cleaning : www.%E4%B8%AD%E6%96%87%20and%20space.com
HTML Turn to normal characters
primary : <a c> ''
After cleaning : <a c> ''
From traditional Chinese to simplified Chinese
primary : Who pays for heartbreak
After cleaning : Who pays for heartbreak 


 


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved