程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Python based on bag of words model features and TFIDF features for support vector machine model Chinese mail classification project combat

編輯:Python

說明:這是一個機器學習實戰項目(附帶數據+代碼+文檔+視頻講解),如需數據+代碼+文檔+視頻講解可以直接到文章最後獲取.

 

1.項目背景

隨著互聯網的發展,More and more users communicate through the Internet,E-mail has become an important tool for people's daily communication.Users may receive hundreds of emails every week,But mostly spam.According to Time magazine estimates,1994Years people sent7760億封電子郵件,1997年則是26000億封,2000年更是達到了66000億封.The proliferation of emails, especially spam, has seriously affected the normal development of e-commerce activities.People usually spend a lot of time processing emails,But the effect is not obvious,Seriously affect normal business activities.Categorize your mail appropriately,Picking out meaningful emails for users is an imperative for all users.

Most of the current email classification methods classify emails into spam and non-spam,So as to realize the automatic filtering of mail.This project is based on bag-of-words model features and TFIDFThe features are used to classify Chinese emails using a support vector machine model,Mail categories are divided into normal mail and spam.

2.數據采集

本次建模數據來源於網絡,數據項統計如下:

數據詳情如下(部分展示):

正常郵件:

垃圾郵件:

Each line represents an email.

3.數據預處理

3.1查看數據

print("總的數據量:", len(labels))
corpus, labels = remove_empty_docs(corpus, labels) # 移除空行
print('樣本之一:', corpus[10])
print('樣本的label:', labels[10])
label_name_map = ["垃圾郵件", "正常郵件"]
print('實際類型:', label_name_map[int(labels[10])])

 結果如圖所示:

4.特征工程

4.1數據集拆分

把數據集分為70%訓練集和30%測試集.

4.2加載停用詞

 The list of stop words is as follows,部分展示:

4.3分詞

import jieba
tokens = jieba.lcut(text)
tokens = [token.strip() for token in tokens]

結果如圖所示:

Training set word segmentation display:

Test set word segmentation display:

 4.4移除特殊字符

pattern = re.compile('[{}]'.format(re.escape(string.punctuation))) # re.escapeAll special symbols are handled automatically
# string.punctuationReturns all punctuation
filtered_tokens = filter(None, [pattern.sub('', token) for token in tokens])
filtered_text = ' '.join(filtered_tokens)

 4.5去停用詞

4.6歸整化

 結果如圖所示:

Normalized training set:

Normalized test set:

 4.7Bag of words model feature extraction

# min_df:when building the vocabulary,Ignore terms whose document frequency is strictly below the given threshold.
# ngram_range的(1,1)Indicates that only single characters are used
vectorizer = CountVectorizer(min_df=1, ngram_range=ngram_range)
features = vectorizer.fit_transform(corpus) # Learn a vocabulary dictionary and return a matrix of document terms

 結果如圖所示:

4.8 TFIDF特征提取

 結果如圖所示:

5.構建支持向量機分類模型

5.1Support vector machines based on bag-of-words model features

5.2基於TFIDFFeature Support Vector Machine

# 基於tfidf的支持向量機模型
print("基於tfidf的支持向量機模型")
svm_tfidf_predictions = train_predict_evaluate_model(classifier=svm,
train_features=tfidf_train_features,
train_labels=train_labels,
test_features=tfidf_test_features,
test_labels=test_labels)

6.模型評估

6.1評估指標及結果

The evaluation indicators mainly include the accuracy rate、查准率、查全率(召回率)、F1points, etc..

 通過上表可以看到,The accuracy of the two feature extraction models is the same97%,F1Scores are both0.97,It shows that the model works well.

6.2分類報告

結果如圖所示:

Classification report based on bag-of-words model features

Type of spamF1分值為0.97;The type is normal mailF1分值為0.97.

基於TFIDFClassification report of features

Type of spamF1分值為0.97;The type is normal mailF1分值為0.97.

6.3混淆矩陣

# 構建數據框
cm_matrix = pd.DataFrame(data=cm, columns=['Actual :0', 'Actual :1'],
index=['Predict :0', 'Predict :1'])
sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu') # 熱力圖展示
plt.show() # 展示圖片

結果如圖所示:

Classification report based on bag-of-words model features

從上圖可以看到,預測為垃圾郵件 There are actually normal mails91封;Predicted as normal mail  There are actually spam5封.

基於TFIDFClassification report of features

從上圖可以看到,預測為垃圾郵件 There are actually normal mails89封;Predicted as normal mail  There are actually spam6封.

7.Model prediction display

Displays correctly categorized messages:

Displays misclassified messages:

8.總結展望

This project applies two feature extraction methods to study Chinese email classification with support vector machine model,通過數據預處理、特征工程、模型構建、模型評估等工作,最終模型的F1分值達到0.97,This is in the field of text classification,is a great effect,can be applied to practical work.

本次機器學習項目實戰所需的資料,項目資源如下:

項目說明:
鏈接:https://pan.baidu.com/s/1dW3S1a6KGdUHK90W-lmA4w 
提取碼:bcbp

網盤如果失效,可以添加博主微信:zy10178083


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved