您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

fastText Python 教程

編輯：Python

諸神緘默不語-個人CSDN博文目錄

fastText Python官方GitHub文件夾網址：fastText/python at main · facebookresearch/fastText
本文介紹fastText Python包的基本教程，包括安裝方式和簡單的使用方式。

本文所使用的示例中文文本分類數據來自https://raw.githubusercontent.com/SophonPlus/ChineseNlpCorpus/master/datasets/ChnSentiCorp_htl_all/ChnSentiCorp_htl_all.csv。除文中所做的工作外，還可以做停用詞處理等其他工作。
其他fasttext Python示例代碼可參考：fastText/python/doc/examples at master · facebookresearch/fastText

文章目錄

1. 安裝fastText
2. 訓練和調用詞向量模型
- 2.1 代碼實現
- 2.2 原理介紹
3. 文本分類
- 3.1 代碼實現
- 3.2 原理介紹
4. 量化實現模型壓縮
5. 模型的屬性和方法
4. 其他在正文及腳注中未提及的參考資料

1. 安裝fastText

首先需要安裝numpy、scipy和pybind11。

因為我使用的是anaconda，所以以下會傾向於使用conda安裝。用pip安裝原理上也差不多。
numpy我是在安裝PyTorch的時候，順帶著安裝的。我使用的命令行是conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch。單獨安裝numpy可以參考使用conda install numpy。
安裝SciPy：conda install scipy
安裝pybind11，參考官方文檔：Installing the library — pybind11 documentation：conda install -c conda-forge pybind11

安裝完前置包之後，安裝fastText：pip install fasttext

2. 訓練和調用詞向量模型

以前我用gensim做過。以後可以比較一下兩個包的不同之處。
此外fasttext詞向量論文中用的baseline是谷歌官方的word2vec包：Google Code Archive - Long-term storage for Google Code Project Hosting.

2.1 代碼實現

官方詳細教程：Word representations · fastText（使用的是英文維基百科的語料，本文的實驗用的是中文語料）

由於fasttext本身沒有中文分詞功能，因此需要手動對文本預先分詞。處理數據的代碼可參考：

import csv,jieba
with open('data/cls/ChnSentiCorp_htl_all.csv') as f:
reader=csv.reader(f)
header = next(reader) #表頭
data = [[int(row[0]),row[1]] for row in reader] #每個元素是一個由字符串組成的列表，第一個元素是標簽（01），第二個元素是評論文本。
tofiledir='data/cls'
with open(tofiledir+'/corpus.txt','w') as f:
f.writelines([' '.join(jieba.cut(row[1]))+'\n' for row in data])

文件效果：

學習詞向量並展示的代碼：

import fasttext
model=fasttext.train_unsupervised('data/cls/corpus.txt',model='skipgram') #model入參可以更換為`cbow`
print(model.words[:10]) #打印前10個單詞
print(model[model.words[9]]) #打印第10個單詞的詞向量

（展示詞向量也可以使用get_word_vector(word)，而且可以查找數據中未出現的data（事實上詞向量是用子字符串總和來表示的））

輸出：

Read 0M words
Number of words: 6736
Number of labels: 0
Progress: 100.0% words/sec/thread: 71833 lr: 0.000000 avg.loss: 2.396854 ETA: 0h 0m 0s
['，', '的', '。', ',', '了', '酒店', '是', '</s>', '很', '房間']
[ 1.44523270e-02 -1.14391923e-01 -1.31457284e-01 -1.59686044e-01
-4.57017310e-02 2.04045177e-01 2.00106978e-01 1.63031772e-01
1.71287894e-01 -2.93396801e-01 -1.01871997e-01 2.42363811e-01
2.78942972e-01 -4.99058776e-02 -1.27043173e-01 2.87460908e-02
3.73771787e-01 -1.69842303e-01 2.42533281e-01 -1.82482198e-01
7.33817369e-02 2.21920848e-01 2.17794716e-01 1.68730497e-01
2.16873884e-02 -3.15452456e-01 8.21631625e-02 -6.56387508e-02
9.51113254e-02 1.69942483e-01 1.13980576e-01 1.15132451e-01
3.28856230e-01 -4.43856061e-01 -5.13903908e-02 -1.74580872e-01
4.39242758e-02 -2.22267807e-01 -1.09185934e-01 -1.62346154e-01
2.11286068e-01 2.44934723e-01 -1.95910111e-02 2.33887792e-01
-7.72107393e-02 -6.28366888e-01 -1.30844399e-01 1.01614185e-01
-2.42928267e-02 4.28218693e-02 -3.78409088e-01 2.31552869e-01
3.49486321e-02 8.70033056e-02 -4.75800633e-01 5.37340902e-02
2.29140893e-02 3.87787819e-04 -5.77102527e-02 1.44286081e-03
1.33415654e-01 2.14263964e-02 9.26891491e-02 -2.24226922e-01
7.32692927e-02 -1.52607411e-01 -1.42978013e-01 -4.28122580e-02
9.64387357e-02 7.77726322e-02 -4.48957413e-01 -6.19397573e-02
-1.22236833e-01 -6.12100661e-02 -5.51685333e-01 -1.35704070e-01
-1.66864052e-01 7.26311505e-02 -4.55838069e-02 -5.94963729e-02
1.23811573e-01 6.13824800e-02 2.12341957e-02 -9.38200951e-02
-1.40030123e-03 2.17677400e-01 -6.04508296e-02 -4.68601920e-02
2.30288744e-01 -2.68855840e-01 7.73726255e-02 1.22143216e-01
3.72817874e-01 -1.87924504e-01 -1.39104724e-01 -5.74962497e-01
-2.42888659e-01 -7.35510439e-02 -6.01616681e-01 -2.18178451e-01]

檢查詞向量的效果：搜索其最近鄰居（nearest neighbor (nn)），給出向量捕獲語義信息的直覺觀感（在教程中英文拼錯了也能用，但是中文這咋試，算了）
（向量距離用余弦相似度計算得到）

print(model.get_nearest_neighbors('房間'))

輸出：[(0.804237425327301, '小房間'), (0.7725597023963928, '房屋'), (0.7687026858329773, '盡頭'), (0.7665393352508545, '第一間'), (0.7633816599845886, '但床'), (0.7551409006118774, '成舊'), (0.7520463466644287, '屋子裡'), (0.750516414642334, '壓抑'), (0.7492958903312683, '油漆味'), (0.7476236820220947, '知')]

word analogies（預測跟第三個詞組成與前兩個詞之間關系的詞）：

print(model.get_analogies('房間','壓抑','環境'))

輸出：[(0.7665581703186035, '優越'), (0.7352521419525146, '地理位置'), (0.7330452799797058, '安靜'), (0.7157530784606934, '周邊環境'), (0.7050396800041199, '自然環境'), (0.6963807344436646, '服務到位'), (0.6960451602935791, '也好'), (0.6948464512825012, '優雅'), (0.6906660795211792, '地點'), (0.6869651079177856, '地理')]

其他函數：

model.save_model(path)
fasttext.load_model(path) 返回model

fasttext.train_unsupervised()其他參數：

dim 向量維度（默認值是100,100-300都是常用值）
minnmaxn 最大和最小的subword子字符串（默認值是3-6）
epoch（默認值是5）
lr 學習率高會更快收斂，但是可能過擬合（默認值是0.05，常見選擇范圍是 [0.01, 1] ）
thread（默認值是12）

 input # training file path (required)
model # unsupervised fasttext model {cbow, skipgram} [skipgram]
lr # learning rate [0.05]
dim # size of word vectors [100]
ws # size of the context window [5]
epoch # number of epochs [5]
minCount # minimal number of word occurences [5]
minn # min length of char ngram [3]
maxn # max length of char ngram [6]
neg # number of negatives sampled [5]
wordNgrams # max length of word ngram [1]
loss # loss function {ns, hs, softmax, ova} [ns]
bucket # number of buckets [2000000]
thread # number of threads [number of cpus]
lrUpdateRate # change the rate of updates for the learning rate [100]
t # sampling threshold [0.0001]
verbose # verbose [2]

fastText官方提供已訓練好的300維多語言詞向量：Wiki word vectors · fastText 新版：Word vectors for 157 languages · fastText

2.2 原理介紹

在論文中如使用詞向量，需要引用這篇文獻：Enriching Word Vectors with Subword Information

skipgram和cbow應該不太需要介紹，這是NLP的常識知識。skipgram用一個隨機選擇的鄰近詞預測目標單詞，cbow用上下文（在一個window內，比如加總向量）預測目標單詞。
fasttext所使用的詞向量兼顧了subword信息（用子字符串表征加總，作為整體的表征），比單使用word信息能獲得更豐富的語義，運算速度更快，而且可以得到原語料中不存在的詞語。

3. 文本分類

3.1 代碼實現

官方詳細教程：Text classification · fastText（官方教程使用的數據集是英文烹饪領域stackexchange數據集）

首先將原始數據處理成fasttext分類格式（需要手動對中文分詞，標簽以__label__為開頭）（由於fasttext只有訓練代碼和測試代碼，所以我只分了訓練集和測試集），代碼可參考：

import csv,jieba,random
with open('data/cls/ChnSentiCorp_htl_all.csv') as f:
reader=csv.reader(f)
header = next(reader) #表頭
data = [[row[0],row[1]] for row in reader] #每個元素是一個由字符串組成的列表，第一個元素是標簽（01），第二個元素是評論文本。
tofiledir='data/cls'
#隨機抽取80%訓練集，20%測試集
random.seed(14560704)
random.shuffle(data)
split_point=int(len(data)*0.8)
with open(tofiledir+'/train.txt','w') as f:
train_data=data[:split_point]
f.writelines([' '.join(jieba.cut(row[1]))+' __label__'+row[0]+'\n' for row in train_data])
with open(tofiledir+'/test.txt','w') as f:
test_data=data[split_point:]
f.writelines([' '.join(jieba.cut(row[1]))+' __label__'+row[0]+'\n' for row in test_data])

文件示例：

訓練分類模型，並進行測試，打印測試結果：

import fasttext
model=fasttext.train_supervised('data/cls/train.txt')
print(model.words[:10])
print(model.labels)
print(model.test('data/cls/test.txt'))
print(model.predict('酒店 環境 還 可以 ， 服務 也 很 好 ， 就是 房間 的 衛生 稍稍 馬虎 了 一些 ， 坐便器 擦 得 不是 十分 干淨 ， 其它 方面 都 還好 。 尤其 是 早餐 ， 在 我 住 過 的 四星 酒店 裡 算是 花樣 比較 多 的 了 。 因為 游泳池 是 在 室外 ， 所以 這個 季節 去 了 怕冷 的 人 就 沒有 辦法 游泳 。 補充 點評 2007 年 11 月 16 日 ： 服務 方面 忘 了 說 一點 ， 因為 我落 了 一樣 小東西 在 酒店 ， 還 以為 就算 了 ， 沒想到 昨天 離開 ， 今天 就 收到 郵件 提醒 我 說 我 落 了 東西 ， 問 我 需要 不 需要 他們 給 寄 回來 ， 這 一點 比 有些 酒店 要 好 很多 。'))

輸出：

Read 0M words
Number of words: 26133
Number of labels: 2
Progress: 100.0% words/sec/thread: 397956 lr: 0.000000 avg.loss: 0.353336 ETA: 0h 0m 0s
['，', '的', '。', ',', '了', '酒店', '是', '</s>', '很', '房間']
['__label__1', '__label__0']
(1554, 0.8783783783783784, 0.8783783783783784)
(('__label__1',), array([0.83198541]))

test()函數的輸出依次是：樣本數，[email protected]，[email protected]
（這個[email protected]大概意思是得分最高的標簽屬於正確標簽的比例，可以參考：IR-ratio: Precision-at-1 and Reciprocal Rank。[email protected]是正確標簽被預測到的概率）

predict()函數也可以傳入字符串列表。

test()和predict()的入參k可以指定返回的標簽數量，默認為1。

儲存和加載模型文件的方式與第二節中詞向量模型的類似。

train_supervised()其他入參：

epoch（默認值為5）
lr（效果好的范圍為0.1-1）
wordNgrams 用n-gram而不是unigram（當使用語序很重要的分類任務（如情感分析）時很重要）
bucket
dim
loss
- 使用hs (hierarchical softmax) 代替標准softmax，可以加速運行
  hierarchical softmax：看了一下沒太看懂，總之大概來說是用二叉樹來表示標簽，這樣復雜度就不呈線性增長而是呈對數增長了。fasttext中用的是哈夫曼樹，平均查詢時間最優。fasttext官方介紹：https://fasttext.cc/docs/en/supervised-tutorial.html#advanced-readers-hierarchical-softmax 此外還給出了一個YouTube講解視頻：Neural networks [10.7] : Natural language processing - hierarchical output layer - YouTube
- one-vs-all或ova：multi-label范式，將每個標簽都單獨建模成一個one-label分類任務（相比其他損失函數，建議調低學習率。predict()時指定k=-1輸出盡量多的預測結果，threshold規定輸出大於阈值的標簽。test()時直接指定k=-1）

 input # training file path (required)
lr # learning rate [0.1]
dim # size of word vectors [100]
ws # size of the context window [5]
epoch # number of epochs [5]
minCount # minimal number of word occurences [1]
minCountLabel # minimal number of label occurences [1]
minn # min length of char ngram [0]
maxn # max length of char ngram [0]
neg # number of negatives sampled [5]
wordNgrams # max length of word ngram [1]
loss # loss function {ns, hs, softmax, ova} [softmax]
bucket # number of buckets [2000000]
thread # number of threads [number of cpus]
lrUpdateRate # change the rate of updates for the learning rate [100]
t # sampling threshold [0.0001]
label # label prefix ['__label__']
verbose # verbose [2]
pretrainedVectors # pretrained word vectors (.vec file) for supervised learning []

3.2 原理介紹

在論文中如使用文本分類功能需引用該文獻：Bag of Tricks for Efficient Text Classification

感覺是個比較直覺的簡單模型，計算詞向量後求平均值，計算輸出標簽。具體細節待補。

4. 量化實現模型壓縮

# with the previously trained `model` object, call :
model.quantize(input='data.train.txt', retrain=True)
# then display results and save the new model :
print_results(*model.test(valid_data))
model.save_model("model_filename.ftz")

5. 模型的屬性和方法

方法：

 get_dimension # Get the dimension (size) of a lookup vector (hidden layer).
# This is equivalent to `dim` property.
get_input_vector # Given an index, get the corresponding vector of the Input Matrix.
get_input_matrix # Get a copy of the full input matrix of a Model.
get_labels # Get the entire list of labels of the dictionary
# This is equivalent to `labels` property.
get_line # Split a line of text into words and labels.
get_output_matrix # Get a copy of the full output matrix of a Model.
get_sentence_vector # Given a string, get a single vector represenation. This function
# assumes to be given a single line of text. We split words on
# whitespace (space, newline, tab, vertical tab) and the control
# characters carriage return, formfeed and the null character.
get_subword_id # Given a subword, return the index (within input matrix) it hashes to.
get_subwords # Given a word, get the subwords and their indicies.
get_word_id # Given a word, get the word id within the dictionary.
get_word_vector # Get the vector representation of word.
get_words # Get the entire list of words of the dictionary
# This is equivalent to `words` property.
is_quantized # whether the model has been quantized
predict # Given a string, get a list of labels and a list of corresponding probabilities.
quantize # Quantize the model reducing the size of the model and it's memory footprint.
save_model # Save the model to the given path
test # Evaluate supervised model using file given by path
test_label # Return the precision and recall score for each label.

model.words # equivalent to model.get_words()
model.labels # equivalent to model.get_labels()

model['king'] # equivalent to model.get_word_vector('king')
'king' in model # equivalent to `'king' in model.get_words()`

4. 其他在正文及腳注中未提及的參考資料

NLP實戰之Fasttext中文文本分類_vivian_ll的博客-CSDN博客_fasttext 中文：這一篇去除了停用詞，此外還介紹了gensim包中計算詞向量的方法。
[原創]《使用 fastText 做中文文本分類》文章合集 – 編碼無悔 / Intent & Focused：這一篇是使用Java做的，因為數據量很大，所以想用map-reduce實現。數據標簽是通過騰訊雲文本分類免費API來調取得到的……
關於文本分類（情感分析）的中文數據集匯總_櫻與刀的博客-CSDN博客_情感分析數據集
Python3讀取CSV數據_柿子鐳的博客-CSDN博客_python3讀取csv文件
python讀取csv時skipinitialspace參數的使用_vanlywang的博客-CSDN博客