程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

[Xinghai essays] 0 fundamentals, python algorithm recommendation function

編輯:Python
pip install nltk
pip install cufflinks

nltk It's a python tool kit , Used to deal with things related to natural language . Including participles (tokenize), Part of speech tagging (POS),
Text classification, etc , It is an easy-to-use ready-made tool . But at present, the word segmentation module of the toolkit , Only English participles are supported , Chinese word segmentation is not supported .

cuffdiff It is mainly found that transcripts express , Cut and splice , Significant changes in promoter use .

import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import re
import random
import cufflinks
from plotly.offline import iplot
cufflinks.go_offline()

Load data , And view the format of the data

df = pd.read_csv('Seattle_Hotels.csv', encoding="latin-1")
df.head()

View the second... Of the data 100 That's ok , All columns
take desc Column 100 That's ok , Print it out in detail

df.iloc[100:101,:]
df['desc'][100]

Separate all the characters , And save
CountVectorizer It's a word counter

vec = CountVectorizer().fit(df['desc'])
bag_of_words = vec.transform(df['desc'])
bag_of_words.shape
# Show that 152 Sentence , Every sentence has 3200 word 

For details, please go to sklear_API View in

There are two ways to display

words_freq = [(word,sum_words[0,idx]) for word,idx in vec.vocabulary_.items()]
words_freq = sorted(words_freq,key=lambda x:x[1],reverse=True)

Summarize the above methods . Form a function , To classify phrases .

def get_top_n_words(corpus,n=None):
vec = CountVectorizer().fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word,sum_words[0,idx]) for word,idx in vec.vocabulary_.items()]
words_freq = sorted(words_freq,key=lambda x:x[1],reverse=True)
return words_freq[:n]

Get the first... After sorting 20

common_words=get_top_n_words(df['desc'] ,20 )

Put it in the DateFrame in , Easy to handle

df1 = pd.DataFrame(common_words,columns=['desc','count'])

DateFrame Data can be used directly iplot() To draw a picture . But I don't know why my version failed .
So I can only turn the data into a list and then draw .

a = df1.iloc[:]
b = df1.iloc[:,0:1]
for x in df1['count']:
b.append(x)
for x in df1['desc']:
a.append(x)
plt.barh(a,b)

therefore , Based on the previous data . String each piece of data split Then find the duplicate numbers , And count . And then arrange , And draw pictures to show .
Also can put the function .iplot Replace with .plot , The following code

df3.groupby('desc').sum()['count'].sort_values().plot(kind='barh',title='top 20 before remove stopwords-ngram_range=(2,2)')
common_words=get_top_n_words(df['desc'],20)
df3 = pd.DataFrame(common_words,columns=['desc','count'])
df3.groupby('desc').sum()['count'].sort_values()

Data add a column of data ,

df['word_count']=df['desc'].apply(lambda x:len(str(x).split()))
#apply It is a sequence flow that passes parameters into ,
#lambda Sequential flow return x
df['word_count'].plot(kind='hist',bins=50)
# Plot the resulting Columns 

Pay attention to the code Need to download tools

import nltk
nltk.download()
sub_replace = re.compile('[^0-9a-z #+_]')
stopwords = set(stopwords.words('english'))
def clean_txt(text):
text.lower()
text = sub_replace.sub('',text)
' '.join(word for word in text.split() if word not in stopwords)
return text
df['desc_clean'] = df['desc'].apply(clean_txt)

# You can omit it word

def clean_txt(text):
text.lower()
text = sub_replace.sub('',text)
' '.join(word for word in text.split())
return text
df['desc_clean'] = df['desc'].apply(clean_txt)

catalog index

df.set_index('name',inplace = True)

Match phrases according to the formula

tf(t,d) yes tf value , Represents a text d in , Term t The frequency of , It can be seen from the formula that tf The value is determined by both the term and the text .
nd Indicates the number of training set texts .
df(d,t) Indicates that a word item is included t The total number of documents , therefore idf Value and the total number of training set texts and contained word items t Is related to the number of texts .
idf Value improves the text vector represented by frequency , It not only considers the frequency of words in the text , At the same time, the frequency of words appearing in general texts is considered , Words always appear in general texts , Indicates that it can provide less classification information , For example, function words “ Of ”、“ The earth ”、“ have to ” etc. .

tf=TfidfVectorizer(analyzer='word',ngram_range=(1,3),stop_words='english')

take desc_clean Pass in , Calculate the correlation .
Match a number for each exact string .

tfidf_matrix=tf.fit_transform(df['desc_clean'])

Linear calculation

cosine_similarity =linear_kernel(tfidf_matrix,tfidf_matrix)
indices = pd.Series(df.index)

Write a function

def recommendations(name,cosine_similarity):
recommended_hotels = []
idx = indices[indices == name].index[0]
score_series = pd.Series(cosine_similarity[idx]).sort_values(ascending=False)
top_10_indexes = list(score_series[1:11].index)
for i in top_10_indexes:
recommended_hotels.append(list(df.index)[i])
return recommended_hotels

Input Exact name And make recommendations Name recommendation , What is passed in later is generated by the previous algorithm The number corresponding to each exact name

recommendations('Hilton Garden Seattle Downtown',cosine_similarity)

  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved