程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Python project combat - 00. Data analysis test questions sharing

編輯:Python

Python項目實戰

  • Python項目實戰 —— 00. Data analysis test questions sharing
    • 一、Basics of probability theory and statistics
    • 二、Python
    • 三、數據分析思維
    • 四、Titanic Survival Data Feature Processing
      • 4.1 Combine training and test sets
      • 4.2 缺失值處理
      • 4.3 Data processing for different feature fields
      • 4.4 Prediction using random forestsAge缺失值
      • 4.5 各特征與Survivedorder of correlation coefficients

Python項目實戰 —— 00. Data analysis test questions sharing

大家可以關注知乎或微信公眾號的share16,我們也會同步更新此文章.

寫在前面的話
The content of this article is participating’和鯨社區-Breakthrough data analysis‘活動後,A summary of some incorrectly answered questions and related knowledge points,以供大家參考.若有侵權,請及時聯系!

一、Basics of probability theory and statistics

The content involved has a binomial distribution、Hypergeometric distribution, etc.








二、Python

There are perfect numbers of content involved、水仙花數、回文素數、斐波那契數列、Queuing report、分解質因數(Contains the greatest common divisor/最小公倍數)、Series.str.函數、Series.dt.函數、分箱(cut/qcut)等.

















三、數據分析思維






四、Titanic Survival Data Feature Processing

Titanic數據集:點此下載

4.1 Combine training and test sets

import pandas as pd
train = pd.read_csv('/xxxxxx/train.csv')
test = pd.read_csv('/xxxxxx/test.csv')
df = pd.concat([train,test])
df.head()
''' 此外,還有merge、join兩種方法'''



4.2 缺失值處理

''' Embarked : 用眾數填充; Fare : 用均值填充; Cabin : 座位號,用'no-ticket'填充; '''
df.Embarked = df.Embarked.fillna(df.Embarked.mode()[0])
df.Fare = df.Fare.fillna(df.Fare.mean())
df.Cabin = df.Cabin.fillna('no-ticket')
#df.loc[df.Cabin.isna(),'Cabin'] = 'no-ticket'

4.3 Data processing for different feature fields

①Fare grading(分箱-分類編碼)

X = df[['PassengerId','Survived','Pclass','Embarked','Name','Cabin','Ticket','Sex','Fare','Age']]
dict_name = {
"Capt":"Officer", "Col":"Officer", "Major":"Officer", "Jonkheer":"Royalty",\
"Don":"Royalty", "Sir" :"Royalty", "Dr":"Officer", "Rev":"Officer", "the Countess":"Royalty",\
"Dona":"Royalty", "Mme":"Mrs", "Mlle":"Miss", "Ms":"Mrs", "Mr" :"Mr", "Mrs" :"Mrs",\
"Miss" :"Miss", "Master" :"Master", "Lady" :"Royalty"}
X['fare_'] = pd.factorize(pd.qcut(X.Fare,5))[0]

②Name handling-Extract the salutation of the first name

X['name_'] = X.Name.str.split(',').str[1].str.split('.').str[0].str.strip().map(dict_name)
#X['name_'] = X.Name.apply(lambda x:x.split(',')[1].split('.')[0].strip()) # apply也可換成map
X['name_length'] = X.Name.agg(len)

③Cabin處理

# Cabin:no-ticket用0表示,其他用1表示
X['cabin_'] = X.Cabin.apply(lambda x:0 if x == 'no-ticket' else 1)

④Ticket處理(Only keep the letters in it,And convert the letters to numbers)

X['ticket_'] = X.Ticket.str.split().str[0]
X['ticket_'] = pd.factorize(X.ticket_.apply(lambda x:'U0' if x.isnumeric() else x))[0]

⑤對Embarked、Sex及Pclass等等,用dummy處理

X.drop(columns=['PassengerId','Name','Cabin','Ticket','Fare'],axis=1,inplace=True)
x = pd.get_dummies(X,columns=['Pclass','Embarked','Sex','fare_','cabin_','name_'])

4.4 Prediction using random forestsAge缺失值

''' 預測的是Age,So here is regression '''
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(n_estimators=1000,n_jobs=-1)
x_train = x[x.Age.notnull()]
x_test = x[x.Age.isnull()]
forest.fit(x_train.iloc[:,2:],x_train.iloc[:,1])
x_test.iloc[:,1] = forest.predict(x_test.iloc[:,2:]).round(1)
x_new = pd.concat([x_train,x_test])

4.5 各特征與Survivedorder of correlation coefficients

x_new.corr().Survived.abs().sort_values(ascending=False)

謝謝大家


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved