程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Identify twitter user gender through Python

編輯:Python

Resource download address :https://download.csdn.net/download/sheziqiong/85705774

This is an introductory project , Used to understand

  1. Text feature engineering ,

  2. Image feature Engineering ,

  3. Basic data cleaning process

  4. Project modeling process

Data set basic information :

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20050 entries, 0 to 20049
Data columns (total 6 columns):
# Column Non-Null Count Dtype 
--- ------ -------------- -----
0 gender 19953 non-null object
1 description 16306 non-null object
2 link_color 20050 non-null object
3 profileimage 20050 non-null object
4 sidebar_color 20050 non-null object
5 text 20050 non-null object
dtypes: object(6)
memory usage: 940.0+ KB
None

The dataset has 20050 That's ok ,6 Column

Feature content :

  • gender: User's gender , That is, the prediction content

  • description: User self description

  • link_color: User theme colors

  • profileimage:twitter Avatar link

  • sidebar_color : User sidebar color

  • text: user twitter Published content

Data preview :

Process introduction :

  1. Data cleaning
1.1 according to 'gender' Columns filter data
1.2 To filter out 'description' Data whose column is empty
1.3 To filter out 'link_color' Column sum 'sidebar_color' Illegal column 16 Hexadecimal data
1.4 Clean text data
1.5 according to profileimage Link to determine whether the avatar image is valid ,
1.6 Replace male->0, female->1
  1. Split the dataset participle Remove stop words

  2. Feature Engineering

3.1 Training data feature extraction

3.1.1 Text data

description Data Extraction desc Textual TF-IDF features

extract text Text TF-IDF features

3.1.2 Image data

link color Of RGB features

Head portrait RGB Histogram features

Combine text features and image features

Feature range normalization

3.2 Test data feature extraction : Just like the training set

3.3 PCA Dimension reduction operation

  1. Model building training , contrast PCA Effect before and after operation

Use is not done PCA Characteristics of operation

Use PCA Characteristics after operation

Model :lr_model = LogisticRegression()

  1. Model test

  2. Delete decompression data , Clean up the space

Resource download address :https://download.csdn.net/download/sheziqiong/85705774


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved