您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python data analysis learning series 13 introduction to Python modeling library

編輯：Python

Python Data analysis learning series 13、 ... and Python Introduction to modeling library

Data transferred from （GitHub Address ）：https://github.com/wesm/pydata-book Friends in need can go by themselves github download

In this book , I've already introduced Python Programming basis of data analysis . Because data analysts and scientists always spend a lot of time on data collation and preparation , The focus of this book is to master these functions .

What library to choose for the development model depends on the application itself . Many statistical problems can be solved by simple methods , Such as ordinary least square regression , Other problems may require complex machine learning methods . Fortunately, ,Python It has become one of the languages that use these analytical methods , So after reading this book , You can explore many tools .

In this chapter , I will review some pandas Characteristics , When you cling to pandas Data normalization and model fitting and scoring , They may come in handy . Then I will briefly introduce two popular modeling tools ,statsmodels and scikit-learn. Each of these two is worth writing another book , I will not make a comprehensive introduction , Instead, it is recommended that you study the online documentation of the two projects and other information based on Python Data science 、 Statistics and machine learning books .

13.1 pandas Interface with model code

The usual workflow for model development is to use pandas Data loading and cleaning , Then switch to the modeling library for modeling . An important part of developing models is in machine learning “ Feature Engineering ”. It can describe any data transformation or analysis that extracts information from the original data set , These datasets may be useful in modeling . Data aggregation and GroupBy Tools are often used in feature engineering .

Excellent feature engineering is beyond the scope of this book , I will try my best to introduce some methods for data operation and modeling switching .

pandas And other analysis libraries usually rely on NumPy Array of . take DataFrame Convert to NumPy Array , have access to .values attribute ：

In [10]: import pandas as pd
In [11]: import numpy as np
In [12]: data = pd.DataFrame({

....: 'x0': [1, 2, 3, 4, 5],
....: 'x1': [0.01, -0.01, 0.25, -4.1, 0.],
....: 'y': [-1.5, 0., 3.6, 1.3, -2.]})
In [13]: data
Out[13]:
x0 x1 y
0 1 0.01 -1.5
1 2 -0.01 0.0
2 3 0.25 3.6
3 4 -4.10 1.3
4 5 0.00 -2.0
In [14]: data.columns
Out[14]: Index(['x0', 'x1', 'y'], dtype='object')
In [15]: data.values
Out[15]:
array([[ 1. , 0.01, -1.5 ],
[ 2. , -0.01, 0. ],
[ 3. , 0.25, 3.6 ],
[ 4. , -4.1 , 1.3 ],
[ 5. , 0. , -2. ]])

To convert back to DataFrame, Can pass a two-dimensional ndarray, Can have column name ：

In [16]: df2 = pd.DataFrame(data.values, columns=['one', 'two', 'three'])
In [17]: df2
Out[17]:
one two three
0 1.0 0.01 -1.5
1 2.0 -0.01 0.0
2 3.0 0.25 3.6
3 4.0 -4.10 1.3
4 5.0 0.00 -2.0

note ： It is best to use when the data is uniform .values attribute . for example , All numeric types . If the data is uneven , The results will be Python Object's ndarray：
In [18]: df3 = data.copy()
In [19]: df3['strings'] = ['a', 'b', 'c', 'd', 'e']
In [20]: df3
Out[20]:
x0 x1 y strings
0 1 0.01 -1.5 a
1 2 -0.01 0.0 b
2 3 0.25 3.6 c
3 4 -4.10 1.3 d
4 5 0.00 -2.0 e
In [21]: df3.values
Out[21]:
array([[1, 0.01, -1.5, 'a'],
[2, -0.01, 0.0, 'b'],
[3, 0.25, 3.6, 'c'],
[4, -4.1, 1.3, 'd'],
[5, 0.0, -2.0, 'e']], dtype=object)

For some models , You may only want to use a subset of the columns . I suggest you use loc, use values Index ：

In [22]: model_cols = ['x0', 'x1']
In [23]: data.loc[:, model_cols].values
Out[23]:
array([[ 1. , 0.01],
[ 2. , -0.01],
[ 3. , 0.25],
[ 4. , -4.1 ],
[ 5. , 0. ]])

Some libraries natively support pandas, Will automatically complete the work ： from DataFrame The switch to NumPy, Add the parameter name of the model to the column or... Of the output table Series. Other circumstances , You can do it by hand “ Metadata management ”.

In the 12 Chapter , We learned pandas Of Categorical The type and pandas.get_dummies function . Suppose there is a non numeric column in the dataset ：

In [24]: data['category'] = pd.Categorical(['a', 'b', 'a', 'a', 'b'],
....: categories=['a', 'b'])
In [25]: data
Out[25]:
x0 x1 y category
0 1 0.01 -1.5 a
1 2 -0.01 0.0 b
2 3 0.25 3.6 a
3 4 -4.10 1.3 a
4 5 0.00 -2.0 b

If we want to replace category Column as a virtual variable , We can create virtual variables , Delete category Column , Then add to the result ：

In [26]: dummies = pd.get_dummies(data.category, prefix='category')
In [27]: data_with_dummies = data.drop('category', axis=1).join(dummies)
In [28]: data_with_dummies
Out[28]:
x0 x1 y category_a category_b
0 1 0.01 -1.5 1 0
1 2 -0.01 0.0 0 1
2 3 0.25 3.6 1 0
3 4 -4.10 1.3 1 0
4 5 0.00 -2.0 0 1

Fitting some statistical models with imaginary variables will have some subtle differences . When you have more than just a column of numbers , Use Patsy（ The topic of the next section ） Maybe it's easier , It's not easy to make mistakes .

13.2 use Patsy Create a model description

Patsy yes Python A library , Use a short string “ Formula grammar ” Describe the statistical model （ Especially linear model ）, It may have been R and S Inspired by the formula syntax of statistical programming language .

Patsy Fit description statsmodels The linear model of , So I will focus on its main features , Let you master .Patsy Is a special string syntax , As shown below ：

y ~ x0 + x1

a+b Not will a And b Adding means , It is the design matrix created for the model .patsy.dmatrices The function receives a formula string and a data set （ It can be DataFrame Or a dictionary of arrays ）, Create a design matrix for a linear model ：

In [29]: data = pd.DataFrame({

....: 'x0': [1, 2, 3, 4, 5],
....: 'x1': [0.01, -0.01, 0.25, -4.1, 0.],
....: 'y': [-1.5, 0., 3.6, 1.3, -2.]})
In [30]: data
Out[30]:
x0 x1 y
0 1 0.01 -1.5
1 2 -0.01 0.0
2 3 0.25 3.6
3 4 -4.10 1.3
4 5 0.00 -2.0
In [31]: import patsy
In [32]: y, X = patsy.dmatrices('y ~ x0 + x1', data)

Now there is ：

In [33]: y
Out[33]:
DesignMatrix with shape (5, 1)
y
-1.5
0.0
3.6
1.3
-2.0
Terms:
'y' (column 0)
In [34]: X
Out[34]:
DesignMatrix with shape (5, 3)
Intercept x0 x1
1 1 0.01
1 2 -0.01
1 3 0.25
1 4 -4.10
1 5 0.00
Terms:
'Intercept' (column 0)
'x0' (column 1)
'x1' (column 2)

these Patsy Of DesignMatrix The instance is NumPy Of ndarray, With additional metadata ：

In [35]: np.asarray(y)
Out[35]:
array([[-1.5],
[ 0. ],
[ 3.6],
[ 1.3],
[-2. ]])
In [36]: np.asarray(X)
Out[36]:
array([[ 1. , 1. , 0.01],
[ 1. , 2. , -0.01],
[ 1. , 3. , 0.25],
[ 1. , 4. , -4.1 ],
[ 1. , 5. , 0. ]])

You may want to Intercept Where did it come from . This is a linear model （ Such as ordinary least squares regression ） The conventional usage of . add to +0 To model can not be displayed intercept：

In [37]: patsy.dmatrices('y ~ x0 + x1 + 0', data)[1]
Out[37]:
DesignMatrix with shape (5, 2)
x0 x1
1 0.01
2 -0.01
3 0.25
4 -4.10
5 0.00
Terms:
'x0' (column 0)
'x1' (column 1)

Patsy Objects can be passed directly to the algorithm （ such as numpy.linalg.lstsq） in , It performs ordinary least squares regression ：

In [38]: coef, resid, _, _ = np.linalg.lstsq(X, y)

The metadata of the model is kept in design_info Properties of the , So you can reattach the column names to the fitting coefficients , To get a Series, for example ：

In [39]: coef
Out[39]:
array([[ 0.3129],
[-0.0791],
[-0.2655]])
In [40]: coef = pd.Series(coef.squeeze(), index=X.design_info.column_names)
In [41]: coef
Out[41]:
Intercept 0.312910
x0 -0.079106
x1 -0.265464
dtype: float64

use Patsy Formula for data conversion

You can take Python Code and the patsy The formula combines . When evaluating formulas , The library will try to find functions that are used within a closed scope ：

In [42]: y, X = patsy.dmatrices('y ~ x0 + np.log(np.abs(x1) + 1)', data)
In [43]: X
Out[43]:
DesignMatrix with shape (5, 3)
Intercept x0 np.log(np.abs(x1) + 1)
1 1 0.00995
1 2 0.00995
1 3 0.22314
1 4 1.62924
1 5 0.00000
Terms:
'Intercept' (column 0)
'x0' (column 1)
'np.log(np.abs(x1) + 1)' (column 2)

Common variable transformations include standardization （ The average value is 0, The variance of 1） And centralization （ Minus the average ）.Patsy There are built-in functions to do this ：

In [44]: y, X = patsy.dmatrices('y ~ standardize(x0) + center(x1)', data)
In [45]: X
Out[45]:
DesignMatrix with shape (5, 3)
Intercept standardize(x0) center(x1)
1 -1.41421 0.78
1 -0.70711 0.76
1 0.00000 1.02
1 0.70711 -3.33
1 1.41421 0.77
Terms:
'Intercept' (column 0)
'standardize(x0)' (column 1)
'center(x1)' (column 2)

As a modeling step , You may fit the model to a data set , Then evaluate the model with another data set . Another data set may be the rest or the new data . When implementing centralization and standardization transformation , Use new data to make predictions with extreme caution . Because you have to use the mean or standard deviation to convert the new data set , This is also called state transition .

patsy.build_design_matrices Function can use the saved information of the original sample data set , To convert new data ,：

In [46]: new_data = pd.DataFrame({

....: 'x0': [6, 7, 8, 9],
....: 'x1': [3.1, -0.5, 0, 2.3],
....: 'y': [1, 2, 3, 4]})
In [47]: new_X = patsy.build_design_matrices([X.design_info], new_data)
In [48]: new_X
Out[48]:
[DesignMatrix with shape (4, 3)
Intercept standardize(x0) center(x1)
1 2.12132 3.87
1 2.82843 0.27
1 3.53553 0.77
1 4.24264 3.07
Terms:
'Intercept' (column 0)
'standardize(x0)' (column 1)
'center(x1)' (column 2)]

because Patsy The plus sign in is not an addition , When you add columns of a dataset by name , You have to use special I Function encapsulates them ：

In [49]: y, X = patsy.dmatrices('y ~ I(x0 + x1)', data)
In [50]: X
Out[50]:
DesignMatrix with shape (5, 2)
Intercept I(x0 + x1)
1 1.01
1 1.99
1 3.25
1 -0.10
1 5.00
Terms:
'Intercept' (column 0)
'I(x0 + x1)' (column 1)

Patsy Of patsy.builtins The module also has some other built-in transformations . Please check the online documentation .

Categorical data has a special transformation class , Let's talk about .

Categorical data and Patsy

Non numerical data can be converted into model design matrix in many ways . A complete explanation is beyond the scope of this book , It's best to study with the statistics class .

When you are in Patsy Use non numerical data in the formula , They are converted to virtual variables by default . If there is an intercept , Will remove one , Avoid collinearity ：

In [51]: data = pd.DataFrame({

....: 'key1': ['a', 'a', 'b', 'b', 'a', 'b', 'a', 'b'],
....: 'key2': [0, 1, 0, 1, 0, 1, 0, 0],
....: 'v1': [1, 2, 3, 4, 5, 6, 7, 8],
....: 'v2': [-1, 0, 2.5, -0.5, 4.0, -1.2, 0.2, -1.7]
....: })
In [52]: y, X = patsy.dmatrices('v2 ~ key1', data)
In [53]: X
Out[53]:
DesignMatrix with shape (8, 2)
Intercept key1[T.b]
1 0
1 0
1 1
1 1
1 0
1 1
1 0
1 1
Terms:
'Intercept' (column 0)
'key1' (column 1)

If you omit the intercept from the model , The columns of each classification value will be included in the model of the design matrix ：

In [54]: y, X = patsy.dmatrices('v2 ~ key1 + 0', data)
In [55]: X
Out[55]:
DesignMatrix with shape (8, 2)
key1[a] key1[b]
1 0
1 0
0 1
0 1
1 0
0 1
1 0
0 1
Terms:
'key1' (columns 0:2)

Use C function , The numeric column can be intercepted as a categorical quantity ：

In [56]: y, X = patsy.dmatrices('v2 ~ C(key2)', data)
In [57]: X
Out[57]:
DesignMatrix with shape (8, 2)
Intercept C(key2)[T.1]
1 0
1 1
1 0
1 1
1 0
1 1
1 0
1 0
Terms:
'Intercept' (column 0)
'C(key2)' (column 1)

When you use multiple category names in your model , Things get complicated , Because it will include key1:key2 The intersection of forms , It can be used in variance （ANOVA） In model analysis ：

In [58]: data['key2'] = data['key2'].map({
0: 'zero', 1: 'one'})
In [59]: data
Out[59]:
key1 key2 v1 v2
0 a zero 1 -1.0
1 a one 2 0.0
2 b zero 3 2.5
3 b one 4 -0.5
4 a zero 5 4.0
5 b one 6 -1.2
6 a zero 7 0.2
7 b zero 8 -1.7
In [60]: y, X = patsy.dmatrices('v2 ~ key1 + key2', data)
In [61]: X
Out[61]:
DesignMatrix with shape (8, 3)
Intercept key1[T.b] key2[T.zero]
1 0 1
1 0 0
1 1 1
1 1 0
1 0 1
1 1 0
1 0 1
1 1 1
Terms:
'Intercept' (column 0)
'key1' (column 1)
'key2' (column 2)
In [62]: y, X = patsy.dmatrices('v2 ~ key1 + key2 + key1:key2', data)
In [63]: X
Out[63]:
DesignMatrix with shape (8, 4)
Intercept key1[T.b] key2[T.zero]
key1[T.b]:key2[T.zero]
1 0 1 0
1 0 0 0
1 1 1 1
1 1 0 0
1 0 1 0
1 1 0 0
1 0 1 0
1 1 1 1
Terms:
'Intercept' (column 0)
'key1' (column 1)
'key2' (column 2)
'key1:key2' (column 3)

Patsy Provide other methods for converting classified data , Includes converting in a specific order . Please refer to the online documentation .

13.3 statsmodels Introduce

statsmodels yes Python Fit a variety of statistical models 、 A visual library for statistical experiments and data exploration .Statsmodels Contains many classic statistical methods , But there are no Bayesian methods or machine learning models .

statsmodels The models included are ：

Linear model , Generalized linear model and robust linear model
Linear mixed effect model
variance （ANOVA） Methods to analyze
Time series process and state space model
Generalized moment estimates

below , I will use some basic statsmodels Tools , Explore Patsy Formula and pandasDataFrame How objects use model interfaces .

Estimate the linear model

statsmodels There are many linear regression models , Including from the basic （ Such as ordinary least squares ） To complexity （ For example, iterative weighted least square method ） Of .

statsmodels The linear model of has two different interfaces ： Array based and formula based . They can go through API Module introduction ：

import statsmodels.api as sm
import statsmodels.formula.api as smf

To show how they can be used , We generate a linear model from some random data ：

def dnorm(mean, variance, size=1):
if isinstance(size, int):
size = size,
return mean + np.sqrt(variance) * np.random.randn(*size)
# For reproducibility
np.random.seed(12345)
N = 100
X = np.c_[dnorm(0, 0.4, size=N),
dnorm(0, 0.6, size=N),
dnorm(0, 0.2, size=N)]
eps = dnorm(0, 0.1, size=N)
beta = [0.1, 0.3, 0.5]
y = np.dot(X, beta) + eps

here , I use the “ real ” Models and known parameters beta. here ,dnorm Can be used to generate normal distribution data , With a specific mean and variance . Now there is ：

In [66]: X[:5]
Out[66]:
array([[-0.1295, -1.2128, 0.5042],
[ 0.3029, -0.4357, -0.2542],
[-0.3285, -0.0253, 0.1384],
[-0.3515, -0.7196, -0.2582],
[ 1.2433, -0.3738, -0.5226]])
In [67]: y[:5]
Out[67]: array([ 0.4279, -0.6735, -0.0909, -0.4895,-0.1289])

Like before Patsy What you see , Linear models usually have to fit an intercept .sm.add_constant Function to add an intercept column to an existing matrix ：

In [68]: X_model = sm.add_constant(X)
In [69]: X_model[:5]
Out[69]:
array([[ 1. , -0.1295, -1.2128, 0.5042],
[ 1. , 0.3029, -0.4357, -0.2542],
[ 1. , -0.3285, -0.0253, 0.1384],
[ 1. , -0.3515, -0.7196, -0.2582],
[ 1. , 1.2433, -0.3738, -0.5226]])

sm.OLS Class can fit an ordinary least squares regression ：

In [70]: model = sm.OLS(y, X)

This model of fit Method returns a regression result object , It contains estimated model parameters and other contents ：

In [71]: results = model.fit()
In [72]: results.params
Out[72]: array([ 0.1783, 0.223 , 0.501 ])

Use... For the results summary Method can print the detailed diagnostic results of the model ：

In [73]: print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.430
Model: OLS Adj. R-squared: 0.413
Method: Least Squares F-statistic: 24.42
Date: Mon, 25 Sep 2017 Prob (F-statistic): 7.44e-12
Time: 14:06:15 Log-Likelihood: -34.305
No. Observations: 100 AIC: 74.61
Df Residuals: 97 BIC: 82.42
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 0.1783 0.053 3.364 0.001 0.073 0.283
x2 0.2230 0.046 4.818 0.000 0.131 0.315
x3 0.5010 0.080 6.237 0.000 0.342 0.660
==============================================================================
Omnibus: 4.662 Durbin-Watson: 2.201
Prob(Omnibus): 0.097 Jarque-Bera (JB): 4.098
Skew: 0.481 Prob(JB): 0.129
Kurtosis: 3.243 Cond. No.
1.74
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.

The parameter name here is the generic name x1, x2 wait . Assume that all model parameters are in one DataFrame in ：

In [74]: data = pd.DataFrame(X, columns=['col0', 'col1', 'col2'])
In [75]: data['y'] = y
In [76]: data[:5]
Out[76]:
col0 col1 col2 y
0 -0.129468 -1.212753 0.504225 0.427863
1 0.302910 -0.435742 -0.254180 -0.673480
2 -0.328522 -0.025302 0.138351 -0.090878
3 -0.351475 -0.719605 -0.258215 -0.489494
4 1.243269 -0.373799 -0.522629 -0.128941

Now? , We use statsmodels Formula API and Patsy Formula string of ：

In [77]: results = smf.ols('y ~ col0 + col1 + col2', data=data).fit()
In [78]: results.params
Out[78]:
Intercept 0.033559
col0 0.176149
col1 0.224826
col2 0.514808
dtype: float64
In [79]: results.tvalues
Out[79]:
Intercept 0.952188
col0 3.319754
col1 4.850730
col2 6.303971
dtype: float64

Under observation statsmodels How to return Series The results of , With DataFrame Column name of . When using formulas and pandas Object time , We don't need to use add_constant.

Give an out of sample data , You can calculate the predicted value based on the estimated model parameters ：

In [80]: results.predict(data[:5])
Out[80]:
0 -0.002327
1 -0.141904
2 0.041226
3 -0.323070
4 -0.100535
dtype: float64

statsmodels There are other analyses of the results of the linear model 、 Diagnostic and visualization tools . Except for the ordinary least square model , There are other linear models .

Estimating time series processes

statsmodels Another type of model is time series analysis , Including autoregressive process 、 Kalman filter and other state space models , And multiple autoregressive models .

Use autoregressive structure and noise to simulate some time series data ：

init_x = 4
import random
values = [init_x, init_x]
N = 1000
b0 = 0.8
b1 = -0.4
noise = dnorm(0, 0.1, N)
for i in range(N):
new_x = values[-1] * b0 + values[-2] * b1 + noise[i]
values.append(new_x)

This data has AR(2) structure （ Two delays ）, Parameter is 0.8 and -0.4. fitting AR Model time , You may not know the number of lags , Therefore, more hysteresis can be used to fit this model ：

In [82]: MAXLAGS = 5
In [83]: model = sm.tsa.AR(values)
In [84]: results = model.fit(MAXLAGS)

The estimation parameter in the result is intercept first , The second is the estimated value of the first two parameters ：

In [85]: results.params
Out[85]: array([-0.0062, 0.7845, -0.4085, -0.0136, 0.015 , 0.0143])

More details and how to interpret the results are beyond the scope of this book , Can pass statsmodels Document learning more .

13.4 scikit-learn Introduce

scikit-learn Is a widely used 、 Versatile Python Machine learning library . It includes a variety of standard supervised and unsupervised machine learning methods and model selection and evaluation 、 Data conversion 、 Data loading and model persistence tools . These models can be used for classification 、 polymerization 、 Forecasting and other tasks .

Learning and application of machine learning scikit-learn and TensorFlow There are many online and paper materials to solve practical problems . In this section , I will briefly introduce scikit-learn API Style .

When writing this book ,scikit-learn There is no harmony. pandas In depth , But some third-party packages are under development . For all that ,pandas It is very suitable for processing data sets before model fitting .

for instance , I use one. Kaggle The classic data set of the competition , About the survival rate of Titanic passengers . We use it pandas Load test and training datasets ：

In [86]: train = pd.read_csv('datasets/titanic/train.csv')
In [87]: test = pd.read_csv('datasets/titanic/test.csv')
In [88]: train[:4]
Out[88]:
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S

statsmodels and scikit-learn Usually missing data cannot be received , So we want to see if the column contains missing values ：

In [89]: train.isnull().sum()
Out[89]:
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
In [90]: test.isnull().sum()
Out[90]:
PassengerId 0
Pclass 0
Name 0
Sex 0
Age 86
SibSp 0
Parch 0
Ticket 0
Fare 1
Cabin 327
Embarked 0
dtype: int64

In the case of statistics and machine learning , According to the characteristics in the data , A typical task is to predict whether passengers will survive . The model is now fitted to the training data set , Then evaluate with the out of sample test data set .

I want to use age as a predictor , But it contains missing values . There are many ways to complete missing data , I used a simple method , Use the median of the training data set to complete the null values of the two tables ：

In [91]: impute_value = train['Age'].median()
In [92]: train['Age'] = train['Age'].fillna(impute_value)
In [93]: test['Age'] = test['Age'].fillna(impute_value)

Now we need to specify the model . I added a column IsFemale, As “Sex” The encoding of the column ：

In [94]: train['IsFemale'] = (train['Sex'] == 'female').astype(int)
In [95]: test['IsFemale'] = (test['Sex'] == 'female').astype(int)

then , We identify some model variables , And create NumPy Array ：

In [96]: predictors = ['Pclass', 'IsFemale', 'Age']
In [97]: X_train = train[predictors].values
In [98]: X_test = test[predictors].values
In [99]: y_train = train['Survived'].values
In [100]: X_train[:5]
Out[100]:
array([[ 3., 0., 22.],
[ 1., 1., 38.],
[ 3., 1., 26.],
[ 1., 1., 35.],
[ 3., 0., 35.]])
In [101]: y_train[:5]
Out[101]: array([0, 1, 1, 1, 0])

I can't guarantee that this is a good model , But its characteristics are consistent with . We use it scikit-learn Of LogisticRegression Model , Create a model instance ：

In [102]: from sklearn.linear_model import LogisticRegression
In [103]: model = LogisticRegression()

And statsmodels similar , We can use the model fit Method , Fit it to the training data ：

In [104]: model.fit(X_train, y_train)
Out[104]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)

Now? , We can use model.predict, Predict the test data ：

In [105]: y_predict = model.predict(X_test)
In [106]: y_predict[:10]
Out[106]: array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0])

If you have the true value of the test data set , You can calculate accuracy or other error measures ：

(y_true == y_predict).mean()

In practice, , Model training often has many additional complexities . Many models have adjustable parameters , Some ways （ Such as cross validation ） It can be used for parameter adjustment , Avoid over fitting the training data . This usually improves predictability or robustness to new data .

Cross validation simulates out of sample prediction by segmenting training data . Model based accuracy score （ For example, mean square deviation ）, You can perform grid search on model parameters . Some models , Such as logistic Return to , Estimation classes with built-in cross validation . for example ,logisticregressioncv Class can specify the regularization parameters of the mesh search on the model with a parameter C Granularity ：

In [107]: from sklearn.linear_model import LogisticRegressionCV
In [108]: model_cv = LogisticRegressionCV(10)
In [109]: model_cv.fit(X_train, y_train)
Out[109]:
LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
fit_intercept=True, intercept_scaling=1.0, max_iter=100,
multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=0)

To perform cross validation manually , You can use cross_val_score Help function , It can handle data segmentation . for example , To cross validate our model with four non overlapping training data , You can do this ：

In [110]: from sklearn.model_selection import cross_val_score
In [111]: model = LogisticRegression(C=10)
In [112]: scores = cross_val_score(model, X_train, y_train, cv=4)
In [113]: scores
Out[113]: array([ 0.7723, 0.8027, 0.7703, 0.7883])

The default scoring indicator depends on the model itself , But you can specify a rating . Cross validated models take longer to train , But there will be higher model performance .

13.5 Continue to learn

I just introduced some Python The surface content of the modeling library , There are more and more frameworks for various statistics and machine learning , They all use Python or Python User interface implementation .

The focus of this book is on data normalization , There are other books that focus on Modeling and data science tools . Among them, the excellent ones are ：

Andreas Mueller and Sarah Guido (O’Reilly) Of 《Introduction to Machine Learning with Python》
Jake VanderPlas (O’Reilly) Of 《Python Data Science Handbook》
Joel Grus (O’Reilly) Of 《Data Science from Scratch: First Principles》
Sebastian Raschka (Packt Publishing) Of 《Python Machine Learning》
Aurélien Géron (O’Reilly) Of 《Hands-On Machine Learning with Scikit-Learn and TensorFlow》

Although books are good resources for learning , But with the development of the underlying open source software , The content of the book will be out of date . It is best to be constantly familiar with the documentation of various statistical and machine learning frameworks , Learn the latest features and API.