程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Self taught programming series - getting started with 5 pandas

編輯:Python

pandas Learning from

  • 5.1 Series data
  • 5.2 DataFrame establish
  • summary : Create method
  • 5.3 Index object
  • 5.4 pandas Basic function
  • 5.5 Descriptive statistics

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import pandas_datareader.data as web

5.1 Series data

  • Indexes
  • operation : Automatic alignment
  • Missing value 、 name
# Series Composed of one-dimensional data and indexes, such as time series data , The index is time 
series1 = pd.Series([1,3,5,7])
print(series1)
# Custom index 
series2 =pd.Series([1,3,4,6],index=['a','b','c','d'])
series2
series2.index
# Select by index Series The value in 
value1 = series2['c']
value2 = series2[['a','d','b']]
value2
# Series Arithmetic : similar numpy
series3 = series2[series2>5]
print(series3)
series4 = series2*2
print(series4)
series5 = np.exp(series2)
print(series5)
# Can be series Think of it as an ordered dictionary 
# You can use a dictionary to create series
print('a' in series2)
sdata = {
'zhao':1000,'qian':2000,'sun':3000,'zhou':4000}
series6 = pd.Series(sdata)
print(series6)
# The index can also be specified , Arrange the keys of the dictionary in the desired order 
keys = ['sun','zhao','qian','li']
series7 = pd.Series(sdata,index=keys) # Not a key keys The value is NaN
# Is a key but is not included in keys Directly delete , No longer in the sequence 
series7
# missing data 
pd.isnull(series7) #pd The function in 
pd.notnull(series7)
series7.isnull() # series Example method 
series7.notnull()
# pd Automatically align data according to index 
result = series6+series7
print(result) # As long as there is one missing value , So the sum is also the missing value 
# name attribute 
series7.name = 'salary' # Sequence name 
series7.index.name = 'name'
series7

5.2 DataFrame establish

  • One group has a sequence : Row index + Column index ,series A dictionary made up of
  • It is usually one or more two-dimensional blocks
# establish DataFrame Generally, it is to directly pass in a dictionary composed of lists or arrays 
data = {

'state':['ohio','ohio','Nevada','Neveda','Nevada','wang'], # Must be equal in length 
'year':[2000,2001,2002,2003,2003,2004],
'pop': [1.5,1.7,3.6,2.4,2.9,3.0]
}
frame = pd.DataFrame(data)
frame
# preview DataFrame head() tail()
frame.head()
frame.tail()
# DataFrame Customizable column sequence 
df = pd.DataFrame(data,columns=['year','state','pop'])
df
# Be similar to series, If the incoming column cannot be found in the data , A missing value is generated 
frame2 = pd.DataFrame(data,columns=['year','state','pop','debt'],index=['one','two','three','four', 'five','six'])
print(frame2.columns)
print(frame2)
# from dataframe Get sequence Like a dictionary or attribute 
print(frame2.year) # Must conform to python Named features 
print(frame2['pop']) # More practical 
# The value of the column can be modified by assignment 
frame2['debt'] = 14.1
print(frame2)
frame2['debt'] = np.arange(6.0) # np You can add decimals to range It can only be integers 
print(frame2)
# The columns returned by the index are views of the data , Not a copy , Therefore, the modified data will be directly reflected on the original data 
# del You can delete a column 
# First create a column that contains Boolean values 
frame2['eastern'] = frame2.state == 'ohio'
print(frame2)
del frame2['eastern']
print(frame2)
# Another kind DataFrame Create format 
# Pass in nested Dictionary 
pop = {

'Nevada':{
2001:2.4,2002:2.9},
'ohio':{
2000:1.5,2001:1.7,2002:3.6}
}
frame3 = pd.DataFrame(pop)
print(frame3) # The inner key acts as a line , The outer layer of the key as a column 
# The keys of the inner dictionary form an index 
# You can also specify index rows 
df = pd.DataFrame(pop,index=[2001,2002,2003])
print(df)
# dataframe The transpose 
print(frame3.T)
# Dictionaries can also be written by series form 
pdata = {

'ohio': frame3['ohio'][:-1],
'Nevada': frame3['Nevada'][:2]
}
df = pd.DataFrame(pdata)
print(df)
# A list of dictionaries : Contrary to nested dictionaries Belongs to no specified row index 
sdata = [{

'name': 'wang',
'age':12},{

'name':'liu',
'age':22
}]
df = pd.DataFrame(sdata)
print(df)

summary : Create method

  • Two dimensional array
  • By an array of 、 list 、 Tuples 、series A dictionary made up of , Each element becomes a column
  • Nested Dictionary , The inner dictionary becomes 1 Column , Keys are merged into row indexes
  • Dictionary or series A list of : Each item is called DF A line , Dictionary keys and indexes are called lists
  • Dictionaries 、 A list of tuples
  • the other one DF
# Set row index name , Column index name 
df.index.name = 'year'
df.columns.name='features'
print(df)
print(df.values) # Will return as an array 
frame2.values # If it contains many types of data , Will specify a compatible data type 

5.3 Index object

  • pandas use index Object to define data such as axis labels and axis names
  • The tags of the array or sequence used will be converted to index object
  • The index object cannot be modified after it is determined
  • You can use python The way to assemble
obj = pd.Series(range(3),index=['a','b','c'])
index = obj.index
index[1:]
# The index object cannot be modified after it is determined 
# index[1] = 'd'
# TypeError: Index does not support mutable operations
# You can create your own indexes and then create them in different places DF Use in 
labels = pd.Index(np.arange(3))
labels
obj2 = pd.Series([1.5,-2.5,0],index=labels)
obj2.index is labels
frame3
frame3.columns
# Be similar to python Set operations , But it can contain the same elements 
labels_index = pd.Index(['foo','foo','bar','bar'])
labels1 = labels_index.append(labels) # Add to another index In the object 
print(labels1)
labels2 = labels1.difference(labels_index) # Difference set 
print(labels2)
labels3 = labels1.intersection(labels) # intersection 
print(labels3)
# union and 
labels3.delete(2) # Delete index 
labels3.drop(1) # Delete value 
# insert Insert value into index i It's about 
# unique Calculate the unique set 

5.4 pandas Basic function

  • Re index : Whatever the original index is , Find the interested data directly
  • Remove a piece of data : Default line , The shaft can be replaced
  • Sequence 、DF The index of 、 section
  • Label operators
  • Considerations for integer indexing :[-1]
  • Index alignment and arithmetic operations
  • Series And DF Operation between : All right 、 All columns 、 radio broadcast
  • Function application
obj = pd.Series([4.5,7.2,5.3,3.6],index=['d','b','a','c'])
print(obj)
# reindex Re index 
obj2 = obj.reindex(['a','b','c','d','e'])
print(obj2)
# Interpolation processing of time series 
obj3 = pd.Series(['blue','purple','yellow'],index=[0,2,4])
print(obj3)
obj3.reindex(range(6),method='ffill')
print(obj3) #ffill The original object will not be modified index Will change the original data view 
print(obj3.reindex(range(6),method='ffill'))
# about DataFrame similar 
df = pd.DataFrame(np.arange(9).reshape((3,3)),index=['a','b','c'],columns=['ohio','california','texas'])
print(df)
df1 = df.reindex(['a','b','c','d'])
print(df1)
# colums You can also re index 
states = ['ohio','utah','california']
df2 = df.reindex(columns=states)
print(df2)
# Discard items on an axis : Sequence 
obj = pd.Series(np.arange(5.),index=['a','b','c','d','e'])
print(obj)
obj1 = obj.drop('c')
print(obj1)
obj2 = obj.drop(['b','c']) # Pass in parameters as a list 
print(obj2)
# DataFrame
df1 = pd.DataFrame(
np.arange(16).reshape(4,4),
index = ['ojio','colorado','utah','newyork'],
columns=['one','two','three','four']
)
print(df1)
df2 = df1.drop(['colorado','utah']) # The default is index , Stored drop object 
print(df2)
print(df1)
df3 = df1.drop(['one','three'],axis=1) # Fixed axis direction , You can delete columns 
df4 = df1.drop(['two','four'],axis='columns') # Equivalent operation 
print(df3)
print(df4)
# drop Functional inplace Parameter can operate on the original data 
df5 = df1.drop(['colorado','utah'],inplace=True)
print(df5) # Returns the deleted object 
print(df1) # Changes have been made to the original data , Deleted some data , Use caution 
# Sequence index 
obj = pd.Series(np.arange(4.0),index=['a','b','c','d'])
print(obj['b'])
print(obj[['b','d']])
print(obj[[2,3]])
print(obj[obj<2])
# Sequence slice 
print(obj['b':'d']) # And python Different ends are included 
print(obj[2:3]) # python
obj[2:3] =5 #python
print(obj)
# DataFrame Indexes 、 section 
df1 = pd.DataFrame(
np.arange(16).reshape((4,4)),
index=['wang','liu','zhao','qian'],
columns=['one','two','three','four'])
print(df1['one'])
print(df1[['one','four']])
print(df1[:2])
print(df1[df1['three']>5]) #>5 All of the line , similar excel
# Boolean value 
df1[df1<5] = 0 # Direct pair df1 Make changes 
print(df1)
# Label operators loc and iloc
df1.loc['wang',['one','three']] # similar numpy Select rows and columns 
df1.loc[['wang','zhao'],['one','two']] # Select a submatrix 
df1.iloc[2,[3,0,1]] # Select by integer index 
df1.iloc[:,:3] # similar numpy
# Select a scalar at a certain position 
df1.at['wang','one'] # You must enter a label 
df1.iat[0,0] # Enter integer index 
# Integer indexes are error prone 
ser1 = pd.Series(np.arange(3.0))
ser1
ser1[1] # Will report a mistake , Different from list and tuple index Syntax 
# Use non integer indexes 
ser1.index=['a','b','c']
ser1
ser1[-1] # Can index normally 
# So we use iloc Index 
ser1.index=[0,1,2]
ser1
ser1.loc[:1] # Index by tag is 0 and 1 Of index
ser1.iloc[:1] # Press python Index to index 0 Of index
# Index alignment 
# The index is union , But it's worth : Arithmetic operation NaN +num = NaN
s1 = pd.Series([7.3,-2.5,3.4, 1.5],index=['a','c','d','e'])
s2 = pd.Series([-2.1,3.6,-1.5,4,3.1],index = ['a','c','e','f','g'])
s1
s2
s1+s2
# about DataFrame Alignment occurs on rows and columns 
df1 = pd.DataFrame(
np.arange(9.0).reshape((3,3)),columns=list('bcd'),
index = ['ohio','texas','colorado']
)
df2 = pd.DataFrame(
np.arange(12.0).reshape((4,3)),columns=list('bde'),
index= ['utah', 'ohio','texas','oregon']
)
df1
df2
df1+df2
# The index does not have a value of intersection 
# If there are no common row and column labels , Then the result is empty 
df3 = pd.DataFrame({
'A':[1,2]})
df4 = pd.DataFrame({
'B':[3,4]})
df3
df4
df3+df4
# DataFrame The arithmetic method of 
# Use add Method , Missing index with 0 fill 
df1 = pd.DataFrame(
np.arange(12.0).reshape((3,4)),
columns=list('abcd'))
df2 = pd.DataFrame(
np.arange(16.0).reshape((4,4)),
columns= list('bcde')
)
print(df1)
print(df2)
df1+df2
df1.add(df2,fill_value=0) # It can not solve the problem that both have missing values 
df1.radd(df2,fill_value=0) # Inverse method radd
df1.sub(df2,fill_value=0) # Subtraction df1-df2
df1.rsub(df2,fill_value=0) # Inverse subtraction df2-df1
# div rdiv division floordiv rfloordiv to be divisible by mul rmul Multiplication pow,rpow chengfang 
# Re index 
df1.reindex(columns=df2.columns,fill_value=0)
# DataFrame And Series Operation between : The default matching column to row operation 
# The difference between a two-dimensional array and one of its rows 
arr = np.arange(12.).reshape((3,4))
print(arr)
arr[0]
arr-arr[0] # This is done for each row : radio broadcast 
# DataFrame And Series
df1 = pd.DataFrame(
np.arange(12.).reshape((4,3)),
columns=list('abc'),
index= ['utah','ohio','texas','oregon']
)
series1 = df1.iloc[0]
print(df1)
series1
# By default , The arithmetic operation will series The index of matches to the column , Operate on each line 
df1 - series1
# If an index is in DataFrame or series Does not exist in the , Then the index will form a union 
series2 = pd.Series(range(3),index=list('acd'))
df1 - series2
# To match rows and columns, you need to use function methods 
series3 = df1['a']
print(df1)
series3
df1.sub(series3,axis='index') # Pass in the axis index you want to match 
# Function application 
# abs
df = pd.DataFrame(
np.random.randn(4,3),columns=list('bde'),
index = ['utah','hoio','texas','oregon']
)
print(df)
np.abs(df) # It can be used numpy Function method of 
# Apply functions to rows or columns to form a one-dimensional array : Take the extreme value of each column 
f = lambda x: x.max() - x.min()
df.apply(f)
# Set the operation axis to columns
df.apply(f,axis='columns')
f = lambda x: x.sum()
df.apply(f,axis='index') #sum mean It 's all very easy to do 
# f You can also return Series Function of 
def f(x):
return pd.Series([x.min(),x.max()],index=['min','max'])
df.apply(f) # This allows you to write descriptive statistical functions 
# Element level functions 
f_str = lambda x: '%.2f' % x # Get the format string 
df.applymap(f_str) # Operations on all elements 
df['e'].map(f_str)
# Sort 
# Sort index 
ser1 = pd.Series(range(4),index=['d','a','b','c'])
print(ser1)
ser1.sort_index()
df = pd.DataFrame(
np.arange(8).reshape(2,4),index=['three','one'],
columns=['d','a','b','c'])
print(df)
df.sort_index()
df.sort_index(1)
df.sort_index(ascending=False) # The above is the default ascending sort , You can also sort in descending order 
df.sort_index(axis=1,ascending=False)
# Sort values 
ser1.sort_values(ascending=False)
df.sort_values(by='b') # Sort by a column 
df1 = pd.DataFrame({

'b':[1,2,3,1],'c':[-2,-4,2,3]
})
df2 = df1.sort_values(by=['b','c'],ascending=[False,False]) # When b The columns are the same ,c Columns are arranged in descending order 
df3 = df1.sort_values(by='b',ascending=False) # Usually use this , The result is different from the above 
print(df2)
print(df3)
# ranking rank Method 
# Give a ranking according to the value , The same value is the average ranking 
ser1 = pd.Series([7,-5,7,4,2,0,4])
ser1.rank()
# The same value appears in the order of ranking 
ser1.rank(method='first')
# Descending 
ser1.rank(ascending=False,method='max') # The higher the same value, the higher the ranking 
# min Take the lowest ranking 
# Index of duplicate tags 
series = pd.Series(range(5),index=['a','a','b','b','c'])
series
series.index.is_unique #is_unique attribute 
# The index of the duplicate tag returns a series
series['a']
# This will make the data type of the index result not unique , Bring difficulties to data processing 
# It is often assumed that the index is not duplicated , about DataFrame So it is with 
df = pd.DataFrame(np.random.randn(4,3),index=['a','a','b','b'])
df
df.loc['a'] # Default index by column , add loc Can be like numpy Same index 

5.5 Descriptive statistics

  • Based on the assumption that there is no missing data
df = pd.DataFrame([[1.4,np.nan],[7.1,-4.5],[np.nan,np.nan],[0.75,-1.3]],index=['a','b','c','d'],columns = ['one','two'])
print(df) # The two-dimensional array is labeled according to the original format 
df.sum() # Returns the sum of each column ,series
df.sum(axis=1) # Returns the sum of each row , Default will ignore na
df.mean(axis='columns',skipna=False) # Set to false after , As long as there is one na It would be na
# Hierarchical index : Multiple indexes can be defined on one axis index=[['a','a','a','b',b',b','c','c','c','d','d','d'],[1,2,3,1,2,3,1,2,3]] Descriptive statistical functions utilize level Parameter grouping 
# Returns the index 
df.idxmax() # The row and column index corresponding to the maximum value of each column 
# Add up 
df.cumsum()
# Generate multiple totals at once 
df.describe()
# For non numeric data 
df1 = pd.Series(['a','a','b','c']*4)
df1.describe()
# Summary of descriptive statistical methods 
# count Number of samples 
# describe Series or describe Column summary data 
# argmin argmax Calculate integer index position 
# idxmin idxmax Calculate index value position 
# quantile Calculate quantile (0,1)
# sum\mean\median
# mad Absolute deviation 
# var/std/skew/kurt
# cumsum/cummin/cummax/cumprod
# diff First order difference 
# pct_change Percentage change 
# # Correlation coefficient and covariance 
# all_data = {

# ticker:web.get_data_yahoo(ticker)
# for ticker in ['AAPL','IBM','MSFT','GOOG'] # Dictionary generator 
# }
# price = pd.DataFrame(
# {ticker:data['Adj Close']
# for ticker,data in all_data.items()}
# )
# volumn = pd.DataFrame(
# {ticker:data['Volumn']
# for ticker,data in all_data.items()}
# )
# returns = price.pct_change()
# returns.tail()
# returns['MSFT'].corr(returns['AAPL'])
# returns.MSFT.corr(returns['AAPL'])
# returns['MSFT'.cov(returns['AAPL'])]
# returns.corr()
# retunrs.cov()
# # Other sequences can also be passed in 
# returns.corrwith(returns.IBM)
# Pass in dataframe Calculate the correlation coefficient of the same column name 
# returns.corrwith(volumn)
# The only value 、 frequency 、 Membership 
obj = pd.Series(['c','a','d','a','a','b','b','c','c'])
# The only value 
uniques = obj.unique()
uniques.sort() # Operate directly on the original sequence 
uniques
# frequency 
obj.value_counts() # Each value frequency 
pd.value_counts(obj.values,sort=False) # Can be used with any array or sequence 
# Membership isin
mask = obj.isin(['b','c'])
mask # amount to if sentence , Is equal to returns T, otherwise F
obj[mask]
# Give the integer index of each value of one array to another array Index(unique).get_indexer(match)
to_match = pd.Series(['c','a','b','b','c','a'])
unique_vals = pd.Series(['c','b','a'])
pd.Index(unique_vals).get_indexer(to_match)
# Similar to match function 
data = pd.DataFrame(
{
'Qu1':[1,3,4,3,4],
'Qu2': [2,3,1,2,3],
'Qu3':[1,5,2,4,4]}
)
print(data)
result = data.apply(pd.value_counts).fillna(0) # frequency 
print(result)

  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved