程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

[Python] illustration numpy: internal mechanism of common functions

編輯:Python

selected from Medium, author :Lev Maximov

Heart of machine compilation

Support a large number of multidimensional array and matrix operations NumPy Software library is a necessary tool for many machine learning developers and researchers , In this paper, we will analyze the commonly used NumPy Functions and functions , To help you understand NumPy The intrinsic mechanism of manipulating arrays .

NumPy It's a basic software library , A lot of common Python It's inspired by the software that processes the data , Include pandas、PyTorch、TensorFlow、Keras etc. . understand NumPy The working mechanism of can help you improve your skills in these software libraries . And in GPU Upper use NumPy when , There is no need to modify or just a small amount of code modification .

NumPy The core concept of n Dimension group .n The beauty of dimension sets is that most operations look the same , No matter how many dimensions an array has . But one and two dimensions are a little special . This article is divided into three parts :

1. vector : One dimensional array

2. matrix : Two dimensional array

3. Three dimensions and higher

This paper refers to Jay Alammar The article 《A Visual Intro to NumPy》 And use it as a starting point , And then it expanded , And made some minor changes .

NumPy Array and Python list

At first glance ,NumPy An array with the Python The list is similar to . They all serve as containers , Can quickly get and set elements , But inserting and removing elements is a little slower .

NumPy The simplest example of an array winning list is arithmetic :

besides ,NumPy The advantages and characteristics of arrays also include :

More compact , Especially when the dimension is larger than one dimension ;

When the operation can be vectorized , Faster than lists ;

When you attach elements to the back , Slower than the list ;

Usually homogeneous : It's fast when all elements are of one type .

here O(N) The time required to complete the operation is proportional to the size of the array , and O*(1)( That is to say 「 capitation O(1)」) The time to complete an operation is usually independent of the size of the array .

vector : One dimensional array

Vector initialization

In order to create NumPy Array , One way is to transform Python list .NumPy Array types can be derived directly from list element types .

Make sure that the list you enter is of the same type , Or you'll end up with dtype=’object’, It affects speed , In the end, only NumPy The grammar sugar contained in .

NumPy Arrays can't be like Python It's growing like a list . There is no space left at the end of the array to quickly attach elements . therefore , The common practice is to either use Python list , When you're ready, convert it to NumPy Array , Or use np.zeros or np.empty Leave the necessary space in advance :

Usually it is necessary to create an empty array that matches an existing array in shape and element type .

in fact , All functions used to create arrays filled with constant values have _like In the form of :

NumPy Two functions in can perform array initialization with a monotone sequence :

If you need something like that [0., 1., 2.] Such an array of floating-point numbers , You can modify arange Type of output :arange(3).astype(float), But there's a better way .arange Functions are type sensitive : If you enter an integer as a parameter , It generates integers ; If you enter a floating-point number ( such as arange(3.)), It generates floating-point numbers .

but arange Not very good at dealing with floating-point numbers :

In our eyes , This 0.1 It looks like a finite decimal number , But computers don't see it that way . In binary representation ,0.1 It's an infinite fraction , So we have to make a reduction , It will inevitably lead to errors . And for this reason , If to arange Function input with fractional part step, It usually doesn't get good results : You may come across a mistake (off-by-one error). You can make the end of the interval fall on a non integer step In number (solution1), But it reduces the readability and maintainability of the code . Now ,linspace I can use it . It's not affected by rounding , Always generate the element values you want . however , Use linspace There's a common pitfall that you'll encounter when you're on the road : It counts the number of data points , Not intervals , So the last parameter num Usually more than you think 1. therefore , The number in the last example above is 11, instead of 10.

When testing , We usually need to generate random arrays :

Vector index

Once you have data in your array ,NumPy They can be easily provided in a very clever way :

except 「 Fancy index (fancy indexing)」 Outside , All the indexing methods given above are called 「view」: They don't store data , It will not reflect the changes of the original array when the data is indexed .

All methods that include fancy indexes are variable : They allow the contents of the original array to be modified by assignment , As shown above . This feature avoids the habit of always copying arrays by splitting them into different parts .

Python List and NumPy Comparison of arrays

In order to obtain NumPy Data in array , Another super useful method is Boolean index (boolean indexing), It supports the use of all kinds of logical operators :

any and all The role of and in Python It's similar to , But no short circuit .

But be careful , Not supported here Python Of 「 Ternary comparison 」, such as 3<=a<=5.

As shown above , Boolean indexes are also writable . Its two common functions have their own special functions : Overloaded np.where Functions and np.clip function . Their meanings are as follows :

Vector operations

NumPy One of the great applications of speed is arithmetic . Vector operators are converted to C++ At the level of execution , So as to avoid slow Python The cost of recycling .NumPy Support the operation of the entire array just like the operation of ordinary numbers .

And Python Syntax is the same ,a//b Express a except b( The quotient of division ),x**n Express xⁿ.

Just like adding and subtracting floating-point numbers, integer numbers are converted to floating-point numbers , Scalars are also converted to arrays , This process takes place in NumPy It's called broadcasting (broadcast).

Most mathematical functions have functions for dealing with vectors NumPy The corresponding function :

Scalar products have their own operators :

You don't have to loop to perform trigonometric functions :

We can round the array as a whole :

floor To give up 、ceil In order to enter ,around Is rounded to the nearest integer ( among .5 Will be abandoned )

NumPy Can also perform basic statistical operations :

NumPy The sort function of does not have Python The sorting function is so powerful :

Python List and NumPy Array sorting function comparison

In the case of one dimension , If there is a lack of reversed keyword , Then simply reverse the result , The end result is the same . The two-dimensional case is more difficult ( People are asking for this function ).

Search for elements in a vector

And Python The list is the opposite ,NumPy Array has no index method . People have been asking for this function for a long time , But it hasn't come true yet .

Python List and NumPy Comparison of arrays ,index() The square brackets in can be omitted j Or omit at the same time i and j.

One way to find elements is np.where(a==x)[0][0], But this approach is neither elegant , Not fast either , Because it needs to check all the elements in the array , Even if the target is at the beginning of the array .

Another way to use it is faster Numba To speed up next((i[0] for i, v in np.ndenumerate(a) if v==x), -1).

Once the array is sorted , It's much easier to search :v = np.searchsorted(a, x); return v if a[v]==x else -1 It's very fast , The time complexity is O(log N), But it needs O(N log N) Time first .

in fact , use C It's not a problem to implement it and speed up the search . The problem is floating point comparisons . This is not a simple and directly available task for any data .

Compare floating-point numbers

function np.allclose(a, b) Can compare floating-point number array under certain tolerance .

function np.allclose(a, b) Example of working process of . There is no universal way !

np.allclose Suppose that all the numbers being compared are in the typical 1 Within the scope of . for instance , If you want to complete the calculation in nanoseconds , You need to use the default atol The parameter value is divided by 1e9:np.allclose(1e-9, 2e-9, atol=1e-17) == False.

math.isclose No assumptions are made about the number to be compared , It depends on the user to give a reasonable abs_tol value ( For typical 1 Values in the range of , Take the default np.allclose atol value 1e-8 That's good enough ):math.isclose(0.1+0.2–0.3, abs_tol=1e-8)==True.

besides ,np.allclose There are still some small problems in the formula of absolute value and relative tolerance , for instance , For a given a and b, There is allclose(a, b) != allclose(b, a). These questions are already in ( Scalar ) function math.isclose It has been solved , We'll talk about it later . For more on this , see also GitHub Floating point guide on and the corresponding NumPy problem (https://floating-point-gui.de/errors/comparison/).

matrix : Two dimensional array

NumPy There was a special matrix class , But now it's abandoned , So this article will alternate 「 matrix 」 and 「 Two dimensional array 」 These two terms .

The initialization syntax of a matrix is similar to that of a vector :

You have to use double brackets here , Because the second positional parameter is dtype( Optional , Also accept integers ).

The syntax of random matrix generation is similar to that of vector generation :

The syntax of two-dimensional index is more convenient than nested list :

view The symbol means that when you slice an array, you don't actually copy it . When the array is modified , These changes will also be reflected in the segmentation results .

axis Parameters

In many operations ( such as sum), You need to tell NumPy Whether the operation is performed on a column or a row . In order to obtain a general symbol suitable for any dimension ,NumPy Introduced axis The concept of : in fact ,axis The value of the parameter is the number of indexes in the related problem : The first index is axis=0, The second index is axis=1, And so on . So in two dimensions ,axis=0 It's in columns ,axis=1 It's by line .

Matrix arithmetic operation

In addition to the regular operators that are executed on an element by element basis ( such as +、-、、/、//、*), There's also a way to compute the product of matrices @ Operator :

We've covered scalar to array broadcasting in the first part , After generalization based on it ,NumPy Mixed operation of support vector and matrix , Even the operation between two vectors :

Broadcast in a two-dimensional array

Row vector and column vector

As the example above shows , In two dimensions , Row vectors and column vectors are treated differently . This is the same as having some kind of one-dimensional array NumPy Practice is different ( Like two-dimensional arrays a— Of the j Column a[:,j] It's a one-dimensional array ). By default , One dimensional arrays are treated as row vectors in two-dimensional operations , So when you multiply a matrix by a row vector , You can use shapes (n,) or (1, n)—— The result is the same . If you need a column vector , There are many ways to get it based on one-dimensional arrays , But here's the surprise 「 Transposition 」 Not one of them .

There are two operations to get two-dimensional array based on one-dimensional array : Use reshape Shape and use newaxis Index :

among -1 This parameter tells reshape Automatically calculate the size of one of the dimensions , In square brackets None Is used as np.newaxis Shortcut to , This will add an empty axis.

therefore ,NumPy There are three types of vectors : One dimensional vector 、 Two dimensional row vectors and two dimensional column vectors . The figure below shows how these three vectors are transformed :

One dimensional vector 、 The transformation between two-dimensional row vector and two-dimensional column vector . According to the principles of broadcasting , One dimensional arrays can be implicitly treated as two-dimensional row vectors , So there's usually no need to perform a transition between the two —— So the corresponding area is shaded .

Matrix operation

There are two main arrays of merge functions :

These two functions work for stacking only matrices or stacking only vectors , But when you need to stack one-dimensional arrays and matrices , Only vstack It can work :hstack There will be a dimension mismatch error , The reasons are described above , One dimensional arrays are treated as row vectors , Instead of column vectors . In response to this question , The solution is either to convert it to a row vector , Or you can do it automatically column_stack function :

The reverse operation of stacking is splitting :

There are two ways to copy a matrix : Copy - Pasted tile And page printing repeat:

delete You can delete specific rows and columns :

The reverse operation of deletion is used as insertion , namely insert:

append Functions are like hstack equally , Cannot transpose one-dimensional arrays automatically , So again , Or you need to change the shape of the vector , Or you need to add a dimension , Or use column_stack:

in fact , If you just need to add constant values to the edge of the array , that ( Slightly more complicated )pad Functions should be enough :

grid

Broadcast rules make it easier to manipulate the grid . Suppose you have the following matrix ( But it's very big ):

Use C And use Python Create a matrix comparison

These two methods are slow , Because they use Python loop . To solve such problems ,MATLAB The way to do this is to create a grid :

Use MATLAB Create a sketch of the grid

Use the parameters provided above I and J,meshgrid Function takes any set of indexes as input ,mgrid It's just segmentation ,indices Only full index ranges can be generated ,fromfunction The function provided is called only once .

But actually ,NumPy There's a better way to do it . We don't have to spend all our memory on I and J On the matrix . It's enough to store vectors of the right shape , Broadcasting rules can do the rest of the work .

Use NumPy Create a sketch of the grid

No, indexing=’ij’ Parameters ,meshgrid Will change the order of these parameters :J, I= np.meshgrid(j, i)—— This is a kind of xy Pattern , To visualize 3D Charts are useful .

Get matrix Statistics

and sum equally ,min、max、argmin、argmax、mean、std、var And all other statistical functions support axis Parameters and can be used to complete statistical calculation :

Three examples of statistical functions , In order to avoid and Python Of min Conflict ,NumPy The corresponding function in is called np.amin.

For two-dimensional and higher dimensional argmin and argmax The function returns the first instance of the minimum and maximum values , There's a bit of trouble returning the expanded index . In order to convert it into two coordinates , Need to use unravel_index function :

Use unravel_index Example of a function

all and any Functions also support axis Parameters :

Use all and any Example of a function

Matrix ordering

axis Parameters are useful for the functions listed above , But it's useless for sorting :

Use Python List and NumPy Arrays perform sort comparisons

This is usually not the result you want to see when sorting a matrix or spreadsheet :axis There is no substitute for key Parameters . But fortunately ,NumPy Provides some auxiliary functions to support sorting by column —— Or sort by multiple columns if necessary :

1. a[a[:,0].argsort()] You can sort the array by the first column :

here argsort Will return the sorted index of the original array .

This technique can be repeated , But you have to be careful , Don't let the next sort disturb the result of the previous sort :

a = a[a[:,2].argsort()]

a = a[a[:,1].argsort(kind='stable')]

a = a[a[:,0].argsort(kind='stable')]

2. lexsort Function can sort all columns in the same way , But it's always on line , And the order of the rows to be sorted is reversed ( From the bottom up ), So it's a bit unnatural to use it , such as

- a[np.lexsort(np.flipud(a[2,5].T))] First of all, according to article 2 Column sorting , then ( When the first 2 When the values of the columns are equal ) And then according to 5 Column sorting .

– a[np.lexsort(np.flipud(a.T))] Will sort from left to right according to all columns .

here ,flipud The matrix is flipped up and down ( To be exact axis=0 Direction , And a[::-1,...] equally , Three of the dots represent 「 All the other dimensions 」, So flipping this one-dimensional array is all of a sudden flipud, instead of fliplr.

3. sort One more order Parameters , But if it's normal at first ( Unstructured ) Array , It's not fast to execute , It's not easy to use .

4. stay pandas It's probably a better choice to execute it in , Because in pandas in , This particular operation is much more readable , It's not so easy to make mistakes :

– pd.DataFrame(a).sort_values(by=[2,5]).to_numpy() I'll start with 2 Column sorting , And then according to section 5 Column sorting .

– pd.DataFrame(a).sort_values().to_numpy() Will sort from left to right according to all columns .

Three dimensions and higher

When you adjust the shape of a one-dimensional vector or transform nested vectors Python List to create 3D Array time , The meaning of index is (z,y,x). The first index is the number of planes , And then there are the coordinates on that plane :

Exhibition (z,y,x) A schematic diagram of the sequence

This index order is very convenient , for instance , It can be used to save some grayscale images :a[i] It's index number i A shortcut to an image .

But this index order is not universal . When operating RGB In the picture , You usually use (y,x,z) The order : First, two pixel coordinates , The last one is the color coordinates (Matplotlib Medium is RGB,OpenCV Medium is BGR):

Exhibition (y,x,z) A schematic diagram of the sequence

such , We can easily index specific pixels :a[i,j] Can provide (i,j) Positional RGB Tuples .

therefore , The actual command to create the geometry depends on the conventions in your field :

Create general 3D arrays and RGB Images

Obviously ,hstack、vstack、dstack These functions don't support these conventions . They're hard coded (y,x,z) Index order of , namely RGB The order of the images :

NumPy Use (y,x,z) A schematic diagram of the sequence , The stack RGB Images ( There are only two colors here )

If your data layout is different , Use concatenate It's more convenient to stack images with the command , To one axis Parameter enter a clear index value :

Stack general 3D arrays

If you're not used to thinking axis Count , You can convert the array to hstack And so on :

Convert an array to hstack In the form of hard coded schematic diagram

The cost of this conversion is very low : No actual replication is performed , It's just the order of the mixed indexes during execution .

Another operation that can mix index order is array transpose . Understanding it may make you more familiar with 3D arrays . According to what you decide to use axis The order is different , The actual command to transpose all planes of the array will be different : For general arrays , It exchanges indexes 1 and 2, Yes RGB The image is 0 and 1:

The command to transpose all planes of a 3D data

But here's the interesting thing ,transpose Default axes Parameters ( And the only a.T Operation mode ) Will rotate the direction of the index order , This is not consistent with the above two index order conventions .

Last , There is also a function that prevents you from using too much training when dealing with multidimensional arrays , It also makes your code simpler ——einsum( Einstein's summation ):

It sums the array along the repeated index . In this particular case ,np.tensordot(a, b, axis=1) Enough to deal with both , But in more complex cases ,einsum It could be faster , And it's usually easier to read and write —— As long as you understand the logic behind it .

If you want to test your NumPy Skill ,GitHub Yes 100 A rather difficult exercise :https://github.com/rougier/numpy-100.

Your favorite NumPy What is the function ? Please share with us !

Link to the original text :https://medium.com/better-programming/numpy-illustrated-the-visual-guide-to-numpy-3b1d4976de1d

 Past highlights
It is suitable for beginners to download the route and materials of artificial intelligence ( Image & Text + video ) Introduction to machine learning series download Chinese University Courses 《 machine learning 》( Huang haiguang keynote speaker ) Print materials such as machine learning and in-depth learning notes 《 Statistical learning method 》 Code reproduction album machine learning communication qq Group 955171419, Please scan the code to join wechat group 


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved