您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Multi label algorithm: masp theory and Python code analysis

編輯：Python

This article is based on the thesis published by my tutor and elder martial sister : Xue-Yang Min, Kun Qian, Ben-Wen Zhang, Guojie Song, and Fan Min, Multi-label active learning through serial-parallel neural networks, Knowledge-Based Systems (2022) Relevant papers can be viewed by yourself , This article is also mainly for the article algorithm learning and analysis , Finally, learn the code and self understanding

Catalog

Preface

1. Multi label concept preparation

1.1 What is multi label

1.2 Multi label model

1.3 Other contents of multiple labels

2.MASP Active learning in ( Learning scene )

2.1 Cold start

2.2 Active learning ( Additional queries )

2.3 Some query reasons in active learning

3.MASP The neural network in ( Learning models )

4. Multi label evaluation index

4.1 The bottleneck of traditional evaluation

4.2 Confuse matrix with parameter

4.3 Relevant evaluation schemes and curves

4.3.1 ROC Curve and AUC value

4.3.2 PR curve

4.3.3 F1 curve

5. The main framework of the program

5.1 The overview

5.2 test_active_learning function

5.3 Masp Constructors

5.4 MultiLabelData Constructors

5.5 MultiLabelAnn And ParallelAnn Constructors

6. List of representative functions

6.1 Learning on the Internet : one_round_train And bounded_train function

6.2 Cold start and active learning : two_stage_active_learn function

· About parameters

6.2.1 Cold start batch and active learning batch calculation

6.2.3 Cold start

6.2.4 Active learning

6.3 F1 The calculation of my_test And compute_f1

Epilogue

Preface

MASP The full name is Multi-label active learning through serial-parallel neural networks, It is a learning model and algorithm of neural network constructed by serial and parallel , And take active learning as a learning scenario to build an efficient algorithm for solving multi label learning problems

1. Multi label concept preparation

1.1 What is multi label

Most of the machine learning algorithms in my previous articles have focused on iris This dataset , as everyone knows , This data set has three labels

@RELATION iris
@ATTRIBUTE sepallength REAL
@ATTRIBUTE sepalwidth REAL
@ATTRIBUTE petallength REAL
@ATTRIBUTE petalwidth REAL
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}

In the learning process, it is necessary to predict the unique label of a certain data row , This should be a multi classification problem , For example, the following iris Three data rows of , Their correct label is "Iris-setosa" This inference

4.3,3.0,1.1,0.1,Iris-setosa
5.8,4.0,1.2,0.2,Iris-setosa
5.7,4.4,1.5,0.4,Iris-setosa

However, in the field of machine learning, there is a multi label situation in label prediction , In the multi label problem , The label of a data row is not so single , It may be the following . Each attribute column can have multiple cases , This is a case of multi label . therefore , Assume that the total number of tags is \(L\), It is not difficult to find that the tag possibility of an attribute line is \(2^L\).

4.3,3.0,1.1,0.1,Iris-setosa,Iris-versicolor
5.8,4.0,1.2,0.2,Iris-virginica
5.7,4.4,1.5,0.4,Iris-versicolor,Iris-virginica
6.4,3.1,5.5,1.8,None
6.0,3.0,4.8,1.8,Iris-setosa,Iris-virginica

therefore , Assume that the total number of tags is \(L = 1\), So usually when \(L=1\), Yes \(2^1\), It is a binary classification problem , That is to say, the data row is non 1 namely 0 Binary assertion of , AdaBoost The single classifier in is to achieve this operation , It is also the simplest classifier .

When \(L > 1\) Is a common multi label problem , At this point, if the restricted label selection is mutually exclusive , Then we return to the common multi classification problem .

1.2 Multi label model

When we talked about multi classification problems, we had a fixed \(N \times 1\) Label column , The range of stored values is 0~\(L-1\) , Used to indicate a certain label . If you enter the multi label field , This stored value " Column " It should be expanded into a " matrix ", Which is a \(N \times L\) Label matrix to store . This is related to the size of the part in the dataset where instances are stored \(N \times M\) The matrix of forms a two tuple :\[S=(\mathbf{X}, \mathbf{Y}) \tag{1}\]

among :

\(\mathbf{X}=\left(\mathbf{x}_{1}, \mathbf{x}_{2}, \ldots, \mathbf{x}_{N}\right)^{\mathrm{T}}=\left(x_{i j}\right)_{N \times M}\) Is the conditional attribute matrix .
\(\mathbf{Y}=\left(\mathbf{y}_{1}, \mathbf{y}_{2}, \ldots, \mathbf{y}_{N}\right)^{\mathrm{T}}=\left(y_{i j}\right)_{N \times L}\) Label matrix .
\(N\) Is the number of objects , It is also the number of data rows in the dataset
\(M\) Is the number of condition attributes
\(L\) Is the number of tags

Here, each tag value is not as much as it can be taken in the multi classification problem , Basically, Boolean means more , A common dataset defines \(y_{ij}=1\) Express , \(y_{ij}=-1\) It means nothing . Of course, in some other cases \(y_{ij}=0\) To represent the absence . It's understandable , After all, in real life, there are more uncertain data than certain data , Easy access to data , But be clear about what it means , Or take the time to define what it means , In fact, it's a complicated thing . This is also the active learning later (Active Learning) One of the reasons why it started .

1.3 Other contents of multiple labels

Multiple tags have some related features , For example, the problem of label relevance , This is a part that many multi label algorithms need to consider , Many common algorithms will cut in from different angles , For example, there are : Regardless of relevance 、 Consider pairwise correlations 、 Consider more than two tag dependencies . for example BP-MLL What we are thinking about is twoness , and LIFT The algorithm discards the correlation .

BP-MLL original text : Zhang, M.-L., & Zhou, Z.-H. (2006). Multi-label neural networks with applications to functional genomics and text categorization. IEEE transactions on Knowledge and Data Engineering, 18, 1338–1351.
There is a blog about my teacher : BP-MLL
LIFT original text : Zhang, M.-L., & Wu, L. (2014). LIFT: Multi-label learning with label-specific features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37, 107–120.
There is a blog about my teacher : LIFT

In addition, the tag value can also be [0,1] The probability value of the multi label study , Not Boolean , This is it. Label distribution learning problem

For more information on this part, see Lecture version of multi label learning ( Internal discussion , To be continued )_ Min fan's blog -CSDN Blog

2.MASP Active learning in ( Learning scene )

Through the above multi label missing phenomenon and the coincidence of active learning motivation , It can be found that active learning with multiple tags should be possible . MASP The following model is proposed in : Limited budget cold start multi label learning

Determine the number of expert queries \(T\)
Through certain algorithm logic , Make a preliminary analysis , From the data set \(N\) Pick the right data sample \(N^{\prime}\)
Examples \(N^{\prime}\) Cold start under certain lower budget conditions ( That is, start learning in the neural network ), During cold start, experts are required to label the cold start data , The cost of this process is small enough .
Based on the results of the neural network started in the last round , Filter out a batch of data again \(Q\), Give it to an expert to label , So as to assist the neural network to learn ( This operation continues to cycle )
When the upper limit of expert query reaches \(T\) after , The known tags will not be updated for subsequent training , The network is trained repeatedly through the inherent query label results until the final recognition rate is no longer significantly increased .

In general, such a process , All the parts that need the assistance of the target label matrix are completed by the expert query , This is the core of active learning ( The code is reflected in the query of the target matrix ) The most special thing in this process is the intervention of cold start , Cold start ensures the formation of the basic neural network , It is convenient for us to extract some untrained tag information later , So as to further deepen the characteristics of the back feeding network . I try to use two pictures to describe this process .

2.1 Cold start

This picture illustrates 2,3 Cold start phase of step , In the picture \(Y^{\prime}\) It is based on... With the participation of experts \(X\) Marked label matrix , The number of rows of this matrix is smaller than that of all label matrices \(Y\) Of , but \(Y\) Is the unknowable part , Therefore, it is known that \(Y^{\prime}\) It will provide guidance for the next study . In the picture \(X^{\prime}\) For the characteristic matrix \(X\) From pretreatment .

This preprocessing results in the training data row set \(U\), And through the data marked by professionals sparsity( sparse ) The high value label determines the data column \(V\), Finally, the cold start query tag point set \(\{(i,j)|i \in U, j \in V\}\). Then when training the network , \(Y^{\prime}\) The tag provided will be learned as the goal of network loss function fitting . Finally, it passes a finite number of times batch Training , Get an evaluation parameter : uncertainty (uncertainty). This parameter can once again guide experts to mark more key tags , So as to further promote \(Y\) To \(Y^{\prime}\) The transformation of , perfect \(Y^{\prime}\).

2.2 Active learning ( Additional queries )

Cold start phase sparsity There is a certain meaning of active learning in the introduction of , But then uncertainty The characteristics of active learning are more obvious in the process of repeated refreshing :

Provided after the cold start of the last round "uncertainty" The information has pushed the experts to further improve \(Y^{\prime}\), Nature has also expanded \(X^{\prime}\) The queryable set of , So according to the new \(S^{\prime} = \mathbf{X^{\prime}}, \mathbf{Y^{\prime}}\) Further training network . After training, we get new uncertainty Further guide experts to label , To update \(Y^{\prime}\), So as to form a training -> Get new uncertainty-> Training ... The cycle of . Until the expert query reaches the upper limit , Output final forecast .

2.3 Some query reasons in active learning

representativeness ( Representative ): \(X\) Turn into \(X^{\prime}\) The means of pretreatment is Density Peak An indicator describing the criticality of the characteristics of the data itself . Density Peak The theory of comes from 14 In Nature Papers published on “Clustering by fast search and find of density peaks” , The algorithm proposed in this paper is a powerful scheme in data clustering , I'm in my ALEC Active learning (2.1 section ) The implementation of this evaluation algorithm is described in , Details can be found at .
sparsity ( sparsity ): We are in the process of active learning , When experts are asked to label us, they will list more labels that are rarely judged for experts to judge . For example, before that, we all gave a picture to ask the experts if they had " Dog ",", cat "," The mouse ", however " The rabbit " The label hardly asks . So the next time you choose the question tag, the probability of using these tags to ask experts will increase . The following formula is provided in the paper :\[\psi_{k}=1-\frac{|\mathbf{Q} \cap\{1,2, \ldots, N\} \times\{k\}|}{N} \tag{2}\] In this formula \(Q\) Represents all the queried label points , \(\{k\}\) It means the first one \(k\) Column ( The first \(k\) A label ), \(\{1,2, \ldots, N\}\) Represents all data rows , therefore \(\{1,2, \ldots, N\} \times\{k\}\) It means the first one \(k\) The label corresponds to \(N\) Label points . So the molecule means this \(N\) The number of times a label point is actually queried . It can be seen that the less it is queried \(\psi_{k}\) The more it's worth , that sparsity The higher , Such tags defined in the algorithm are easier to query .
uncertainty ( uncertainty ): Please read page 3 Section to understand the basic network structure and then continue to understand this indicator . The prediction of network results is correct Output The result of the trade-off between the two values of the dual port . A port value is close to 1 One port is close to 0, It's just backPropagation Set when 1/0 Result of value fitting , The natural energy with good fitting effect shows that one end is big and the other end is small . But vice versa , Those that do not fit well , It may be test data , Or it is the original missing label port that is not punished . these Output The difference between the values of dual ports is not very large , Inefficient Forecasting , Therefore, additional fitting is required . So we define the absolute difference between the two ports and the " uncertainty " There is a correlation . And there is the label uncertainty formula in the original paper :\[\eta(i, k)=\left\{\begin{array}{ll}
0, & \text { if } y_{i k} \text { is queried } \\
1-\left|g_{k}^{-}\left(f\left(\mathbf{x}_{i}\right)\right)-g_{k}^{+}\left(f\left(\mathbf{x}_{i}\right)\right)\right|, & \text { otherwise. }
\end{array}\right.\tag{3}\]

3.MASP The neural network in ( Learning models )

MASP Medium SP Describes the network it uses ——serial-parallel neural networks. Here I use the original picture in the paper to describe :

So called serial (serial) The Internet Is the leading in the figure serial part, This part of the network plays the role of feature extraction , All the features between tags are analyzed in series here , At the same time, it also realizes the processing of label correlation . therefore MASP The tag correlation is considered in , However, the feature of this paper is that the consideration of label relevance is completed in the pilot learning , Instead of analyzing the predicted results , It is different from the common tag correlation processing methods .

parallel (parallel) The network is the part of the network before parallel output in the figure , The number of this part of the network is equal to \(L\), The implication , Each parallel network implements a separate binary prediction for a tag . MASP The output of the parallel network used in is a dual port output , The output values range from [0,1], When the network makes predictions , As a positive output structure , We define that most of the dual ports are evaluated as 1, The smaller part is evaluated as 0. Final pairwise = <1, 0> Is predicted to be a negative label (-1), That is, the label does not exist ; Final pairwise = <0, 1> When predicted as a positive label (1). Another available solution is to use softmax Calculate the dual port as a certain probability value , The probability of being a positive label , Finally, some thresholds are determined to determine the positive and negative of the current tag .

When training on the Internet , As a negative backPropagation Structure of , If the target tag is a negative tag (-1), Then create a pairwise = <1, 0>, During training forward The loss function is then calculated for this pairwise To fit the dual port ; If the target label is a positive label (1), Then create a pairwise = <0, 1>, Fitting is the same . Additional emphasis , When the target tag is missing (0), The algorithm will generate pairwise = <\(\hat{y}_{0}\), \(\hat{y}_{1}\)>, Thus, there is no penalty in calculating the loss function . This is it. MASP There is no penalty for missing tags in Characteristics of .

say concretely , Code emulation MASP when , The missing tag does not mean that the tag is in the dataset \(Y\) Is missing , But this label is not marked by experts , That is to say \(Y^{\prime}\) Could not find this tag in . Of course , Although it is missing , But it can still be predicted by mature networks , Or it will be selected as uncertainty The label is thus marked by experts ( The representation in the code is queried ) Thus becoming a non missing tag .

In addition, I get some network information by reading the source code , MASP In the backPropagation The gradient descent scheme for the loss function is Adaptive moment estimation (Adam: Adaptive Moment Estimation), The activation functions of most layers use Sigmoid, But between the serial and parallel networks ReLU, The ultimate loss function is MSE Mean square error .

4. Multi label evaluation index

MASP Adopted F1 The evaluation index , about F1, AUC Readers who know about it can read it directly 4.3.3

4.1 The bottleneck of traditional evaluation

There is only one label column in the multi classification problem , Whether the prediction is correct can be used 1/0 Express , After extending to multiple data rows, you can judge 1 The proportion of is recognized , This is it. accuracy.

But in the multi label problem, a data row has multiple label columns , If 1/0 If the prediction is correct, there will be some unreasonable things : Firstly, the label correctness of multi label problem is Boolean , A tag predicts " non-existent / There is " It may all be reasonable , For example, for a \(L=4\) The data line , If the truth is {+1, -1, -1, -1}, I can never say that the prediction is { +1, -1, +1, -1} Absolutely wrong , Because after all, I predicted three ! Well, in proportion , For example, this data line predicts accuracy by 75%. But it doesn't make sense , Because the multi label matrix in the multi label data in reality has serious Label sparsity , Simply put, the number of negative labels in a data row may account for more than 90% of all labels ! In this way, it seems that if I predict all the labels of a data row as negative labels 99% The accuracy of , however , Is that reasonable? ? No matter what picture I give you , Let you predict what animals , You all predict " There are no animals ".

Therefore, we need to put forward some new evaluation indicators .

4.2 Confuse matrix with parameter

Confusion matrix is a very key thing to evaluate the overall accuracy , It is also the basis of various evaluation indicators :

The overlapping parts in the table give rise to four overlapping situations between the forecast and the actual ：

TP： Forecast as 1, The sample is 1, The prediction is correct
FP： Forecast as 1, The sample is 0, Wrong prediction
FN： Forecast as 0, The sample is 1, Wrong prediction
TN： Forecast as 0, The sample is 0, The prediction is correct

Through these four information, several corresponding indicators have been born :

Precision( Precision rate ): \(\frac{TP}{TP+FP}\) In all the samples with positive predictions , The proportion of true positive samples . I bought it at the vegetable market today 5 A watermelon , I'm sure they're sweet , The result is really sweet only 2 individual , The accuracy is \(\frac{2}{5}\). so , The precision of The denominator It may increase with the increase of experimental samples , however molecular But not necessarily . For example, I bought it today on the basis of yesterday 10 A watermelon , I believe they must be very sweet , The results are not as sweet as yesterday ! So the precision becomes \(\frac{2}{15}\). so , If I'm lucky , Melons are sweet every day , Then the accuracy rate will be maintained 1, Of course, there may be lucky people in the world , But there is no one who is always lucky , So keep going , One day there will be unsweetened melons , At this time, the overall accuracy will deviate 1 了 .
Recall( Recall rate ) , TPR( The real rate , sensitivity ): \(\frac{TP}{TP+FN}\) In all the real positive samples , Proportion of samples with positive prediction . Recall The denominator Is a global data , The implication is that as long as the total amount of data is determined , Then he will not change from the very beginning , Only molecular . If there are 10 Inmates have gone to the total population of 990 A small town in , Now police officers go door to door to arrest people and interrogate them as suspects ( It is impossible to do so in reality ), Our base number is 10, Maybe before the trial 50 Every family catches the wrong person , So our recall rate remains \(\frac{0}{10}\), Whenever a prisoner is found, he will be given to the elements +1, Until you find all 10 personal , The recall rate reaches 1. Thus we can see that , The initial value of recall is 0, As queries increase , Always reach 1, But at different times , When you are lucky, you will soon , Just before we checked 10 Household is 10 A place where criminals hide , Bad luck , Last 1 Of criminals in the 1000 Found only after query , The former has only 10, The latter is 1000.
FPR( False positive rate , Specificity ): \(\frac{FP}{FP+TN}\) In all real negative samples , Proportion of samples with positive prediction . This can be compared with TPR Opposite contrast , Let's say it's the same town as before , Keep that strategy . The initial negative sample is 990, The first house was arrested, considered as a suspect and interrogated , The result was a mistake , So our FPR From \(\frac{0}{990}\) become \(\frac{1}{990}\), You can also find , As the investigation goes on , No TPR increase \(\frac{x+1}{10}\) Namely FPR increase \(\frac{y+1}{990}\), Eventually both arrive 1

4.3 Relevant evaluation schemes and curves

Many evaluation indicators have been derived from the above three indicators , Here are a few .

4.3.1 ROC Curve and AUC value

First of all ROC curve , Full name “ Subject curve ”, Simply speaking , He will continue to increase with the deepening of the survey sample TPR and FPR As ordinate and abscissa respectively . When actually looking at this image, please do not understand it as a function image , Look at it as a map , and (0,0) There is a traveler , He can only go north or East in each round , But no matter how he goes, he will eventually come to (1,1) spot . This understanding can perfectly interpret the connotation of the above purple characters .

ROC curve ( Abscissa for FPR, Ordinate for TPR)

so ROC A curve is a gradient line , But with the increase of samples , Our curves will also become smoother . meanwhile , When this curve " The more convex " when , It proves that TPR The increase of is very fast , In other words, the traveler went north as much as possible at the beginning , Finally, there is no alternative but to arrive (1,1) So I went east . We are hoping for such a result , Just put TPR From the example of catching criminals , If the robbers can be caught in the first few trials , It seems that we can end the arrest ahead of time and avoid the remaining unjust, false and wrong cases . So in order to measure " Convex " The degree of , We define this curve at [0,1] Definite integral value on ( area ) by AUC value , As an index to judge whether the sample distribution is good or bad .

If the first extraction \(x\) Samples are predicted to be positive, and all predictions are correct , And just put all \(x\) All the real positive samples have been predicted , that TPR before \(x\) The score will change from \(\frac{0}{x}\) Become directly \(\frac{x}{x}\), This is it. best Of ROC curve , His AUC The value is 1.

4.3.2 PR curve

Learn by introduction , Precision The initial value of is from 1 Start (Precision The value starts with \(\frac{0}{1}\)) The possibility is relatively small , Because we often rank the probability of positive samples when we actually make predictions , Often the probability that the prediction of the first sample is positive is close to 100%), Then, with the occurrence of errors in the prediction process, it leads to gradual deviation 1, But because molecules have accumulated amounts , So it won't come to 0. and Recall The value will change from 0 To approach gradually 1, When the sample distribution is orderly , This approach will reach... Earlier . Final , With Recall Abscissa , Precision Vertical coordinates , To get PR curve .

For different models, the prediction effect on the same data set is better , We can draw a series of PR curve , In general, if a curve is completely “ Surround ” Another curve , We can think that the classification effect of this model is better than that of the contrast model . however PR The description of the curve is rough after all , Actually about Precision And Recall We use more F1, This is also MASP One of the measurement schemes selected in .

4.3.3 F1 curve

F1 The value is P and R Of the harmonic mean of 2 times :\[F_{1} = \frac{2PR}{P+R} \tag{4}\] here P representative Precision, R representative Recall.

In fact, we get the final predicted tag value through network learning or other means , In the beginning, these tag values have not been converted to certain 1/-1 When they are both a dual port value <\(\hat{y}_{0}\), \(\hat{y}_{1}\)> (\(\hat{y}_{p} \in [0,1]\)), And uniquely correspond to its real tag value in logic . Logically, these Separate dual port binary And Its true tag value But they are all coordinates in the label matrix \(\{(i,j)| 0 \leq i < N , 0 \leq j < L\}\) Of mapping and Value . Now I reduce them all to one-dimensional arrays , This correspondence remains unchanged , Then I will Real tag value The basis for carrying out Dual port binary softmax Probability value Sort , That is, according to the probability that the labels trained by the network may be positive Real label Rearrange , Get one Real label Rearranging arrays sortedArray( Array length is N*L, It's about [1,1,1,...,1,1, -1,-1,-1,...,-1,-1], But because the predictions are flawed , There may be continuous 1 Mixed with -1, continuity -1 Mixed with 1). Then keep taking it out sortedArray Before \(Q\%\) data , And they all predict that the data in this data is a positive label , And then according to the formula 4 To calculate F1.

best The curve of represents the best F1 curve , If the current total amount of data is \(x+y\), among \(x\) One is a real positive sample , \(y\) One is a real negative sample . We can judge by the formula , In the best case before \(x\) The sample is indeed a positive sample , So before \(x\) In secondary sampling Precision Will always be 1( Take a forecast as a positive prediction pair ), And when the forecast goes into the \(x+1\) When a data Precision Will be taken from 1 falling ( There are no positive samples , Another positive prediction will lead to wrong prediction ), Until it finally drops to \(\frac{x}{x+y}\). and Recall At the beginning \(\frac{0}{x}\), As sampling continues \(\frac{1}{x}\),\(\frac{2}{x}\)... increase . And then after checking the \(x\) Data changes to \(\frac{x}{x} = 1\), It will not change in the future . So when checking to the \(x\) When a data , Precision Is the last time for 1, and Recall For the first time 1, So you get best Above picture best The highest value of the curve :\(F_{1} = \frac{2 \times 1 \times 1}{1+1} = 1\). Before [0~\(x\)] Within the interval Precision keep 1 unchanged , and Recall It is another step from 0->1 The increasing process of , Therefore, the curve in the figure is increasing .

The actual data will be blue actual curve , although [0~\(x\)] The interval will be incremented by a part , however Precision It is possible that some wrong predictions may lead to early departure from 1 Start to fall , and Recall It will also slow down due to the increase of wrong predictions . So the blue line will deviate in advance best curve , Reach ahead of time Peak And fall , Finally, at the end, a small number of successful predictions lead to Recall Rise to slightly raise F1. Last on worst And random Curve readers can make a similar analysis , in short worst In fact, it is the result of sorting negative labels first and then positive labels .

In the end, no matter how much trouble , Finally, after the global data test Recall It must be 1, Precision It must be \(\frac{x}{x+y}\). So the final curve finally converges on \((1.0,\frac{2x}{2x+y})\)( Junior high school mathematics calculation )

5. The main framework of the program

5.1 The overview

The volume of source code is relatively large , Several rounds of tests were conducted , The scene is divided into supervised learning , Semi supervised learning and active learning of random query . In terms of training data , There are by 5 Fold cross validation to use only one training set to complete training and testing , There is also training given using the default raw data set / Test documents to guide the test and training respectively . Here, for the sake of simplicity, we mainly show a branch :

In the specific description, I will omit some code for reading and writing files , The functions of some variables will be introduced in the description , At the end of the paper, some common variables are explained . This time it's not my normal article , I will The top-down Introduce this code , First put the main body , When it comes to a certain function, I am simply explaining . Please check the source code if you are interested .

The data used in this article to describe the test is Birds Data sets :

Birds Data sets ( Training ) Attribute matrix 322 Row data and 230 Attribute composition

Birds Data sets ( Training ) Label matrix 322 Row data and 19 Tags make up ( Notice that the matrix in the figure has been transposed )

( The test matrix is not shown , Yes 323 Test data , The number of test and training data sets is almost the same )

I use a diagram to describe the specific process

Perform the division of labor for each class in the process

Each color in this diagram represents a class , Except for the leftmost function , You can take test_active_learning As part of this process main Function to understand . The vertical same color flow box of all classes represents the process of task completion in this class constructor , And in the " Incidental : Various Implementation of function " Below the white box are some key function members of this class , Without this white box, all the methods owned by the current class have been displayed . obviously , Only ParallelAnn and Properties Class coincidence , Because their volume is the smallest . The other three classes are relatively large , There are many incidental methods , It is impossible and unnecessary for this article to introduce , I will only make a basic exposition of some key points . If you are interested, please check the source code .

5.2 test_active_learning function

def test_active_learning(para_dataset_name: str = 'Emotion'):
"""
Used to fill in the third set of experimental data , Compared with the general multi label learning active algorithm .
:param para_dataset_name: Dataset name .
"""
print(para_dataset_name)
temp_start_time = time.time()
prop = Properties(para_dataset_name)
temp_train_data, temp_train_labels, temp_test_data, temp_test_labels = read_data(para_train_filename=prop.filename, param_cross_flag=False)
prop.train_data_matrix = temp_train_data
prop.test_data_matrix = temp_test_data
prop.train_label_matrix = temp_train_labels
prop.test_label_matrix = temp_test_labels
prop.num_instances = prop.train_data_matrix.shape[0]
prop.num_conditions = prop.train_data_matrix.shape[1]
prop.num_labels = prop.train_label_matrix.shape[1]
prop.full_connect_layer_num_nodes[0] = prop.num_conditions
temp_masp = Masp(para_train_data_matrix=prop.train_data_matrix,
para_test_data_matrix=prop.test_data_matrix,
para_train_label_matrix=prop.train_label_matrix,
para_test_label_matrix=prop.test_label_matrix,
para_num_instances=prop.num_instances,
para_num_conditions=prop.num_conditions,
para_num_labels=prop.num_labels,
para_full_connect_layer_num_nodes=prop.full_connect_layer_num_nodes,
para_parallel_layer_num_nodes=prop.parallel_layer_num_nodes,
para_learning_rate=prop.learning_rate,
para_mobp=prop.mobp, para_activators=prop.activators)
temp_init_end_time = time.time()
temp_masp.two_stage_active_learn(para_instance_selection_proportion=prop.instance_selection_proportion,
para_budget=prop.budget,
para_cold_start_labels_proportion=prop.cold_start_labels_proportion,
para_label_batch=prop.label_batch,
para_instance_batch=prop.instance_batch,
para_dc=prop.dc, para_pretrain_rounds=prop.pretrain_rounds,
para_increment_rounds=prop.increment_rounds,
para_enhancement_threshold=prop.enhancement_threshold)
temp_acc, temp_f1 = temp_masp.my_test()
temp_test_end_time = time.time()
print('Init time: ', temp_init_end_time - temp_start_time)
print('Cold start time: ', temp_masp.cold_start_end_time - temp_init_end_time)
print('One round time: ', (temp_masp.multi_round_end_time - temp_masp.cold_start_end_time)/temp_masp.num_additional_queries)
print('Bounded time: ', temp_masp.final_update_end_time - temp_masp.multi_round_end_time)
print('Test time: ', temp_test_end_time - temp_masp.final_update_end_time)

(Line 8) adopt Properties Class acquisition prop object , This class mainly encapsulates various parameters of the current learning , It includes the range radius that will be used in the image data representation \(d_c\), Gradient factor , The number of routine training , Cold start budget value, etc . There are also non-uniform network layer arrays for different data , Dataset address , Activation function string , batch Training batch, etc . Interested can be viewed through the source code .
(Line 9~18) Reading documents , Coexist in variables , At the same time, some necessary information is stored . The data is saved here for convenience prop In the object , This is very important in large code , Ensure that multiple different classes can share the same global data .
(Line 20) By reading in the matrix \(X\) To determine the number of neurons at the input of the network , See the code for specific variable contents config.json File with the Properties Class .
(Line 22~32) structure masp object : temp_masp. This is the key to this code
(Line 35~42) perform temp_masp Internal two_stage_active_learn Methods to complete cold start and active learning
(Line 43) Calculation Accuracy And F1
(Line 45~49) Output time analysis of different stages , This is the table in the paper 7 The origin of

5.3 Masp Constructors

class Masp:
"""
Multi-label active learning through serial-parallel networks.
The main algorithm.
"""
def __init__(self, para_train_data_matrix, para_test_data_matrix, para_train_label_matrix, para_test_label_matrix, # Four matrices
para_num_instances: int = 0, para_num_conditions: int = 0, para_num_labels: int = 0, # The three parameters of the matrix
para_full_connect_layer_num_nodes: list = None, para_parallel_layer_num_nodes: list = None,
para_learning_rate: float = 0.01, para_mobp: float = 0.6, para_activators: str = "s" * 100):
# Step 1. Accept parameters.
self.dataset = MultiLabelData(para_train_data_matrix=para_train_data_matrix,
para_test_data_matrix=para_test_data_matrix,
para_train_label_matrix=para_train_label_matrix,
para_test_label_matrix=para_test_label_matrix,
para_num_instances=para_num_instances, para_num_conditions=para_num_conditions,
para_num_labels=para_num_labels)
self.representativeness_array = np.zeros(self.dataset.num_instances)
self.representativeness_rank_array = np.zeros(self.dataset.num_instances)
self.output_file = None
self.device = torch.device('cuda')
self.network = MultiLabelAnn(self.dataset, para_full_connect_layer_num_nodes, para_parallel_layer_num_nodes,
para_learning_rate, para_mobp, para_activators, self.device).to(self.device)
self.cold_start_end_time = 0
self.multi_round_end_time = 0
self.final_update_end_time = 0
self.num_additional_queries = 0

stay test_active_learning A... Is declared in the function Masp object , This class describes the main process of the algorithm

(Line 7~10) The structural parameters are 11 Parameters , The first four are [ Training \ test ] Attribute matrix and [ Training \ test ] Label matrix , The following sequence is the number of data rows \(N\), Number of condition attributes \(M\), Number of labels \(L\), Serial connection layer array , Parallel connection layer array , Learning factors , Inertia factor mobp, Activation function string
(Line 12~17) structure MultLabelData object : self.dataset. This object is in charge of many key marking matrices in the whole learning process . The variables here must be the same as Properties Class to distinguish , prop The object is mainly a set of super parameters , It is used for regulation , It is an artificially controlled parameter . and self.dataset The variables in are various tag arrays that may be used in various stages of the algorithm , Bucket array , Temporary array . Accordingly , differ Properties terrestrial , It has a set of operations based on these arrays , That's why Properties Class only 66 That's ok and MultLabelData Classes are almost 500 Row or so ( Add some notes )
(Line 19~20) Initialization of the representative array and the sorted representative subscript array , This is what you need to select data sets before cold start
(Line 24) determine CUDA engine , This is torch Programming content , Used for subsequent code execution GPU Driven boot object
(Line 26~27) structure MultLabelAnn object : self.network. This object contains various construction methods to realize network learning .
(Line 29~32) Timer initialization , because Masp After class construction , In succession, the function methods of some objects are about to be called , The efficient operation code is about to begin , So you need to have a timer ready to work .

5.4 MultiLabelData Constructors

class MultiLabelData:
"""
Multi-label data.
This class handles the whole data.
"""
def __init__(self, para_train_data_matrix, para_test_data_matrix, para_train_label_matrix, para_test_label_matrix,
para_num_instances: int = 0, para_num_conditions: int = 0, para_num_labels: int = 0):
"""
Construct the dataset.
:param para_train_filename: The training filename.
:param para_test_filename: The testing filename. The testing data are not employed for testing.
They are stacked to the training data to form the whole data.
:param para_num_instances:
:param para_num_conditions:
:param para_num_labels:
"""
# Step 1. Accept parameters.
self.num_instances = para_num_instances
self.num_conditions = para_num_conditions
self.num_labels = para_num_labels
self.data_matrix = para_train_data_matrix
self.label_matrix = para_train_label_matrix
self.test_data_matrix = para_test_data_matrix
self.test_label_matrix = para_test_label_matrix
# -1 to 0
self.test_label_matrix[self.test_label_matrix == -1] = 0
self.label_matrix[self.label_matrix == -1] = 0
self.test_label_matrix_to_vector = self.test_label_matrix.reshape(-1) # test label matrix n*l to vector
self.extended_label_matrix = np.zeros((self.num_instances, self.num_labels * 2))
for i in range(self.num_instances):
for j in range(self.num_labels):
# Copy label matrix.
if self.label_matrix[i][j] == 0:
self.extended_label_matrix[i][j * 2] = 1
self.extended_label_matrix[i][j * 2 + 1] = 0
else:
self.extended_label_matrix[i][j * 2] = 0
self.extended_label_matrix[i][j * 2 + 1] = 1
# Step 2. Space allocation for other member variables.
self.test_predicted_proba_label_matrix = np.zeros((self.num_instances, self.num_labels))
self.predicted_label_matrix = np.zeros((self.num_instances, self.num_labels))
self.test_predicted_label_matrix = np.zeros(self.test_label_matrix.size)
self.label_query_matrix = np.zeros((self.num_instances, self.num_labels))
self.has_label_queried_array = np.zeros(self.num_instances)
self.label_query_count_array = np.zeros(self.num_labels)
self.distance_measure = MyEnum.EUCLIDEAN
self.device = torch.device('cuda')
self.pdist = torch.nn.PairwiseDistance(p=2, eps=0, keepdim=True).to(self.device)

Just mentioned , MultiLabelData The member variables declared in the class are various tags that may be used in various stages of the algorithm [ Array / matrix ], Bucket array , The staging [ Array / matrix ] And various operations acting on these structures . Is different from Properties Class , Established , Super parameters for regulation .

(Line 8~9) The call in parameters of the class are still classic [ Training \ test ] Attribute matrix and [ Training \ test ] Label matrix , The following sequence is the number of data rows \(N\), Number of condition attributes \(M\), Number of labels \(L\).
(Line 20~27) Receive function parameters
(Line 30~31) take All in the label matrix -1 Turn into 0, At the code level , 1 Indicates a positive label , 0 Indicates a negative label . Not actually designed -1 The meaning of , Follow up -1 The setting of is also a local meaning transfer . This place is very important , Otherwise, you will misunderstand the meaning of some codes .
(Line 33) The label matrix of the test set is compressed into one dimension (\(N\) * \(L\) -> 1 * \(NL\)) The meaning here can be referred to 4.3.3 in F1 The origin of the abscissa value of the curve . In addition, you need to understand , Any meaningful evaluation index is based on the test set .
(Line 35~44) Create a dual port version of the training set label matrix (\(N\) * \(L\)-> \(N\) * 2\(L\))
Single port to dual port conversion
This operation is to prepare for network learning , Because looking down , Network is \(M\) Inputs and \(2L\) Structure of outputs , Conventional \(N*L\) The label matrix of is not real fit , So we need to construct a \(N*2L\) Label the matrix to prepare for the calculation of the loss function .
(Line 47) The probability prediction test matrix is a single port matrix , The stored value is located in [0,1], By testing the network Output The dual port matrix passes through softmax It's a transformation ( Untreated Output The individual values of the matrix are located in [0,1]) The probability of , Used to indicate the probability that a label is positive .
(Line 49~50) Two are [ Training / test ] Label prediction matrix , By [ Training / test ] Online Output From the dual port matrix , The transformation mechanism is \(\left\{\begin{matrix}
\hat{y} = 1 & ,\{<\hat{y}_0,\hat{y}_1>|\hat{y}_0<\hat{y}_1\} \\
\hat{y} = 0 & ,\{<\hat{y}_0,\hat{y}_1>|\hat{y}_0>\hat{y}_1\} \\
\end{matrix}\right.\) It is used to represent the actual forecast label obtained after the current data passes through the network
(Line 52) Indicates that the current tag is missing ( Inquire about ) Matrix of cases , 1 Indicates that the label is not missing ( Be inquired of ), 0 Indicates that the label is missing ( Not queried ). This is a key variable , \(Y^{\prime}\) Namely \(Y\) The mapping under this variable mask , Simulate the matrix contents known in reality .
(Line 54) This with 52 The matrix of rows is used together , It just represents the data row of the query , Sometimes even though this data row has been queried , But some of his tags may still be missing ( Not queried ) The state of .
(Line 55) Tag query bucket , This can indicate how often the tag is queried , This array is only used during active learning , When selecting query tags, we try to select self.label_query_count_array The matrix with smaller value participates in learning , This is what we mentioned in the theory section sparsity value , without doubt , self.label_query_count_array The smaller sparsity The bigger it is , The more likely it is to be queried .
(Line 57~62) complete torch Some preprocessing operations for the self-contained distance calculation .

5.5 MultiLabelAnn And ParallelAnn Constructors

class MultiLabelAnn(nn.Module):
"""
Multi-label ANN.
This class handles the whole network.
"""
def __init__(self, para_dataset: MultiLabelData = None, para_full_connect_layer_num_nodes: list = None,
para_parallel_layer_num_nodes: list = None, para_learning_rate: float = 0.01,
para_mobp: float = 0.6, para_activators: str = "s" * 100, para_device=None):
"""
:param para_dataset:
:param para_full_connect_layer_num_nodes:
:param para_parallel_layer_num_nodes:
:param para_learning_rate:
:param para_mobp:
:param para_activators:
:param para_device:
"""
super().__init__()
self.dataset = para_dataset
self.num_parts = self.dataset.num_labels
self.num_layers = len(para_full_connect_layer_num_nodes) + len(para_parallel_layer_num_nodes)
self.learning_rate = para_learning_rate
self.mobp = para_mobp
self.device = para_device
temp_model = []
for i in range(len(para_full_connect_layer_num_nodes) - 1):
temp_input = para_full_connect_layer_num_nodes[i]
temp_output = para_full_connect_layer_num_nodes[i + 1]
temp_linear = nn.Linear(temp_input, temp_output)
temp_model.append(temp_linear)
temp_model.append(get_activator(para_activators[i]))
self.full_connect_model = nn.Sequential(*temp_model)
temp_parallel_activators = para_activators[len(para_full_connect_layer_num_nodes) - 1:]
self.parallel_model = [ParallelAnn(para_parallel_layer_num_nodes, temp_parallel_activators).to(self.device)
for _ in range(self.dataset.num_labels)]
self.my_optimizer = torch.optim.Adam(itertools.chain(self.full_connect_model.parameters(),
*[model.parameters() for model in self.parallel_model]),
lr=para_learning_rate)
self.my_loss_function = nn.MSELoss().to(para_device)
class ParallelAnn(nn.Module):
"""
Parallel ANN.
This class handles the parallel part.
"""
def __init__(self, para_parallel_layer_num_nodes: list = None, para_activators: str = "s" * 100):
super().__init__()
temp_model = []
for i in range(len(para_parallel_layer_num_nodes) - 1):
temp_input = para_parallel_layer_num_nodes[i]
temp_output = para_parallel_layer_num_nodes[i + 1]
temp_linear = nn.Linear(temp_input, temp_output)
temp_model.append(temp_linear)
temp_model.append(get_activator(para_activators[i]))
self.model = nn.Sequential(*temp_model)

MultiLabelData Class is in charge of a series of operations such as network construction and operation . Most of the content uses torch Programming content .

(Line 9~20) The call in parameter of class is MultLabelData object dataset, Serial network array and parallel network array , Learning factors , Inertial parameters , Activation function string , The engine of execution . It can be found from the transferred parameters that they are all super parameters needed to build the network , And all kinds of complicated marking matrices are composed of dataset Just finish .
(Line 21~27) Receive parameter .
(Line 29~37) basis para_full_connect_layer_num_nodes The depth of each layer of the network provided , para_activators The activation function of each layer provided completes the construction of the serial neural network part of the neural network . For details, please refer to torch Programming nn.Sequential Method .
(Line 39~42) utilize python Of for Generate a list with a length of \(L\) Of ParallelAnn The object list . Each element stores a serial network . This array will be linked to the primary serial network constructed before forward Make connections in the process , Finally, a serial parallel network is built .
(Line 44~46) Set gradient descent method for each network part Adam( Adaptive moment estimation ), Generate optimizer objects . optim yes torch An optimization package for , Here are a variety of optimization tools , here adam It's a method inside , lr The step size used to receive the gradient .
(Line 48) Generate loss function object , The loss function method is MSE, The calculation procedure calls GPU.
ParallelAnn Class method is MultiLabelAnn A replica of a class , It is mainly used to realize each sub serial part of the parallel network . Because the main learning and calculation methods are ParallelAnn Implemented on , therefore MultoiLabelAnn Its function is almost castrated , Mainly responsible for building the network and forward Calculation , BP All unified by MultiLabelAnn Class BP All in all .

6. List of representative functions

6.1 Learning on the Internet : one_round_train And bounded_train function

one_round_train The function is located at MultiLabelAnn class , It is used to realize once forward+backPropagation The process of , Because it's used here torch Programming , So it's very concise ( The wheels are so fragrant !). What is the kernel of the wheel? Welcome to my blog : be based on Java Self study notes of machine learning （ The first 71-73 God ：BP neural network ）_LTA_ALBlack The blog of -CSDN Blog

 def one_round_train(self, para_input: np.ndarray = None, para_extended_label_matrix: np.ndarray = None, para_label_query_matrix: np.ndarray = None) -> object:
"""
One round train. Use instances with labels.
:return:
"""
temp_outputs = self(para_input)
temp_memory_outputs = temp_outputs.cpu().data
para_extended_label_matrix = torch.tensor(para_extended_label_matrix, dtype=torch.float)
i_index, j_index = np.where(para_label_query_matrix == 0)
para_extended_label_matrix[i_index, j_index * 2] = temp_memory_outputs[i_index, j_index * 2]
para_extended_label_matrix[i_index, j_index * 2 + 1] = temp_memory_outputs[i_index, j_index * 2 + 1]
temp_loss = self.my_loss_function(temp_outputs, para_extended_label_matrix.to(self.device))
self.my_optimizer.zero_grad()
temp_loss.backward()
self.my_optimizer.step()
return temp_loss.item()
def forward(self, para_input: np.ndarray = None):
temp_input = torch.as_tensor(para_input, dtype=torch.float).to(self.device)
temp_inner_output = self.full_connect_model(temp_input)
temp_inner_output = [model(temp_inner_output) for model in self.parallel_model]
temp_output = temp_inner_output[0]
for i in range(len(temp_inner_output) - 1):
temp_output = torch.cat((temp_output, temp_inner_output[i + 1]), -1)
return temp_output

This function needs to pass three parameters , One is Attribute matrix to be trained , be used for forward Provides input neurons ; The other is Extended target label matrix , be used for BackPropagation Calculate the loss function and prepare for quantifying the penalty information to update the network edge weight ; And finally Whether the tag matrix is missing the tag matrix , It is specially introduced in training to take out the missing labels when calculating the loss function " Take care of alone ", The loss is guaranteed to be 0, No punishment

(Line 7) This is a very Python The programming of , Also very torch Programming for . there self yes MultiLabelAnn object , This object inherits from model, and model When called by a single parameter, the encapsulated __call__ Method , and nn.Module hold __call__ Method is implemented as a class object forward function , So it is essentially equivalent to self.forward(para_input)
(Line 24) Then enter the forward function , First of all, will np.ndarray The matrix of the object is transformed into torch Medium tensor object , This is to match the interface of the wheel
(Line 25~26) nn.Sequential The constructed object can also receive parameter triggers directly forward Self iteration of , This is torch Well designed , It just works . Finally, the serial network output neuron tensor: temp_inner_output. And then with these tensor Neurons are the starting point , Different parallel network objects in the parallel network list , Finally, the output results of each network are saved to temp_inner_output in . At this time temp_inner_output It has become a list (python Built in receive format conversion ), Each item in the list represents the dual port corresponding to a label Output matrix ( Similar to tensor).
(Line 27~30) List all \(k * 2\)tensor come together , Constitute a \(k*2L\) Of tensor And back to .
(Line 10, 15) The extended matrix in the formal parameter is transformed into tensor And then into the loss function , Loss function calculation enabled GPU
(Line 11~13) For missing tags ( Tag not queried ), Set the target matrix value to be consistent with the predicted value , In this case, the loss result is 0, It is not penalized in actual operation .
(Line 17~20) Initialize the gradient value , Conduct backPropagation, Update the weights , Finally, the loss function value is returned .

 def bounded_train(self, para_lower_rounds: int = 200, para_checking_rounds: int = 200,
para_enhancement_threshold: float = 0.001):
temp_input, temp_extended_label_matrix, temp_label_query_matrix = self.dataset.get_queried_instances()
print("bounded_train")
# Step 2. Train a number of rounds.
for i in range(para_lower_rounds):
if i % 100 == 0:
print("round: ", i)
self.network.one_round_train(temp_input, temp_extended_label_matrix, temp_label_query_matrix)
# Step 3. Train more rounds.
i = para_lower_rounds
temp_last_training_accuracy = 0
while True:
temp_loss = self.network.one_round_train(temp_input, temp_extended_label_matrix, temp_label_query_matrix)
if i % para_checking_rounds == para_checking_rounds - 1:
temp_training_accuracy, temp_testing_accuracy, temp_overall_accuracy = self.network.test()
print("Regular round: ", (i + 1), ", training accuracy = ", temp_training_accuracy)
if temp_last_training_accuracy > temp_training_accuracy - para_enhancement_threshold:
break # No more enhancement.
else:
temp_last_training_accuracy = temp_training_accuracy
print("The loss is: ", temp_loss)
i += 1
temp_training_accuracy, temp_testing_accuracy, temp_overall_accuracy = self.network.test()
print("Training accuracy (learner knows it) = ", temp_training_accuracy,
", testing accuracy (learner does not know it) = ", temp_testing_accuracy,
", overall accuracy (learner does not know it) = ", temp_overall_accuracy)

bounded_train Function in a word is to do loop training , The first parameter para_lower_rounds It describes the number of cycles , para_checking_rounds Is the checkpoint width , When the loop test is actually performed, the prompt will be output after the checkpoint is reached , The third parameter para_enhancement_threshold Indicates the accuracy threshold of the current training , stay para_lower_rounds After the second cycle test, it will enter the accuracy detection training , When the accuracy improvement is lower than this threshold after each training, the training will be stopped .

(Line 4) This line of code uses MultiLabelData Class get_queried_instances Method , This method provides a training sample for a training . stay MASP In the code of , Not all exercises call the main data set matrix directly , But through Query the marking matrix / Array To put the tags that are not missing , Determine whether there is a data row to be queried . These materials are the materials for the next online learning , and get_queried_instances Method It is the task of reading them out .
(Line 6~11) Cycle training section
(Line 14~27) Accuracy training part

6.2 Cold start and active learning : two_stage_active_learn function

The cold start and active learning are realized by a complete function body , It is generally divided into four stages

Cold start batch and active ( Additional queries ) Learn batch calculation
Rearrange data sets in a representative way
Cold start
Take the initiative ( Additional queries ) Study

· About parameters

 def two_stage_active_learn(self, para_instance_selection_proportion: float = 1.0,
para_budget: float = 0.2,
para_cold_start_labels_proportion: float = 0.2,
para_label_batch: int = 2,
para_instance_batch: int = 2,
para_dc: float = 0.12, para_pretrain_rounds: int = 200,
para_increment_rounds: int = 100,
para_enhancement_threshold: float = 0.001):
"""
Two stages : Cold start and active learning stage . The total number of tag queries is
num_instances * num_labels * para_budge
:param para_instance_selection_proportion: Proportion of actual data to training data . Other data must not be queried , So don't keep .
:param para_budget: Total query proportion . In the total data ( It's not just a training set ) Proportion of the number of tags .
:param para_cold_start_labels_proportion: The proportion of query tags in the total queries in the cold start phase .
:param para_label_batch: In the cold start phase, each instance queries the number of tags each time .
:param para_instance_batch: In the second stage, query the number of instances in each round .
:param para_dc: Radius used to calculate the instance density . For a scale .
:param para_pretrain_rounds: Network pre training rounds .
:param para_increment_rounds: After each tag query , The number of fixed training rounds .
:param para_enhancement_threshold: The training accuracy improvement will stop when it is less than this threshold .
:return:
"""

The source code has comments , Here are a few additional points .

Most of them are super parameters . budge It's a percentage , It is mainly used to limit the number of tags involved in the cold start phase to avoid excessive overhead . para_instance_selection_proportion Is a percentage set individually within each dataset when defining the dataset , Used to further limit the amount of data for some very large data sets . para_cold_start_labels_proportion It is a ratio of differentiated cold start and active learning , If this ratio is \(p \%\), that \(p \%\) Filter tab for cold start , and \(1-p \%\) For active learning , Avoid two phases occupying the same data content .

This part of the training uses batch Training , So both cold start and active learning have a batch step , This is what is mentioned in the formal parameter batch.

6.2.1 Cold start batch and active learning batch calculation

 # Step 1. Reset the dataset to clean learning information.two_stage_active_learn
print("two_stage_active_learn test 1, para_dc = ", para_dc)
self.dataset.reset()
print("two_stage_active_learn test 2")
# This code should be changed later to suit user requirement.
temp_num_selected = int(self.dataset.num_instances * para_instance_selection_proportion)
print("self.dataset.num_instances = ", self.dataset.num_instances,
"para_instance_selection_proportion = ", para_instance_selection_proportion,
"temp_num_selected: ", temp_num_selected)
temp_cold_start_instances = int(self.dataset.num_instances * self.dataset.num_labels
* para_budget * 5 / 4
* para_cold_start_labels_proportion
// para_label_batch)
print("Cold start instances = ", temp_cold_start_instances)
temp_num_additional_queries = int(self.dataset.num_instances * self.dataset.num_labels
* para_budget * 5 / 4
* (1 - para_cold_start_labels_proportion)
/ para_instance_batch)
print("temp_num_additional_queries = ", temp_num_additional_queries)
self.num_additional_queries = temp_num_additional_queries

The batch calculation may be better understood by the formula ( Follow the parameter interpretation "\(p\)" The definition of ):\[\operatorname{ColdstartInstances} = \frac{\frac{5}{4}NL\times \text{budget} \times p}{\text{LabelBatchsize}} \tag{5}\] Why do cold start batches use "Instances" To express ? In fact, cold starts in code are not done batch by batch ( It's not strictly batch Training ), It's a one-time start . The code takes a batch of cold starts as one line , The number of tags per row should be equal , For example, our molecules have calculated 15 Number of tags , And limit a batch ( a line ) Check at most 3 A label , Then it will start cold 5 That's ok . this 5 The line is not divided into 5 Time training , Instead, it is integrated into a matrix and added to a training . In practice, our data sets are sorted by representativeness , Therefore, it often defaults to Top-ColdstartInstances Data cold start .

In addition, when selecting tags during cold start, try to select those tags with less query times , This is according to sparsity The principle of selection .

\[\operatorname{AdditionalQueries} = \frac{\frac{5}{4}NL\times \text{budget} \times (1-p)}{\text{InstanceBatchsize}} \tag{6}\] The second stage is the formal active learning stage , This part is genuine batch Training , The calculated number of additional queries is taken as the total number of batches , And execute such a total number of cycles , Each round robin query InstanceBatchsize Time , According to this query quantity, a new \(X^{\prime}\), \(Y^{\prime}\) Put in training , to update uncertainty And further query .

In this formula \(\frac{5}{4}NL\times \text{budget}\) In fact, it is a query upper limit estimated according to the data volume of the actual data set , In other words , This is the maximum total number of tags queried by experts , The second section introduces MASP The upper limit of the active learning process \(T\).

6.2.2 Rearrange data sets in a representative way

 self.compute_instance_representativeness(para_dc, 0)
temp_selected_array = self.representativeness_rank_array[0: temp_num_selected]
temp_dataset_select = self.dataset.select_data(temp_selected_array)
# Replace the training data with the selected part.
self.dataset = temp_dataset_select
self.network.dataset = temp_dataset_select

First, you need to rearrange the data through the representativeness of each data row , For the specific operation scheme, please refer to my This article Used in the Master Treelike Java programme (3.2~3.3), Of course, the plan is a little lengthy , Yes 100 Multiple lines , and Python Just a little 30 That's ok , Just don't get any better . Finally, the obtained representative sorting subscripts are used to guide the data set rearrangement , Get a sorted data set ( because MultiLabelAnn A copy of dataset, So there are dataset Also rearrange ).

 def compute_instance_representativeness(self, para_dc: float = 0.1, para_scheme: int = 0):
"""
Compute the representativeness of all instances.
:param para_dc: The dc ratio.
:return: An array.
"""
# Step 1. Calculate density using Gaussian kernel.
temp_dc_square = para_dc * para_dc
# The array length is n(n-1)/2.
temp_dist2 = torch.nn.functional.pdist(torch.from_numpy(self.dataset.data_matrix)).to(self.device)
# Convert to an n^2 matrix
temp_distances_matrix = scipy.spatial.distance.squareform(temp_dist2.cpu().numpy(), force='no', checks=True)
# Gaussian density.
temp_density_array = np.sum(np.exp(np.multiply(-temp_distances_matrix, temp_distances_matrix) / temp_dc_square),
axis=0)
# Step 2. Calculate distance to its master.
temp_distance_to_master_array = np.zeros(self.dataset.num_instances)
for i in range(self.dataset.num_instances):
temp_index = np.argwhere(temp_density_array > temp_density_array[i])
if temp_index.size > 0:
temp_distance_to_master_array[i] = np.min(temp_distances_matrix[i, temp_index])
# Step 3. Representativeness.
self.representativeness_array = temp_density_array * temp_distance_to_master_array
# Step 4. Sort instances according to representativeness.
self.representativeness_rank_array = np.argsort(self.representativeness_array)[::-1]
return self.representativeness_rank_array

In a word , Work out \(N\) The distance between data , Thus, the importance is obtained through the Gaussian optimization formula calculated according to the importance , Then each data point finds the nearest data point that is more important than itself, so as to obtain the independence and finally multiply it . This code should pay attention to the nature of matrix operations , Especially in the first 15 That's ok .

6.2.3 Cold start

 # Step 2. Cold start stage.
# Can be revised to support the random label selection scheme. This is not the efficiency bottleneck.
for i in range(temp_cold_start_instances):
self.dataset.query_label_batch(i, self.dataset.get_scare_labels(para_label_batch))
print("two_stage_active_learn test 3")
self.bounded_train(para_pretrain_rounds, 100, para_enhancement_threshold)
self.cold_start_end_time = time.time()

In the cold start-up phase, it mainly passes through query_label_batch Complete the tag query , Create some new tag sets that are not missing . Then we use the cycle training to train this part of data .

 def get_scare_labels(self, para_length):
temp_indices = np.argsort(self.label_query_count_array)
result_array = np.zeros(para_length)
for i in range(para_length):
result_array[i] = temp_indices[i]
return result_array

In code get_scare_labels Method can return sparsity Satisfy Top-LabelBatchsize Tag set . The principle is not complicated , In fact, the label bucket is rearranged every time you select label_query_count_array, Select the first one with the lowest value LabelBatchsize Label subscripts . This variable is initially declared in MultiLabelData In the constructor of .

And determine the current \(i\) After the row and the label column array within the row, a stack of labels can be uniquely determined , After determining the label, you need to mark the label ( That is to say " Expert inquiry "), This operation is performed by dataset.query_label_batch Method .

6.2.4 Active learning

 # Step 3. Active learning stage.
for i in range(temp_num_additional_queries):
print("Incremental training process: ", i, " out of ", temp_num_additional_queries)
temp_query_matrix = self.network.get_uncertain_label_batch(para_instance_batch)
for j in range(para_instance_batch):
self.dataset.query_label(int(temp_query_matrix[j][0]), int(temp_query_matrix[j][1]))
temp_input, temp_extended_label_matrix, temp_label_query_matrix = self.dataset.get_queried_instances()
for i in range(para_increment_rounds):
self.network.one_round_train(temp_input, temp_extended_label_matrix, temp_label_query_matrix)
self.multi_round_end_time = time.time()
self.bounded_train(200, 100, para_enhancement_threshold)
self.final_update_end_time = time.time()

The process of active learning is actually learning -> Get the maximum certainty, Perfect data set -> Study ->... The cycle of , Until the total number of cycles reaches the upper limit .

(Line 4) get_uncertain_label_batch Method Return to one InstanceBatchsize * 2 Label coordinate matrix , Indicates the label coordinates that need to participate in the training at present . The implementation process of this function is as follows : First, put all the training data sets into the learning network to get the dual port Output Training prediction tag set , And then use it uncertainty Calculation formula 3 Calculate the... For each dual port uncertainty value , And compressed and stored in the array temp_certainty_array in , Then the array is indexed by value , Get the subscript array temp_indices, Then extract temp_indices Before InstanceBatchsize value , And the actual label coordinates are obtained by converting the values from one dimension to two dimensions , So mark it ( Identify non missing ) And store its coordinates in a InstanceBatchsize * 2 And returns .
(Line 6~7) basis InstanceBatchsize * 2 Prompt each line of label coordinates to query
(Line 9) According to the tag of the query , adopt get_queried_instances Method , Extract the... To be trained \(X^{\prime}\), \(Y^{\prime}\)
(Line 11~12) perform para_increment_rounds Time training , Take the initiative to learn a batch of training times para_increment_rounds It is also a custom super parameter .
(Line 14) After the active learning is completed, a cycle training is also conducted to test the accuracy .

6.3 F1 The calculation of my_test And compute_f1

F1 I am introducing F1 when (4.3.3) And explain MultiLabelData Class to compress the array (5.4 Line 33, 47) Similar parameters have been made , Now a brief overview of its implementation :

 def my_test(self):
temp_test_start_time = time.time()
temp_input = torch.tensor(self.dataset.test_data_matrix[:], dtype=torch.float, device='cuda:0')
temp_predictions = self(temp_input)
temp_switch = temp_predictions[:,::2] < temp_predictions[:,1::2]
self.dataset.test_predicted_label_matrix = temp_switch.int()
self.dataset.test_predicted_proba_label_matrix = torch.exp(temp_predictions[:, 1::2]) / \
(torch.exp(temp_predictions[:, 1::2]) + torch.exp(temp_predictions[:, ::2]))
temp_test_end_time = time.time()
temp_my_testing_accuracy = self.dataset.compute_my_testing_accuracy()
temp_testing_f1 = self.dataset.compute_f1()
print("Test takes time: ", (temp_test_end_time - temp_test_start_time))
print("-------My test accuracy: ", temp_my_testing_accuracy, "-------")
print("-------My test f1: ", temp_testing_f1, "-------")
return temp_my_testing_accuracy, temp_testing_f1

(Line 4~7) The basis of forward operation , obtain temp_predictions matrix : A dual port Output Prediction label matrix .
(Line 9~10) utilize softmax The formula : \(S_{i}=\frac{ \exp{\hat{y}_i}}{\sum_{2} \exp{\hat{y}_i}}\) Calculate the proportion of the second port , That is to say \(S_1\), Finally, I got one and \(S_1\) Composed of \(N\times L\) label softmax Prediction matrix . Why the second port , Because in the actual forecast , We have to Output The output data predicts that the larger party will be 1, The smaller party predicts that 0, And only in pairwise=<0,1> Is predicted as a positive label . Ought to be right , The larger the second port value, the more likely it is to be 1. With the idea of threshold , It can be said that we assert {\(S_1\)>0.5} The label is predicted as a positive label .
(Line 15) Calculation F1, See the following code for details .

 def compute_f1(self):
"""
our F1
"""
temp_proba_matrix_to_vector = self.test_predicted_proba_label_matrix.reshape(-1).cpu().detach().numpy()
temp = np.argsort(-temp_proba_matrix_to_vector)
all_label_sort = self.test_label_matrix_to_vector[temp]
temp_y_F1 = np.zeros(temp.size)
all_TP = np.sum(self.test_label_matrix_to_vector == 1)
for i in range(temp.size):
TP = np.sum(all_label_sort[0:i+1] == 1)
P = TP / (i+1)
R = TP / all_TP
if (P+R)==0:
temp_y_F1[i] = 0
else:
temp_y_F1[i] = 2.0*P*R / (P+R)
return np.max(temp_y_F1)

This part of the Union 4.3.3 Section to understand .

(Line 5) take softmax The probability matrix is compressed into an array temp_proba_matrix_to_vector, It will be used as sort data
(Line 7) Generate to temp_proba_matrix_to_vector Based on descending sort array , Just a quick example , temp[0] = 77 That means temp_proba_matrix_to_vector[77] Is the maximum probability value , The implication , We think that the subscript 77 The label represented is the one most likely to be positive . ( Because I know the length and width of the label matrix , therefore 77 It can be transformed into a coordinate by means of one-dimensional to two-dimensional transformation )
(Line 9) An array of real tags self.test_label_matrix_to_vector according to temp Sort by subscript , With the above example , The first... Of the real tag array 77 The label will come first , I can imagine what it looks like , The final rearranged array all_label_sort It must be in the shape of [1 1 1 1 1 ... 1 1 1 0 0 0 ... 0 0 0] The appearance of ( Negative labels in the code are uniformly converted to 0 了 ), Just in a continuous 1 There may be a few 0, In succession 0 There will always be a few 1, otherwise F1 The curve approximates best It's curved .
(Line 12) Understand the following for There should be a thought preparation before the loop and the following parameters , F1 There should be a unified Positive prediction , So the follow-up for Each range selected in the loop is predicted to be a positive label . Here all_TP The actual all_label_sort in 1 Number of , Naturally, this is the number of positive labels in all real samples , So in terms of confusion matrix , all_TP = TP + FN
(Line 14~15) Assume \(\text{len} = N \times L\) The following for The loop actually takes out one in sequence \(i\)(\(0<i<\text{len}\)), And then calculate \([0,i]\) Within the scope of F1, and \(\frac{i+1}{\text{len}}\) Namely 4.3.3 Abscissa in section \(Q\). 15 Yes TP In fact, it is to take out the 1 Number of , Simply put, it is the current scope ( Forecast as 1 Within the scope of ) Inside is really the number of positive labels , So in terms of confusion matrix , TP = TP
(Line 16) i+1 yes [0, i] The width of the range , Within this range we all predict positive labels . therefore i + 1 It is the number that we predict to be positive at present , So in terms of confusion matrix , i + 1 = TP + FP. So the variable P Namely Precision.
(Line 17) all_TP Express TP + FN. So the R Namely Recall.
(Line 18~22) Through the formula 4 Calculation F1, And in the range curve Peak-F1

Peak-F1 Why on earth is it available ?

We are measuring Peak-F1 The label matrix is sorted , The aim is to ensure that most of the predictions are correct in the first place , This is the first practical 1 It can more match the positive tags we predicted . natural , If this is continuous 1 There is no inclusion 0, Successive 0 There is no inclusion 1, Then we can approach F1 Of best curve .

thus , Born out of best Curved two port softmax A threshold value must be found in the probability value \(\theta\), Greater than \(\theta\) The dual port prediction of is all correct after the label is positive , Less than \(\theta\) The predicted negative label is all correct . Rearrange the tag array with the predicted value again to form a perfect continuity 1 And continuity 0 A sort array of .

in other words \(\theta\) It has become a perfect watershed , There are no impurities around the watershed . such \(\theta\) The existence of can show that our network can 100% The prediction is correct . But in practice , No matter how good the multi label scheme is, it is impossible to find a way to achieve perfect segmentation \(\theta\), There are always impurities on the left and right , But the algorithm can find one as far as possible \(\theta\) Keep the impurities on the left and right as minimum as possible .

therefore Peak-F1 In essence, it describes that the multi label algorithm can find as perfect as possible \(\theta\) The ability of .

Epilogue

Finally, we will not show the relevant running results of the algorithm , This part can refer to the original paper . The thesis studies from supervision , Semi-supervised learning , Active learning Accuracy, as well as F1 result , The running time and other angles are compared with various multi label algorithms in parallel , The content is sufficiently detailed and substantial .

About active learning Accuracy Horizontal comparison

If the description in the article is biased, you are welcome to make corrections , Content bloggers about multi tags are also in the process of learning !