程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Introduction to protobuf protocol of Python crawler

編輯:Python

Preface

Do you encounter any of the following types while learning about reptiles . If you are interested in learning or understanding relevant knowledge , I don't think I'm too talented to learn , You can refer to it . Welcome your comments .

First, describe the pattern of the problem .

You may see the following garbled behavior in the request parameters :

Then you will find out content-type The data type is x-protobuf type , Then maybe you need to learn protobuf Agreement to continue your crawler .

Then let's talk about why this problem occurs ?

I don't know if that's right , For reference only , Can provide an idea . Let's start with a normal data content-type The data type is

Under the circumstances . Web page according to utf-8 Encoding decodes data . But if content-type The data type is x-protobuf when , He can't rely on protobuf Protocol to parse , So there will be garbled code .

Let's move on to our main topic .
First of all, this article will introduce you protobuf The definition and resolution of the protocol , So that you can have a deeper understanding of protobuf agreement , In the next section, we will introduce how to encounter protobuf
The practical operation of how the protocol is resolved .

One 、 What is? protobuf agreement ?

protobuf (protocol buffer) Is Google's internal mixed language data standard . By serializing structured data ( Serialization ), Used for communication protocol 、 Language independence in areas such as data storage 、 Platform independent 、 Extensible serialization structure data format .

  • serialize : Transform structural data or objects into a format that can be used for storage and transmission .
  • Deserialization : In other computing environments , Restore the serialized data to structure data and objects .

1.1 The relationship between serialization and deserialization

As shown in the figure, the programmer has written proto Procedure of documents , Then it is programmed into a package suitable for the programming language . This process can be done by downloading the link below .( This process will be described later )
https://github.com/protocolbuffers/protobuf/releases/
take proto File into the package you need . What you need to do is write proto The content of the document .
Then through the compiled package, the data and binary can be converted, which is called serialization and deserialization .

Two 、 To write proto file

2.1 Why write proto file

Some people may be curious , We just want to convert a garbled code into data we can understand. Why should we learn to write this file . Then you can look at the picture above first , If the parameters carried in your request data are garbled , If you want to create such garbled data, you need to learn how to pass proto Compiled package ( Because the description of this article is python Language then this is py file ), To convert data into binary files . To request data normally .

2.2 proto Document preparation process

First, write a simple proto file

Look at the code
syntax = "proto3";
message Panda {
int32 id = 1;
string name = 2;
}
The first line determines proto Protocol used , Now most of them use proto3 instead of proto2 Then define a message body to store some data fields you need , Each of these data fields has a type , A name and a value make up . This value is not in your data , Instead, look for the ID of the type that defines the data field . I have not tried to use non numeric situations . The specific writing format can be according to the following figure proto Write in the format of . You can define the data type you need according to your own situation .

Save the file name as xxx.proto file . So far, we have finished writing proto The process of , Next we need to finish the proto The file needs to be compiled into the package used by our program , download https://github.com/protocolbuffers/protobuf/releases/
Files in the web site , download windows edition

Set the environment variable

function cmd, find proto File office

Translate it into python file

So far we have finished proto The compilation of the file .
Next we start to serialize the data into binary
You need to install protobuf==3.20 google And the deserialization library blackboxprotobuf

So far you can complete proto Compilation of documents , Here's a slightly more complicated code , You can see the code according to the data type table above . The following code can be used to parse some proto The relationship between nested data in

Code display
syntax = "proto3";
message Message
{
int32 id_2 = 2;
int32 id_3 = 3;
Info id_5 = 5;
message Info
{
string id_1 = 1;
repeated int32 id_2 = 2 [packed=false];
int32 id_3 = 3;
Number id_5 = 5;
message Number
{
int32 id_2 = 2;
}
repeated int32 id_8 = 8 [packed=false];
int32 id_6 = 6;
int32 id_7 = 7;
int32 id_9 = 9;
int32 id_11 = 11;
}
string id_6 = 6;
}
Call result display ![image](https://img2022.cnblogs.com/blog/2636039/202206/2636039-20220618184512495-166331078.png) Okay , The above description explains to you how to write proto The process of , The following will show you how to parse binary proto The file of . Take a break first .

3、 ... and 、Protobuf Data format analysis

Reference blog https://www.cxymm.net/article/mine_song/76691817
First we need to understand Varints code , And then through Varints Coding understanding protobuf The decoding process of

3.1 Varints code

Varint It's a compact way to represent numbers . It uses one or more bytes to represent a number , Smaller numbers use fewer bytes . This reduces the number of bytes used to represent numbers .

Varint Every byte in ( Except for the last byte ) The most significant bit is set (msb), This bit indicates that more bytes will appear . Low per byte 7 Bits are used to 7 A binary complement representation of a number in the form of a bit set , First place in the least effective group .

If it doesn't work 1 Bytes , Then the most significant bit is set to 0 , As the following example ,1 One byte can represent , therefore msb by 0.

0000 0001

If you need more than one byte to represent ,msb It should be set to 1 . for example 300, If you use Varint What they say :

1010 1100 0000 0010

If it is calculated according to the normal binary system , This means 88068 (65536 + 16384 + 4096 + 2048 + 4).

But if you follow Varint The way of coding , First look at the first byte :10101100, The highest position is 1, The rest is 0101100,msb by 1, Indicates that there are still bytes left to be read , The second byte 00000010, The highest position is 0, The rest is 0000010,msb by 0, Indicates that there are no bytes left . Put two 7 Put binary numbers together , Is the target value

0000010 0101100 => 4 + 8 + 32 + 256 = 300
Here is the small end mode , Little Endian , Read it first , High position behind , Read it later . therefore 0000010 To be calculated later .

3.2 protobuf analysis

First, let's write a simple protobuf code , As shown in the figure

Then assign a value

So get the binary bit

08 01 12 04 74 65 73 74

Explain our decoding process , In the process of analyzing and decoding , We need to understand Wire Type, Each message item is preceded by a corresponding tag, To resolve the corresponding data type , Express tag The data type of is also Varint.
tag Calculation method of : (field_number << 3) | wire_type
Each data type has a corresponding wire_type:

Wire TypeMeaning Used For0Varint int32, int64, uint32, uint64, sint32, sint64, bool, enum164-bit fixed64, sfixed64, double2Length-delimited string, bytes, embedded messages, packed repeated fields3Start group groups (deprecated)4End group groups (deprecated)532-bit fixed32, sfixed32, float

therefore wire_type At most, it can only support 8 Kind of , There are 6 Kind of .
therefore 08 The corresponding binary is :

After the filling is
0 0001 000
Why do you write like this ?
First, the last three are wire_type The type of 0 , It means int32 Consistent with our definition above

id stay protobuf Is not shown in , Only the following identifiers are displayed 1
As shown in the following figure :

And then because it's int32 Type, so we directly take the value as 01, This is the value we assigned , namely

This starts parsing the string

Division 0 0010 010
That is, the identifier is 2 type Wire Type by Length-delimited string.
The following value is the length of the string 04 Then the next four data are string data
namely 74 65 73 74 ASCII Convert to test This completes the coding process .
You can try to learn the more difficult decoding process , Next section , We describe how to use... In reptiles .


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved