您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

The father of Python has issued a new document, which will replace the existing parser

編輯：Python

A few years ago , Someone asked Python Will it be used for conversion PEG Parser （ Or is it PEG grammar , I don't remember exactly 、 I chose 、 When did you say ）. I've seen the theme a little bit , But there is no clue , Gave up .

lately , I learned a lot about PEG（Parsing Expression Grammars） Knowledge , Now I think it's an interesting alternative , Just replace me in 30 Just started to create... Years ago Python It's self-made （home-grown） Parsing generator （parser generator）（ That parsing generator , go by the name of “pgen”, It's me for Python The first piece of code I wrote ）.

I'm interested in PEG, The reason is right pgen It's a little irritated by the limitations of .

It uses what I wrote myself LL(1) Variants of parsing —— I don't like syntax rules that can produce empty strings , So I disabled it , Furthermore, the algorithm of generating analytic table is simplified .

meanwhile , I also invented a set of similar EBNF The grammatical symbols of （ Translation notes ：Extended Backus-Naur Form,BNF An extension of , It's a formalized symbol , Describes the syntax in a given language ）, I still like .

Here are pgen Some of the problems that bother me .

LL(1) In the name “1” Indicates that it uses only a single forward marker （a single token lookahead）, And that limits our ability to write beautiful grammar rules . for example , One Python sentence （statement） It can be an expression （expression）, It can also be assignment （assignment）（ Or something else , But those are all based on if or def This kind of special keyword starts with ）.

We want to use pgen Notation to write the following syntax .（ Please note that , This example describes a toy language （toy language）, It is Python A tiny subset of , Just like the traditional language design .）

statement: assignment | expr | if_statement
expr: expr '+' term | expr '-' term | term
term: term '*' atom | term '/' atom | atom
atom: NAME | NUMBER | '(' expr ')'
assignment: target '=' expr
target: NAME
if_statement: 'if' expr ':' statement

About these symbols , Explain a few words ：NAME and NUMBER It's a marker （token）, Predefined outside of grammar . A string in quotation marks is like '+' or 'if' It's also a marker .（ I'll talk about markers later .） Grammar rules start with their names , What follows is : Number , And then one or more of them | Symbol separated options （alternatives）.

But the problem is , If you write grammar like this , The parser doesn't work ,pgen There will be a strike .

One reason is that some rules （ Such as expr and term） It's left recursive , and pgen Not smart enough to parse . This usually needs to be solved by rewriting the rules , for example （ Keep the other rules the same ）：

expr: term ('+' term | '-' term)*
term: atom ('*' atom | '/' atom)*

This reveals pgen Part of EBNF Ability ： You can embed options in brackets , And you can put... After the brackets * To create repetition , So here expr The rules mean ： It's a term （term）, Followed by zero or more statement blocks , Inside the statement block are the plus sign and the term , Or minus sign and terminology .

This syntax is compatible with the first version of the language , But it doesn't reflect the original intention of the language designer —— In particular, it does not indicate that operators are left bound , And that's very important when you're trying to generate code .

But in this toy language （ And in Python） in , There's another annoying question .

Because of the forward single marker , The parser can't be sure that it's looking at the beginning of an expression , It's still an assignment . At the beginning of a statement , The parser needs to look at the first marker it sees , To decide what it's looking at statement Optional content of .（ Why? ？pgen This is how the auto parser works .）

Suppose our program is like this ：

answer = 42
This program will be parsed into three flags ：NAME（ The value is answer）,‘=’ and NUMBER（ The value is 42）. At the beginning of the program , The only forward marker we have is NAME. here , The rule we are trying to meet is statement（ The starting mark of this grammar ）. There are three options for this rule ：expr、assignment as well as if_statement. We can exclude if_statement, Because the forward marker is not “if”.

however expr And assignment Can be NAME The beginning of the marker , So it will cause ambiguity （ambiguous）,pgen Will reject our grammar .

（ It's not exactly right , Because grammar does not lead to ambiguity in technology ; But we don't care about it , Because I can't think of a better word to express . that pgen How to make a decision ？ It computes a name for each grammar rule FIRST Group stuff , If at a given point ,FIRST There are overlapping options for groups , It will complain ）（ Translation notes ： Complain ？ It should mean that the analysis can't go on , Strike ）.

that , Can we provide a larger forward buffer for the parser , To solve this problem ？

For our toy language , A second forward marker is enough , Because in this grammar ,assignment The second marker of must be “=”.

But in Python In this more realistic language , You may need an infinite forward buffer , Because in “=” Things to the left of the marker can be extremely complex , for example ：

table[index + 1].name.first = 'Steven'

stay “=” Before the sign , It has been used 10 Markers , If you want to challenge , I can also give examples of any length . In order to be in pgen Solve it , Our method is to modify the grammar , And add an extra check , So that it can receive some illegal programs , But if you check that the assignment to the left is invalid , Will throw a SyntaxError .

For our toy language , This can be summed up as follows ：

statement: assignment_or_expr | if_statement
assignment_or_expr: expr ['=' expr]

（ Square brackets indicate an optional part .） And then during the subsequent compilation （ such as , When generating bytecode ）, We will check if there is “=”, If there is , Let's check if there's any target grammar .

When the function is called , Keyword parameters have a similar problem . We want to write like this （ Again , This is a Python A simplified version of the call syntax for ）：

call: atom '(' arguments ')'
arguments: arg (',' arg)*
arg: posarg | kwarg
posarg: expr
kwarg: NAME '=' expr

But a single forward marker doesn't tell the parser , In the beginning of a parameter NAME What is the posarg The beginning of （ because expr Maybe NAME start ） still kwarg The beginning of .

similarly ,Python When the current parser solves this problem , It's through a special statement ：

arg: expr ['=' expr]

And then in the subsequent compilation process to solve the problem .（ We even made a little mistake , Allow like foo((a)=1) Such things , Give it and foo(a=1) Same meaning , until Python 3.8 Just fix it .）

that ,PEG How does the parser solve these problems ？

By using infinite forward buffering ！PEG The classic implementation of the parser uses a name called “packrat parsing”（ Translation notes ：PackRat, Pocket mouse ） Things that are , It not only loads the entire program into memory before parsing , It also allows the parser to trace back at will .

although PEG The term mainly refers to grammatical symbols , But in order to PEG Syntax generated parsers are recursively descended with infinite backtracking （recursive-descent） Parser ,“packrat parsing” By remembering the rules that match each position , Make it work .

It makes it easy , But there are costs ： Memory .

Thirty years ago , I have every reason to use the parsing technique of a single forward marker ： Memory is expensive .LL(1) analysis （ And other technologies like LALR(1), because YACC And famous ） Use state machines and stacks （ A kind of “ Pushdown automaton ”） To effectively construct the parse tree .

Fortunately, , function CPython Of computers 30 Years ago, we had more memory , It is no longer a burden to store the entire file in memory . for example , The biggest non test file I can find in the standard library is _pydecimal.py, It's about 223 kilobytes （ Translation notes ：kilobytes, namely KB）. In a GB In the world , It's not much .

That's why I'm looking at analytic technology again .

however , At present CPython There's another parser in bug My things .

Compilers are complex ,CPython No exception ： although pgen- The driver's parser outputs a parse tree , But the parse tree is not directly used as input to the code generator ： It will first be transformed into an abstract syntax tree （AST）, And then it's compiled into bytecode .（ More details , But I don't care about .）

Why not compile directly from the parse tree ？ This is actually the first way it works , But about 15 Years ago , We found that the compiler was complicated by the structure of the parse tree , So we introduced a separate AST, A translation of the parse tree into AST Link . With Python The development of ,AST More stable than parse trees , This reduces the possibility of compiler errors .

AST For those who want to check （inspect）Python Third party code of code , It's easier , It also passes the popular ast Module and open . This module also allows you to build from scratch AST node , Or modify the existing AST node , Then you can compile the new node into bytecode .

The latter ability supports the whole for Python Language added to the extended cottage industry （ Translation notes ：ast The module is Python The tripartite expansion of provides convenience ）.（ With the help of parser modular , The parse tree can also face Python Open to users , But it's too cumbersome to use , So compared to ast modular , It's out of date .）

in summary , My idea now is to see if I can do it for CPython Create a new parser , In parsing , Use PEG And packrat parsing To build directly AST, So skip the middle parse tree structure , And save as much memory as possible , Even though it uses infinite forward buffering .

I haven't made it to this point , But there's already a prototype , You can put a Python To compile a subset of AST, Its speed and current CPython The parsers are roughly the same . It's just , It takes up more memory , So I expect that when I extend it to the entire language , Will decrease PEG The speed of the parser .

however , I haven't optimized it yet , So there's still hope .

convert to PEG The last benefit of this is that it provides more flexibility for the future evolution of language .

Someone used to say ,pgen Of LL(1) Defects help Python Keep grammar simple . It makes sense , But we still have a lot of proper processes , It can prevent the uncontrolled expansion of language （ Mainly PEP technological process , With the help of very strict backward compatibility requirements and new governance structure ）. So I'm not worried about .

The above is all the content shared this time , Want to know more python Welcome to official account ：Python Programming learning circle , send out “J” Free access to , Daily dry goods sharing