C++实现解释器(3):语法分析与普拉特解析¶
参照《Writing An Interpreter/Compiler In Go》,改用C++实现。
引言¶
本篇对应的源码位于目录: src/02/
02
|token
| |token.hpp
|CMakeLists.txt
|test
| |lexer_test.hpp
| |parser_test.hpp
| |ast_test.hpp
| |main.cpp
|lexer
| |lexer.hpp
|repl
| |repl.hpp
|parser
| |parser.hpp
| |parser_tracing.hpp
|ast
| |ast.hpp
|main
| |monkey.cpp
语法分析器
将输入的内容转换为对应的数据结构,具体就是将词法分析器得到的一系列词法Token转换为对应的数据结构,形成抽象语法树(AST)
。
使用yacc
、bison
、ANTLR
等语法分析器生成器工具,可以根据BNF或EBNF形式描述的语法构成,自动生成语法分析器。为了理解语法分析器的工作原理,不使用这些工具。
这里准备编写的语法分析器是递归下降语法分析器
,具体就是基于自上而下的运算符优先级分析法,其发明人是普拉特,又称普拉特语法分析器
。
AST类型¶
AST节点可以分成两类:
语句
:不产生值表达式
: 会产生值
// The base Node interface
struct Node
{
virtual ~Node() {}
virtual std::string TokenLiteral() { return ""; }
virtual std::string String() { return ""; }
virtual bool Good() { return true; }
virtual NodeType GetNodeType() { return ast::NodeType::Base; }
};
// All statement nodes implement this
struct Statement : Node
{
virtual ~Statement() {}
virtual void StatementNode() {}
virtual NodeType GetNodeType() { return ast::NodeType::Statement; }
};
// All expression nodes implement this
struct Expression : Node
{
virtual ~Expression() {}
virtual void ExpressionNode() {}
virtual NodeType GetNodeType() { return ast::NodeType::Expression; }
};
它们作为其它节点的父类,除了let
和return
语句,其余都是表达式处理。
AST根节点¶
struct Program
{
std::vector<std::unique_ptr<Statement>> v_pStatements;
std::string TokenLiteral()
{
if (v_pStatements.size() > 0)
{
return v_pStatements[0]->TokenLiteral();
}
else
{
return "";
}
}
std::string String()
{
std::stringstream oss;
for (auto &pStmt : v_pStatements)
{
oss << pStmt->String();
}
return oss.str();
}
NodeType GetNodeType() { return ast::NodeType::Program; }
};
整个程序就是由这个根节点包含的一系列语句展开的,后续求值的时候就沿着根节点自上而下。
解析let语句¶
语法:
定义:
struct Identifier : Expression
{
token::Token Token; // the token.IDENT token
std::string Value;
Identifier(token::Token tok, std::string literal) : Token(tok), Value(literal) {}
virtual ~Identifier() {}
virtual void ExpressionNode() {}
virtual std::string TokenLiteral() { return Token.Literal; }
virtual std::string String() { return Value; }
virtual NodeType GetNodeType() { return ast::NodeType::Identifier; }
};
struct LetStatement : Statement
{
token::Token Token; // the token.LET token
std::unique_ptr<Identifier> pName;
std::unique_ptr<Expression> pValue;
virtual ~LetStatement() {}
virtual void StatementNode() {}
virtual std::string TokenLiteral()
{
return Token.Literal;
}
virtual std::string String()
{
std::stringstream oss;
oss << TokenLiteral() + " "
<< pName->String()
<< " = ";
if (pValue->Good())
{
oss << pValue->String();
}
oss << ";";
return oss.str();
}
virtual NodeType GetNodeType() { return ast::NodeType::LetStatement; }
};
这里将值也看着表达式,是为了后续更好的支持求值表达式,保持统一。
初步构造AST¶
struct Parser
{
std::unique_ptr<lexer::Lexer> pLexer;
std::vector<std::string> errors;
token::Token curToken;
token::Token peekToken;
void nextToken()
{
curToken = peekToken;
peekToken = pLexer->NextToken();
}
bool curTokenIs(token::TokenType t)
{
return (curToken.Type == t);
}
bool expectPeek(token::TokenType t)
{
if (peekTokenIs(t))
{
nextToken();
return true;
}
else
{
peekError(t);
return false;
}
}
std::unique_ptr<ast::Program> ParseProgram()
{
std::unique_ptr<ast::Program> pProgram = std::make_unique<ast::Program>();
pProgram->v_pStatements.clear();
while (!curTokenIs(token::types::EndOF))
{
std::unique_ptr<ast::Statement> pStmt{parseStatement()};
if (pStmt)
{
pProgram->v_pStatements.push_back(std::move(pStmt));
}
nextToken();
}
return pProgram;
}
std::unique_ptr<ast::Statement> parseStatement()
{
if (curToken.Type == token::types::LET)
{
return parseLetStatement();
}
else if (curToken.Type == token::types::RETURN)
{
return parseReturnStatement();
}
else
{
return parseExpressionStatement();
}
}
};
std::unique_ptr<Parser> New(std::unique_ptr<lexer::Lexer> pLexer)
{
std::unique_ptr<Parser> pParser = std::make_unique<Parser>();
pParser->pLexer = std::move(pLexer);
pParser->errors.clear();
// Read two tokens, so curToken and peekToken are both set
pParser->nextToken();
pParser->nextToken();
return pParser;
}
在ParseProgram中不断的判断下一个词法单元Token类型,从而在parseStatement中决定应该按照哪个语句语法要求来构建。
解析Let语句:
std::unique_ptr<ast::LetStatement> parseLetStatement()
{
std::unique_ptr<ast::LetStatement> pStmt = std::make_unique<ast::LetStatement>();
pStmt->Token = curToken;
if (!expectPeek(token::types::IDENT))
{
return nullptr;
}
pStmt->pName = std::make_unique<ast::Identifier>(curToken, curToken.Literal);
if (!expectPeek(token::types::ASSIGN))
{
return nullptr;
}
nextToken();
pStmt->pValue = parseExpression(LOWEST);
if (peekTokenIs(token::types::SEMICOLON))
{
nextToken();
}
return pStmt;
}
通过expectPeek不断试探下一个词法单元是否合规,然后递归调用。
解析return语句¶
语法:
定义:
struct ReturnStatement : Statement
{
token::Token Token; // the 'return' token
std::unique_ptr<Expression> pReturnValue;
ReturnStatement(token::Token tok) : Token(tok), pReturnValue(nullptr) {}
virtual ~ReturnStatement() {}
virtual void StatementNode() {}
virtual std::string TokenLiteral() { return Token.Literal; }
virtual std::string String()
{
std::stringstream oss;
oss << TokenLiteral() + " ";
if (pReturnValue->Good())
{
oss << pReturnValue->String();
}
oss << ";";
return oss.str();
}
virtual NodeType GetNodeType() { return ast::NodeType::ReturnStatement; }
};
解析return语法:
std::unique_ptr<ast::ReturnStatement> parseReturnStatement()
{
std::unique_ptr<ast::ReturnStatement> pStmt = std::make_unique<ast::ReturnStatement>(curToken);
nextToken();
pStmt->pReturnValue = parseExpression(LOWEST);
if (peekTokenIs(token::types::SEMICOLON))
{
nextToken();
}
return pStmt;
}
解析表达式: 自上而下的运算符优先级分析(普拉特解析)¶
在这个语言中,除了let和return是语句外,其余的都是表达式,所以解析表达式是关键。
解析let和return语句是从左到右处理Token即可,但表达式还要考虑优先级等复杂情况。
表达式定义:
struct ExpressionStatement : Statement
{
token::Token Token; // the first token of the expression
std::unique_ptr<Expression> pExpression;
ExpressionStatement(token::Token tok) : Token(tok), pExpression(nullptr) {}
virtual ~ExpressionStatement() {}
virtual void StatementNode() {}
virtual std::string TokenLiteral() { return Token.Literal; }
virtual std::string String()
{
if (pExpression->Good())
{
return pExpression->String();
}
return "";
}
virtual NodeType GetNodeType() { return ast::NodeType::ExpressionStatement; }
};
普拉特解析法(自上而下解析法)与其它语法分析的主要区别在于,没有将解析函数与语法规则相关联,而是与单个词法单元相关联。这个方法的关键是,每个词法单元类型都具有两个与之关联的解析函数,具体取决于词法单元的位置,比如是中缀还是前缀。
struct Parser
{
using prefixParseFn = std::unique_ptr<ast::Expression> (Parser::*)();
using infixParseFn = std::unique_ptr<ast::Expression> (Parser::*)(std::unique_ptr<ast::Expression>);
std::map<token::TokenType, prefixParseFn> prefixParseFns;
std::map<token::TokenType, infixParseFn> infixParseFns;
void registerPrefix(token::TokenType tokenType, prefixParseFn fn)
{
prefixParseFns[tokenType] = fn;
}
void registerInfix(token::TokenType tokenType, infixParseFn fn)
{
infixParseFns[tokenType] = fn;
}
};
通过映射方式将词法单元和解析函数关联起来,每当遇到词法单元就调用对应的解析函数来生成AST节点。
定义优先级:
enum class Priority
{
LOWEST = 1,
EQUALS, // ==
LESSGREATER, // > or <
SUM, // +
PRODUCT, // *
PREFIX, // -X or !X
CALL // myFunction(X)
};
std::map<token::TokenType, Priority> precedences{{token::types::EQ, Priority::EQUALS},
{token::types::NOT_EQ, Priority::EQUALS},
{token::types::LT, Priority::LESSGREATER},
{token::types::GT, Priority::LESSGREATER},
{token::types::PLUS, Priority::SUM},
{token::types::MINUS, Priority::SUM},
{token::types::SLASH, Priority::PRODUCT},
{token::types::ASTERISK, Priority::PRODUCT},
{token::types::LPAREN, Priority::CALL}};
在创建语法分析器时注册所有词法单元对应的解析函数映射关系:
pParser->prefixParseFns.clear();
pParser->registerPrefix(token::types::IDENT, &Parser::parseIdentifier);
pParser->registerPrefix(token::types::INT, &Parser::parseIntegerLiteral);
pParser->registerPrefix(token::types::BANG, &Parser::parsePrefixExpression);
pParser->registerPrefix(token::types::MINUS, &Parser::parsePrefixExpression);
pParser->registerPrefix(token::types::TRUE, &Parser::parseBoolean);
pParser->registerPrefix(token::types::FALSE, &Parser::parseBoolean);
pParser->registerPrefix(token::types::LPAREN, &Parser::parseGroupedExpression);
pParser->registerPrefix(token::types::IF, &Parser::parseIfExpression);
pParser->registerPrefix(token::types::FUNCTION, &Parser::parseFunctionLiteral);
pParser->infixParseFns.clear();
pParser->registerInfix(token::types::PLUS, &Parser::parseInfixExpression);
pParser->registerInfix(token::types::MINUS, &Parser::parseInfixExpression);
pParser->registerInfix(token::types::SLASH, &Parser::parseInfixExpression);
pParser->registerInfix(token::types::ASTERISK, &Parser::parseInfixExpression);
pParser->registerInfix(token::types::EQ, &Parser::parseInfixExpression);
pParser->registerInfix(token::types::NOT_EQ, &Parser::parseInfixExpression);
pParser->registerInfix(token::types::LT, &Parser::parseInfixExpression);
pParser->registerInfix(token::types::GT, &Parser::parseInfixExpression);
pParser->registerInfix(token::types::LPAREN, &Parser::parseCallExpression);
解析整数对应函数:
std::unique_ptr<ast::Expression> parseIntegerLiteral()
{
std::unique_ptr<ast::IntegerLiteral> pLit = std::make_unique<ast::IntegerLiteral>(curToken);
try
{
long long int value = std::stoll(curToken.Literal);
pLit->Value = value;
}
catch (...)
{
std::string msg = "could not parse " + curToken.Literal + " as integer";
errors.push_back(msg);
return nullptr;
}
return pLit;
}
普拉特解析的工作方式¶
std::unique_ptr<ast::ExpressionStatement> parseExpressionStatement()
{
std::unique_ptr<ast::ExpressionStatement> pStmt = std::make_unique<ast::ExpressionStatement>(curToken);
pStmt->pExpression = parseExpression(Priority::LOWEST);
if (peekTokenIs(token::types::SEMICOLON))
{
nextToken();
}
return pStmt;
}
std::unique_ptr<ast::Expression> parseExpression(Priority precedence)
{
prefixParseFn prefix = prefixParseFns[curToken.Type];
if (prefix == nullptr)
{
noPrefixParseFnError(curToken.Type);
return nullptr;
}
std::unique_ptr<ast::Expression> leftExp = (this->*(prefix))();
while (!peekTokenIs(token::types::SEMICOLON) && precedence < peekPrecedence())
{
infixParseFn infix = infixParseFns[peekToken.Type];
if (!infix)
{
return leftExp;
}
nextToken();
leftExp = (this->*(infix))(std::move(leftExp));
}
return leftExp;
}
这里是关键部分,前面已经做好了两个工作:
- 定义了每个词法单元Token对应的操作函数
- 定义了每个词法单元对应的优先级
那么,接下来就是交给递归,根据优先级递归调用就可以了。
shared_ptr从父类转换为子类¶
std::shared_ptr<ast::Expression> exp;
std::shared_ptr<ast::Identifier> stmt = std::synamic_pointer_cast<ast::Identifier>(exp);
unique_ptr从父类转换为子类¶
std::unique_ptr<ast::Expression> exp;
auto x = reinterpret_cast<ast::Identifier *>(exp.release());
std::unique_ptr<ast::Identifier> stmt(x);
需要先释放父类智能指针并转换为子类普通指针,然后再转换为子类智能指针。
测试用例¶
$ ./test_monkey
[==========] Running 16 tests from 16 test suites.
[----------] Global test environment set-up.
[----------] 1 test from TestNextToken
[ RUN ] TestNextToken.BasicAssertions
[ OK ] TestNextToken.BasicAssertions (0 ms)
[----------] 1 test from TestNextToken (0 ms total)
[----------] 1 test from TestString
[ RUN ] TestString.BasicAssertions
[ OK ] TestString.BasicAssertions (0 ms)
[----------] 1 test from TestString (0 ms total)
[----------] 1 test from TestLetStatements
[ RUN ] TestLetStatements.BasicAssertions
[ OK ] TestLetStatements.BasicAssertions (0 ms)
[----------] 1 test from TestLetStatements (0 ms total)
[----------] 1 test from TestReturnStatements
[ RUN ] TestReturnStatements.BasicAssertions
[ OK ] TestReturnStatements.BasicAssertions (0 ms)
[----------] 1 test from TestReturnStatements (0 ms total)
[----------] 1 test from TestIdentifierExpression
[ RUN ] TestIdentifierExpression.BasicAssertions
[ OK ] TestIdentifierExpression.BasicAssertions (0 ms)
[----------] 1 test from TestIdentifierExpression (0 ms total)
[----------] 1 test from TestIntegerLiteralExpression
[ RUN ] TestIntegerLiteralExpression.BasicAssertions
[ OK ] TestIntegerLiteralExpression.BasicAssertions (0 ms)
[----------] 1 test from TestIntegerLiteralExpression (0 ms total)
[----------] 1 test from TestParsingPrefixExpressions
[ RUN ] TestParsingPrefixExpressions.BasicAssertions
[ OK ] TestParsingPrefixExpressions.BasicAssertions (0 ms)
[----------] 1 test from TestParsingPrefixExpressions (0 ms total)
[----------] 1 test from TestParsingInfixExpressions
[ RUN ] TestParsingInfixExpressions.BasicAssertions
[ OK ] TestParsingInfixExpressions.BasicAssertions (0 ms)
[----------] 1 test from TestParsingInfixExpressions (0 ms total)
[----------] 1 test from TestOperatorPrecedenceParsing
[ RUN ] TestOperatorPrecedenceParsing.BasicAssertions
[ OK ] TestOperatorPrecedenceParsing.BasicAssertions (0 ms)
[----------] 1 test from TestOperatorPrecedenceParsing (0 ms total)
[----------] 1 test from TestBooleanExpression
[ RUN ] TestBooleanExpression.BasicAssertions
[ OK ] TestBooleanExpression.BasicAssertions (0 ms)
[----------] 1 test from TestBooleanExpression (0 ms total)
[----------] 1 test from TestIfExpression
[ RUN ] TestIfExpression.BasicAssertions
[ OK ] TestIfExpression.BasicAssertions (0 ms)
[----------] 1 test from TestIfExpression (0 ms total)
[----------] 1 test from TestIfElseExpression
[ RUN ] TestIfElseExpression.BasicAssertions
[ OK ] TestIfElseExpression.BasicAssertions (0 ms)
[----------] 1 test from TestIfElseExpression (0 ms total)
[----------] 1 test from TestFunctionLiteralParsing
[ RUN ] TestFunctionLiteralParsing.BasicAssertions
[ OK ] TestFunctionLiteralParsing.BasicAssertions (0 ms)
[----------] 1 test from TestFunctionLiteralParsing (0 ms total)
[----------] 1 test from TestFunctionParameterParsing
[ RUN ] TestFunctionParameterParsing.BasicAssertions
[ OK ] TestFunctionParameterParsing.BasicAssertions (0 ms)
[----------] 1 test from TestFunctionParameterParsing (0 ms total)
[----------] 1 test from TestCallExpressionParsing
[ RUN ] TestCallExpressionParsing.BasicAssertions
[ OK ] TestCallExpressionParsing.BasicAssertions (0 ms)
[----------] 1 test from TestCallExpressionParsing (0 ms total)
[----------] 1 test from TestCallExpressionParameterParsing
[ RUN ] TestCallExpressionParameterParsing.BasicAssertions
[ OK ] TestCallExpressionParameterParsing.BasicAssertions (0 ms)
[----------] 1 test from TestCallExpressionParameterParsing (0 ms total)
[----------] Global test environment tear-down
[==========] 16 tests from 16 test suites ran. (7 ms total)
[ PASSED ] 16 tests.
本来写好测试用例后Clang
编译器已经测试通过,后来换成GCC
编译器竟然出现问题,一番查看发现是隐式类型转换引起的。
REPL¶
$ ./monkey
Hello lesliezhu! This is the Monkey-CPP programming language!
Feel free to type in commands
>> let x = 1 * 2 * 3 * 4 * 5
let x = ((((1 * 2) * 3) * 4) * 5);
>> x * y / 2 + 3 * 8 - 123
((((x * y) / 2) + (3 * 8)) - 123)
>> let x 12 * 3
.--. .-" "-. .--.
/ .. \/ .-. .-. \/ .. \
| | '| / Y \ |' | |
| \ \ \ 0 | 0 / / / |
\ '- ,\.-"""""""-./, -' /
''-' /_ ^ ^ _\ '-''
| \._ _./ |
\ \ '~' / /
'._ '-=-' _.'
'-----'
Woops! We ran into some monkey business here!
parser errors:
expected next token to be =, got INT instead
>>
总结¶
因为这次要求代码里面不出现new
和delete
,全部使用智能指针来实现,导致遇到两个不大不小的问题:
- 智能指针必须和移动语义一起使用,导致到处都是
std::move
- 从父类智能指针(
unique_ptr
)转化为子类智能指针没有很直接的方法,导致代码冗余,测试用例代码很明显 - 将在下一步
求值
的时候将unique_ptr
换成shared_ptr
- 微信搜索: 「 MinYiLife 」, 关注公众号!
- 本文链接: https://www.lesliezhu.com/blog/2022/11/23/writing_an_interpreter_in_cpp_3/
- 版权声明: 原创文章,如需转载请注明文章作者和出处。谢谢!