参照《Writing An Interpreter/Compiler In Go》,改用C++实现。

项目源码: https://github.com/LeslieZhu/monkey-cpp

引言

本篇对应的源码位于目录: src/02/

02
 |token
 | |token.hpp
 |CMakeLists.txt
 |test
 | |lexer_test.hpp
 | |parser_test.hpp
 | |ast_test.hpp
 | |main.cpp
 |lexer
 | |lexer.hpp
 |repl
 | |repl.hpp
 |parser
 | |parser.hpp
 | |parser_tracing.hpp
 |ast
 | |ast.hpp
 |main
 | |monkey.cpp

语法分析器将输入的内容转换为对应的数据结构,具体就是将词法分析器得到的一系列词法Token转换为对应的数据结构,形成抽象语法树(AST)

使用yaccbisonANTLR等语法分析器生成器工具,可以根据BNF或EBNF形式描述的语法构成,自动生成语法分析器。为了理解语法分析器的工作原理,不使用这些工具。

这里准备编写的语法分析器是递归下降语法分析器,具体就是基于自上而下的运算符优先级分析法,其发明人是普拉特,又称普拉特语法分析器

AST类型

AST节点可以分成两类:

  • 语句:不产生值
  • 表达式: 会产生值
// The base Node interface
struct Node
{
   virtual ~Node() {}
   virtual std::string TokenLiteral() { return ""; }
   virtual std::string String() { return ""; }
   virtual bool Good() { return true; }        
   virtual NodeType GetNodeType() { return ast::NodeType::Base; }
};

// All statement nodes implement this
struct Statement : Node
{
   virtual ~Statement() {}
   virtual void StatementNode() {}
   virtual NodeType GetNodeType() { return ast::NodeType::Statement; }
};

// All expression nodes implement this
struct Expression : Node
{
   virtual ~Expression() {}
   virtual void ExpressionNode() {}
   virtual NodeType GetNodeType() { return ast::NodeType::Expression; }
};

它们作为其它节点的父类,除了letreturn语句,其余都是表达式处理。

AST根节点

struct Program
{
    std::vector<std::unique_ptr<Statement>> v_pStatements;

    std::string TokenLiteral()
    {
        if (v_pStatements.size() > 0)
        {
            return v_pStatements[0]->TokenLiteral();
        }
        else
        {
            return "";
        }
    }

    std::string String()
    {
        std::stringstream oss;
        for (auto &pStmt : v_pStatements)
        {
            oss << pStmt->String();
        }

        return oss.str();
    }

    NodeType GetNodeType() { return ast::NodeType::Program; }
};

整个程序就是由这个根节点包含的一系列语句展开的,后续求值的时候就沿着根节点自上而下。

解析let语句

语法:

let <标志符> = <表达式>;

定义:

struct Identifier : Expression
{
    token::Token Token; // the token.IDENT token
    std::string Value;

    Identifier(token::Token tok, std::string literal) : Token(tok), Value(literal) {}

    virtual ~Identifier() {}
    virtual void ExpressionNode() {}
    virtual std::string TokenLiteral() { return Token.Literal; }
    virtual std::string String() { return Value; }
    virtual NodeType GetNodeType() { return ast::NodeType::Identifier; }
};

struct LetStatement : Statement
{
    token::Token Token; // the token.LET token
    std::unique_ptr<Identifier> pName;
    std::unique_ptr<Expression> pValue;

    virtual ~LetStatement() {}
    virtual void StatementNode() {}

    virtual std::string TokenLiteral()
    {
        return Token.Literal;
    }

    virtual std::string String()
    {
        std::stringstream oss;
        oss << TokenLiteral() + " "
            << pName->String()
            << " = ";

        if (pValue->Good())
        {
            oss << pValue->String();
        }

        oss << ";";

        return oss.str();
    }
    virtual NodeType GetNodeType() { return ast::NodeType::LetStatement; }
};

这里将值也看着表达式,是为了后续更好的支持求值表达式,保持统一。

初步构造AST

struct Parser
{
    std::unique_ptr<lexer::Lexer> pLexer;
    std::vector<std::string> errors;

    token::Token curToken;
    token::Token peekToken;

    void nextToken()
    {
        curToken = peekToken;
        peekToken = pLexer->NextToken();
    }

    bool curTokenIs(token::TokenType t)
    {
        return (curToken.Type == t);
    }

    bool expectPeek(token::TokenType t)
    {
        if (peekTokenIs(t))
        {
            nextToken();
            return true;
        }
        else
        {
            peekError(t);
            return false;
        }
    }

    std::unique_ptr<ast::Program> ParseProgram()
    {
        std::unique_ptr<ast::Program> pProgram = std::make_unique<ast::Program>();
        pProgram->v_pStatements.clear();

        while (!curTokenIs(token::types::EndOF))
        {
            std::unique_ptr<ast::Statement> pStmt{parseStatement()};

            if (pStmt)
            {
                pProgram->v_pStatements.push_back(std::move(pStmt));
            }
            nextToken();
        }
        return pProgram;
    }

    std::unique_ptr<ast::Statement> parseStatement()
    {
        if (curToken.Type == token::types::LET)
        {
            return parseLetStatement();
        }
        else if (curToken.Type == token::types::RETURN)
        {
            return parseReturnStatement();
        }
        else
        {
            return parseExpressionStatement();
        }
    }
};

std::unique_ptr<Parser> New(std::unique_ptr<lexer::Lexer> pLexer)
{
    std::unique_ptr<Parser> pParser = std::make_unique<Parser>();
    pParser->pLexer = std::move(pLexer);
    pParser->errors.clear();

    // Read two tokens, so curToken and peekToken are both set
    pParser->nextToken();
    pParser->nextToken();

    return pParser;
}

在ParseProgram中不断的判断下一个词法单元Token类型,从而在parseStatement中决定应该按照哪个语句语法要求来构建。

解析Let语句:

std::unique_ptr<ast::LetStatement> parseLetStatement()
{
    std::unique_ptr<ast::LetStatement> pStmt = std::make_unique<ast::LetStatement>();
    pStmt->Token = curToken;

    if (!expectPeek(token::types::IDENT))
    {
        return nullptr;
    }

    pStmt->pName = std::make_unique<ast::Identifier>(curToken, curToken.Literal);

    if (!expectPeek(token::types::ASSIGN))
    {
        return nullptr;
    }
    nextToken();

    pStmt->pValue = parseExpression(LOWEST);

    if (peekTokenIs(token::types::SEMICOLON))
    {
        nextToken();
    }

    return pStmt;
}

通过expectPeek不断试探下一个词法单元是否合规,然后递归调用。

解析return语句

语法:

return <表达式>;

定义:

struct ReturnStatement : Statement
{
    token::Token Token; // the 'return' token
    std::unique_ptr<Expression> pReturnValue;

    ReturnStatement(token::Token tok) : Token(tok), pReturnValue(nullptr) {}
    virtual ~ReturnStatement() {}

    virtual void StatementNode() {}

    virtual std::string TokenLiteral() { return Token.Literal; }

    virtual std::string String()
    {
        std::stringstream oss;
        oss << TokenLiteral() + " ";
        if (pReturnValue->Good())
        {
            oss << pReturnValue->String();
        }
        oss << ";";

        return oss.str();
    }

    virtual NodeType GetNodeType() { return ast::NodeType::ReturnStatement; }
};

解析return语法:

std::unique_ptr<ast::ReturnStatement> parseReturnStatement()
{
    std::unique_ptr<ast::ReturnStatement> pStmt = std::make_unique<ast::ReturnStatement>(curToken);
    nextToken();
    pStmt->pReturnValue = parseExpression(LOWEST);
    if (peekTokenIs(token::types::SEMICOLON))
    {
        nextToken();
    }
    return pStmt;
}

解析表达式: 自上而下的运算符优先级分析(普拉特解析)

在这个语言中,除了let和return是语句外,其余的都是表达式,所以解析表达式是关键。

解析let和return语句是从左到右处理Token即可,但表达式还要考虑优先级等复杂情况。

表达式定义:

struct ExpressionStatement : Statement
{
    token::Token Token; // the first token of the expression
    std::unique_ptr<Expression> pExpression;

    ExpressionStatement(token::Token tok) : Token(tok), pExpression(nullptr) {}
    virtual ~ExpressionStatement() {}

    virtual void StatementNode() {}

    virtual std::string TokenLiteral() { return Token.Literal; }

    virtual std::string String()
    {
        if (pExpression->Good())
        {
            return pExpression->String();
        }

        return "";
    }

    virtual NodeType GetNodeType() { return ast::NodeType::ExpressionStatement; }
};

普拉特解析法(自上而下解析法)与其它语法分析的主要区别在于,没有将解析函数与语法规则相关联,而是与单个词法单元相关联。这个方法的关键是,每个词法单元类型都具有两个与之关联的解析函数,具体取决于词法单元的位置,比如是中缀还是前缀。

struct Parser
{
    using prefixParseFn = std::unique_ptr<ast::Expression> (Parser::*)();
    using infixParseFn = std::unique_ptr<ast::Expression> (Parser::*)(std::unique_ptr<ast::Expression>);

    std::map<token::TokenType, prefixParseFn> prefixParseFns;
    std::map<token::TokenType, infixParseFn> infixParseFns;

    void registerPrefix(token::TokenType tokenType, prefixParseFn fn)
    {
        prefixParseFns[tokenType] = fn;
    }

    void registerInfix(token::TokenType tokenType, infixParseFn fn)
    {
        infixParseFns[tokenType] = fn;
    }
};

通过映射方式将词法单元和解析函数关联起来,每当遇到词法单元就调用对应的解析函数来生成AST节点。

定义优先级:

enum class Priority
 {
     LOWEST = 1,
     EQUALS,      // ==
     LESSGREATER, // > or <
     SUM,         // +
     PRODUCT,     // *
     PREFIX,      // -X or !X
     CALL         // myFunction(X)
 };

 std::map<token::TokenType, Priority> precedences{{token::types::EQ, Priority::EQUALS},
                                             {token::types::NOT_EQ, Priority::EQUALS},
                                             {token::types::LT, Priority::LESSGREATER},
                                             {token::types::GT, Priority::LESSGREATER},
                                             {token::types::PLUS, Priority::SUM},
                                             {token::types::MINUS, Priority::SUM},
                                             {token::types::SLASH, Priority::PRODUCT},
                                             {token::types::ASTERISK, Priority::PRODUCT},
                                             {token::types::LPAREN, Priority::CALL}};

在创建语法分析器时注册所有词法单元对应的解析函数映射关系:

pParser->prefixParseFns.clear();
pParser->registerPrefix(token::types::IDENT, &Parser::parseIdentifier);
pParser->registerPrefix(token::types::INT, &Parser::parseIntegerLiteral);
pParser->registerPrefix(token::types::BANG, &Parser::parsePrefixExpression);
pParser->registerPrefix(token::types::MINUS, &Parser::parsePrefixExpression);
pParser->registerPrefix(token::types::TRUE, &Parser::parseBoolean);
pParser->registerPrefix(token::types::FALSE, &Parser::parseBoolean);
pParser->registerPrefix(token::types::LPAREN, &Parser::parseGroupedExpression);
pParser->registerPrefix(token::types::IF, &Parser::parseIfExpression);
pParser->registerPrefix(token::types::FUNCTION, &Parser::parseFunctionLiteral);

pParser->infixParseFns.clear();
pParser->registerInfix(token::types::PLUS, &Parser::parseInfixExpression);
pParser->registerInfix(token::types::MINUS, &Parser::parseInfixExpression);
pParser->registerInfix(token::types::SLASH, &Parser::parseInfixExpression);
pParser->registerInfix(token::types::ASTERISK, &Parser::parseInfixExpression);
pParser->registerInfix(token::types::EQ, &Parser::parseInfixExpression);
pParser->registerInfix(token::types::NOT_EQ, &Parser::parseInfixExpression);
pParser->registerInfix(token::types::LT, &Parser::parseInfixExpression);
pParser->registerInfix(token::types::GT, &Parser::parseInfixExpression);
pParser->registerInfix(token::types::LPAREN, &Parser::parseCallExpression);

解析整数对应函数:

std::unique_ptr<ast::Expression> parseIntegerLiteral()
{
    std::unique_ptr<ast::IntegerLiteral> pLit = std::make_unique<ast::IntegerLiteral>(curToken);

    try
    {
        long long int value = std::stoll(curToken.Literal);
        pLit->Value = value;
    }
    catch (...)
    {
        std::string msg = "could not parse " + curToken.Literal + " as integer";
        errors.push_back(msg);
        return nullptr;
    }

    return pLit;
}

普拉特解析的工作方式

std::unique_ptr<ast::ExpressionStatement> parseExpressionStatement()
{
    std::unique_ptr<ast::ExpressionStatement> pStmt = std::make_unique<ast::ExpressionStatement>(curToken);
    pStmt->pExpression = parseExpression(Priority::LOWEST);
    if (peekTokenIs(token::types::SEMICOLON))
    {
        nextToken();
    }
    return pStmt;
}

std::unique_ptr<ast::Expression> parseExpression(Priority precedence)
{
    prefixParseFn prefix = prefixParseFns[curToken.Type];
    if (prefix == nullptr)
    {
        noPrefixParseFnError(curToken.Type);
        return nullptr;
    }

    std::unique_ptr<ast::Expression> leftExp = (this->*(prefix))();
    while (!peekTokenIs(token::types::SEMICOLON) && precedence < peekPrecedence())
    {
        infixParseFn infix = infixParseFns[peekToken.Type];
        if (!infix)
        {
            return leftExp;
        }

        nextToken();
        leftExp = (this->*(infix))(std::move(leftExp));
    }
    return leftExp;
}

这里是关键部分,前面已经做好了两个工作:

  • 定义了每个词法单元Token对应的操作函数
  • 定义了每个词法单元对应的优先级

那么,接下来就是交给递归,根据优先级递归调用就可以了。

shared_ptr从父类转换为子类

std::shared_ptr<ast::Expression> exp;

std::shared_ptr<ast::Identifier> stmt = std::synamic_pointer_cast<ast::Identifier>(exp);

unique_ptr从父类转换为子类

std::unique_ptr<ast::Expression> exp;

auto x = reinterpret_cast<ast::Identifier *>(exp.release());

std::unique_ptr<ast::Identifier> stmt(x);

需要先释放父类智能指针并转换为子类普通指针,然后再转换为子类智能指针。

测试用例

$ ./test_monkey
[==========] Running 16 tests from 16 test suites.
[----------] Global test environment set-up.
[----------] 1 test from TestNextToken
[ RUN      ] TestNextToken.BasicAssertions
[       OK ] TestNextToken.BasicAssertions (0 ms)
[----------] 1 test from TestNextToken (0 ms total)

[----------] 1 test from TestString
[ RUN      ] TestString.BasicAssertions
[       OK ] TestString.BasicAssertions (0 ms)
[----------] 1 test from TestString (0 ms total)

[----------] 1 test from TestLetStatements
[ RUN      ] TestLetStatements.BasicAssertions
[       OK ] TestLetStatements.BasicAssertions (0 ms)
[----------] 1 test from TestLetStatements (0 ms total)

[----------] 1 test from TestReturnStatements
[ RUN      ] TestReturnStatements.BasicAssertions
[       OK ] TestReturnStatements.BasicAssertions (0 ms)
[----------] 1 test from TestReturnStatements (0 ms total)

[----------] 1 test from TestIdentifierExpression
[ RUN      ] TestIdentifierExpression.BasicAssertions
[       OK ] TestIdentifierExpression.BasicAssertions (0 ms)
[----------] 1 test from TestIdentifierExpression (0 ms total)

[----------] 1 test from TestIntegerLiteralExpression
[ RUN      ] TestIntegerLiteralExpression.BasicAssertions
[       OK ] TestIntegerLiteralExpression.BasicAssertions (0 ms)
[----------] 1 test from TestIntegerLiteralExpression (0 ms total)

[----------] 1 test from TestParsingPrefixExpressions
[ RUN      ] TestParsingPrefixExpressions.BasicAssertions
[       OK ] TestParsingPrefixExpressions.BasicAssertions (0 ms)
[----------] 1 test from TestParsingPrefixExpressions (0 ms total)

[----------] 1 test from TestParsingInfixExpressions
[ RUN      ] TestParsingInfixExpressions.BasicAssertions
[       OK ] TestParsingInfixExpressions.BasicAssertions (0 ms)
[----------] 1 test from TestParsingInfixExpressions (0 ms total)

[----------] 1 test from TestOperatorPrecedenceParsing
[ RUN      ] TestOperatorPrecedenceParsing.BasicAssertions
[       OK ] TestOperatorPrecedenceParsing.BasicAssertions (0 ms)
[----------] 1 test from TestOperatorPrecedenceParsing (0 ms total)

[----------] 1 test from TestBooleanExpression
[ RUN      ] TestBooleanExpression.BasicAssertions
[       OK ] TestBooleanExpression.BasicAssertions (0 ms)
[----------] 1 test from TestBooleanExpression (0 ms total)

[----------] 1 test from TestIfExpression
[ RUN      ] TestIfExpression.BasicAssertions
[       OK ] TestIfExpression.BasicAssertions (0 ms)
[----------] 1 test from TestIfExpression (0 ms total)

[----------] 1 test from TestIfElseExpression
[ RUN      ] TestIfElseExpression.BasicAssertions
[       OK ] TestIfElseExpression.BasicAssertions (0 ms)
[----------] 1 test from TestIfElseExpression (0 ms total)

[----------] 1 test from TestFunctionLiteralParsing
[ RUN      ] TestFunctionLiteralParsing.BasicAssertions
[       OK ] TestFunctionLiteralParsing.BasicAssertions (0 ms)
[----------] 1 test from TestFunctionLiteralParsing (0 ms total)

[----------] 1 test from TestFunctionParameterParsing
[ RUN      ] TestFunctionParameterParsing.BasicAssertions
[       OK ] TestFunctionParameterParsing.BasicAssertions (0 ms)
[----------] 1 test from TestFunctionParameterParsing (0 ms total)

[----------] 1 test from TestCallExpressionParsing
[ RUN      ] TestCallExpressionParsing.BasicAssertions
[       OK ] TestCallExpressionParsing.BasicAssertions (0 ms)
[----------] 1 test from TestCallExpressionParsing (0 ms total)

[----------] 1 test from TestCallExpressionParameterParsing
[ RUN      ] TestCallExpressionParameterParsing.BasicAssertions
[       OK ] TestCallExpressionParameterParsing.BasicAssertions (0 ms)
[----------] 1 test from TestCallExpressionParameterParsing (0 ms total)

[----------] Global test environment tear-down
[==========] 16 tests from 16 test suites ran. (7 ms total)
[  PASSED  ] 16 tests.

本来写好测试用例后Clang编译器已经测试通过,后来换成GCC编译器竟然出现问题,一番查看发现是隐式类型转换引起的。

REPL

$ ./monkey
Hello lesliezhu! This is the Monkey-CPP programming language!
Feel free to type in commands
>> let x = 1 * 2 * 3 * 4 * 5
let x = ((((1 * 2) * 3) * 4) * 5);
>> x * y  / 2 + 3 * 8 - 123
((((x * y) / 2) + (3 * 8)) - 123)
>> let x 12 * 3

   .--.  .-"     "-.  .--.
  / .. \/  .-. .-.  \/ .. \
 | |  '|  /   Y   \  |'  | |
 | \   \  \ 0 | 0 /  /   / |
  \ '- ,\.-"""""""-./, -' /
   ''-' /_   ^ ^   _\ '-''
       |  \._   _./  |
       \   \ '~' /   /
        '._ '-=-' _.'
           '-----'

Woops! We ran into some monkey business here!
 parser errors:
    expected next token to be =, got INT instead
>>

总结

因为这次要求代码里面不出现newdelete,全部使用智能指针来实现,导致遇到两个不大不小的问题:

  • 智能指针必须和移动语义一起使用,导致到处都是std::move
  • 从父类智能指针(unique_ptr)转化为子类智能指针没有很直接的方法,导致代码冗余,测试用例代码很明显
  • 将在下一步求值的时候将unique_ptr换成shared_ptr