Skip to content

Ledge语言:字节码编译文件的存储与加载

Ledge语言: https://ledge-lang.github.io/zh/

引言

虚拟机编译器生成字节码后,如果不是保存在内存中,而是要保存为字节码文件,比如Java保存的.class文件,Python保存的.pyc文件,这些文件里面主要是序列化的常量池和字节码本身。

这里先简单看看.pyc文件里面存了什么,然后为Ledge语言实现初步的字节码文件存储与加载。

Python的.pyc文件

Python在序列化字节码的时候,开头4个字节存储的是MagicNumber,在源文件(PC/launcher.c):

可以验证一下:

$ python --version
Python 2.7.5

>>> import imp
>>> a = imp.get_magic().strip()
>>> a
'\x03\xf3'
>>> a[0]
'\x03'
>>> a[1]
'\xf3'
>>> a0 = int((a[0]).encode('hex'),16)
>>> a1 = int((a[1]).encode('hex'),16)
>>> a0
3
>>> a1
243
>>> a1 << 8 | a0 & 0xFFFF
62211

62211是Python2.7版本最后一个魔法数,而2.7也是Python2最后一个版本。

再验证一下:

$ python --version
Python 3.7.2

>>> import imp
>>> a = imp.get_magic()
>>> a
b'B\r\r\n'
>>> a[1] << 8 | a[0] & 0xFFFF
3394

确认魔法数后,基本可以确定是Python的字节码文件,同时防止魔法数碰撞,还会读取若干个Object再次确认是Python字节码文件。

基本上思路是通过前几个字节内容判断是否是这个编程语言所属的字节码文件,进而判断版本是否一致,再后面基本就是常量池和字节码本身了。

boost serialization

采用Boost库里面序列化功能,它有一个非侵入式方案,比较适合目前Ledge在剧烈变动开发阶段。

它支持序列化为文本、XML和二进制格式,涉及到二进制格式有可能大端小端字节序导致存在移植性问题,因此Ledge目前选择同时支持文本和二进制格式。

#include <boost/archive/binary_iarchive.hpp>
#include <boost/archive/binary_oarchive.hpp>
#include <boost/archive/text_iarchive.hpp>
#include <boost/archive/text_oarchive.hpp>

#include <boost/serialization/list.hpp>
#include <boost/serialization/shared_ptr.hpp>
#include <boost/serialization/string.hpp>
#include <boost/serialization/vector.hpp>
#include <boost/serialization/version.hpp>

#include <boost/iostreams/stream.hpp>

namespace boost {
    namespace serialization {

        template <class Archive>
        void serialize([[maybe_unused]] Archive &ar,
                       [[maybe_unused]] std::shared_ptr<objects::Object> obj,
                       [[maybe_unused]] const unsigned int version) {
            std::cout << "un-catch serialize Object" << std::endl;
        }

        template <class Archive>
        void serialize(Archive &ar,
                       std::shared_ptr<objects::Integer> obj,
                       [[maybe_unused]] const unsigned int version) {
            ar & obj->Value;
        }

        template <class Archive>
        void serialize(Archive &ar,
                       std::shared_ptr<objects::Double> obj,
                       [[maybe_unused]] const unsigned int version) {
            ar & obj->Value;
        }

        template <class Archive>
        void serialize(Archive &ar,
                       std::shared_ptr<objects::String> obj,
                       [[maybe_unused]] const unsigned int version) {
            ar & obj->Value;
        }

        template <class Archive>
        void serialize(Archive &ar,
                       std::shared_ptr<objects::CompiledFunction> obj,
                       [[maybe_unused]] const unsigned int version) {
            ar & obj->NumLocals;
            ar & obj->NumParameters;
            ar & obj->Instructions;
        }
    } // namespace serialization
} // namespace boost

通过这段代码,以非侵入式的方式在Boost内部添加了对Ledge各种对象的支持,这里只有数字、字符串和函数三种类型,是因为Ledge目前的虚拟机对于数组、哈希和闭包等都是运行时重新构建(导致运行效率低),后续有新的类型需要支持只需要这里添加。非侵入式的好处是不需要改动各个对象本身的代码,即使将来改动序列化方案也很方便一次性替换。

Ledge输出.llc文件

using versionPair = std::pair<int, int>;

const static std::map<std::string, versionPair> MagicNumberDict{
    { "0.0.4", { 3390, 3394 } },
};

bool SaveLLCFileToText(const std::string llcFileName, std::shared_ptr<compiler::ByteCode> code) {
    std::ofstream outfile;
    outfile.open(llcFileName);

    if (!outfile.good()) {
        std::cerr << "can not open file: " << llcFileName << std::endl;
        outfile.close();
        return false;
    }

    boost::archive::text_oarchive oa(outfile);

    auto fit = MagicNumberDict.find(repl::LEDGE_VERSION);
    if (fit != MagicNumberDict.end()) {
        oa << fit->second.second;
    } else {
        std::cout << "ERROR: there is a unspport magic number with version: " << repl::LEDGE_VERSION << std::endl;
        return false;
    }

    oa << code->Constants.size();

    for (auto &obj : code->Constants) {
        if (!obj) {
            oa << objects::ObjectType::Bad;
            continue;
        }

        auto objType = obj->Type();
        oa << objType;

        if (objType == objects::ObjectType::INTEGER) {
            auto realObj = std::dynamic_pointer_cast<objects::Integer>(obj);
            oa << realObj;
        } else if (objType == objects::ObjectType::COMPILED_FUNCTION) {
            auto realObj = std::dynamic_pointer_cast<objects::CompiledFunction>(obj);
            oa << realObj;
        } else if (objType == objects::ObjectType::STRING) {
            auto realObj = std::dynamic_pointer_cast<objects::String>(obj);
            oa << realObj;
        } else if (objType == objects::ObjectType::DOUBLE) {
            auto realObj = std::dynamic_pointer_cast<objects::Double>(obj);
            oa << realObj;
        } else {
            std::cout << "ERROR: there is a unspport constant object with type: " << obj->TypeStr() << std::endl;
            return false;
        }
    }

    oa << code->Instructions;

    outfile.close();

    return true;
}

基本以文本或二进制形式打开源文件,交给Boost就行了:

boost::archive::text_oarchive oa(ofs);
boost::archive::binary_oarchive oa(ofs);

Ledge读取.llc文件

基本是一个常量池重建的过程:

std::shared_ptr<compiler::ByteCode> LoadLLCFileFromText(const std::string llcFileName) {
        std::ifstream infile;
        infile.open(llcFileName);

        if (!infile.good()) {
            std::cerr << "can not open file: " << llcFileName << std::endl;
            infile.close();
            return nullptr;
        }

        int magicNumberMix = 0;
        int magicNumberMax = 0;
        auto fit = MagicNumberDict.find(repl::LEDGE_VERSION);
        if (fit != MagicNumberDict.end()) {
            magicNumberMix = fit->second.first;
            magicNumberMax = fit->second.second;
        } else {
            std::cout << "ERROR: there is a unspport magic number with version: " << repl::LEDGE_VERSION << std::endl;
            return nullptr;
        }

        std::shared_ptr<compiler::ByteCode> code = std::make_shared<compiler::ByteCode>();
        int magicNumber = 0;
        int constantSize = 0;

        boost::archive::text_iarchive ia(infile);

        ia >> magicNumber;

        unsigned char *magicNumberChar = (unsigned char *)&magicNumber;
        unsigned char magicNumber0 = *(&magicNumberChar[0]);
        unsigned char magicNumber1 = *(&magicNumberChar[1]);
        int magicNumberMatch = (((magicNumber1 << 8) | magicNumber0) & 0xFFFF);

        if (magicNumberMatch < magicNumberMix || magicNumberMatch > magicNumberMax) {
            std::cout << "The Ledge Compiled File With Wrong Magic Number: " << magicNumberMatch << std::endl;
            return nullptr;
        }

        ia >> constantSize;

        code->Constants.resize(constantSize);

        objects::ObjectType objType;
        for (int i = 0; i < constantSize; i++) {
            ia >> objType;

            if (objType == objects::ObjectType::INTEGER) {
                auto realObj = std::make_shared<objects::Integer>();
                ia >> realObj;
                code->Constants[i] = realObj;
            } else if (objType == objects::ObjectType::COMPILED_FUNCTION) {
                auto realObj = std::make_shared<objects::CompiledFunction>();
                ia >> realObj;
                code->Constants[i] = realObj;
            } else if (objType == objects::ObjectType::STRING) {
                auto realObj = std::make_shared<objects::String>();
                ia >> realObj;

                code->Constants[i] = realObj;
            } else if (objType == objects::ObjectType::DOUBLE) {
                auto realObj = std::make_shared<objects::Double>();
                ia >> realObj;

                code->Constants[i] = realObj;
            } else {
                std::cout << "ERROR: unsupport constant object type or nullptr" << std::endl;
                code->Constants[i] = nullptr;
            }
        }

        ia >> code->Instructions;

        infile.close();

        return code;
    }

基本以文本或二进制形式打开源文件,交给Boost就行了:

boost::archive::text_iarchive oa(ifs);
boost::archive::binary_iarchive oa(ifs);

执行效果

测试源文件(test.ll):

let a = 3.14159;

let b = [1,2,3]; # list or array

let c = {1:1,2:"name",3:true, 4:1+2, true:1024} # map or dict

let fib = fn(x){  # function
        if(x == 0){ 
            return 0;
        }elif(x == 1){ 
            return 1
        }else{
            return fib(x-1) + fib(x-2);
        }
    };

let adder = fn(x){  # closure
        let add = fn(y){ 
              return x + y
        }

        add;
    };

let add2 = adder(2);

print("a:", a);
print("list:", b)
print("dict:", c)
print("fib(5):",fib(15));
print("add2(2):", add2(2));

编译:

$ ./ledge test.ll

serialize Double: 3.14159
serialize Integer: 1
serialize Integer: 2
serialize Integer: 3
serialize Integer: 1
serialize Integer: 1
serialize Integer: 2
serialize String: name
serialize Integer: 3
serialize Integer: 4
serialize Integer: 1
serialize Integer: 2
serialize Integer: 1024
serialize Integer: 0
serialize Integer: 0
serialize Integer: 1
serialize Integer: 1
serialize Integer: 1
serialize Integer: 2
serialize CompiledFunction: CompiledFunction[0x600000979358]
serialize Integer: 2
serialize CompiledFunction: CompiledFunction[0x600000979458]
serialize CompiledFunction: CompiledFunction[0x600000979398]
serialize Integer: 2
serialize String: a:
serialize String: list:
serialize String: dict:
serialize String: fib(5):
serialize Integer: 15
serialize String: add2(2):
serialize Integer: 2

a: 3.141590
list: [1, 2, 3]
dict: {true: 1024, 2: "name", 3: true, 4: 3}
fib(5): 610
add2(2): 4

执行字节码文件:

$ ./ledge test.llc

read constant: 3.141590
read constant: 1
read constant: 2
read constant: 3
read constant: 1
read constant: 1
read constant: 2
read constant: "name"
read constant: 3
read constant: 4
read constant: 1
read constant: 2
read constant: 1024
read constant: 0
read constant: 0
read constant: 1
read constant: 1
read constant: 1
read constant: 2
read constant: CompiledFunction[0x600001c789d8]
read constant: 2
read constant: CompiledFunction[0x600001c78a58]
read constant: CompiledFunction[0x600001c78a98]
read constant: 2
read constant: "a:"
read constant: "list:"
read constant: "dict:"
read constant: "fib(5):"
read constant: 15
read constant: "add2(2):"
read constant: 2

a: 3.141590
list: [1, 2, 3]
dict: {true: 1024, 2: "name", 3: true, 4: 3}
fib(5): 610
add2(2): 4

一切工作正常。

总结

字节码文件保存与加载本质就是序列化与反序列化问题。