8. Python3源码—Code对象与pyc文件

8.1. Python程序的执行过程

Python解释器在执行任何一个Python程序文件时,首先进行的动作都是先对文件中的Python源代码进行编译,编译的主要结果是产生一组Python的byte code(字节码),然后将编译的结果交给Python的虚拟机(Virtual Machine),由虚拟机按照顺序一条一条地执行字节码,从而完成对Python程序的执行动作。

对于Python编译器来说,PyCodeObject对象才是其真正的编译结果,而pyc文件只是这个对象在硬盘上的表现形式,它们实际上是Python对源文件编译的结果的两种不同存在方式。

在程序运行期间,编译结果存在于内存的PyCodeObject对象中;而Python结束运行后,编译结果又被保存到了pyc文件中。当下一次运行相同的程序时,Python会根据pyc文件中记录的编译结果直接建立内存中的PyCodeObject对象,而不用再次对源文件进行编译了。

对整体流程认识清晰后完全可以写一个工具,将基于Python3.7生成的pyc文件解析出来,pyc文件的内容用json格式组织一下如下图:
8. Python3源码—Code对象与pyc文件

写工具的目的只是为了更加理解整个流程。实际上使用Python的dis模块可以输出更为详细清晰的内容,如下图:
8. Python3源码—Code对象与pyc文件

8.2. PyCodeObject源码

// code.h
typedef struct {
    PyObject_HEAD
    int co_argcount;
    int co_kwonlyargcount;
    int co_nlocals;
    int co_stacksize; 
    int co_flags; 
    int co_firstlineno;
    PyObject *co_code;
    PyObject *co_consts;
    PyObject *co_names;
    PyObject *co_varnames;
    PyObject *co_freevars;
    PyObject *co_cellvars;
    Py_ssize_t *co_cell2arg;
    PyObject *co_filename;      
    PyObject *co_name;          
    PyObject *co_lnotab;        
    void *co_zombieframe; 
    PyObject *co_weakreflist;
    void *co_extra;
} PyCodeObject;
  • Code Block:
    Python编译器在对Python源代码进行编译的时候,对于代码中的一个Code Block,会创建一个PyCodeObject对象与这段代码对应。当进入一个新的名字空间,或者说作用域时,就算是进入了一个新的Code Block了。比如下面的代码有三个code block:一个对应整个test.py文件,一个对应class A,一个对应def Fun。
# test.py
class A:
    pass

def Fun():
    pass

a = A()
Fun()
  • 名字空间:
    名字空间是符号的上下文环境,符号的含义取决于名字空间。更具体地说,一个变量名对应的变量值是什么,在Python中,这并不是确定的,而是需要通过名字空间来决定。一个Code Block,对应着一个名字空间,它会对应一个PyCodeObject对象。
  • Python中的code对象:
    在Python中,有与C语言下的PyCodeObject对象对应的对象——code对象,这个对象是对C语言下的PyCodeObject对象的一个简单包装,通过code对象,我们可以访问PyCodeObject对象中的各个域。

8. Python3源码—Code对象与pyc文件

8.3. 生成pyc文件

# pyc_generator.py
import imp
import sys

def generate_pyc(name):
    fp, pathname, description = imp.find_module(name)
    try:
        imp.load_module(name, fp, pathname, description)
    finally:
        if fp:
            fp.close()

if __name__ == '__main__':
    generate_pyc(sys.argv[1])

命令行中输入如下命令会生成pyc文件:

>>> ./python3.7 pyc_generator.py test

8.3.1. 生成PyCodeObject对象和pyc文件的C流程

从上面的pyc_generator文件中的imp.load_module开始,函数调用顺序如下:

// imp.py
load_module
=>load_source

// _bootstrap.py[1]
=>_load
=>_load_unlocked

// _bootstrap_external.py
=> exec_module
=> get_code

get_code方法中调用source_to_code方法生成PyCodeObject对象,调用_code_to_timestamp_pyc将PyCodeObject转为二进制数据,调用_cache_bytecode方法将二进制数据写入文件。

值得注意的是真正的Python不会调用_bootstrap.py的_load方法(上面函数调用顺序中的[1]),在Lib/importlib/__init__.py中:

# __init__.py
try:
    import _frozen_importlib as _bootstrap
except ImportError:
    from . import _bootstrap
    _bootstrap._setup(sys, _imp)
else:
    # do sth

try:
    import _frozen_importlib_external as _bootstrap_external
except ImportError:
    from . import _bootstrap_external
    _bootstrap_external._setup(_bootstrap)
    _bootstrap._bootstrap_external = _bootstrap_external
else:
   # do sth

可以看到实际上调用的是_frozen_importlib中的_load方法,而不是_bootstrap中的_load方法,此lib的内容在Python/importlib.h中被定义:
8. Python3源码—Code对象与pyc文件
不太明白为什么要这么处理,但是分析整体流程时将此处换成了_bootstrap,便于阅读源码。

下面会详细分析生成PyCodeObject对象,将PyCodeObject转为二进制数据和将二进制数据写入文件的流程。

8.3.2. 生成PyCodeObject对象源码

// _bootstrap_external.py
source_to_code

// _bootstrap.py
=>_call_with_frames_removed

// bltinmodule.c
=> builtin_compile_impl

builtin_compile_impl的C源码如下:

// bltinmodule.c
static PyObject *
builtin_compile_impl(PyObject *module, PyObject *source, PyObject *filename, const char *mode, int flags, int dont_inherit, int optimize)
{
    PyObject *source_copy;
    const char *str;
    int compile_mode = -1;
    int is_ast;
    PyCompilerFlags cf;
    int start[] = {Py_file_input, Py_eval_input, Py_single_input};
    PyObject *result;

    cf.cf_flags = flags | PyCF_SOURCE_IS_UTF8;

    if (flags &
        ~(PyCF_MASK | PyCF_MASK_OBSOLETE | PyCF_DONT_IMPLY_DEDENT | PyCF_ONLY_AST))
    {
        PyErr_SetString(PyExc_ValueError,
                        "compile(): unrecognised flags");
        goto error;
    }
    /* XXX Warn if (supplied_flags & PyCF_MASK_OBSOLETE) != 0? */

    if (optimize < -1 || optimize > 2) {
        PyErr_SetString(PyExc_ValueError,
                        "compile(): invalid optimize value");
        goto error;
    }

    if (!dont_inherit) {
        PyEval_MergeCompilerFlags(&cf);
    }

    if (strcmp(mode, "exec") == 0)
        compile_mode = 0;
    else if (strcmp(mode, "eval") == 0)
        compile_mode = 1;
    else if (strcmp(mode, "single") == 0)
        compile_mode = 2;
    else {
        PyErr_SetString(PyExc_ValueError,
                        "compile() mode must be 'exec', 'eval' or 'single'");
        goto error;
    }

    is_ast = PyAST_Check(source);
    if (is_ast == -1)
        goto error;
    if (is_ast) {
        // do sth.
    }

    str = source_as_string(source, "compile", "string, bytes or AST", &cf, &source_copy);
    if (str == NULL)
        goto error;

    result = Py_CompileStringObject(str, filename, start[compile_mode], &cf, optimize);
    Py_XDECREF(source_copy);
    goto finally;

error:
    result = NULL;
finally:
    Py_DECREF(filename);
    return result;
}

其中:

  • 调用source_as_string方法将上面的test.py源码加载进内存:
    8. Python3源码—Code对象与pyc文件
  • 调用Py_CompileStringObject方法生成PyCodeObject对象:
// pythonrun.c
PyObject *
Py_CompileStringObject(const char *str, PyObject *filename, int start,
                       PyCompilerFlags *flags, int optimize)
{
    PyCodeObject *co;
    mod_ty mod;
    PyArena *arena = PyArena_New();
    if (arena == NULL)
        return NULL;

    mod = PyParser_ASTFromStringObject(str, filename, start, flags, arena);
    if (mod == NULL) {
        PyArena_Free(arena);
        return NULL;
    }
    if (flags && (flags->cf_flags & PyCF_ONLY_AST)) {
        PyObject *result = PyAST_mod2obj(mod);
        PyArena_Free(arena);
        return result;
    }
    co = PyAST_CompileObject(mod, filename, flags, optimize, arena);
    PyArena_Free(arena);
    return (PyObject *)co;
}

调用PyParser_ASTFromStringObject方法生成语法树,调用PyAST_CompileObject方法生成PyCodeObject对象。此处不对语法解析和编译做深入分析。

8.3.3. 将PyCodeObject对象转为二进制数据

_code_to_timestamp_pyc方法负责将PyCodeObject对象转为二进制数据,源码如下:

// _bootstrap_external.py
def _code_to_timestamp_pyc(code, mtime=0, source_size=0):
    "Produce the data for a timestamp-based pyc."
    data = bytearray(MAGIC_NUMBER)
    data.extend(_w_long(0))
    data.extend(_w_long(mtime))
    data.extend(_w_long(source_size))
    data.extend(marshal.dumps(code))
    return data

可以看出一个pyc文件包含几部分内容:

  • MAGIC_NUMBER:不同版本的Python实现都会定义不同的MAGIC_NUMBER,比如Python 3.7a0 3392,Python 3.6a0 3360,防止加载不兼容的pyc文件;
  • 0:不清楚是用作什么;
  • mtime:py文件创建或最近一次修改的时间信息,如果修改时间没有改变则不需要转为二进制保存,即不需要修改pyc文件;
  • source_size:源码大小;
  • marshal.dumps(code):PyCodeObject对象的二进制流;

marshal.dumps调用marshal_dumps_impl方法:

// marshal.c
static PyObject *
marshal_dumps_impl(PyObject *module, PyObject *value, int version)
/*[clinic end generated code: output=9c200f98d7256cad input=a2139ea8608e9b27]*/
{
    return PyMarshal_WriteObjectToString(value, version);
}

PyMarshal_WriteObjectToString源码为:

// marshal.c
PyObject *
PyMarshal_WriteObjectToString(PyObject *x, int version)
{
    WFILE wf;

    memset(&wf, 0, sizeof(wf));
    wf.str = PyBytes_FromStringAndSize((char *)NULL, 50);
    if (wf.str == NULL)
        return NULL;
    wf.ptr = wf.buf = PyBytes_AS_STRING((PyBytesObject *)wf.str);
    wf.end = wf.ptr + PyBytes_Size(wf.str);
    wf.error = WFERR_OK;
    wf.version = version;
    if (w_init_refs(&wf, version)) {
        Py_DECREF(wf.str);
        return NULL;
    }
    w_object(x, &wf);
    w_clear_refs(&wf);
    if (wf.str != NULL) {
        char *base = PyBytes_AS_STRING((PyBytesObject *)wf.str);
        if (wf.ptr - base > PY_SSIZE_T_MAX) {
            Py_DECREF(wf.str);
            PyErr_SetString(PyExc_OverflowError,
                            "too much marshal data for a bytes object");
            return NULL;
        }
        if (_PyBytes_Resize(&wf.str, (Py_ssize_t)(wf.ptr - base)) < 0)
            return NULL;
    }
    if (wf.error != WFERR_OK) {
        Py_XDECREF(wf.str);
        if (wf.error == WFERR_NOMEMORY)
            PyErr_NoMemory();
        else
            PyErr_SetString(PyExc_ValueError,
              (wf.error==WFERR_UNMARSHALLABLE)?"unmarshallable object"
               :"object too deeply nested to marshal");
        return NULL;
    }
    return wf.str;

此处最关键的方法为w_object,该方法会调用w_complex_object,真正将PyCodeObject对象转为二进制数据就在w_complex_object方法中:

// marshal.c
static void
w_complex_object(PyObject *v, char flag, WFILE *p)
{
    // do sth.
    else if (PyCode_Check(v)) {
        PyCodeObject *co = (PyCodeObject *)v;
        W_TYPE(TYPE_CODE, p);
        w_long(co->co_argcount, p);
        w_long(co->co_kwonlyargcount, p);
        w_long(co->co_nlocals, p);
        w_long(co->co_stacksize, p);
        w_long(co->co_flags, p);
        w_object(co->co_code, p);
        w_object(co->co_consts, p);
        w_object(co->co_names, p);
        w_object(co->co_varnames, p);
        w_object(co->co_freevars, p);
        w_object(co->co_cellvars, p);
        w_object(co->co_filename, p);
        w_object(co->co_name, p);
        w_long(co->co_firstlineno, p);
        w_object(co->co_lnotab, p);
    }
    // do sth.
}

可以看出:

  • PyCodeObject对象的类型是TYPE_CODE,8.2节中的test.py文件会生成三个PyCodeObject对象,它们之间的关系为一个PyCodeObject对象嵌套两个PyCodeObject对象;
  • co_argcount、co_kwonlyargcount等字段是通过调用w_long(调用w_byte方法写入四个字节),co_code、co_consts 等字段是通过调用w_object(实际上是调用w_long、w_string等方法),最终转为二进制数据的。这些字段的具体含义之后再进行深入分析;
  • 需要注意的是有一个特殊的类型:TYPE_REF,可以通过该类型节约存储空间。以co_filename为例,这个字段的含义为py文件的完整路径,下面为test.py生成的pyc文件中co_filename字段的值:
// class A
"co_filename": {
    "type": "unicode",
    "size": 49,
    "value": "/Users/l.wang/Documents/pythonindepth/bin/test.py"
}

// def Fun
"co_filename": {
    "type": "ref",
    "ref": 6
}

// test.py
"co_filename": {
    "type": "ref",
    "ref": 6
}

这是通过w_ref方法实现的,w_ref的源码如下。其中有一个hash表,该表的key为对象的地址,value为index,如果表中存在相同地址的对象,则写入TYPE_REF类型和index,从而节省空间。

// marshal.c
static int
w_ref(PyObject *v, char *flag, WFILE *p)
{
    _Py_hashtable_entry_t *entry;
    int w;

    if (p->version < 3 || p->hashtable == NULL) {
        return 0; /* not writing object references */
    }

    /* if it has only one reference, it definitely isn't shared */
    if (Py_REFCNT(v) == 1) {
        return 0;
    }

    entry = _Py_HASHTABLE_GET_ENTRY(p->hashtable, v);
    if (entry != NULL) {
        /* write the reference index to the stream */
        _Py_HASHTABLE_ENTRY_READ_DATA(p->hashtable, entry, w);
        /* we don't store "long" indices in the dict */
        assert(0 <= w && w <= 0x7fffffff);
        w_byte(TYPE_REF, p);
        w_long(w, p);
        return 1;
    } else {
        size_t s = p->hashtable->entries;
        /* we don't support long indices */
        if (s >= 0x7fffffff) {
            PyErr_SetString(PyExc_ValueError, "too many objects");
            goto err;
        }
        w = (int)s;
        Py_INCREF(v);
        if (_Py_HASHTABLE_SET(p->hashtable, v, w) < 0) {
            Py_DECREF(v);
            goto err;
        }
        *flag |= FLAG_REF;
        return 0;
    }
err:
    p->error = WFERR_UNMARSHALLABLE;
    return 1;
}

这个过程的逆序实现过程如下。如果flag不为0,则向list表中增加实际的值。如果类型为TYPE_REF,则根据读取的index从list表中获取真实的值。

static PyObject *
r_object(RFILE *p)
{
    PyObject *v, *v2;
    Py_ssize_t idx = 0;
    long i, n;
    int type, code = r_byte(p);
    int flag, is_interned = 0;
    PyObject *retval = NULL;

    if (code == EOF) {
        PyErr_SetString(PyExc_EOFError,
                        "EOF read where object expected");
        return NULL;
    }

    p->depth++;

    if (p->depth > MAX_MARSHAL_STACK_DEPTH) {
        p->depth--;
        PyErr_SetString(PyExc_ValueError, "recursion limit exceeded");
        return NULL;
    }

    flag = code & FLAG_REF;
    type = code & ~FLAG_REF;

#define R_REF(O) do{\
    if (flag) \
        O = r_ref(O, flag, p);\
} while (0)

    switch (type) {
      // do sth.
      case TYPE_REF:
        n = r_long(p);
        if (n < 0 || n >= PyList_GET_SIZE(p->refs)) {
            if (n == -1 && PyErr_Occurred())
                break;
            PyErr_SetString(PyExc_ValueError, "bad marshal data (invalid reference)");
            break;
        }
        v = PyList_GET_ITEM(p->refs, n);
        if (v == Py_None) {
            PyErr_SetString(PyExc_ValueError, "bad marshal data (invalid reference)");
            break;
        }
        Py_INCREF(v);
        retval = v;
        break;
      // do sth.
      }
}

这里存在一个问题,为什么w_ref没有像r_object中根据flag的值决定哪个字段写入hash表中,目前没有想明白。

8.3.4. 将二进制数据写入文件

_cache_bytecode方法负责将将二进制数据写入文件,源码如下:

# _bootstrap_external.py    
def _cache_bytecode(self, source_path, bytecode_path, data):
    # Adapt between the two APIs
    mode = _calc_mode(source_path)
    return self.set_data(bytecode_path, data, _mode=mode)

set_data方法源码如下:

    def set_data(self, path, data, *, _mode=0o666):
        """Write bytes data to a file."""
        parent, filename = _path_split(path)
        path_parts = []
        # Figure out what directories are missing.
        while parent and not _path_isdir(parent):
            parent, part = _path_split(parent)
            path_parts.append(part)
        # Create needed directories.
        for part in reversed(path_parts):
            parent = _path_join(parent, part)
            try:
                _os.mkdir(parent)
            except FileExistsError:
                # Probably another Python process already created the dir.
                continue
            except OSError as exc:
                # Could be a permission error, read-only filesystem: just forget
                # about writing the data.
                _bootstrap._verbose_message('could not create {!r}: {!r}',
                                            parent, exc)
                return
        try:
            _write_atomic(path, data, _mode)
            _bootstrap._verbose_message('created {!r}', path)
        except OSError as exc:
            # Same as above: just don't write the bytecode.
            _bootstrap._verbose_message('could not create {!r}: {!r}', path,
                                        exc)

写入文件的关键方法为_write_atomic,源码如下。该方法采用写入临时文件,而后重命名的方式,用于保证要么有异常从而不会生成文件,要么无异常生成指定名称的文件。

def _write_atomic(path, data, mode=0o666):
    """Best-effort function to write data to a path atomically.
    Be prepared to handle a FileExistsError if concurrent writing of the
    temporary file is attempted."""
    # id() is used to generate a pseudo-random filename.
    path_tmp = '{}.{}'.format(path, id(path))
    fd = _os.open(path_tmp,
                  _os.O_EXCL | _os.O_CREAT | _os.O_WRONLY, mode & 0o666)
    try:
        # We first write data to a temporary file, and then use os.replace() to
        # perform an atomic rename.
        with _io.FileIO(fd, 'wb') as file:
            file.write(data)
        _os.replace(path_tmp, path)
    except OSError:
        try:
            _os.unlink(path_tmp)
        except OSError:
            pass
        raise

8.4. 参考

  • Python源码剖析

8.5. 附录

分析清楚pyc文件生成的流程后,就可以实现8.1节中提到的工具,工具源码如下:

# -*- coding:utf-8 -*-
import json
import datetime
import sys

FLAG_REF = ord('\x80')
TYPE_CODE = ord('c')
TYPE_STRING = ord('s')
TYPE_SMALL_TUPLE = ord(')')
TYPE_INT = ord('i')
TYPE_SHORT_ASCII = ord('z')
TYPE_SHORT_ASCII_INTERNED = ord('Z')
TYPE_REF = ord('r')
TYPE_NONE = ord('N')

REFS_HASH = {}

def parse_code(fp):
    code = int.from_bytes(fp.read(1), 'little')
    code_type = code & ~FLAG_REF
    code_flag = code & FLAG_REF

    idx = len(REFS_HASH)
    if code_flag:
        REFS_HASH[idx] = None

    code_dict = {}
    if code_type == TYPE_CODE:
        code_dict['type'] = 'code'
        code_dict['co_argcount'] = int.from_bytes(fp.read(4), 'little')
        code_dict['co_kwonlyargcount'] = int.from_bytes(fp.read(4), 'little')
        code_dict['co_nlocals'] = int.from_bytes(fp.read(4), 'little')
        code_dict['co_stacksize'] = int.from_bytes(fp.read(4), 'little')
        code_dict['co_flags'] = int.from_bytes(fp.read(4), 'little')
        code_dict['co_code'] = parse_code(fp)
        code_dict['co_consts'] = parse_code(fp)
        code_dict['co_names'] = parse_code(fp)
        code_dict['co_varnames'] = parse_code(fp)
        code_dict['co_freevars'] = parse_code(fp)
        code_dict['co_cellvars']  = parse_code(fp)
        code_dict['co_filename']  = parse_code(fp)
        code_dict['co_name']  = parse_code(fp)
        code_dict['co_firstlineno']  = int.from_bytes(fp.read(4), 'little')
        code_dict['co_lnotab']  = parse_code(fp)
    elif code_type == TYPE_STRING:
        code_dict['type'] = 'string'

        length = int.from_bytes(fp.read(4), 'little')
        code_dict['length'] = length

        # todo
        value = fp.read(length)
        code_dict['value'] = str(value)

        if code_flag:
            REFS_HASH[idx] = code_dict['value']
    elif code_type == TYPE_SMALL_TUPLE:
        code_dict['type'] = 'tuple'

        size = int.from_bytes(fp.read(1), 'little')
        code_dict['size'] = size

        items = []
        for _ in range(size):
            items.append(parse_code(fp))
        code_dict['items'] = items

        if code_flag:
            REFS_HASH[idx] = code_dict['items']
    elif code_type == TYPE_INT:
        code_dict['type'] = 'long'

        value = int.from_bytes(fp.read(4), 'little')
        code_dict['value'] = value

        if code_flag:
            REFS_HASH[idx] = code_dict['value']
    elif code_type == TYPE_SHORT_ASCII:
        code_dict['type'] = 'unicode'

        size = int.from_bytes(fp.read(1), 'little')
        code_dict['size'] = size

        code_dict['value'] = fp.read(size).decode()

        if code_flag:
            REFS_HASH[idx] = code_dict['value']
    elif code_type == TYPE_SHORT_ASCII_INTERNED:
        code_dict['type'] = 'unicode'

        size = int.from_bytes(fp.read(1), 'little')
        code_dict['size'] = size

        code_dict['value'] = fp.read(size).decode()

        if code_flag:
            REFS_HASH[idx] = code_dict['value']
    elif code_type == TYPE_REF:
        code_dict['type'] = 'ref'
        code_dict['ref'] = int.from_bytes(fp.read(4), 'little')
        code_dict['value'] = REFS_HASH[code_dict['ref']]
    elif code_type == TYPE_NONE:
        code_dict['type'] = 'none'
    else:
        print(code_type)

    return code_dict

def parse_pyc(file_name):
    pyc_dict = {}

    with open(file_name, 'rb') as fp:
        magic_number = int.from_bytes(fp.read(2), 'little')
        if magic_number >= 3390 and magic_number <= 3392:
            pyc_dict['version'] = 'Python 3.7'
        else:
            print('only support Python 3.7')
            exit(0)
        
        _ = fp.read(2)
        _ = fp.read(4)

        timestamp = int.from_bytes(fp.read(4), 'little')
        pyc_dict['modified'] = str(datetime.datetime.fromtimestamp(timestamp))

        source_size = int.from_bytes(fp.read(4), 'little')
        pyc_dict['size'] = source_size
        pyc_dict['code'] = parse_code(fp)

    return(pyc_dict)

if __name__ == '__main__':
    file_name = sys.argv[1]
    print(json.dumps(parse_pyc(file_name), indent=2))

分析test.py后结果为:
8. Python3源码—Code对象与pyc文件

实现了对TYPE_REF的处理,下面的value值并不在真实的二进制中包含:

"co_filename": {
    "type": "ref",
    "ref": 6,
    "value": "/Users/l.wang/Documents/pythonindepth/bin/test.py"
}

目前没有对指令集做处理。

上一篇:keras 实现 GAN


下一篇:基于皮尔森相关系数的协同过滤算法