python – 从S3读取ZIP文件而不下载整个文件

我们的ZIP文件大小为5-10GB.典型的ZIP文件有5-10个内部文件,每个文件大小为1-5 GB,未压缩.

我有一套很好的Python工具来读取这些文件.基本上,我可以打开文件名,如果有ZIP文件,工具搜索ZIP文件,然后打开压缩文件.这一切都相当透明.

我想将这些文件存储在Amazon S3中作为压缩文件.我可以获取S3文件的范围,因此应该可以获取ZIP中心目录(它是文件的末尾,所以我只能读取最后的64KiB),找到我想要的组件,下载它,然后直接流到调用过程.

所以我的问题是,我如何通过标准的Python ZipFile API来做到这一点?没有记录如何用支持POSIX语义的任意对象替换文件系统传输.如果不重写模块,这可能吗?

解决方法:

这是一种不需要获取整个文件的方法(完整版可用here).

但它确实需要boto(或boto3)(除非你可以通过AWS CLI模仿远程GET;我猜这也很有可能).

import sys
import zlib
import zipfile
import io

import boto
from boto.s3.connection import OrdinaryCallingFormat


# range-fetches a S3 key
def fetch(key, start, len):
    end = start + len - 1
    return key.get_contents_as_string(headers={"Range": "bytes=%d-%d" % (start, end)})


# parses 2 or 4 little-endian bits into their corresponding integer value
def parse_int(bytes):
    val = ord(bytes[0]) + (ord(bytes[1]) << 8)
    if len(bytes) > 3:
        val += (ord(bytes[2]) << 16) + (ord(bytes[3]) << 24)
    return val


"""
bucket: name of the bucket
key:    path to zipfile inside bucket
entry:  pathname of zip entry to be retrieved (path/to/subdir/file.name)    
"""

# OrdinaryCallingFormat prevents certificate errors on bucket names with dots
# https://*.com/questions/51604689/read-zip-files-from-amazon-s3-using-boto3-and-python#51605244
_bucket = boto.connect_s3(calling_format=OrdinaryCallingFormat()).get_bucket(bucket)
_key = _bucket.get_key(key)

# fetch the last 22 bytes (end-of-central-directory record; assuming the comment field is empty)
size = _key.size
eocd = fetch(_key, size - 22, 22)

# start offset and size of the central directory
cd_start = parse_int(eocd[16:20])
cd_size = parse_int(eocd[12:16])

# fetch central directory, append EOCD, and open as zipfile!
cd = fetch(_key, cd_start, cd_size)
zip = zipfile.ZipFile(io.BytesIO(cd + eocd))


for zi in zip.filelist:
    if zi.filename == entry:
        # local file header starting at file name length + file content
        # (so we can reliably skip file name and extra fields)

        # in our "mock" zipfile, `header_offset`s are negative (probably because the leading content is missing)
        # so we have to add to it the CD start offset (`cd_start`) to get the actual offset

        file_head = fetch(_key, cd_start + zi.header_offset + 26, 4)
        name_len = parse_int(file_head[0:2])
        extra_len = parse_int(file_head[2:4])

        content = fetch(_key, cd_start + zi.header_offset + 30 + name_len + extra_len, zi.compress_size)

        # now `content` has the file entry you were looking for!
        # you should probably decompress it in context before passing it to some other program

        if zi.compress_type == zipfile.ZIP_DEFLATED:
            print zlib.decompressobj(-15).decompress(content)
        else:
            print content
        break

在您的情况下,您可能需要将获取的内容写入本地文件(由于大小),除非不考虑内存使用情况.

上一篇:[AWS Devops]CodeBuild ----buildspec.yaml


下一篇:如何使用python boto3将s3对象从一个桶复制到另一个桶