我正在使用h5py从python中保存HDF5格式的numpy数组.最近,我试图应用压缩,我得到的文件的大小更大……
我从事物(每个文件都有几个数据集)就像这样
self._h5_current_frame.create_dataset(
'estimated position', shape=estimated_pos.shape,
dtype=float, data=estimated_pos)
这样的事情
self._h5_current_frame.create_dataset(
'estimated position', shape=estimated_pos.shape, dtype=float,
data=estimated_pos, compression="gzip", compression_opts=9)
在特定示例中,压缩文件的大小是172K,未压缩文件的大小是72K(并且h5diff报告两个文件相等).我尝试了一个更基本的例子,它按预期工作……但不是在我的程序中.
怎么可能?我不认为gzip算法会提供更大的压缩文件,因此它可能与h5py及其使用有关: – /任何想法?
干杯!!
编辑:
看到h5stat的输出,似乎压缩版本节省了大量元数据(在输出的最后几行)
压缩文件
Filename: res_totolaca_jue_2015-10-08_17:06:30_19387.hdf5
File information
# of unique groups: 21
# of unique datasets: 56
# of unique named datatypes: 0
# of unique links: 0
# of unique other: 0
Max. # of links to object: 1
Max. # of objects in group: 5
File space information for file metadata (in bytes):
Superblock extension: 0
User block: 0
Object headers: (total/unused)
Groups: 3798/503
Datasets(exclude compact data): 15904/9254
Datatypes: 0/0
Groups:
B-tree/List: 0
Heap: 0
Attributes:
B-tree/List: 0
Heap: 0
Chunked datasets:
Index: 116824
Datasets:
Heap: 0
Shared Messages:
Header: 0
B-tree/List: 0
Heap: 0
Small groups (with 0 to 9 links):
# of groups with 1 link(s): 1
# of groups with 2 link(s): 5
# of groups with 3 link(s): 5
# of groups with 5 link(s): 10
Total # of small groups: 21
Group bins:
# of groups with 1 - 9 links: 21
Total # of groups: 21
Dataset dimension information:
Max. rank of datasets: 3
Dataset ranks:
# of dataset with rank 1: 51
# of dataset with rank 2: 3
# of dataset with rank 3: 2
1-D Dataset information:
Max. dimension size of 1-D datasets: 624
Small 1-D datasets (with dimension sizes 0 to 9):
# of datasets with dimension sizes 1: 36
# of datasets with dimension sizes 2: 2
# of datasets with dimension sizes 3: 2
Total # of small datasets: 40
1-D Dataset dimension bins:
# of datasets with dimension size 1 - 9: 40
# of datasets with dimension size 10 - 99: 2
# of datasets with dimension size 100 - 999: 9
Total # of datasets: 51
Dataset storage information:
Total raw data size: 33602
Total external raw data size: 0
Dataset layout information:
Dataset layout counts[COMPACT]: 0
Dataset layout counts[CONTIG]: 2
Dataset layout counts[CHUNKED]: 54
Number of external files : 0
Dataset filters information:
Number of datasets with:
NO filter: 2
GZIP filter: 54
SHUFFLE filter: 0
FLETCHER32 filter: 0
SZIP filter: 0
NBIT filter: 0
SCALEOFFSET filter: 0
USER-DEFINED filter: 0
Dataset datatype information:
# of unique datatypes used by datasets: 4
Dataset datatype #0:
Count (total/named) = (20/0)
Size (desc./elmt) = (14/8)
Dataset datatype #1:
Count (total/named) = (17/0)
Size (desc./elmt) = (22/8)
Dataset datatype #2:
Count (total/named) = (10/0)
Size (desc./elmt) = (22/8)
Dataset datatype #3:
Count (total/named) = (9/0)
Size (desc./elmt) = (14/8)
Total dataset datatype count: 56
Small # of attributes (objects with 1 to 10 attributes):
Total # of objects with small # of attributes: 0
Attribute bins:
Total # of objects with attributes: 0
Max. # of attributes to objects: 0
Summary of file space information:
File metadata: 136526 bytes
Raw data: 33602 bytes
Unaccounted space: 5111 bytes
Total space: 175239 bytes
未压缩的文件
Filename: res_totolaca_jue_2015-10-08_17:03:04_19267.hdf5
File information
# of unique groups: 21
# of unique datasets: 56
# of unique named datatypes: 0
# of unique links: 0
# of unique other: 0
Max. # of links to object: 1
Max. # of objects in group: 5
File space information for file metadata (in bytes):
Superblock extension: 0
User block: 0
Object headers: (total/unused)
Groups: 3663/452
Datasets(exclude compact data): 15904/10200
Datatypes: 0/0
Groups:
B-tree/List: 0
Heap: 0
Attributes:
B-tree/List: 0
Heap: 0
Chunked datasets:
Index: 0
Datasets:
Heap: 0
Shared Messages:
Header: 0
B-tree/List: 0
Heap: 0
Small groups (with 0 to 9 links):
# of groups with 1 link(s): 1
# of groups with 2 link(s): 5
# of groups with 3 link(s): 5
# of groups with 5 link(s): 10
Total # of small groups: 21
Group bins:
# of groups with 1 - 9 links: 21
Total # of groups: 21
Dataset dimension information:
Max. rank of datasets: 3
Dataset ranks:
# of dataset with rank 1: 51
# of dataset with rank 2: 3
# of dataset with rank 3: 2
1-D Dataset information:
Max. dimension size of 1-D datasets: 624
Small 1-D datasets (with dimension sizes 0 to 9):
# of datasets with dimension sizes 1: 36
# of datasets with dimension sizes 2: 2
# of datasets with dimension sizes 3: 2
Total # of small datasets: 40
1-D Dataset dimension bins:
# of datasets with dimension size 1 - 9: 40
# of datasets with dimension size 10 - 99: 2
# of datasets with dimension size 100 - 999: 9
Total # of datasets: 51
Dataset storage information:
Total raw data size: 50600
Total external raw data size: 0
Dataset layout information:
Dataset layout counts[COMPACT]: 0
Dataset layout counts[CONTIG]: 56
Dataset layout counts[CHUNKED]: 0
Number of external files : 0
Dataset filters information:
Number of datasets with:
NO filter: 56
GZIP filter: 0
SHUFFLE filter: 0
FLETCHER32 filter: 0
SZIP filter: 0
NBIT filter: 0
SCALEOFFSET filter: 0
USER-DEFINED filter: 0
Dataset datatype information:
# of unique datatypes used by datasets: 4
Dataset datatype #0:
Count (total/named) = (20/0)
Size (desc./elmt) = (14/8)
Dataset datatype #1:
Count (total/named) = (17/0)
Size (desc./elmt) = (22/8)
Dataset datatype #2:
Count (total/named) = (10/0)
Size (desc./elmt) = (22/8)
Dataset datatype #3:
Count (total/named) = (9/0)
Size (desc./elmt) = (14/8)
Total dataset datatype count: 56
Small # of attributes (objects with 1 to 10 attributes):
Total # of objects with small # of attributes: 0
Attribute bins:
Total # of objects with attributes: 0
Max. # of attributes to objects: 0
Summary of file space information:
File metadata: 19567 bytes
Raw data: 50600 bytes
Unaccounted space: 5057 bytes
Total space: 75224 bytes
解决方法:
首先,这是一个可重复的例子:
import h5py
from scipy.misc import lena
img = lena() # some compressible image data
f1 = h5py.File('nocomp.h5', 'w')
f1.create_dataset('img', data=img)
f1.close()
f2 = h5py.File('complevel_9.h5', 'w')
f2.create_dataset('img', data=img, compression='gzip', compression_opts=9)
f2.close()
f3 = h5py.File('complevel_0.h5', 'w')
f3.create_dataset('img', data=img, compression='gzip', compression_opts=0)
f3.close()
现在让我们来看看文件大小:
~$h5stat -S nocomp.h5
Filename: nocomp.h5
Summary of file space information:
File metadata: 1304 bytes
Raw data: 2097152 bytes
Unaccounted space: 840 bytes
Total space: 2099296 bytes
~$h5stat -S complevel_9.h5
Filename: complevel_9.h5
Summary of file space information:
File metadata: 11768 bytes
Raw data: 302850 bytes
Unaccounted space: 1816 bytes
Total space: 316434 bytes
~$h5stat -S complevel_0.h5
Filename: complevel_0.h5
Summary of file space information:
File metadata: 11768 bytes
Raw data: 2098560 bytes
Unaccounted space: 1816 bytes
Total space: 2112144 bytes
在我的例子中,使用gzip -9进行压缩是有意义的 – 尽管它需要额外的~10kB的元数据,但这远远超过了图像数据大小减少~1794kB(大约7:1的压缩比).最终结果是总文件大小减少了约6.6倍.
但是,在您的示例中,压缩只会将原始数据的大小减少大约16kB(压缩比约为1.5:1),这大大超过了元数据大小116kB的增加.元数据大小的增加比我的示例大得多的原因可能是因为您的文件包含56个数据集而不是一个.
即使gzip神奇地将原始数据的大小减小到零,你仍然会得到一个比未压缩版本大1.8倍的文件.元数据的大小或多或少保证随着数组的大小线性地缩放,因此如果您的数据集要大得多,那么您将开始看到压缩它们的一些好处.就目前而言,你的阵列非常小,你不可能从压缩中获得任何东西.
更新:
压缩版本需要更多元数据的原因并不是与压缩本身有关,而是与使用压缩过滤器的数据集需要为split into fixed-size chunks的事实有关.可能是很多额外的元数据用于存储索引块所需的B-tree.
f4 = h5py.File('nocomp_autochunked.h5', 'w')
# let h5py pick a chunk size automatically
f4.create_dataset('img', data=img, chunks=True)
print(f4['img'].chunks)
# (32, 64)
f4.close()
f5 = h5py.File('nocomp_onechunk.h5', 'w')
# make the chunk shape the same as the shape of the array, so that there
# is only one chunk
f5.create_dataset('img', data=img, chunks=img.shape)
print(f5['img'].chunks)
# (512, 512)
f5.close()
f6 = h5py.File('complevel_9_onechunk.h5', 'w')
f6.create_dataset('img', data=img, chunks=img.shape, compression='gzip',
compression_opts=9)
f6.close()
结果文件大小:
~$h5stat -S nocomp_autochunked.h5
Filename: nocomp_autochunked.h5
Summary of file space information:
File metadata: 11768 bytes
Raw data: 2097152 bytes
Unaccounted space: 1816 bytes
Total space: 2110736 bytes
~$h5stat -S nocomp_onechunk.h5
Filename: nocomp_onechunk.h5
Summary of file space information:
File metadata: 3920 bytes
Raw data: 2097152 bytes
Unaccounted space: 96 bytes
Total space: 2101168 bytes
~$h5stat -S complevel_9_onechunk.h5
Filename: complevel_9_onechunk.h5
Summary of file space information:
File metadata: 3920 bytes
Raw data: 305051 bytes
Unaccounted space: 96 bytes
Total space: 309067 bytes
很明显,分块是产生额外元数据而不是压缩的原因,因为nocomp_autochunked.h5包含与上述complevel_0.h5完全相同的元数据量,并且在complevel_9_onechunk.h5中对分块版本引入压缩对于总量没有影响.元数据.
在此示例中,增加块大小以使阵列存储为单个块将元数据量减少约3倍.在您的情况下,这将产生多大的差异可能取决于h5py如何自动为输入数据集选择块大小.有趣的是,这也导致了压缩比的轻微降低,这不是我所预测的.
请记住,拥有更大的块也有缺点.每当您想要访问块中的单个元素时,整个块都需要解压缩并读入内存.对于大型数据集,这对性能来说可能是灾难性的,但在您的情况下,阵列非常小,可能不值得担心.
您应该考虑的另一件事是您是否可以将数据集存储在单个数组中而不是许多小数组中.例如,如果您具有相同dtype的K个2D阵列,每个阵列都具有MxN维度,那么您可以将它们更有效地存储在KxMxN 3D阵列中,而不是许多小型数据集中.我对您的数据知之甚少,不知道这是否可行.