Apache POI 用eventmodel 读取大文件Excel (3) Xlsx格式内容补充

本文部分内容来自于ECMA-376
http://www.ecma-international.org/publications/standards/Ecma-376.htm

在上一节中,已经了解了OOXML大部分的构成,然鹅在实际操作中,遇到了t="shared"这样的共享公式,是上一篇中没有出现过的内容,*上查到是共享公式shared formula但是更详细的内容就没了。

在百度,谷歌,*上搜到的描述,讲解OOXML的信息很少,导致最后不得不追根溯源,去wiki上看了一下,发现OOXML是ECMA-376制定的。

.xlsx格式出现的时间是2007年,且它遵守了这个标准,那这个标准诞生的时间必须在2007年以及之前,经过查询发现ECMA-376 1st 第一版就是2006年发布的,所以把它下载下来,看看内容。

经过查询发现,关于.xlsx的XML的cell里的字段定义是在SpreadsheetML里定义的,以下是ECMA原文内容:

8.4 SpreadsheetML
This subclause introduces the overall form of a SpreadsheetML package, and identifies some of its main element types.
(See Part 3 for a more detailed introduction.) A SpreadsheetML package
has a relationship of type officeDocument, which specifies the
location of the main part in the package. For a SpreadsheetML
document, that part contains the workbook definition.

这里它说详细的内容在第三部分,于是我找到第三部分,发现果然是有的:
Apache POI 用eventmodel 读取大文件Excel (3) Xlsx格式内容补充

Apache POI 用eventmodel 读取大文件Excel (3) Xlsx格式内容补充

经过查找,补充一部分需要使用的内容,这些都来自于ECMA-376 1st part3

3.2.9.2.1 Shared Formulas

<row r="7" spans="4:8">
  <c r="H7" s="1">
    <f t="shared" ref="H7:H11" ce="1" si="0">SUM(E7:G7)</f>
    <v>1.0246225028914113</v>
  </c>
</row>
<row r="8" spans="4:8">
  <c r="H8" s="1">
    <f t="shared" ce="1" si="0">SUM(E8:G8)</f>
    <v>0.9063376048733931</v>
  </c>
</row>

Just as strings in cells can be extremely pervasive and redundant in a sheet (and therefore must be optimized), formulas are also extremely pervasive in a sheet, and often can be optimized.

Consider the table in the above example, where column H contains a formula that sums the numbers in columns E through G, for each row.

The only difference between the formulas in H6:H12 is that the reference increases by 1 row from one row to the next.

Therefore, an optimization is created where only the formula in H6 needs to be written out, with some additional information indicating how far to propagate the formula once loaded.

This enables the loading application to load and parse only the first of the shared formulas, and then more quickly apply the necessary transforms to produce the additional related formulas in subsequent cells.

Note that while formulas can be shared, it is desirable to enable easy access to the contents of a cell.

Therefore, it is allowed that all formulas may be written out, but only the primary formula in a shared formula need be loaded and parsed.

3.2.9 Cell

<c r="B3">
  <f>B2+1</f>
  <v>2</v>
</c>

The cell itself is expressed by the c collection.

Each cell indicates it’s location in the grid using A1-style reference notation.

A cell can also indicate a style identifier (attribute s) and a data type (attribute t).

The cell types include string, number, and Boolean.

In order to optimize load/save operations, default data values are not written out.

3.2.9.1 Cell Values

Cells contain values, whether the values were directly typed in (e.g., cell A2 in our example has the value External Link:) or are the result of a calculation (e.g., cell B3 in our example has the formula B2+1).

String values in a cell are not stored in the cell table unless they are the result of a calculation.

Therefore, instead of seeing External Link: as the content of the cell’s v node, instead you see a zero-based index into the shared string table where that string is stored uniquely.

This is done to optimize load/save performance and to reduce duplication of information.

To determine whether the 0 in v is a number or an index to a string, the cell’s data type must be examined.

When the data type indicates string, then it is an index and not a numeric value.

在ECMA part4中,终于找到了关于公式的定义

Apache POI 用eventmodel 读取大文件Excel (3) Xlsx格式内容补充

还有普通类型的定义
Apache POI 用eventmodel 读取大文件Excel (3) Xlsx格式内容补充

当前就补充到这里,如果后续有需要再返回来补充

上一篇:20.12.12 376. 摆动序列


下一篇:leetcode 376. 摆动序列 思考分析