【软件工程实践】Pig项目3-Data目录源码分析-Tuple2

2022-09-11 19:15:18

2021SC@SDUSC

上篇已经看了TupleFactory(抽象类)和TupleMaker(接口），接下来继续结合博客内容对源码进行分析

博客摘要：

在BinSedesTupleFactory的newTuple方法中，返回的是BinSedesTuple对象。BinSedesTuple类继承于DefaultTuple类，在DefaultTuple类中有List<Object> mFields字段，这便是存储Tuple数据的地方了，mFields所持有类型为ArrayList<Object>()；。类图关系：

找到项目中BinSedesTupleFactory的代码：

/**

* Default implementation of TupleFactory.

@InterfaceAudience.Private

public class BinSedesTupleFactory extends TupleFactory {

@Override

public Tuple newTuple() {

return new BinSedesTuple();

}

@Override

public Tuple newTuple(int size) {

return new BinSedesTuple(size);

}

@Override

@SuppressWarnings("unchecked")

public Tuple newTuple(List c) {

return new BinSedesTuple(c);

}

@Override

@SuppressWarnings("unchecked")

public Tuple newTupleNoCopy(List list) {

return new BinSedesTuple(list, 1);

}

@Override

public Tuple newTuple(Object datum) {

Tuple t = new BinSedesTuple(1);

try {

t.set(0, datum);

} catch (ExecException e) {

// The world has come to an end, we just allocated a tuple with one slot

// but we can't write to that slot.

throw new RuntimeException("Unable to write to field 0 in newly " +

"allocated tuple of size 1!", e);

}

return t;

}

@Override

public Class<? extends Tuple> tupleClass() {

return BinSedesTuple.class;

}

@Override

public Class<? extends TupleRawComparator> tupleRawComparatorClass() {

return BinSedesTuple.getComparatorClass();

}

@Override

public boolean isFixedSize() {

return false;

}

可以看到newTuple进行了多次重载，上篇我们知道，TupleFactory大多数方法都是抽象方法，这里出现的所有函数均为抽象的实现，是一种重写现象；newTuple返回值是Tuple，返回函数是new BinSedeTuple，Tuple是BinSedeTuple的父类，严格来说，有以下关系，也就是上文那张图

这里也可以看到继承的双面性，一方面可以重用父类的代码，另一方面，子类必须给父类的代码给出实现

这里提出疑问：面向对象程序设计的继承结构真的有必要吗？从前面的分析来看，是不是觉得继承很麻烦？一个程序可以完成的事情使用继承需要五六个程序完成？这里我们拿Tuple举例

指示该对象是否保存的值
*为null的标记
/**
 * Marker for indicating whether the value this object holds
 * is a null
 */
public static byte NULL = 0x00;

指示该对象是否保存的值
*不是空的标记/**
 * Marker for indicating whether the value this object holds
 * is not a null
 */
public static byte NOTNULL = 0x01;

使这个元组引用另一个元组的内容。此方法不进行复制
*底层数据。它维护对原始数据的引用
*元组(甚至可能是保存数据的数据结构)。
* @param t元组引用
/**
 * Make this tuple reference the contents of another.  This method does not copy
 * the underlying data.   It maintains references to the data from the original
 * tuple (and possibly even to the data structure holding the data).
 * @param t Tuple to reference.
 */
@Deprecated
void reference(Tuple t);

查找元组的大小。用于响应arity()。
* @return元组中字段的个数。
/**
 * Find the size of the tuple.  Used to be called arity().
 * @return number of fields in the tuple.
 */
int size();

找出给定字段是否为空。
* @param fieldNum检查字段是否为空。
* @如果字段为空则返回true，否则返回false。
* @throws ExecException如果给定的字段数更大
大于或等于元组中字段的数量。
/**
 * Find out if a given field is null.
 * @param fieldNum Number of field to check for null.
 * @return true if the field is null, false otherwise.
 * @throws ExecException if the field number given is greater
 * than or equal to the number of fields in the tuple.
 */
boolean isNull(int fieldNum) throws ExecException;

查找给定字段的类型。
* @param fieldNum获取类型字段的数量。
* @return类型，编码为字节值。值定义在
* {@link数据类型}。如果字段为空，则返回DataType。未知的
*将被返回。
* @throws ExecException如果字段数大于等于
元组中字段的数量。
/**
 * Find the type of a given field.
 * @param fieldNum Number of field to get the type for.
 * @return type, encoded as a byte value.  The values are defined in
 * {@link DataType}.  If the field is null, then DataType.UNKNOWN
 * will be returned.
 * @throws ExecException if the field number is greater than or equal to
 * the number of fields in the tuple.
 */
byte getType(int fieldNum) throws ExecException;

获取给定字段中的值。
* @param fieldNum获取值的字段号。
* @返回值，作为一个对象。
* @throws ExecException如果字段数大于等于
元组中字段的数量
/**
 * Get the value in a given field.
 * @param fieldNum Number of the field to get the value for.
 * @return value, as an Object.
 * @throws ExecException if the field number is greater than or equal to
 * the number of fields in the tuple.
 */
Object get(int fieldNum) throws ExecException;

*以列表形式获取元组中的所有字段。
* @return一个包含元组字段的对象有序列表
/**
 * Get all of the fields in the tuple as a list.
 * @return a list of objects containing the fields of the tuple
 * in order.
 */
List<Object> getAll();

在给定字段中设置值。这个不应该被调用，除非
*该元组由{@link TupleFactory#newTuple(int)}构造
*参数大于传递到这里的fieldNum。这个调用将
*不会自动扩展元组大小。如果你打电话的话
* {@link TupleFactory#newTuple(int)}使用2，可以调用
这个函数带有1，但不带有2或更大的值。
* @param fieldNum要设置值的字段号。
* @param val对象放入指定的字段。
* @throws ExecException如果字段数大于等于
元组中字段的数量
/**
 * Set the value in a given field.  This should not be called unless
 * the tuple was constructed by {@link TupleFactory#newTuple(int)} with an
 * argument greater than the fieldNum being passed here.  This call will
 * not automatically expand the tuple size.  That is if you called 
 * {@link TupleFactory#newTuple(int)} with a 2, it is okay to call
 * this function with a 1, but not with a 2 or greater.
 * @param fieldNum Number of the field to set the value for.
 * @param val Object to put in the indicated field.
 * @throws ExecException if the field number is greater than or equal to
 * the number of fields in the tuple.
 */
void set(int fieldNum, Object val) throws ExecException;

将字段附加到元组。这种方法并不像它可能的那样有效
*强制复制现有数据以增长数据结构。
*只要有可能，你就应该使用
* {@link TupleFactory#newTuple(int)}，然后用
* {@link #set(int, Object)}，而不是
*然后使用{@link TupleFactory#newTuple()}构造它并添加值。
* @param val对象附加到元组。
/**
 * Append a field to a tuple.  This method is not efficient as it may
 * force copying of existing data in order to grow the data structure.
 * Whenever possible you should construct your Tuple with 
 * {@link TupleFactory#newTuple(int)} and then fill in the values with 
 * {@link #set(int, Object)}, rather
 * than construct it with {@link TupleFactory#newTuple()} and append values.
 * @param val Object to append to the tuple.
 */
void append(Object val);

确定元组在内存中的大小。这是由数据袋使用的
*来确定它们的内存大小。这并不需要精确，但是
*应该是一个体面的估计。
* @return估计的内存大小，以字节为单位
/**
 * Determine the size of tuple in memory.  This is used by data bags
 * to determine their memory size.  This need not be exact, but it
 * should be a decent estimation.
 * @return estimated memory size, in bytes.
 */
long getMemorySize();

将值元组写入字符串。输出将是结果
对元组中的每个值调用toString。
* @param delim在字符串中使用的分隔符。
* @return包含元组的字符串。
* @throws ExecException不会抛出。这只存在于向后兼容的原因。
/** 
 * Write a tuple of values into a string. The output will be the result
 * of calling toString on each of the values in the tuple.
 * @param delim Delimiter to use in the string.
 * @return A string containing the tuple.
 * @throws ExecException this is never thrown. This only exists for backwards compatability reasons.
 */
String toDelimitedString(String delim) throws ExecException;

笔记：Tuple作为一个接口，其函数只提供了定义，没有提供实现，在abstractTuple中会更加详细的实现，我们来看看它的源码

public abstract class AbstractTuple implements Tuple {

@Override

public Iterator<Object> iterator() {

return getAll().iterator();

}

@Override

public String toString() {

return TupleFormat.format(this);

}

/**

* {@inheritDoc}

@Override

public String toDelimitedString(String delim) throws ExecException {

return Joiner.on(delim).useForNull("").join(this);

}

/**

* {@inheritDoc}

@Override

public byte getType(int fieldNum) throws ExecException {

return DataType.findType(get(fieldNum));

}

/**

* {@inheritDoc}

@Override

public boolean isNull(int fieldNum) throws ExecException {

return (get(fieldNum) == null);

}

@Override

public boolean equals(Object other) {

return (compareTo(other) == 0);

}

@Override

public void reference(Tuple t) {

throw new RuntimeException("Tuple#reference(Tuple) is deprecated and should not be used");

}

这里产生了一个疑问，这里toString在Tuple中并没有定义，为何也可以重写？原因如下

我们知道Java中只有接口是可以多重继承的，这里Tuple就用了多重继承，toString改写的是从某个父类继承下来的方法

abstractTuple给出了部分接口的实现，如果它的子类没有重写，那么就会调用abstractTuple里面的方法

abstractTuple调用了很多其他函数，比如TupleFormat,这个是impl目录下的文件，因此不展开分析了,这里给出注释

元组格式的默认实现。Dump和PigDump使用默认值
*实现

/**

* Default implementation of format of Tuple. Dump and PigDump use this default

* implementation

public class TupleFormat

转来转去，我们突然发现，分析了半天，我们仍然不知道Tuple究竟存放了什么！这是因为Tuple具体存放了什么放到defaultTuple实现了，估计的原因是Tuple支持多种类型，因此DefaultTuple只是其中一种实现，其他的实现和DefaultTuple一样继承自AbstractTuple，个人觉得这么设计特别麻烦，直接在顶层实现将各种方法定义好就完事了，这也是使用继承需要面对的问题，有时候继承不一定比直接编程好用，至于多种类型，实际上可以使用模板来实现

最后抛出个问题，在这个程序设计中，这样的继承设计真的有必要吗？

本文的内容先到这里

码农公寓

相关文章