用Java实现MVPtree——MVPtree核心算法代码的搭建

2022-12-16 21:32:38

　　项目需要，需要把MVPtree这种冷门的数据结构写入Java，然网上没有成形的Java实现，虽说C++看惯了不过对C++实现复杂结构也是看得蒙蔽，幸好客户给了个github上job什么的人用Java写的VPtree，大体结构可以嵌入MVPtree。

　　对于MVPtree的其他信息请左转百度= =本文只讲述算法实现。

　　点查找树结构主要需解决的问题有2个：如何减少非必要点的搜索，以及如何减少距离计算次数。前者的解决方法比较容易想到，把点集分割为左右对称的两半长方形，或者脑洞大点的，通过距离切分（效率很高，因为所有查询都是基于点距离的）成为圆和圆环。后者适用面不是很广，优化思路通常是预先计算与基准点的距离，查询点时筛点。

　　VPtree就是使用距离划分点集的例子。每个结点一个点集，随意定个点作为基准点，然后把点集根据与基准点距离分成数量相等的2个子集，这2个子集再分别进入此结点的子结点，用点查找出点集的过程如出一辙，但是没有对第2点进行优化，这个结构适合于距离函数是曼哈顿距离或者欧几里得距离的情况。

　　MVPtree继承了VPtree用距离划分的特点，只不过一个结点会划分4个点集，同时通过path数组限制距离函数运行次数。划分为4个点集而不是2个点集，可以分割得细一些，减少无效点；使用一定数量的基准点限制，可以在查询频繁的情况下减少距离计算次数，并且这些基准点通常被切分得很散，大片大片的无效区域被排除了，效果拔群。这个结构适合于距离函数是计算次数过高的切比雪夫函数之流。

　　接下来就是代码的实现了。

　　MVPtree与VPtree的点有个不同之处，就是MVPtree的点还附上了与基准点的距离数组，这里就需要使用特别的点数据结构：MVPtree用点

　　核心代码如下：

public class MVPTreePoint<P> {

    private ArrayList<Double> path;

    private P point;

    private final int maxLevel;

    public MVPTreePoint(final P point, final int maxLevel) {

        this.point = point;

        this.maxLevel = maxLevel;

        this.path = new ArrayList<>();

    }

    public void addDistanceToSelf(final MVPTreePoint<P> vantagePointElement, final DistanceFunction<P> distanceFunction) {

        if(this.path.size() < this.maxLevel)

            this.path.add(distanceFunction.getDistance(this.point, vantagePointElement.point));

    }

    public void addDistanceToSelf(final P vantagePoint, final DistanceFunction<P> distanceFunction) {

        if(this.path.size() < this.maxLevel)

            this.path.add(distanceFunction.getDistance(this.point, vantagePoint));

    }

    public void addDistanceToSelf(final double distance) {

        if(this.path.size() < this.maxLevel) {

            this.path.add(distance);

        }

    }

    public void removeDistanceToSelf(final int position) {

        if(position < this.path.size()) {

            this.path.remove(position);

        }

    }

    public double getDistanceToSelf(int i) {

        return this.path.get(i);

    }

    public int size() {

        return this.path.size();

    }

    public void clearPath() {

        this.path.clear();

    }

    public P getPoint() {

        return this.point;

    }

    @SuppressWarnings("unchecked")

    public boolean equals(Object o){

        MVPTreePoint<P> t = (MVPTreePoint<P>) o;

        return this.point.equals(t.point);

    }

}

MVPTreePoint

　　把距离数组写到点类上而不是集成到树结点类上，结构会清晰一些，并且从点里取出距离也方便。

　　MVPtree与VPtree有好多不同的地方，但是好多都只是改一下类名，把P,E改成MVPTreePoint<P>,MVPTreePoint<E>，这里主讲核心算法——初始化树和点查询。

　　初始化MVPtree不仅要多选出一个基准点，多切分2次数组，还要把基准点到每个点的距离都分别储存起来。

　　capacity就是叶子结点的容量，要设中间一些，根据数据规模定吧。

　　原论文把基准点从点集取出来放到单独的位置上，但是实际编写程序时，把基准点仅仅当作一个基准点，基准点还是作为点集的一部分初始化。这样，数据结构仅仅是多出quantityOfPoint/capacity个点，但是程序编写方便了很多。

public MVPTreeNode(

            final Collection<MVPTreePoint<E>> pointNodes,

            final DistanceFunction<P> distanceFunction,

            final MVPThresholdSelectionStrategy<P, E> thresholdSelectionStrategy,

            final int capacity, final int maxLevel) {

        if (capacity < 1) {

            throw new IllegalArgumentException("Capacity must be positive.");

        }

        if (pointNodes.isEmpty()) {

            throw new IllegalArgumentException(

                    "Cannot create a MVPTreeNode with an empty list of points.");

        }

        this.capacity = capacity;

        this.maxLevel = maxLevel;

        this.distanceFunction = distanceFunction;

        this.thresholdSelectionStrategy = thresholdSelectionStrategy;

        this.pointNodes = new ArrayList<>(pointNodes);

        this.children = new MVPTreeNode[2][2];

        this.vantagePoint = (E[]) new Object[2];

        this.secondThreshold = new double[2];

        this.anneal();

    }

    protected void anneal() {

        if (this.pointNodes == null) {

            int childrenSize[][] = new int[2][2];

            for (int i = 0; i < 2; i++) {

                for (int j = 0; j < 2; j++) {

                    childrenSize[i][j] = this.children[i][j].size();

                }

            }

            if (childrenSize[0][0] == 0 || childrenSize[0][1] == 0

                    || childrenSize[1][0] == 0 || childrenSize[1][1] == 0) {

                // One of the child nodes has become empty, and needs to be

                // pruned.

                this.pointNodes = new ArrayList<>(childrenSize[0][0]

                        + childrenSize[0][1] + childrenSize[1][0]

                        + childrenSize[1][1]);

                this.addAllPointsToCollection(this.pointNodes);

                for (MVPTreePoint<E> pointNode : this.pointNodes) {

                    pointNode.clearPath();

                }

                for (int i = 0; i < 2; i++) {

                    for (int j = 0; j < 2; j++) {

                        this.children[i][j] = null;

                    }

                }

                this.anneal();

            } else {

                for (int i = 0; i < 2; i++) {

                    for (int j = 0; j < 2; j++) {

                        this.children[i][j].anneal();

                    }

                }

            }

        } else {

            int firstVantagePointIndex = new Random().nextInt(this.pointNodes

                    .size());

            this.vantagePoint[0] = this.pointNodes.get(firstVantagePointIndex)

                    .getPoint();

            this.firstThreshold = this.thresholdSelectionStrategy

                    .selectThreshold(this.pointNodes, this.vantagePoint[0],

                            this.distanceFunction);

            int firstIndexPastThreshold;

            try {

                firstIndexPastThreshold = MVPTreeNode.partitionPoints(

                        this.pointNodes, this.vantagePoint[0],

                        this.firstThreshold, this.distanceFunction);

            } catch (final PartitionException e) {

                this.storeInOneNode();

                return;

            }

            if (this.pointNodes.size() > this.capacity) {

                List<MVPTreePoint<E>> subTreeList[] = new List[2];

                subTreeList[0] = this.pointNodes.subList(0,

                        firstIndexPastThreshold);

                subTreeList[1] = this.pointNodes.subList(

                        firstIndexPastThreshold, this.pointNodes.size());

                // if points can be divided into 2 parts, find second vantage

                // point and try to split point array

                int secondVantagePointIndex = new Random()

                        .nextInt(subTreeList[1].size());

                this.vantagePoint[1] = subTreeList[1].get(

                        secondVantagePointIndex).getPoint();

                int splitPosition[] = new int[2];

                for (int i = 0; i < 2; i++) {

                    this.secondThreshold[i] = this.thresholdSelectionStrategy

                            .selectThreshold(subTreeList[i],

                                    this.vantagePoint[1], this.distanceFunction);

                    try {

                        splitPosition[i] = MVPTreeNode.partitionPoints(

                                subTreeList[i], this.vantagePoint[1],

                                this.secondThreshold[i], this.distanceFunction);

                    } catch (final PartitionException e) {

                        this.storeInOneNode();

                        return;

                    }

                }

                for (MVPTreePoint<E> pointNode : this.pointNodes) {

                    pointNode.addDistanceToSelf(this.distanceFunction

                            .getDistance(pointNode.getPoint(),

                                    this.vantagePoint[0]));

                    pointNode.addDistanceToSelf(this.distanceFunction

                            .getDistance(pointNode.getPoint(),

                                    this.vantagePoint[1]));

                }

                for (int i = 0; i < 2; i++) {

                    this.children[i][0] = new MVPTreeNode<>(

                            subTreeList[i].subList(0, splitPosition[i]),

                            this.distanceFunction,

                            this.thresholdSelectionStrategy, this.capacity,

                            this.maxLevel);

                    this.children[i][1] = new MVPTreeNode<>(

                            subTreeList[i].subList(splitPosition[i],

                                    subTreeList[i].size()),

                            this.distanceFunction,

                            this.thresholdSelectionStrategy, this.capacity,

                            this.maxLevel);

                }

                this.pointNodes = null;

            } else {

                this.storeInOneNode();

            }

        }

    }

    private void storeInOneNode() {

        int maxIndex = 0;

        double maxDistance = this.distanceFunction.getDistance(this.pointNodes

                .get(0).getPoint(), this.vantagePoint[0]);

        for (int i = 1; i < this.pointNodes.size(); i++) {

            double curDistance = this.distanceFunction.getDistance(

                    this.pointNodes.get(i).getPoint(), this.vantagePoint[0]);

            if (maxDistance < curDistance) {

                maxDistance = curDistance;

                maxIndex = i;

            }

        }

        this.vantagePoint[1] = this.pointNodes.get(maxIndex).getPoint();

        for (int i = 0; i < 2; i++) {

            for (int j = 0; j < 2; j++) {

                this.children[i][j] = null;

            }

        }

    }

init MVPtree

　　原作者给出了2种查询方式：找离查询点前k近点和找离查询点不远于u点。

　　找离查询点前k点的算法可以沿用查询VPtree时的做法，先查找查询点所在的子结点，再查找其他子结点，注意要先判定收集者是否装满（没装满的话，不管是啥点都直接塞），再判定收集者与查询点的最远距离（对第二种查找方式来说是固定距离）是否小于点/点集与查询点的最近距离（在树结点和叶子结点都有用处）。

public void collectNearestNeighbors(

            final NearestNeighborCollector<P, E> collector, int depth) {

        if (this.pointNodes == null) {

            // O1-Q

            final double distanceFromFirstVantagePointToQueryPoint = this.distanceFunction

                .getDistance(this.vantagePoint[0],

                    collector.getQueryPoint().getPoint());

            // O2-Q

            final double distanceFromSecondVantagePointToQueryPoint = this.distanceFunction

                .getDistance(this.vantagePoint[1],

                    collector.getQueryPoint().getPoint());

            collector.getQueryPoint().addDistanceToSelf(

                    distanceFromFirstVantagePointToQueryPoint);

            collector.getQueryPoint().addDistanceToSelf(

                    distanceFromSecondVantagePointToQueryPoint);

            final MVPTreeNode<P, E> index = this

                    .getChildNodeForPoint(collector.getQueryPoint().getPoint());

            index.collectNearestNeighbors(collector, depth + 1);

            // O1-Q - O1-S1

            double basicDistance = distanceFromFirstVantagePointToQueryPoint

                    - this.firstThreshold;

            for(int i = 0;i < 2;i ++){

                if (!collector.isFull() || basicDistance <= collector.getRadius()) {

                    // O2-Q - O2-S2

                    double touchDistance = distanceFromSecondVantagePointToQueryPoint

                            - this.secondThreshold[i];

                    for(int j = 0;j < 2;j ++){

                        if (index != this.children[i][j]

                                && (!collector.isFull() || touchDistance <= collector.getRadius())) {

                            this.children[i][j].collectNearestNeighbors(collector, depth + 1);

                        }

                        touchDistance *= -1;

                    }

                }

                basicDistance *= -1;

            }

            collector.getQueryPoint().removeDistanceToSelf(depth + depth + 1);

            collector.getQueryPoint().removeDistanceToSelf(depth + depth);

        } else {

            for (final MVPTreePoint<E> pointNode : this.pointNodes) {

                if(!collector.isFull() || this.isAbleToInsert(collector.getRadius(),

                                collector.getQueryPoint(), pointNode)) {

                    collector.offerPoint(pointNode.getPoint());

                }

            }

        }

    }

collectNearestNeighbors

　　找离查询点不远于u点算法就是论文里讲述的算法，执行步骤与收集第k近有相同之处，不同在于限定距离是固定值，且任何时候都必须判定，点集没有数量限制。

public void collectAllWithinDistance(final MVPTreePoint<P> queryPoint,

            final double maxDistance, final Collection<E> collection, int depth) {

        if (this.pointNodes == null) {

            final double distanceFromFirstVantagePointToQueryPoint = this.distanceFunction

                    .getDistance(this.vantagePoint[0], queryPoint.getPoint());

            final double distanceFromSecondVantagePointToQueryPoint = this.distanceFunction

                    .getDistance(this.vantagePoint[1], queryPoint.getPoint());

            queryPoint

                    .addDistanceToSelf(distanceFromFirstVantagePointToQueryPoint);

            queryPoint

                    .addDistanceToSelf(distanceFromSecondVantagePointToQueryPoint);

            // We want to search any of this node's children that intersect with

            // the query region

            if (distanceFromFirstVantagePointToQueryPoint <= this.firstThreshold

                    + maxDistance) {

                if (distanceFromSecondVantagePointToQueryPoint <= this.secondThreshold[0]

                        + maxDistance) {

                    this.children[0][0].collectAllWithinDistance(queryPoint,

                            maxDistance, collection, depth + 1);

                }

                if (distanceFromSecondVantagePointToQueryPoint + maxDistance >= this.secondThreshold[0]) {

                    this.children[0][1].collectAllWithinDistance(queryPoint,

                            maxDistance, collection, depth + 1);

                }

            }

            if (distanceFromFirstVantagePointToQueryPoint + maxDistance >= this.firstThreshold) {

                if (distanceFromSecondVantagePointToQueryPoint <= this.secondThreshold[1]

                        + maxDistance) {

                    this.children[1][0].collectAllWithinDistance(queryPoint,

                            maxDistance, collection, depth + 1);

                }

                if (distanceFromSecondVantagePointToQueryPoint + maxDistance >= this.secondThreshold[1]) {

                    this.children[1][1].collectAllWithinDistance(queryPoint,

                            maxDistance, collection, depth + 1);

                }

            }

            queryPoint.removeDistanceToSelf(depth + depth + 1);

            queryPoint.removeDistanceToSelf(depth + depth);

        } else {

            for (MVPTreePoint<E> pointNode : pointNodes) {

                if (this.isAbleToInsert(maxDistance, queryPoint, pointNode))

                    collection.add(pointNode.getPoint());

            }

        }

    }

collectAllWithinDistance

　　这两种查询方式都需要比较预先计算的距离，把这种计算合为一个函数：

public boolean isAbleToInsert(double limitDistance,

            MVPTreePoint<P> queryPoint, MVPTreePoint<E> pointNode) {

        for (int i = 0; i < queryPoint.size(); i++) {

            double disOffset = queryPoint.getDistanceToSelf(i)

                    - pointNode.getDistanceToSelf(i);

            if (Math.abs(disOffset) > limitDistance) {

                return false;

            }

        }

        return this.distanceFunction.getDistance(pointNode.getPoint(),

                queryPoint.getPoint()) <= limitDistance;

    }

isAbleToInsert

　　其他函数也需要修改，但是没有像这3个函数一样大幅度的修改结构。

-------------------------------我是分割线------------------------------------

代码地址：https://coding.net/u/funcfans/p/MVPtree-for-Java/git

码农公寓

相关文章