使用 ElasticSearch Aggregations 进行统计分析

2023-11-13 09:54:22

https://blog.csdn.net/zxjiayou1314/article/details/53837719/

ElasticSearch 的特点随处可见：基于 Lucene 的分布式搜索引擎，友好的 RESTful API……

大部分文章都围绕 ELK Stack 和全文搜索展开，本文试图用一个小案例来展示 ElasticSearch Aggregations 在统计分析的强大之处。

表单长这样

需求：对回收的问卷进行统计，统计方式可能有：

看每周／天／小时回收量（可以做成可视化的柱状图，人人都爱 Dashboard）
以上需求加一个时间范围（例如最近90天）
在问题 1 中选择 A 答案的用户，其他答案的占比
问题 1 选择了 A 答案和问题 2 中选择了 B 答案的用户的其他回答占比

前两个需求都是对文档的根字段进行查询，后面的都是对子文档的字段进行搜索

可视化用了 Chart.js 和 Twitter Bootstrap；胶水语言么，自然是世界上最好的语(P)言(H)啦(P)，安装和启动过程什么的太简单就跳过了。

1. 初次见面

就像新人学习如何使用 Postgres 那样，步骤如下：

创建一个 index（index 既是名词，又是动词，这里是名词）
定义 mapping （相当于 schema）
使用 bulk 导入数据
查询（ElasticSearch 的强大之处可在这里体现）

创建 index 和定义 mapping

在 ElasticSearch 使用 index 的成本相当低，以下代码在创建 index 时也同时指定了 mapping

代码只展示关键部分（反正你们也不会去运行）

$client = Elasticsearch\ClientBuilder::create()->build();
$params = [
    'index' => 'your_awesome_data',
    'body' => [
        'mappings' => [
            'ur_radio_answers' => [
                'properties' => [
                    'answer_id' => [ #这里是字段名
                        'type' => 'string', #字段类型（不指定也行，elasticsearch 自己会猜）
                        'index' => 'not_analyzed' #告诉 elasticsearch，本字段不需要被分词，需要完整的读写）
                    ],
                    'user_id' => ['type' => 'string', 'index' => 'not_analyzed'],
                    'questions' => [
                        'type' => 'nested',
                        'properties' => [
                            'page_id' => [
                                'type' => 'string',
                                'index' => 'not_analyzed'
                            ],
                            'question_id' => ['type' => 'string', 'index' => 'not_analyzed'],
                            'question' => ['type' => 'string', 'index' => 'not_analyzed'],
                            'option' => ['type' => 'string', 'index' => 'not_analyzed']
                        ]
                    ],
                    'start_at' => [
                        'type' => 'date',
                        'format' => 'yyyy-MM-dd HH:mm:ss'
                    ],
                    'ended_at' => ['type' => 'date', 'format' => 'yyyy-MM-dd HH:mm:ss']
                ]
            ]
        ]
    ]
];

$client->indices()->create($params);

使用 bulk API 导入数据

这部分代码没啥好看，只要知道在批量导入数据的时候用 bulk API 就行了

bulk 是批量插入文档的 API，一般是将几千个 Document 一起插入（因为每插入一次就是一个 HTTP 请求）

$client = Elasticsearch\ClientBuilder::create()->build();
$connect = new mysqli('localhost', 'root', 'STUPIDPASSWORD', 'db');

$max = 823880;
$cursor = 1000;

while ($cursor < $max) {
    $result = $connect->query("select * from raw_answer_265033 where wd_oaid > {$cursor} order by wd_oaid asc limit 1000");
    $params = [];
    while ($obj = $result->fetch_array()) {
        $pages = json_decode($obj['wd_answer_json']);
        $answer = [
            'answer_id' => $obj['wd_oaid'],
            'user_id' => $obj['wd_uin'],
            'questions' => [],
            'ip' => $obj['wd_ip'],
            'start_at' => date('Y-m-d h:i:s', $obj['wd_starttime']),
            'ended_at' => date('Y-m-d h:i:s', $obj['wd_endtime'])
        ];
        foreach ($pages as $page) {
            foreach ($page->questions as $question) {
                foreach ($question->options as $option) {
                    if (isset($option->checked) && $option->checked == 1) {
                        $answer['questions'][] = [
                            'page_id' => $page->id,
                            'question_id' => $question->id,
                            'question' => trim(strip_tags(htmlspecialchars_decode($question->title))),
                            'option' => trim(strip_tags(htmlspecialchars_decode($option->text))),
                        ];
                    }
                }
            }
        }
        $cursor = $obj['wd_oaid'];
        $params['body'][] = [
            'index' => ['_index' => 'your_awesome_data', '_type' => 'your_awesome_data']
        ];
        $params['body'][] = $answer;
    }
    // 这里是重点
    $response = $client->bulk($params);
    $params = [];
}

经过上面胶水语言的拼装，单个 Document 在入库时是长这样的：

{
    "answer_id": "192013",
    "user_id": "2971957289",
    "questions": [  #这里是一个数组，数量都不一样;（在 ElasticSearch 中就是 Nested Document）
        {
            "page_id": "p-12-Y1cU",
            "question_id": "q-35-gJ9a",
            "question": "八月飘香香满园（打一地名）",
            "option": "桂林"
        },
        {
            "page_id": "p-1-e8fe",
            "question_id": "q-4-irlF",
            "question": "遥知不是雪，为有暗香来（打一《红楼梦》人名）",
            "option": "王作梅"
        },
        {
            "page_id": "p-2-8jI8",
            "question_id": "q-48-WG7d",
            "question": "单刀赴会 （打一《水浒传》人名）",
            "option": "林冲"
        }
    ],
    "ip": "223.88.92.21",
    "start_at": "2016-02-21 12:02:01",
    "ended_at": "2016-02-21 13:18:15"
}

以下是返回结果， took 属性是查询耗时，这里的空白查询花了 42ms，hits.total 表示有多少个 Document，这里有 82万，表明我们刚才的批量插入成功了

{
    "took": 42,
    "timed_out": false,
    "_shards": { "total": 5, "successful": 5, "failed": 0 },
    "hits": { "total": 822880, "max_score": 1.0, "hits": [ #这里是搜索结果，省略了 ] }
}

查询

好了，以上都只是准备工作，需求来了：

没有任何条件过滤，统计所有问题的各选项比例

这是查询语句

{
    "aggs": {
        "answers": {
            "nested": {
                "path": "questions"
            },
            "aggs": {
                "questions": {
                    "terms": {
                        "field": "questions.question",
                        "size": 100,
                        "order": {
                            "_count": "desc"
                        }
                    },
                    "aggs": {
                        "options": {
                            "terms": {
                                "field": "questions.option",
                                "size": 100,
                                "order": {
                                    "_count": "desc"
                                }
                            }
                        }
                    }
                }
            }
        },
        "dates": {
            "date_histogram": {
                "field": "ended_at",
                "interval": "day",
                "min_doc_count": 0
            },
            "aggs": {
                "user_count": {
                    "cardinality": {
                        "field": "answer_id"
                    }
                }
            }
        }
    }
}

这是返回结果，只耗时 155ms，并且在一个请求内返回了两个统计结果（ dates 和 answers ））

下一段再介绍这个查询用到的聚合

{
    "took": 155,
    "timed_out": false,
    "_shards": { "total": 5, "successful": 5, "failed": 0},
    "hits": {"total": 822880, "max_score": 0, "hits": []},
    "aggregations": {
        "dates": {
            "buckets": [
                {"key_as_string": "2016-02-22 00:00:00", "key": 1456099200000, "doc_count": 573855, "user_count": {"value": 613589}},
                {"key_as_string": "2016-02-23 00:00:00", "key": 1456185600000, "doc_count": 35533,  "user_count": {"value": 32221}}
                # 省略类似以上两条的内容
            ]
        },
        "answers": {
            "doc_count": 2738528,
            "questions": {
                "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0,
                "buckets": [
                    {   "key": "千条线，万条线， 掉到水里看不见（打一自然现象）",
                        "doc_count": 166145,
                        "options": {
                            "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0,
                            "buckets": [
                                {"key": "雨", "doc_count": 147481},
                                {"key": "雪", "doc_count": 11717},
                                {"key": "雾", "doc_count": 6947}
                            ]
                        }
                    },
                    {   "key": "细白嫩肉裹紫衣，霜儿一打不成器（打一蔬菜）",
                        "doc_count": 164585,
                        "options": {
                            "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0,
                            "buckets": [
                                {"key": "茄子", "doc_count": 136404},
                                {"key": "紫薯", "doc_count": 19811},
                                {"key": "萝卜", "doc_count": 8370}
                            ]
                        }
                    },
                    {   "key": "八月飘香香满园（打一地名）",
                        "doc_count": 164571,
                        "options": {
                            "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0,
                            "buckets": [
                                {"key": "桂林", "doc_count": 148744},
                                {"key": "厦门", "doc_count": 8963},
                                {"key": "青岛", "doc_count": 6864}
                            ]
                        }
                    }
                    # 省略类似内容
                ]
            }
        }
    }
}

直接可视化就是下图的样子

改一下需求：问题1选择 A 选项的用户是怎么选择其他选项的？

这里只现实 query 部分，省略 aggs，以下是查询

{
    "query": {
        "filtered": {
            "query": {
                "nested": {
                    "path": "questions",
                    "query": {
                        "bool": {
                            "must": [
                                {
                                    "term": {
                                        "questions.question": {
                                            "value": "千条线，万条线， 掉到水里看不见（打一自然现象）"
                                        }
                                    }
                                },
                                {
                                    "term": {
                                        "questions.option": {
                                            "value": "雨"
                                        }
                                    }
                                }
                            ]
                        }
                    }
                }
            },
            "filter": {
                "and": [
                    {
                        "range": {
                            "ended_at": {
                                "from": "2016-02-14 00:00:00",
                                "to": "2016-03-15 23:59:59"
                            }
                        }
                    }
                ]
            }
        }
    },
    "aggs": {
        #
        .
        .
        .
    }
}

返回结果，耗时差不多，还是很快的

{
    "took": 63,
    "timed_out": false,
    "_shards": { "total": 5, "successful": 5, "failed": 0 },
    "hits": { "total": 147481, "max_score": 0, "hits": [ #... ] }
}

聚合

在 ElasticSearch 中，聚合分为两种： Metrics 和 Bucket，上面的查询里包含了这两种聚合，分别展开说明

Metrics 直接计算出结果，类似 SQL 中的 sum(), min(), max(), avg(), count() 函数

Bucket 不像 Metrics 直接出指标，而且创建一堆桶(可以看到每个桶有多少数量的文档)，然后还可以再用 Sub-Aggregations 再聚合

Nested Aggregation

aggs.answers 用到了，这个聚合不出结果，只是告诉 ElasticSearch 某个字段是 Nested 的，然后再继续进行聚合

Date Histogram Aggregation

例子中的 aggs.dates 就使用了 Date Histogram，这是最常用的聚合，只要数据中包含时间字段就可以使用这个聚合。有哪些使用场景？

每月／周／日／时／分，不同周期内的数量，而且这个周期不一定是单周、单日，还可以是每2天，每3个小时 etc.
某个时间点如果没有数据， ElasticSearch 也能自动补充上这个时间点（count 为 0）

Terms Aggregation

aggs.answers.aggs.questions 中使用了两次，相当于 SQL 的 group by，属于 Bucket Aggregations

Cardinality Aggregation

相当于 SQL 的 count(distinct(FIELD))，属于 Metrics Aggregations

*还有一个很重要的概念：聚合后再聚合 Sub-Aggregations *

像例子中的 aggs.answers.aggs.questions，就是先用题目进行聚合，然后再将答案聚合一次（见 aggs.answers.aggs.questions.options），如果不使用 Sub-Aggregations 就没法讲答案放在问题下了

2. 日常使用

在导入完数据后，常规维护有哪些呢？

插入新的 Document，相当于 SQL 的 insert
更新原有的 Document，相当于 SQL 的 update
删除 Document，也就是 SQL 的 delete

插入单个 Document （例如有用户刚填完一份问卷）

以下都是从官方拷贝的例子

curl -XPUT 'localhost:9200/customer/external/1' -d '
{
  "name": "John Doe"
}'

更新原有的 Document

curl -XPOST 'localhost:9200/customer/external/1/_update' -d '
{
  "doc": { "name": "Jane Doe" }
}'

删除 Document，没有意外，如你所见，用的还是 DELETE 方法，很 RESTful

curl -XDELETE 'localhost:9200/customer/external/2'

常规的使用如果不更新字段，就跟使用 MySQL 差不多，没有太大区别

总结

查询时间

好了，这里是重点，实时计算真的很重要（否则要验证一个想法的成本都很高），在 ElasticSearch 中，对几百万行进行搜索都能在几十至几百 ms 内完成

初次导入数据耗时

从 MySQL 读取到全部塞进 ElasticSearch 花了 420秒（7分钟），文档结构简单时能更加快（每秒几万）

空间占用

本例子中 Documents 有 360万（子文档也算一个），空间占用只有 434.4MB

其他

ElasticSearch 真的很快，尤其是在数据分析领域，请不要被它的名字上的 search 给骗了

在对几百万、几千万的数据能实时搜索和聚合，同时占用空间也不大，很轻松就能造一个穷人版的 Google Analytics

ElasticSearch 为啥这么快？IEG 前同事 @wentao 写了一系列文章分享，强烈建议阅读一下：

时间序列数据库的秘密（1）—— 介绍 http://www.infoq.com/cn/articles/database-timestamp-01
时间序列数据库的秘密（2）—— 索引 http://www.infoq.com/cn/articles/database-timestamp-02
时间序列数据库的秘密（3）—— 加载和分布式计算 http://www.infoq.com/cn/articles/database-timestamp-03
时间序列数据库的选择条件 http://km.oa.com/group/24825/articles/show/223511
ElasticSearch 的测试报告 https://segmentfault.com/a/1190000002688549
MongoDB 的测试报告 https://segmentfault.com/a/1190000002690548

码农公寓