1 动手实战-基于EMR离线数据分析
1.1 创建资源,连接EMR集群
场景申请到的资源
登陆ram子账号,找到主节点公网IP地址
连接EMR集群,场景中的终端操作起来不太方便,使用本地putty终端也可以连接到主节点,完成后面的操作。
1.2 导入数据至EMR集群
在HDFS上创建目录,将编辑的文件放到HFDS文件系统上
[root@emr-header-1 ~]hdfs dfs -mkdir -p /data/student
[root@emr-header-1 ~]vim u.txt
[root@emr-header-1 ~] hdfs dfs -put u.txt /data/student
显示放入的文件和文件内容
[root@emr-header-1 ~]# hdfs dfs -ls /data/student Found 1 items -rw-r----- 2 root hadoop 2391 2022-02-28 09:30 /data/student/u.txt [root@emr-header-1 ~]# hdfs dfs -cat /data/student/u.txt 196 242 3 881250949 186 302 3 891717742 22 377 1 878887116 244 51 2 880606923 166 346 1 886397596 298 474 4 884182806 115 265 2 881171488 253 465 5 891628467 305 451 3 886324817
登陆hive,创建表,导入数据
[root@emr-header-1 ~]# hive Logging initialized using configuration in file:/etc/ecm/hive-conf-2.3.2-1.0.1/hive-log4j2.properties Async: true Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. hive> CREATE TABLE emrusers ( userid INT, movieid INT, rating INT, unixtime STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' ; OK Time taken: 1.053 seconds hive> LOAD DATA INPATH '/data/student/u.txt' INTO TABLE emrusers; Loading data to table default.emrusers OK Time taken: 0.459 seconds
1.3 查询表,在表上运行统计分析sql语句
查看表的前五行数据,sql语句被转成了map-reduce任务,花费的时间较长。
hive> select * from emrusers limit 5; OK 196 242 3 881250949 186 302 3 891717742 22 377 1 878887116 244 51 2 880606923 166 346 1 886397596 Time taken: 0.069 seconds, Fetched: 5 row(s)
查询表的总行数,sql语句被转成了map-reduce任务,花费的时间较长。
hive> select count(*) from emrusers; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20220228110103_9aec542e-2d15-49de-b0fe-388ee617b755 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1646010854736_0005, Tracking URL = http://emr-header-1.cluster-286405:20888/proxy/application_1646010854736_0005/ Kill Command = /usr/lib/hadoop-current/bin/hadoop job -kill job_1646010854736_0005 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2022-02-28 11:01:11,438 Stage-1 map = 0%, reduce = 0% 2022-02-28 11:01:16,722 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.99 sec 2022-02-28 11:01:22,891 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.28 sec MapReduce Total cumulative CPU time: 2 seconds 280 msec Ended Job = job_1646010854736_0005 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.28 sec HDFS Read: 10079 HDFS Write: 103 SUCCESS Total MapReduce CPU Time Spent: 2 seconds 280 msec OK 106 Time taken: 20.893 seconds, Fetched: 1 row(s)
查询数据表中评级最高的三个电影,sql语句被转成了map-reduce任务,花费的时间较长。
hive> select movieid,sum(rating) as rat from emrusers group by movieid order by rat desc limit 3; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20220228110213_6733e92a-00ed-4d71-b289-5be55aaa26af Total jobs = 2 Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1646010854736_0006, Tracking URL = http://emr-header-1.cluster-286405:20888/proxy/application_1646010854736_0006/ Kill Command = /usr/lib/hadoop-current/bin/hadoop job -kill job_1646010854736_0006 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2022-02-28 11:02:21,418 Stage-1 map = 0%, reduce = 0% 2022-02-28 11:02:25,532 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.0 sec 2022-02-28 11:02:30,664 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.0 sec MapReduce Total cumulative CPU time: 2 seconds 0 msec Ended Job = job_1646010854736_0006 Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1646010854736_0007, Tracking URL = http://emr-header-1.cluster-286405:20888/proxy/application_1646010854736_0007/ Kill Command = /usr/lib/hadoop-current/bin/hadoop job -kill job_1646010854736_0007 Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1 2022-02-28 11:02:38,922 Stage-2 map = 0%, reduce = 0% 2022-02-28 11:02:43,038 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 1.12 sec 2022-02-28 11:02:48,162 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 2.14 sec MapReduce Total cumulative CPU time: 2 seconds 140 msec Ended Job = job_1646010854736_0007 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.0 sec HDFS Read: 9642 HDFS Write: 2131 SUCCESS Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 2.14 sec HDFS Read: 7869 HDFS Write: 143 SUCCESS Total MapReduce CPU Time Spent: 4 seconds 140 msec OK 144 13 274 10 304 9 Time taken: 36.114 seconds, Fetched: 3 row(s)
2 动手实战-使用阿里云Elasticsearch快速搭建智能运维系统
2.1 申请资源,登录Elasticsearch集群
场景申请到的资源如下
登录子账号能看到三个Elasticsearch集群
核对一下,本次体验申请到的资源应该是es-cn-jpy7 开头的集群
修改Kibana配置,打开私网访问,从公网访问kibana。
2.2 开启自动创建索引功能
这一步比较坑的是dev工具在左侧导航栏的最下面,不知这个导航栏是以什么顺序排列的。
2.3 创建metricbeat采集器
选择ecs实例后,启动采集器
查看采集器状态
启动器状态为已生效
一共创建了3个采集器,只有一个成功运行,状态为已生效0/1的采集器其实部署是失败的。
查看dashboard
可以看到ECS的进程数,cpu、系统负载等。
2.4 总结
这个场景有一定难度,不知为啥场景中出现了多个Elasticsearch集群,对于采集器来说只能创建,删除和重启时都提示权限不够,创建的采集器有2个部署失败,体验手册中也没有给出分析和解决办法。
3 推荐系统入门之使用协同过滤实现商品推荐
这个场景除了需要因为版本变化需要切换到旧版本之外,其它同体验手册完全相同,甚至数据和结果也和体验手册完全一致。
打开实验
检查数据
运行实验
运行完成
检查join-1 节点结果,显示相似条目
查看全表统计-1 .显示推荐的结果
查看全表统计-2,显示相关性。