Learning Objectives
- Understand the concepts of structured data,semi-structured data an unstructured data
- the concepts of big data
- the Hadoop technology and archtechture
- Learn to use the command lines in Hadoop for file storge
- Understand the process of MapReduce and application
- Understand the big data stroage and data manipulation using Pig and Hive(SQL)
- Understand the NoSQL data storage by implementing MongoDB
- Understand the big data analytices and analytical tools
- Implement the big data processing and analytices in case studies
TODO-LIST:
-
you can install MONGODB for NoSQL DBS
-
Install RapidMiner Studio Free
datasets
- https://github.com/datasets【重要的,常用的数据集,高质量,易于使用和开放的形式作为数据包】
-
https://archive.ics.uci.edu/ml/index.php【加州大学欧文分校的一些数据集
机器学习库】
PhishTank.com - https://cloud.tencent.com/developer/article/1369690
- 各类数据集整理(持续更新中ing)
Pre-knowledge
1. ETL
ETL工具(Extract-Transform-Load的缩写,即数据抽取、转换、装载的过程)
参考文章
- ETL的基础知识
- ETL介绍与ETL工具比较——(不太懂
- ETL工具
2. Structured dataset
结构化数据集。
结构化数据也称作行数据,是由二维表结构来逻辑表达和实现的数据,严格地遵循数据格式与长度规范,主要通过关系型数据库进行存储和管理。1
3.Column-Oriented dataset
- Semi-Structured Data – Some examples of Column-Oriented database are:
Apache HBase
Cassandra
Hypertable
Google BigTable
新兴的Hbase、HP Vertica、EMC Greenplum等分布式数据库采用列式存储,当然这些数据库也有对行式存储的支持比如HP
Vertica。随着传统关系型数据库与新兴的分布式数据库不断的发展,列式存储与行式存储会不断融合,数据库系统会呈现双模式数据存放方式,这也是商业竞争的需要。
参考博客——一分钟搞懂列式与行式数据库
4. unstructured dataset
Assignment
-
Find (download) 3 kinds of dataset in the following structures list:
Structured dataset
Column-Oriented dataset
Key-value datasets
Document dataset —> MongoDB
Graph dataset -
Select the appropriated ETL tools / methods / applications / softwares for
storing those 3 kinds of dataset. -
Describe and demonstrate each kind of dataset.
-
https://baike.baidu.com/item/%E7%BB%93%E6%9E%84%E5%8C%96%E6%95%B0%E6%8D%AE/5910594 ↩︎