[Dataman] Week 1

Learning Objectives

  1. Understand the concepts of structured data,semi-structured data an unstructured data
  2. the concepts of big data
  3. the Hadoop technology and archtechture
  4. Learn to use the command lines in Hadoop for file storge
  5. Understand the process of MapReduce and application
  6. Understand the big data stroage and data manipulation using Pig and Hive(SQL)
  7. Understand the NoSQL data storage by implementing MongoDB
  8. Understand the big data analytices and analytical tools
  9. Implement the big data processing and analytices in case studies

TODO-LIST:

  1. you can install MONGODB for NoSQL DBS

  2. Install RapidMiner Studio Free

datasets

  1. https://github.com/datasets【重要的,常用的数据集,高质量,易于使用和开放的形式作为数据包】
  2. https://archive.ics.uci.edu/ml/index.php【加州大学欧文分校的一些数据集
    机器学习库】

    PhishTank.com
  3. https://cloud.tencent.com/developer/article/1369690
  4. 各类数据集整理(持续更新中ing)

Pre-knowledge

1. ETL

ETL工具(Extract-Transform-Load的缩写,即数据抽取、转换、装载的过程)
参考文章

  1. ETL的基础知识
  2. ETL介绍与ETL工具比较——(不太懂
  3. ETL工具

2. Structured dataset

结构化数据集。

结构化数据也称作行数据,是由二维表结构来逻辑表达和实现的数据,严格地遵循数据格式与长度规范,主要通过关系型数据库进行存储和管理。1

3.Column-Oriented dataset

  • Semi-Structured Data – Some examples of Column-Oriented database are:
    Apache HBase
    Cassandra
    Hypertable
    Google BigTable

新兴的Hbase、HP Vertica、EMC Greenplum等分布式数据库采用列式存储,当然这些数据库也有对行式存储的支持比如HP
Vertica。

随着传统关系型数据库与新兴的分布式数据库不断的发展,列式存储与行式存储会不断融合,数据库系统会呈现双模式数据存放方式,这也是商业竞争的需要。
参考博客——一分钟搞懂列式与行式数据库

4. unstructured dataset

结构化数据、半结构化数据和非结构化数据


Assignment

  • Find (download) 3 kinds of dataset in the following structures list:
    Structured dataset
    Column-Oriented dataset
    Key-value datasets
    Document dataset —> MongoDB
    Graph dataset

  • Select the appropriated ETL tools / methods / applications / softwares for
    storing those 3 kinds of dataset.

  • Describe and demonstrate each kind of dataset.


  1. https://baike.baidu.com/item/%E7%BB%93%E6%9E%84%E5%8C%96%E6%95%B0%E6%8D%AE/5910594 ↩︎

上一篇:HTTP 错误 500.21 - Internal Server Error


下一篇:delphi 求时间的间隔小记