本文主要介绍机器学习的概念、分类与相关学习资料
发展历史
-
machine learning as a "field of study that gives computers the ability to learn without being explicitly programmed --Arthur Samuel
-
Well-Posed Learning Problems
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. --Tom Mitchell
AL、ML 和 DL
AI is really a broad term and somewhat this also causes every company to claim their product has AI these days. Then ML is a subset of AI, and consists of the more advanced techniques and models that enable computers to figure things out from the data and deliver AI applications. ML is the science of getting computers to act without being explicitly programmed.
Finally, DL is a newer area of ML that that uses multi-layered artificial neural networks to deliver high accuracy in tasks such as object detection, speech recognition, language translation and other recent breakthroughs that you hear in the news.
机器学习算法分类
根据训练数据是否拥有标记信息,学习任务可大致划分为两大类:“监督学习”(supervised learning)和“无监督学习”(unsupervised learning),分类和回归是前者的代表,而聚类则是后者的代表.
- Supervised learning
Teach the computer how to do something - Unsupervised learning
Let it learning by itself
Others:
- Reinforcement learning
- Recommender systems
The classification of Supervised Learning and Unsupervised Learning are based on the forms of the data you get.
Supervised learning
Give the algorithm a data set in which the "right answer" (label) were given. The task of the algorithm is to learn to produce more of this right answer througth learning the given data set.
给定有标签的数据集,通过它学习输入与输出的对应关系。就像刷题一样,自己做题,然后根据给的答案(label)来不断调整自己的方法和思路,最终作出正确答案。
监督学习目前使用较为广泛,主要分为两类:
-
Regression problem
预测连续的输出值(例如:价格、高度、时间等)
根据数据样本上抽取的特征,预测连续值结果,如:房价多少,得分多少,GDP多少回归问题是在做计算题
-
Classification problem
预测离散的输出值(例如:对错、好坏、a、b 或 c 等)
根据数据样本上抽取出的特征,判定其属于有限个类别中的哪一个,比如:垃圾邮件识别(结果类别:yes or no),文本情感褒贬识别(结果类别:褒、贬),图像内容识别(结果类别:猫,狗,人,其他)分类问题是在做选择题
Unsupervised learning
We‘re given data that looks different, and doesn‘t havs any labels or that all have the same label or really no labels. We expect the model find some structure in the data.
给定数据集,没有标签,通过程序自己去挖掘数据具有的特征,从而学得模型。
无监督学习主要以聚类问题为主
-
Clustering algorithm
Break the data into some kinds of separate clusters (such as google new sort tons of informations into separate clusters、Organize computing clusters、Social network analysis、Market segmentation、Astronomical data analysis etc)
聚类算法:将数据分成几类,根据数据样本抽取出的特征,挖掘数据的关联、聚合模式。
-
Cocktail party problem
Separate the different things from different sources
Semi-Supervised Learning
Semi-supervised learning falls between unsupervised learning (with no labeled training data) and supervised learning (with only labeled training data).
It combines a small amount of labeled data with a large amount of unlabeled data during training.
Reinforcement learning
强化学习是从环境到行为映射的学习,它研究基于环境而行动,以取得最大化的预期收益,例如:游戏如何得最高分,机器人完成任务。
基本概念术语
下面以使用范围较广的监督学习为例子,介绍机器学习的概念
数据集
- 训练集 有正确答案的,被标记的,用来学习,归纳的数据集。
- 测试集 没有正确答案的,没有标记的,用来测试模型的优劣的数据集。
对于非监督学习,训练集和测试集就没与什么区别了,只是使用时的目的不一样而已。
上图中数据的每一行,叫做一个示例(instance)、样例(example)、样本(sample)
前三列每一列的表头叫做:属性(attribute)、特征(feature)
每个样本的每一列上的值叫做:属性值、特征值
所有的属性构成一个属性空间,所有的样本构成一个样本空间,输入‘X’可能取值的集合就是输入空间(input space)
属性向量:每一个属性有一个列向量,这些列向量构建: [x1,x2,..xn] 成为一个特征向量
机器学习的整个流程:根据数据的类型,特点等,采用不同学习方法(监督与无监督)中不同的学习算法(learning algorithm)来进行训练,从而得到一个模型,然后对这个模型进行测试,然后改进、迭代。
模型/假设(hypothesis)/学习器(learner):估计函数,对规律和模式的预测
学习机(learner):使用的学习算法
真相(ground-truth):标签、标准答案
样本(sample) = 属性(attribute)/特征(feature)+标记/标签(label)
书籍资料
- Prof Andrew Ng Machine Learning. Stanford University
- 周志华,机器学习,清华大学出版社,2016
- Python 数据分析与挖掘实战
- 面向机器智能的 tensorflow 实践
- 机器学习系统设计
- tensorflow 技术解析与实战
- Scikit-learn
- Google Crash-Course