文章目录
1.Preparing a Machine Learning Environment in Python
There are 5 key libraries that you will need to install. Below is a list of the Python SciPy libraries:
- scipy
- numpy
- matplotlib
- pandas
- sklearn
2.Load Data
We are going to use the iris flowers dataset. The dataset contains 150 observations of iris flowers. There are four columns of measurements of the flowers in centimeters. The fifth column is the species of the flower observed. All observed flowers belong to one of three species.
2.1 Import Libraries and Load Data
To load a csv-format data file, we need read_csv
function in pandas
module:
# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = read_csv(url, names=names)
If you do have network problems, you can download the iris.csv file into your working directory and load it using the same method, changing URL to the local file name.
3.Summerize the Dataset
Now it’s time to take a look at the data in a few different ways as follow:
- Dimensions of the dataset.
- Peek at the data itself.
- Statistical summary of all attributes.
- Breakdown of the data by the class variable.
3.1 Dimensions of Dataset
# shape
print(dataset.shape)
(150, 5)
3.2 Peek at the Data
It is also a good idea to actually eyeball your data.
print(dataset.head(20))
3.3 Statistical Summary
Now we can take a look at a summary of each attribute which includes the count, mean, min and max as well as some percentiles.
print(dataset.describe())
3.4 Class Distribution
Let’s now take a look at the number of instances (rows) that belong to each class. We can view this as an absolute count.
# class distribution
print(dataset.groupby('class').size())
4 Data Visualization
We now have a basic idea about the data. We need to extend that with some visualizations.
We are going to look at two types of plots:
-
Univariate plots
to better understand each attribute. -
Multivariate plots
to better understand the relationships between attributes.
4.1 Univariate Plots
Given that the input variables are numeric, we can create box and whisker plots
of each.
# box and whisker plots
dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
pyplot.show()
We can also create a histogram
of each input variable to get an idea of the distribution.
# histograms
dataset.hist()
pyplot.show()
It looks like perhaps two of the input variables have a Gaussian distribution. This is useful to note as we can use algorithms that can exploit this assumption.
4.2 Multivariate Plots
Now we can look at the interactions between the variables.
First, let’s look at scatterplots of all pairs of attributes. This can be helpful to spot structured relationships between input variables.
# scatter plot matrix
scatter_matrix(dataset)
pyplot.show()
Evaluate Some Algorithms
Now it is time to create some models of the data and estimate their accuracy on unseen data.
The steps are as follows:
- Separate out a test dataset.
- Set-up the test harness to use 10-fold cross validation.
- Build multiple different models to predict species from flower measurements
- Select the best model.
5.1 Create a Test Dataset
We need to know that the model we created is good.
Later, we will use statistical methods to estimate the accuracy of the models that we create on unseen data. We also want a more concrete estimate of the accuracy of the best model on unseen data by evaluating it on actual unseen data.
That is, we are going to hold back some data that the algorithms will not get to see and we will use this data to get a second and independent idea of how accurate the best model might actually be.
We will split the loaded dataset into two:
- 80% of which we will use to train, evaluate and select among our models;
- 20% that we will hold back as a test dataset to make predictions of accuracy of selected model on unseen data .
# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
y = array[:,4]
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.20, random_state=1)
You now have training data in the X_train and Y_train for preparing models and a X_test and Y_test sets that we can use later to make predictions.
5.2 Test Harness
We will use stratified 10-fold cross validation
to estimate model accuracy in order to select the best model.
This will split our dataset into 10 parts, train on 9 and test on 1 and repeat for all combinations of train-test splits.
Stratified
means that each fold or split of the dataset will aim to have the same distribution of example by class as exist in the whole training dataset.
For more on the k-fold cross-validation technique, see the tutorial:
We set the random seed via the random_state
argument to a fixed number to ensure that each algorithm is evaluated on the same splits of the training dataset.
The specific random seed does not matter, learn more about pseudorandom
number generators here:
-
Introduction to Random Number Generators for Machine Learning in Python
We are using the metric ofaccuracy
to evaluate models.
This is a ratio of the number of correctly predicted instances divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate). We will be using the scoring variable when we run build and evaluate each model next.
5.3 Build Models
We don’t know which algorithms would be good on this problem or what configurations to use.
Let’s test 6 different algorithms:
- Logistic Regression (LR)
- Linear Discriminant Analysis (LDA)
- K-Nearest Neighbors (KNN).
- Classification and Regression Trees (CART).
- Gaussian Naive Bayes (NB).
- Support Vector Machines (SVM).
This is a good mixture of simple linear (LR and LDA), nonlinear (KNN, CART, NB and SVM) algorithms.
# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
results.append(cv_results)
names.append(name)
print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))
5.4 Select Best Model
We now have 6 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate.
In this case, we can see that it looks like Support Vector Machines (SVM) has the largest estimated accuracy score at about 0.98 or 98%.
We can also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (via 10 fold-cross validation).
A useful way to compare the samples of results for each algorithm is to create a box and whisker plot for each distribution and compare the distributions.
# Compare Algorithms
pyplot.boxplot(results, labels=names)
pyplot.title('Algorithm Comparison')
pyplot.show()
We can see that the box and whisker plots are squashed at the top of the range, with many evaluations achieving 100% accuracy, and some pushing down into the high 80% accuracies.
5.5 Complete Example
# compare algorithms
from pandas import read_csv
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = read_csv(url, names=names)
# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
y = array[:,4]
X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1, shuffle=True)
# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
results.append(cv_results)
names.append(name)
print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))
# Compare Algorithms
pyplot.boxplot(results, labels=names)
pyplot.title('Algorithm Comparison')
pyplot.show()
6.Make Predictions
We must choose an algorithm to use to make predictions. The results in the previous section suggest that the SVM
was perhaps the most accurate model, then we use this model as our final model.
6.1 Make Predictions
# make predictions on test dataset
model = SVC(gamma='auto')
model.fit(X_train,Y_train)
predictions = model.predict(X_validation)
You might also like to make predictions for single rows of data. For examples on how to do that, see the tutorial:
You might also like to save the model to file and load it later to make predictions on new data. For examples on how to do this, see the tutorial:
6.2 Evaluate Predictions
We can evaluate the predictions by comparing them to the expected results in the test set, then calculate classification accuracy, as well as a confusion matrix
and a classification report
.
# Evaluate predictions
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))
- The
confusion matrix
provides an indication of the errors made. - The
classification report
provides a breakdown of each class byprecision
,recall
,f1-score
andsupport
showing excellent results (granted the validation dataset was small)
You do not need to know how the algorithms work. It is important to know about the limitations
and how to configure
machine learning algorithms. But learning about algorithms can come later. You need to build up this algorithm knowledge slowly over a long period of time. Today, start off by getting comfortable with the platform.