MachineLearning.Predictions on Irish Flowers

2024-01-22 22:09:00

文章目录

1.Preparing a Machine Learning Environment in Python

There are 5 key libraries that you will need to install. Below is a list of the Python SciPy libraries:

scipy
numpy
matplotlib
pandas
sklearn

2.Load Data

We are going to use the iris flowers dataset. The dataset contains 150 observations of iris flowers. There are four columns of measurements of the flowers in centimeters. The fifth column is the species of the flower observed. All observed flowers belong to one of three species.

2.1 Import Libraries and Load Data

To load a csv-format data file, we need read_csv function in pandas module:

# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = read_csv(url, names=names)

If you do have network problems, you can download the iris.csv file into your working directory and load it using the same method, changing URL to the local file name.

3.Summerize the Dataset

Now it’s time to take a look at the data in a few different ways as follow:

Dimensions of the dataset.
Peek at the data itself.
Statistical summary of all attributes.
Breakdown of the data by the class variable.

3.1 Dimensions of Dataset

# shape
print(dataset.shape)
(150, 5)

3.2 Peek at the Data

It is also a good idea to actually eyeball your data.

print(dataset.head(20))

3.3 Statistical Summary

Now we can take a look at a summary of each attribute which includes the count, mean, min and max as well as some percentiles.

print(dataset.describe())

3.4 Class Distribution

Let’s now take a look at the number of instances (rows) that belong to each class. We can view this as an absolute count.

# class distribution
print(dataset.groupby('class').size())

4 Data Visualization

We now have a basic idea about the data. We need to extend that with some visualizations.

We are going to look at two types of plots:

Univariate plotsto better understand each attribute.
Multivariate plots to better understand the relationships between attributes.

4.1 Univariate Plots

Given that the input variables are numeric, we can create box and whisker plots of each.

# box and whisker plots
dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
pyplot.show()

We can also create a histogram of each input variable to get an idea of the distribution.

# histograms
dataset.hist()
pyplot.show()

It looks like perhaps two of the input variables have a Gaussian distribution. This is useful to note as we can use algorithms that can exploit this assumption.

4.2 Multivariate Plots

Now we can look at the interactions between the variables.
First, let’s look at scatterplots of all pairs of attributes. This can be helpful to spot structured relationships between input variables.

# scatter plot matrix
scatter_matrix(dataset)
pyplot.show()

Evaluate Some Algorithms

Now it is time to create some models of the data and estimate their accuracy on unseen data.
The steps are as follows:

Separate out a test dataset.
Set-up the test harness to use 10-fold cross validation.
Build multiple different models to predict species from flower measurements
Select the best model.

5.1 Create a Test Dataset

We need to know that the model we created is good.

Later, we will use statistical methods to estimate the accuracy of the models that we create on unseen data. We also want a more concrete estimate of the accuracy of the best model on unseen data by evaluating it on actual unseen data.

That is, we are going to hold back some data that the algorithms will not get to see and we will use this data to get a second and independent idea of how accurate the best model might actually be.

We will split the loaded dataset into two:

80% of which we will use to train, evaluate and select among our models;
20% that we will hold back as a test dataset to make predictions of accuracy of selected model on unseen data .

# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
y = array[:,4]
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.20, random_state=1)

You now have training data in the X_train and Y_train for preparing models and a X_test and Y_test sets that we can use later to make predictions.

5.2 Test Harness

We will use stratified 10-fold cross validation to estimate model accuracy in order to select the best model.

This will split our dataset into 10 parts, train on 9 and test on 1 and repeat for all combinations of train-test splits.

Stratified means that each fold or split of the dataset will aim to have the same distribution of example by class as exist in the whole training dataset.

For more on the k-fold cross-validation technique, see the tutorial:

A Gentle Introduction to k-fold Cross-Validation

We set the random seed via the random_state argument to a fixed number to ensure that each algorithm is evaluated on the same splits of the training dataset.

The specific random seed does not matter, learn more about pseudorandom number generators here:

Introduction to Random Number Generators for Machine Learning in Python
We are using the metric of accuracy to evaluate models.

This is a ratio of the number of correctly predicted instances divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate). We will be using the scoring variable when we run build and evaluate each model next.

5.3 Build Models

We don’t know which algorithms would be good on this problem or what configurations to use.

Let’s test 6 different algorithms:

Logistic Regression (LR)
Linear Discriminant Analysis (LDA)
K-Nearest Neighbors (KNN).
Classification and Regression Trees (CART).
Gaussian Naive Bayes (NB).
Support Vector Machines (SVM).

This is a good mixture of simple linear (LR and LDA), nonlinear (KNN, CART, NB and SVM) algorithms.

# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))
# evaluate each model in turn
results = []
names = []
for name, model in models:
	kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
	cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
	results.append(cv_results)
	names.append(name)
	print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))

5.4 Select Best Model

We now have 6 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate.

In this case, we can see that it looks like Support Vector Machines (SVM) has the largest estimated accuracy score at about 0.98 or 98%.

We can also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (via 10 fold-cross validation).

A useful way to compare the samples of results for each algorithm is to create a box and whisker plot for each distribution and compare the distributions.

# Compare Algorithms
pyplot.boxplot(results, labels=names)
pyplot.title('Algorithm Comparison')
pyplot.show()

We can see that the box and whisker plots are squashed at the top of the range, with many evaluations achieving 100% accuracy, and some pushing down into the high 80% accuracies.

5.5 Complete Example

# compare algorithms
from pandas import read_csv
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = read_csv(url, names=names)

# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
y = array[:,4]
X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1, shuffle=True)

# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))

# evaluate each model in turn
results = []
names = []
for name, model in models:
	kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
	cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
	results.append(cv_results)
	names.append(name)
	print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))
	
# Compare Algorithms
pyplot.boxplot(results, labels=names)
pyplot.title('Algorithm Comparison')
pyplot.show()

6.Make Predictions

We must choose an algorithm to use to make predictions. The results in the previous section suggest that the SVM was perhaps the most accurate model, then we use this model as our final model.

6.1 Make Predictions

# make predictions on test dataset
model = SVC(gamma='auto')
model.fit(X_train,Y_train)
predictions = model.predict(X_validation)

You might also like to make predictions for single rows of data. For examples on how to do that, see the tutorial:

How to Make Predictions with scikit-learn

You might also like to save the model to file and load it later to make predictions on new data. For examples on how to do this, see the tutorial:

Save and Load Machine Learning Models in Python with scikit-learn

6.2 Evaluate Predictions

We can evaluate the predictions by comparing them to the expected results in the test set, then calculate classification accuracy, as well as a confusion matrix and a classification report.

# Evaluate predictions
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))

The confusion matrix provides an indication of the errors made.
The classification report provides a breakdown of each class by precision, recall, f1-score and support showing excellent results (granted the validation dataset was small)

You do not need to know how the algorithms work. It is important to know about the limitations and how to configure machine learning algorithms. But learning about algorithms can come later. You need to build up this algorithm knowledge slowly over a long period of time. Today, start off by getting comfortable with the platform.

码农公寓