1 Loading an example dataset
scikit-learn
comes with a few standard datasets, for instance the iris and digits datasets for classification and the diabetes dataset for regression.
In the following, we start a Python interpreter from our shell and then load the iris
and digits
datasets. Our notational convention is that $
denotes the shell prompt while >>>
denotes the Python interpreter prompt:
$ python >>> from sklearn import datasets >>> iris = datasets.load_iris() >>> digits = datasets.load_digits()
A dataset is a dictionary-like object that holds all the data and some metadata about the data. This data is stored in the .data
member, which is a n_samples, n_features
array. In the case of supervised problem, one or more response variables are stored in the .target
member. More details on the different datasets can be found in the dedicated section.
For instance, in the case of the digits dataset, digits.data
gives access to the features that can be used to classify the digits samples:
>>> print(digits.data) [[ 0. 0. 5. ... 0. 0. 0.] [ 0. 0. 0. ... 10. 0. 0.] [ 0. 0. 0. ... 16. 9. 0.] ... [ 0. 0. 1. ... 6. 0. 0.] [ 0. 0. 2. ... 12. 0. 0.] [ 0. 0. 10. ... 12. 1. 0.]]
and digits.target
gives the ground truth for the digit dataset, that is the number corresponding to each digit image that we are trying to learn:
>>> digits.target array([0, 1, 2, ..., 8, 9, 8])
Shape of the data arrays
The data is always a 2D array, shape (n_samples, n_features)
, although the original data may have had a different shape. In the case of the digits, each original sample is an image of shape (8, 8)
and can be accessed using:
>>> digits.images[0] array([[ 0., 0., 5., 13., 9., 1., 0., 0.], [ 0., 0., 13., 15., 10., 15., 5., 0.], [ 0., 3., 15., 2., 0., 11., 8., 0.], [ 0., 4., 12., 0., 0., 8., 8., 0.], [ 0., 5., 8., 0., 0., 9., 8., 0.], [ 0., 4., 11., 0., 1., 12., 7., 0.], [ 0., 2., 14., 5., 10., 12., 0., 0.], [ 0., 0., 6., 13., 10., 0., 0., 0.]])
The simple example on this dataset illustrates how starting from the original problem one can shape the data for consumption in scikit-learn.
Loading from external datasets
To load from an external dataset, please refer to loading external datasets.
2 Dataset loading utilities
he sklearn.datasets
package embeds some small toy datasets as introduced in the Getting Started section.
This package also features helpers to fetch larger datasets commonly used by the machine learning community to benchmark algorithms on data that comes from the ‘real world’.
To evaluate the impact of the scale of the dataset (n_samples
and n_features
) while controlling the statistical properties of the data (typically the correlation and informativeness of the features), it is also possible to generate synthetic data.
General dataset API. There are three main kinds of dataset interfaces that can be used to get datasets depending on the desired type of dataset.
The dataset loaders. They can be used to load small standard datasets, described in the Toy datasets section.
The dataset fetchers. They can be used to download and load larger datasets, described in the Real world datasets section.
Both loaders and fetchers functions return a Bunch
object holding at least two items: an array of shape n_samples
* n_features
with key data
(except for 20newsgroups) and a numpy array of length n_samples
, containing the target values, with key target
.
The Bunch object is a dictionary that exposes its keys as attributes. For more information about Bunch object, see Bunch
.
It’s also possible for almost all of these function to constrain the output to be a tuple containing only the data and the target, by setting the return_X_y
parameter to True
.
The datasets also contain a full description in their DESCR
attribute and some contain feature_names
and target_names
. See the dataset descriptions below for details.
The dataset generation functions. They can be used to generate controlled synthetic datasets, described in the Generated datasets section.
These functions return a tuple (X, y)
consisting of a n_samples
* n_features
numpy array X
and an array of length n_samples
containing the targets y
.
In addition, there are also miscellaneous tools to load datasets of other formats or from other locations, described in the Loading other datasets section.