《Feature Engineering for Machine Learning》chapter 2

内容自《Feature Engineering for Machine Learning

Scalars, Vectors, and Spaces

2.2 Dealing with Counts

2.2.1 二值化(Binarization)

    《Feature Engineering for Machine Learning》chapter 2

    In the Million Song Dataset, the raw listen count is not a robust measure of user taste. (In statistical lingo, robustness means that the method works under a wide variety of conditions.) Users have different listening habits. Some people might put their favor‐ ite songs on infinite loop, while others might savor them only on special occasions. We can’t necessarily say that someone who listens to a song 20 times must like it twice as much as someone else who listens to it 10 times.

    A more robust representation of user preference is to binarize the count and clip all counts greater than 1 to 1. In other words, if the user listened to a song at least once, then we count it as the user liking the song. This way, the model will not need to spend cycles on predicting the minute differences between the raw counts. The binary target is a simple and robust measure of user preference

 

2.2.2 区间量化(分箱) Quantization or Binning

    Raw counts that span several orders of magnitude are problematic for many models. In a linear model, the same linear coefficient would have to work for all possible val‐ ues of the count. Large counts could also wreak havoc in unsupervised learning methods such as k-means clustering, which uses Euclidean distance as a similarity function to measure the similarity between data points. A large count in one element of the data vector would outweigh the similarity in all other elements, which could throw off the entire similarity measurement.

    One solution is to contain the scale by quantizing the count. In other words, we group the counts into bins, and get rid of the actual count values. Quantization maps a continuous number to a discrete one. We can think of the discretized numbers as an ordered sequence of bins that represent a measure of intensity.

1.固定宽度分箱(Fixed-width binning)

    With fixed-width binning, each bin contains a specific numeric range. The ranges can be custom designed or automatically segmented, and they can be linearly scaled or exponentially scaled. For example, we can group people into age ranges by decade: 0–9 years old in bin 1, 10–19 years in bin 2, etc. To map from the count to the bin, we simply divide by the width of the bin and take the integer part.

    When the numbers span multiple magnitudes, it may be better to group by powers of 10 (or powers of any constant): 0–9, 10–99, 100–999, 1000–9999, etc. The bin widths grow exponentially, going from O(10), to O(100), O(1000), and beyond.

>>> import numpy as np
# Generate 20 random integers uniformly between 0 and 99
>>> small_counts = np.random.randint(0, 100, 20)
>>> small_counts
array([30, 64, 49, 26, 69, 23, 56, 7, 69, 67, 87, 14, 67, 33, 88, 77, 75,
 47, 44, 93])
# Map to evenly spaced bins 0-9 by division
>>> np.floor_divide(small_counts, 10)
array([3, 6, 4, 2, 6, 2, 5, 0, 6, 6, 8, 1, 6, 3, 8, 7, 7, 4, 4, 9], dtype=int32)
# An array of counts that span several magnitudes
>>> large_counts = [296, 8286, 64011, 80, 3, 725, 867, 2215, 7689, 11495, 91897,
... 44, 28, 7971, 926, 122, 22222]
# Map to exponential-width bins via the log function
>>> np.floor(np.log10(large_counts))
array([ 2., 3., 4., 1., 0., 2., 2., 3., 3., 4., 4., 1., 1.,
 3., 2., 2., 4.])

 

2.分位数分箱(Quantile binning)

    Fixed-width binning is easy to compute. But if there are large gaps in the counts, then there will be many empty bins with no data. This problem can be solved by adaptively positioning the bins based on the distribution of the data. This can be done using the quantiles of the distribution.

    Quantiles are values that divide the data into equal portions. For example, the median divides the data in halves; half the data points are smaller and half larger than the median. The quartiles divide the data into quarters, the deciles into tenths, etc.

    pandas.DataFrame.quantile and pandas.Ser ies.quantile compute the quantiles. pandas.qcut maps data into a desired number of quantiles.

# Continue example 2-3 with large_counts
>>> import pandas as pd
# Map the counts to quartiles
>>> pd.qcut(large_counts, 4, labels=False)
array([1, 2, 3, 0, 0, 1, 1, 2, 2, 3, 3, 0, 0, 2, 1, 0, 3], dtype=int64)
# Compute the quantiles themselves
>>> large_counts_series = pd.Series(large_counts)
>>> large_counts_series.quantile([0.25, 0.5, 0.75])
0.25 122.0
0.50 926.0
0.75 8286.0
dtype: float64

 

2.3 对数变换(Log Transformation)

......

上一篇:国外公司技术博客盘点


下一篇:AI学习---特征工程(Feature Engineering)