An overconfident model is not calibrated and its predicted probabilities are consistently higher than the accuracy.
For example, it may predict 0.9 for inputs where the accuracy is only 0.6.
Notice that models with small test errors can still be overconfident, and therefore can benefit from label smoothing.
Label smoothing replaces one-hot encoded label vector y_hot with a mixture of y_hot and the uniform distribution:
y_ls = (1 - a) * y_hot + a / k
where K is the number of label classes, and α is a hyperparameter that determines the amount of smoothing.
If α = 0, we obtain the original one-hot encoded y_hot. If α = 1, we get the uniform distribution.
ref:
def label_smoothing(inputs, epsilon=0.1): '''Applies label smoothing. See 5.4 and https://arxiv.org/abs/1512.00567. inputs: 3d tensor. [N, T, V], where V is the number of vocabulary. epsilon: Smoothing rate. For example, ``` import tensorflow as tf inputs = tf.convert_to_tensor([[[0, 0, 1], [0, 1, 0], [1, 0, 0]], [[1, 0, 0], [1, 0, 0], [0, 1, 0]]], tf.float32) outputs = label_smoothing(inputs) with tf.Session() as sess: print(sess.run([outputs])) >> [array([[[ 0.03333334, 0.03333334, 0.93333334], [ 0.03333334, 0.93333334, 0.03333334], [ 0.93333334, 0.03333334, 0.03333334]], [[ 0.93333334, 0.03333334, 0.03333334], [ 0.93333334, 0.03333334, 0.03333334], [ 0.03333334, 0.93333334, 0.03333334]]], dtype=float32)] ``` ''' V = inputs.get_shape().as_list()[-1] # number of channels return ((1-epsilon) * inputs) + (epsilon / V)