利用 IMDB 数据进行 Sentiment Analysis。
通过 keras.datasets 里面下载,注意下载的结构,并进行预处理。
from keras.datasets import imdb from keras import preprocessing # Number of words to consider as features max_features = 10000 # Cut texts after this number of words # (among top max_features most common words) maxlen = 20 # Load the data as lists of integers. (x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
x_train
- type: numpy.ndarray
- shape: (25000, ),每一个文本的长度不同,需要补充 0 或者截取,保证长度相同
- 都是由数字组成,数字与单词对应
y_train: 二分类 0 和 1
需要对文本长度进行调节
# This turns our lists of integers # into a 2D integer tensor of shape `(samples, maxlen)` x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen) x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)
长度设置为 maxlen=20。
得到的矩阵可以直接作为 Embedding 层的输入数据。
语法:
keras.preprocessing.sequence.pad_sequences(sequences, maxlen=None, dtype=‘int32‘, padding=‘pre‘, truncating=‘pre‘, value=0.)
将长为nb_samples
的序列(标量序列)转化为形如(nb_samples,nb_timesteps)
2D numpy array。如果提供了参数maxlen
,nb_timesteps=maxlen
,否则其值为最长序列的长度。其他短于该长度的序列都会在后部填充0以达到该长度。长于nb_timesteps
的序列将会被截断,以使其匹配目标长度。padding和截断发生的位置分别取决于padding
和truncating
.
参数:
-
sequences:浮点数或整数构成的两层嵌套列表
-
maxlen:None或整数,为序列的最大长度。大于此长度的序列将被截短,小于此长度的序列将在后部填0.
-
dtype:返回的numpy array的数据类型
-
padding:‘pre’或‘post’,确定当需要补0时,在序列的起始还是结尾补
-
truncating:‘pre’或‘post’,确定当需要截断序列时,从起始还是结尾截断
-
value:浮点数,此值将在填充时代替默认的填充值0
返回值:
返回形如(nb_samples,nb_timesteps)
的2D张量
举例:
>>> a = np.array([[2, 3], [3, 4, 6], [7, 8, 9, 10]]) >>> a array([list([2, 3]), list([3, 4, 6]), list([7, 8, 9, 10])], dtype=object) >>> import keras Using TensorFlow backend. >>> b = keras.preprocessing.sequence.pad_sequences(a, maxlen=10) >>> b array([[ 0, 0, 0, 0, 0, 0, 0, 0, 2, 3], [ 0, 0, 0, 0, 0, 0, 0, 3, 4, 6], [ 0, 0, 0, 0, 0, 0, 7, 8, 9, 10]]) >>> c = keras.preprocessing.sequence.pad_sequences(a, maxlen=10, padding=‘post‘) >>> c array([[ 2, 3, 0, 0, 0, 0, 0, 0, 0, 0], [ 3, 4, 6, 0, 0, 0, 0, 0, 0, 0], [ 7, 8, 9, 10, 0, 0, 0, 0, 0, 0]]) >>> d = keras.preprocessing.sequence.pad_sequences(a, maxlen=3, padding=‘post‘) >>> d array([[ 2, 3, 0], [ 3, 4, 6], [ 8, 9, 10]]) >>> e = keras.preprocessing.sequence.pad_sequences(a, maxlen=3) >>> e array([[ 0, 2, 3], [ 3, 4, 6], [ 8, 9, 10]]) >>> f = keras.preprocessing.sequence.pad_sequences(a, maxlen=3, padding=‘post‘, truncating=‘post‘) >>> f array([[2, 3, 0], [3, 4, 6], [7, 8, 9]])