背景
使用feature_column可以非常方便的实现shared_embedding
tf.feature_column.shared_embedding_columns(shared_column_list, iembedding_size)
但是换成keras后,没有相应的接口。
查找资料,实现了共享embedding
核心代码
from keras.layers import Input, Embedding
first_input = Input(shape = (your_shape_tuple) )
second_input = Input(shape = (your_shape_tuple) )
...
embedding_layer = Embedding(embedding_size)
first_input_encoded = embedding_layer(first_input)
second_input_encoded = embedding_layer(second_input)
Embedding的实现
import tensorflow as tf
tf.enable_eager_execution()
import numpy as np
np.random.seed(0)
构造数据
test_array = np.random.randint(10, size=(2, 2, 5))
print("特征的原始值:\n", test_array)
特征的原始值: [[[5 0 3 3 7] [9 3 5 2 4]] [[7 6 8 8 1] [6 7 7 8 1]]] embedding层
embedding_layer = tf.keras.layers.Embedding(input_dim=10,
output_dim=5,
# input_length=10
)
# 在调用embedding_layer之前,无法打印所有的embedding,即print(embedding_layer.embeddings)不可用
# 有冒号的切片,会保持维度,若只取一维,会降维度
input_array = test_array[:,1, :]
print("输入:")
print(input_array)
outputs = embedding_layer(input_array)
print("所有的embedding,只有在调用embedding_layer后才可以打印出: ")
print(embedding_layer.embeddings)
输入: [[9 3 5 2 4] [6 7 7 8 1]] 所有的embedding,只有在调用embedding_layer后才可以打印出: <tf.Variable 'embedding_19/embeddings:0' shape=(10, 5) dtype=float32, numpy= array([[-0.01669228, -0.03277595, -0.02553627, 0.04471892, 0.02014923], [ 0.03088691, 0.01974137, 0.01614926, 0.02858892, -0.0177582 ], [-0.00453576, 0.01963988, -0.01710666, 0.01563035, 0.01191955], [-0.03036073, -0.02238812, 0.03199982, 0.04594323, 0.03707203], [ 0.04210759, 0.02329702, -0.00898039, -0.00615982, 0.04915803], [-0.02301152, -0.04580457, -0.02101959, -0.04137435, 0.00952489], [-0.00083036, 0.04911815, -0.00913552, 0.00430775, -0.03765134], [-0.013464 , 0.02491002, 0.0139573 , 0.03407213, 0.02752925], [ 0.01323656, -0.0487511 , -0.01865114, 0.02167027, -0.00414805], [ 0.03010296, -0.02651706, -0.00619526, -0.02087659, -0.02629858]], dtype=float32)>
我们的输入[[9 3 5 2 4] [6 7 7 8 1]],调用outputs = embedding_layer(input_array),即
从上面的全部embedding查找embedding,且按索引查找
数字9:索引9第10个,即 [ 0.03010296, -0.02651706, -0.00619526, -0.02087659, -0.02629858]
数字3:索引3第4个,即 [-0.03036073, -0.02238812, 0.03199982, 0.04594323, 0.03707203]
数字5:索引5第6个 ,即[-0.02301152, -0.04580457, -0.02101959, -0.04137435, 0.00952489]
。。。
验证
print("\n 多值的序列特征 Output \n", outputs)
多值的序列特征 Output tf.Tensor( [[[ 0.03010296 -0.02651706 -0.00619526 -0.02087659 -0.02629858] [-0.03036073 -0.02238812 0.03199982 0.04594323 0.03707203] [-0.02301152 -0.04580457 -0.02101959 -0.04137435 0.00952489] [-0.00453576 0.01963988 -0.01710666 0.01563035 0.01191955] [ 0.04210759 0.02329702 -0.00898039 -0.00615982 0.04915803]] [[-0.00083036 0.04911815 -0.00913552 0.00430775 -0.03765134] [-0.013464 0.02491002 0.0139573 0.03407213 0.02752925] [-0.013464 0.02491002 0.0139573 0.03407213 0.02752925] [ 0.01323656 -0.0487511 -0.01865114 0.02167027 -0.00414805] [ 0.03088691 0.01974137 0.01614926 0.02858892 -0.0177582 ]]], shape=(2, 5, 5), dtype=float32)
是吧,老铁
共享Embedding
再次构造数据,只需要调用同一个embedding_layer就行,因为是从embedding_layer.embeddings进行查找,所以就实现了shared_embedding
input_array_1col = test_array[:, 1, 1]
print("\n只有一列的单值特征: ", input_array_1col)
out_put_1col = embedding_layer(input_array_1col)
print("一维特征结果:\n", out_put_1col)
只有一列的单值特征: [3 7] 一维特征结果: tf.Tensor( [[-0.03036073 -0.02238812 0.03199982 0.04594323 0.03707203] [-0.013464 0.02491002 0.0139573 0.03407213 0.02752925]], shape=(2, 5), dtype=float32)
意外