pyspark代码练习11 —— VectorAssembler

VectorAssembler是一个变换器,它将给定的列列表组合到一个向量列中。 将原始特征和由不同特征变换器生成的特征组合成单个特征向量非常有用,以便训练ML模型,如逻辑回归和决策树。 VectorAssembler接受以下输入列类型:所有数字类型,布尔类型和矢量类型。 在每一行中,输入列的值将按指定的顺序连接到一个向量中。


from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer
df3 = spark.createDataFrame([
    (Vectors.dense(10.2,), "a"),
    (Vectors.dense(1.6,), "b"),
    (Vectors.dense(23.6,), "c"),
    (Vectors.dense(35.7,), "e"),
    (Vectors.dense(4.8,), "e"),
    (Vectors.dense(50.8,), "e")
], ["amt", "value"])


scaler = StandardScaler(inputCol="amt", outputCol="scaledAmt", withStd=True, withMean=False)

stringIndexer = StringIndexer(inputCol="value", outputCol="valueIndex").setHandleInvalid("keep") 
encoder = OneHotEncoder(inputCol="valueIndex", outputCol="valueIndexVec")
pipeline = Pipeline(stages=[stringIndexer, encoder, scaler])
model = pipeline.fit(df3)
transformed = model.transform(df3)
transformed.show()

assembler = VectorAssembler(
    inputCols=["valueIndexVec", "scaledAmt"],
    outputCol="features")

output = assembler.transform(transformed)
output.select("features", "amt").show(truncate=False)

上一篇:TypeError: Dense_net() takes 0 positional arguments but 1 was given


下一篇:Tensorflow简单入门