0
点赞
收藏
分享

微信扫一扫

Pyspark特征工程--Binarize

Binarize :在给定阈值的情况下对一列连续特征进行二值化

class pyspark.ml.feature.Binarizer(threshold=0.0, inputCol=None, outputCol=None)[[source]](https://spark.apache.org/docs/2.4.5/api/python/pyspark.ml.html#pyspark.ml.feature.Binarizer

threshold:用于单列,thresholds:用于多列(当前版本2.4.5不支持)

​ threshold即为阈值

inputCol:用于单列,inputCols:用于多列(当前版本2.4.5不支持)

01.创建对象

from pyspark.sql import SparkSession
from pyspark.ml.feature import Binarizer
spark = SparkSession.builder.config("spark.Driver.host","192.168.1.4")\
    .config("spark.ui.showConsoleProgress","false")\
    .appName("Binarize").master("local[*]").getOrCreate()

02.创建数据

data = spark.createDataFrame([
    (0.1,),
    (2.3,),
    (1.1,),
    (4.2,),
    (2.5,),
    (6.8,),
],["values"])
data.show()

​ 输出结果:

+------+
|values|
+------+
|   0.1|
|   2.3|
|   1.1|
|   4.2|
|   2.5|
|   6.8|
+------+

03.创建一个Binarize对象,参数中指定输入列,阈值和输出列

binarizer = Binarizer(threshold=2.4,inputCol="values",outputCol="features")

04.转换原始数据并查看结果

res = binarizer.transform(data)
res.show()

​ 输出结果

+------+--------+
|values|features|
+------+--------+
|   0.1|     0.0|
|   2.3|     0.0|
|   1.1|     0.0|
|   4.2|     1.0|
|   2.5|     1.0|
|   6.8|     1.0|
+------+--------+

05.查看结构

res.printSchema()

1输出结果:

root
 |-- values: double (nullable = true)
 |-- features: double (nullable = true)
举报

相关推荐

0 条评论