Spark MLlib提供了一种叫colStats()的统计方法,调用该方法会返回一个类型为MultivariateStatisticalSummary的实例。通过这个实例看,我们可以获得每一列的最大值,最小值,均值、方差、总数等。
1 2 3 4 5 6 7 1 5 9 3 5 6 3 1 3 1 1 5 6
val data_path = "file:///Users/walle/Documents/D3/sparkmlib/sample_stat.txt"
val data = sc.textFile(data_path).map(_.split("\t")).map(f => f.map(f => f.toDouble))
val data1 = data.map(f => Vectors.dense(f))
val stat1 = Statistics.colStats(data1)
stat1.max
stat1.min
stat1.mean
stat1.variance
stat1.normL1
stat1.normL2