You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Sa...@wellsfargo.com on 2015/07/03 17:04:06 UTC
Spark-csv into labeled points with null values
Hello all,
I am learning scala spark and going through some applications with data I have. Please allow me to put a couple questions:
spark-csv: The data I have, ain't malformed, but there are empty values in some rows, properly comma-sepparated and not catched by "DROPMALFORMED" mode
These values are taken into account as null values. My final mission is to create a LabeledPoint vector for MLLIB, so my steps are:
a. load csv
b. cast column types to have a proper DataFrame schema
c. apply map() to create a LabeledPoint with denseVector. Using map( Row => Row.getDouble(col_index) )
To this point:
res173: org.apache.spark.mllib.regression.LabeledPoint = (-1.530132691E9,[162.89431,13.55811,18.3346818,-1.6653182])
As running the following code:
val model = new LogisticRegressionWithLBFGS().
setNumClasses(2).
setValidateData(true).
run(data_map)
java.lang.RuntimeException: Failed to check null bit for primitive double value.
Debugging this, I am pretty sure this is because rows that look like -2.593849123898,392.293891,,,,
Any suggestions to get round this?
Saif