You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by SK <sk...@gmail.com> on 2014/07/11 00:28:44 UTC

incorrect labels being read by MLUtils.loadLabeledData()

Hi,

I have a csv data file, which I have organized  in the following format to
be read as a LabeledPoint(following the example in
mllib/data/sample_tree_data.csv):

1,5.1,3.5,1.4,0.2
1,4.9,3,1.4,0.2
1,4.7,3.2,1.3,0.2
1,4.6,3.1,1.5,0.2

The first column is the binary label (1 or 0) and the remaining columns are
features. I am using the Logistic Regression Classifier in MLLib to create a
model based on the training data and predict the (binary) class of the test
data.   I use MLUtils.loadLabeledData to read  the data file. My prediction
accuracy is quite low (compared to the results I got for the same data from
R), So I tried to debug, by first verifying that the LabeledData is being
read correctly. 
I find that some of the labels are not read correctly. For example, the
first 40 points of the training data have a class of 1, whereas the training
data read by loadLabeledData has label 0 for point 12 and point 14. I would
like to know if this is because of the distributed algorithm that MLLib uses
or if there is something wrong with the format I have above.

thanks  





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/incorrect-labels-being-read-by-MLUtils-loadLabeledData-tp9356.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: incorrect labels being read by MLUtils.loadLabeledData()

Posted by Yana Kadiyska <ya...@gmail.com>.

I do not believe the order of points in a distributed RDD is in any
way guaranteed. For a simple test, you can always add a last column
which is an id (make it double and throw it in the feature vector).
Printing the rdd back will not give you the points in file order. If
you don't want to go that far you can always examine the full feature
vector carefully -- points 12 and 14 should differ from your input csv
in the feature vector as well as the label.

On Thu, Jul 10, 2014 at 6:28 PM, SK <sk...@gmail.com> wrote:
> Hi,
>
> I have a csv data file, which I have organized  in the following format to
> be read as a LabeledPoint(following the example in
> mllib/data/sample_tree_data.csv):
>
> 1,5.1,3.5,1.4,0.2
> 1,4.9,3,1.4,0.2
> 1,4.7,3.2,1.3,0.2
> 1,4.6,3.1,1.5,0.2
>
> The first column is the binary label (1 or 0) and the remaining columns are
> features. I am using the Logistic Regression Classifier in MLLib to create a
> model based on the training data and predict the (binary) class of the test
> data.   I use MLUtils.loadLabeledData to read  the data file. My prediction
> accuracy is quite low (compared to the results I got for the same data from
> R), So I tried to debug, by first verifying that the LabeledData is being
> read correctly.
> I find that some of the labels are not read correctly. For example, the
> first 40 points of the training data have a class of 1, whereas the training
> data read by loadLabeledData has label 0 for point 12 and point 14. I would
> like to know if this is because of the distributed algorithm that MLLib uses
> or if there is something wrong with the format I have above.
>
> thanks
>
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/incorrect-labels-being-read-by-MLUtils-loadLabeledData-tp9356.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.