You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by SK <sk...@gmail.com> on 2014/07/12 02:12:23 UTC

ML classifier and data format for dataset with variable number of features

Hi,

I need to perform binary classification on an image dataset. Each image is a
data point described by a Json object. The feature set for each image is a
set of feature vectors, each feature vector corresponding to a distinct
object in the image. For example, if an image has 5 objects, its feature set
will have 5 feature vectors, whereas an image that has 3 objects will have a
feature set consisting of 3 feature vectors. So  the number of feature
vectors  may be different for different images, although  each feature
vector has the same number of attributes. The classification depends on the
features of the individual objects, so I cannot aggregate them all into a
flat vector.

I have looked through the Mllib examples and it appears that the libSVM data
format and the LabeledData format that Mllib uses, require  all the points
to have the same number of features and they read in a flat feature vector.
I would like to know if any of the Mllib supervised learning classifiers can
be used with json data format and whether they can be used to classify
points with different number of features as described above.

thanks
 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ML-classifier-and-data-format-for-dataset-with-variable-number-of-features-tp9486.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: ML classifier and data format for dataset with variable number of features

Posted by Xiangrui Meng <me...@gmail.com>.

You can load the dataset as an RDD of JSON object and use a flatMap to
extract feature vectors at object level. Then you can filter the
training examples you want for binary classification. If you want to
try multiclass, checkout DB's PR at
https://github.com/apache/spark/pull/1379

Best,
Xiangrui

On Fri, Jul 11, 2014 at 5:12 PM, SK <sk...@gmail.com> wrote:
> Hi,
>
> I need to perform binary classification on an image dataset. Each image is a
> data point described by a Json object. The feature set for each image is a
> set of feature vectors, each feature vector corresponding to a distinct
> object in the image. For example, if an image has 5 objects, its feature set
> will have 5 feature vectors, whereas an image that has 3 objects will have a
> feature set consisting of 3 feature vectors. So  the number of feature
> vectors  may be different for different images, although  each feature
> vector has the same number of attributes. The classification depends on the
> features of the individual objects, so I cannot aggregate them all into a
> flat vector.
>
> I have looked through the Mllib examples and it appears that the libSVM data
> format and the LabeledData format that Mllib uses, require  all the points
> to have the same number of features and they read in a flat feature vector.
> I would like to know if any of the Mllib supervised learning classifiers can
> be used with json data format and whether they can be used to classify
> points with different number of features as described above.
>
> thanks
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ML-classifier-and-data-format-for-dataset-with-variable-number-of-features-tp9486.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.