You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Devl Devel <de...@gmail.com> on 2015/01/08 16:58:39 UTC

K-Means And Class Tags

Hi All,

I'm trying a simple K-Means example as per the website:

val parsedData = data.map(s => Vectors.dense(s.split(',').map(_.toDouble)))

but I'm trying to write a Java based validation method first so that
missing values are omitted or replaced with 0.

public RDD<Vector> prepareKMeans(JavaRDD<String> data) {
        JavaRDD<Vector> words = data.flatMap(new FlatMapFunction<String,
Vector>() {
            public Iterable<Vector> call(String s) {
                String[] split = s.split(",");
                ArrayList<Vector> add = new ArrayList<Vector>();
                if (split.length != 2) {
                    add.add(Vectors.dense(0, 0));
                } else
                {
                    add.add(Vectors.dense(Double.parseDouble(split[0]),
               Double.parseDouble(split[1])));
                }

                return add;
            }
        });

        return words.rdd();
}

When I then call from scala:

val parsedData=dc.prepareKMeans(data);
val p=parsedData.collect();

I get Exception in thread "main" java.lang.ClassCastException:
[Ljava.lang.Object; cannot be cast to
[Lorg.apache.spark.mllib.linalg.Vector;

Why is the class tag is object rather than vector?

1) How do I get this working correctly using the Java validation example
above or
2) How can I modify val parsedData = data.map(s =>
Vectors.dense(s.split(',').map(_.toDouble))) so that when s.split size <2 I
ignore the line? or
3) Is there a better way to do input validation first?

Using spark and mlib:
libraryDependencies += "org.apache.spark" % "spark-core_2.10" %  "1.2.0"
libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" % "1.2.0"

Many thanks in advance
Dev

Re: K-Means And Class Tags

Posted by "devl.development" <de...@gmail.com>.
Thanks for the suggestion, can anyone offer any advice on the ClassCast
Exception going from Java to Scala? Why does going from JavaRDD.rdd() and
then a collect() result in this exception?



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-tp10038p10047.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: K-Means And Class Tags

Posted by Joseph Bradley <jo...@databricks.com>.
(After asking around,) retag() is private[spark] in Scala, but Java ignores
the "private[X]," making retag (unintentionally) public in Java.

Currently, your solution of retagging from Java is the best hack I can
think of.  It may take a bit of engineering to create a proper fix for the
long-term.
Joseph

On Fri, Jan 9, 2015 at 2:41 AM, Devl Devel <de...@gmail.com>
wrote:

> Hi Joseph
>
> Thanks for the suggestion, however retag is a private method and when I
> call in Scala:
>
> val retaggedInput = parsedData.retag(classOf[Vector])
>
> I get:
>
> Symbol retag is inaccessible from this place
>
> However I can do this from Java, and it works in Scala:
>
> return words.rdd().retag(Vector.class);
>
> Dev
>
>
>
> On Thu, Jan 8, 2015 at 9:35 PM, Joseph Bradley <jo...@databricks.com>
> wrote:
>
>> I believe you're running into an erasure issue which we found in
>> DecisionTree too.  Check out:
>>
>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala#L134
>>
>> That retags RDDs which were created from Java to prevent the exception
>> you're running into.
>>
>> Hope this helps!
>> Joseph
>>
>> On Thu, Jan 8, 2015 at 12:48 PM, Devl Devel <de...@gmail.com>
>> wrote:
>>
>>> Thanks for the suggestion, can anyone offer any advice on the ClassCast
>>> Exception going from Java to Scala? Why does JavaRDD.rdd() and then a
>>> collect() result in this exception?
>>>
>>> On Thu, Jan 8, 2015 at 4:13 PM, Yana Kadiyska <ya...@gmail.com>
>>> wrote:
>>>
>>> > How about
>>> >
>>> >
>>> data.map(s=>s.split(",")).filter(_.length>1).map(good_entry=>Vectors.dense((Double.parseDouble(good_entry[0]),
>>> > Double.parseDouble(good_entry[1]))
>>> > ​
>>> > (full disclosure, I didn't actually run this). But after the first map
>>> you
>>> > should have an RDD[Array[String]], then you'd discard everything
>>> shorter
>>> > than 2, and convert the rest to dense vectors?...In fact if you're
>>> > expecting length exactly 2 might want to filter ==2...
>>> >
>>> >
>>> > On Thu, Jan 8, 2015 at 10:58 AM, Devl Devel <
>>> devl.development@gmail.com>
>>> > wrote:
>>> >
>>> >> Hi All,
>>> >>
>>> >> I'm trying a simple K-Means example as per the website:
>>> >>
>>> >> val parsedData = data.map(s =>
>>> >> Vectors.dense(s.split(',').map(_.toDouble)))
>>> >>
>>> >> but I'm trying to write a Java based validation method first so that
>>> >> missing values are omitted or replaced with 0.
>>> >>
>>> >> public RDD<Vector> prepareKMeans(JavaRDD<String> data) {
>>> >>         JavaRDD<Vector> words = data.flatMap(new
>>> FlatMapFunction<String,
>>> >> Vector>() {
>>> >>             public Iterable<Vector> call(String s) {
>>> >>                 String[] split = s.split(",");
>>> >>                 ArrayList<Vector> add = new ArrayList<Vector>();
>>> >>                 if (split.length != 2) {
>>> >>                     add.add(Vectors.dense(0, 0));
>>> >>                 } else
>>> >>                 {
>>> >>
>>>  add.add(Vectors.dense(Double.parseDouble(split[0]),
>>> >>                Double.parseDouble(split[1])));
>>> >>                 }
>>> >>
>>> >>                 return add;
>>> >>             }
>>> >>         });
>>> >>
>>> >>         return words.rdd();
>>> >> }
>>> >>
>>> >> When I then call from scala:
>>> >>
>>> >> val parsedData=dc.prepareKMeans(data);
>>> >> val p=parsedData.collect();
>>> >>
>>> >> I get Exception in thread "main" java.lang.ClassCastException:
>>> >> [Ljava.lang.Object; cannot be cast to
>>> >> [Lorg.apache.spark.mllib.linalg.Vector;
>>> >>
>>> >> Why is the class tag is object rather than vector?
>>> >>
>>> >> 1) How do I get this working correctly using the Java validation
>>> example
>>> >> above or
>>> >> 2) How can I modify val parsedData = data.map(s =>
>>> >> Vectors.dense(s.split(',').map(_.toDouble))) so that when s.split
>>> size <2
>>> >> I
>>> >> ignore the line? or
>>> >> 3) Is there a better way to do input validation first?
>>> >>
>>> >> Using spark and mlib:
>>> >> libraryDependencies += "org.apache.spark" % "spark-core_2.10" %
>>> "1.2.0"
>>> >> libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" %
>>> "1.2.0"
>>> >>
>>> >> Many thanks in advance
>>> >> Dev
>>> >>
>>> >
>>> >
>>>
>>
>>
>

Re: K-Means And Class Tags

Posted by Devl Devel <de...@gmail.com>.
Hi Joseph

Thanks for the suggestion, however retag is a private method and when I
call in Scala:

val retaggedInput = parsedData.retag(classOf[Vector])

I get:

Symbol retag is inaccessible from this place

However I can do this from Java, and it works in Scala:

return words.rdd().retag(Vector.class);

Dev



On Thu, Jan 8, 2015 at 9:35 PM, Joseph Bradley <jo...@databricks.com>
wrote:

> I believe you're running into an erasure issue which we found in
> DecisionTree too.  Check out:
>
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala#L134
>
> That retags RDDs which were created from Java to prevent the exception
> you're running into.
>
> Hope this helps!
> Joseph
>
> On Thu, Jan 8, 2015 at 12:48 PM, Devl Devel <de...@gmail.com>
> wrote:
>
>> Thanks for the suggestion, can anyone offer any advice on the ClassCast
>> Exception going from Java to Scala? Why does JavaRDD.rdd() and then a
>> collect() result in this exception?
>>
>> On Thu, Jan 8, 2015 at 4:13 PM, Yana Kadiyska <ya...@gmail.com>
>> wrote:
>>
>> > How about
>> >
>> >
>> data.map(s=>s.split(",")).filter(_.length>1).map(good_entry=>Vectors.dense((Double.parseDouble(good_entry[0]),
>> > Double.parseDouble(good_entry[1]))
>> > ​
>> > (full disclosure, I didn't actually run this). But after the first map
>> you
>> > should have an RDD[Array[String]], then you'd discard everything shorter
>> > than 2, and convert the rest to dense vectors?...In fact if you're
>> > expecting length exactly 2 might want to filter ==2...
>> >
>> >
>> > On Thu, Jan 8, 2015 at 10:58 AM, Devl Devel <devl.development@gmail.com
>> >
>> > wrote:
>> >
>> >> Hi All,
>> >>
>> >> I'm trying a simple K-Means example as per the website:
>> >>
>> >> val parsedData = data.map(s =>
>> >> Vectors.dense(s.split(',').map(_.toDouble)))
>> >>
>> >> but I'm trying to write a Java based validation method first so that
>> >> missing values are omitted or replaced with 0.
>> >>
>> >> public RDD<Vector> prepareKMeans(JavaRDD<String> data) {
>> >>         JavaRDD<Vector> words = data.flatMap(new
>> FlatMapFunction<String,
>> >> Vector>() {
>> >>             public Iterable<Vector> call(String s) {
>> >>                 String[] split = s.split(",");
>> >>                 ArrayList<Vector> add = new ArrayList<Vector>();
>> >>                 if (split.length != 2) {
>> >>                     add.add(Vectors.dense(0, 0));
>> >>                 } else
>> >>                 {
>> >>                     add.add(Vectors.dense(Double.parseDouble(split[0]),
>> >>                Double.parseDouble(split[1])));
>> >>                 }
>> >>
>> >>                 return add;
>> >>             }
>> >>         });
>> >>
>> >>         return words.rdd();
>> >> }
>> >>
>> >> When I then call from scala:
>> >>
>> >> val parsedData=dc.prepareKMeans(data);
>> >> val p=parsedData.collect();
>> >>
>> >> I get Exception in thread "main" java.lang.ClassCastException:
>> >> [Ljava.lang.Object; cannot be cast to
>> >> [Lorg.apache.spark.mllib.linalg.Vector;
>> >>
>> >> Why is the class tag is object rather than vector?
>> >>
>> >> 1) How do I get this working correctly using the Java validation
>> example
>> >> above or
>> >> 2) How can I modify val parsedData = data.map(s =>
>> >> Vectors.dense(s.split(',').map(_.toDouble))) so that when s.split size
>> <2
>> >> I
>> >> ignore the line? or
>> >> 3) Is there a better way to do input validation first?
>> >>
>> >> Using spark and mlib:
>> >> libraryDependencies += "org.apache.spark" % "spark-core_2.10" %
>> "1.2.0"
>> >> libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" %
>> "1.2.0"
>> >>
>> >> Many thanks in advance
>> >> Dev
>> >>
>> >
>> >
>>
>
>

Re: K-Means And Class Tags

Posted by Joseph Bradley <jo...@databricks.com>.
I believe you're running into an erasure issue which we found in
DecisionTree too.  Check out:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala#L134

That retags RDDs which were created from Java to prevent the exception
you're running into.

Hope this helps!
Joseph

On Thu, Jan 8, 2015 at 12:48 PM, Devl Devel <de...@gmail.com>
wrote:

> Thanks for the suggestion, can anyone offer any advice on the ClassCast
> Exception going from Java to Scala? Why does JavaRDD.rdd() and then a
> collect() result in this exception?
>
> On Thu, Jan 8, 2015 at 4:13 PM, Yana Kadiyska <ya...@gmail.com>
> wrote:
>
> > How about
> >
> >
> data.map(s=>s.split(",")).filter(_.length>1).map(good_entry=>Vectors.dense((Double.parseDouble(good_entry[0]),
> > Double.parseDouble(good_entry[1]))
> > ​
> > (full disclosure, I didn't actually run this). But after the first map
> you
> > should have an RDD[Array[String]], then you'd discard everything shorter
> > than 2, and convert the rest to dense vectors?...In fact if you're
> > expecting length exactly 2 might want to filter ==2...
> >
> >
> > On Thu, Jan 8, 2015 at 10:58 AM, Devl Devel <de...@gmail.com>
> > wrote:
> >
> >> Hi All,
> >>
> >> I'm trying a simple K-Means example as per the website:
> >>
> >> val parsedData = data.map(s =>
> >> Vectors.dense(s.split(',').map(_.toDouble)))
> >>
> >> but I'm trying to write a Java based validation method first so that
> >> missing values are omitted or replaced with 0.
> >>
> >> public RDD<Vector> prepareKMeans(JavaRDD<String> data) {
> >>         JavaRDD<Vector> words = data.flatMap(new FlatMapFunction<String,
> >> Vector>() {
> >>             public Iterable<Vector> call(String s) {
> >>                 String[] split = s.split(",");
> >>                 ArrayList<Vector> add = new ArrayList<Vector>();
> >>                 if (split.length != 2) {
> >>                     add.add(Vectors.dense(0, 0));
> >>                 } else
> >>                 {
> >>                     add.add(Vectors.dense(Double.parseDouble(split[0]),
> >>                Double.parseDouble(split[1])));
> >>                 }
> >>
> >>                 return add;
> >>             }
> >>         });
> >>
> >>         return words.rdd();
> >> }
> >>
> >> When I then call from scala:
> >>
> >> val parsedData=dc.prepareKMeans(data);
> >> val p=parsedData.collect();
> >>
> >> I get Exception in thread "main" java.lang.ClassCastException:
> >> [Ljava.lang.Object; cannot be cast to
> >> [Lorg.apache.spark.mllib.linalg.Vector;
> >>
> >> Why is the class tag is object rather than vector?
> >>
> >> 1) How do I get this working correctly using the Java validation example
> >> above or
> >> 2) How can I modify val parsedData = data.map(s =>
> >> Vectors.dense(s.split(',').map(_.toDouble))) so that when s.split size
> <2
> >> I
> >> ignore the line? or
> >> 3) Is there a better way to do input validation first?
> >>
> >> Using spark and mlib:
> >> libraryDependencies += "org.apache.spark" % "spark-core_2.10" %  "1.2.0"
> >> libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" % "1.2.0"
> >>
> >> Many thanks in advance
> >> Dev
> >>
> >
> >
>

Re: K-Means And Class Tags

Posted by Devl Devel <de...@gmail.com>.
Thanks for the suggestion, can anyone offer any advice on the ClassCast
Exception going from Java to Scala? Why does JavaRDD.rdd() and then a
collect() result in this exception?

On Thu, Jan 8, 2015 at 4:13 PM, Yana Kadiyska <ya...@gmail.com>
wrote:

> How about
>
> data.map(s=>s.split(",")).filter(_.length>1).map(good_entry=>Vectors.dense((Double.parseDouble(good_entry[0]),
> Double.parseDouble(good_entry[1]))
> ​
> (full disclosure, I didn't actually run this). But after the first map you
> should have an RDD[Array[String]], then you'd discard everything shorter
> than 2, and convert the rest to dense vectors?...In fact if you're
> expecting length exactly 2 might want to filter ==2...
>
>
> On Thu, Jan 8, 2015 at 10:58 AM, Devl Devel <de...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> I'm trying a simple K-Means example as per the website:
>>
>> val parsedData = data.map(s =>
>> Vectors.dense(s.split(',').map(_.toDouble)))
>>
>> but I'm trying to write a Java based validation method first so that
>> missing values are omitted or replaced with 0.
>>
>> public RDD<Vector> prepareKMeans(JavaRDD<String> data) {
>>         JavaRDD<Vector> words = data.flatMap(new FlatMapFunction<String,
>> Vector>() {
>>             public Iterable<Vector> call(String s) {
>>                 String[] split = s.split(",");
>>                 ArrayList<Vector> add = new ArrayList<Vector>();
>>                 if (split.length != 2) {
>>                     add.add(Vectors.dense(0, 0));
>>                 } else
>>                 {
>>                     add.add(Vectors.dense(Double.parseDouble(split[0]),
>>                Double.parseDouble(split[1])));
>>                 }
>>
>>                 return add;
>>             }
>>         });
>>
>>         return words.rdd();
>> }
>>
>> When I then call from scala:
>>
>> val parsedData=dc.prepareKMeans(data);
>> val p=parsedData.collect();
>>
>> I get Exception in thread "main" java.lang.ClassCastException:
>> [Ljava.lang.Object; cannot be cast to
>> [Lorg.apache.spark.mllib.linalg.Vector;
>>
>> Why is the class tag is object rather than vector?
>>
>> 1) How do I get this working correctly using the Java validation example
>> above or
>> 2) How can I modify val parsedData = data.map(s =>
>> Vectors.dense(s.split(',').map(_.toDouble))) so that when s.split size <2
>> I
>> ignore the line? or
>> 3) Is there a better way to do input validation first?
>>
>> Using spark and mlib:
>> libraryDependencies += "org.apache.spark" % "spark-core_2.10" %  "1.2.0"
>> libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" % "1.2.0"
>>
>> Many thanks in advance
>> Dev
>>
>
>

Re: K-Means And Class Tags

Posted by Yana Kadiyska <ya...@gmail.com>.
How about

data.map(s=>s.split(",")).filter(_.length>1).map(good_entry=>Vectors.dense((Double.parseDouble(good_entry[0]),
Double.parseDouble(good_entry[1]))
​
(full disclosure, I didn't actually run this). But after the first map you
should have an RDD[Array[String]], then you'd discard everything shorter
than 2, and convert the rest to dense vectors?...In fact if you're
expecting length exactly 2 might want to filter ==2...


On Thu, Jan 8, 2015 at 10:58 AM, Devl Devel <de...@gmail.com>
wrote:

> Hi All,
>
> I'm trying a simple K-Means example as per the website:
>
> val parsedData = data.map(s => Vectors.dense(s.split(',').map(_.toDouble)))
>
> but I'm trying to write a Java based validation method first so that
> missing values are omitted or replaced with 0.
>
> public RDD<Vector> prepareKMeans(JavaRDD<String> data) {
>         JavaRDD<Vector> words = data.flatMap(new FlatMapFunction<String,
> Vector>() {
>             public Iterable<Vector> call(String s) {
>                 String[] split = s.split(",");
>                 ArrayList<Vector> add = new ArrayList<Vector>();
>                 if (split.length != 2) {
>                     add.add(Vectors.dense(0, 0));
>                 } else
>                 {
>                     add.add(Vectors.dense(Double.parseDouble(split[0]),
>                Double.parseDouble(split[1])));
>                 }
>
>                 return add;
>             }
>         });
>
>         return words.rdd();
> }
>
> When I then call from scala:
>
> val parsedData=dc.prepareKMeans(data);
> val p=parsedData.collect();
>
> I get Exception in thread "main" java.lang.ClassCastException:
> [Ljava.lang.Object; cannot be cast to
> [Lorg.apache.spark.mllib.linalg.Vector;
>
> Why is the class tag is object rather than vector?
>
> 1) How do I get this working correctly using the Java validation example
> above or
> 2) How can I modify val parsedData = data.map(s =>
> Vectors.dense(s.split(',').map(_.toDouble))) so that when s.split size <2 I
> ignore the line? or
> 3) Is there a better way to do input validation first?
>
> Using spark and mlib:
> libraryDependencies += "org.apache.spark" % "spark-core_2.10" %  "1.2.0"
> libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" % "1.2.0"
>
> Many thanks in advance
> Dev
>