You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Kousuke Saruta (Code Review)" <ge...@cloudera.org> on 2016/12/14 04:30:36 UTC

[kudu-CR] KuduRDD.collect fails because of NoSerializableException

Kousuke Saruta has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/5496

Change subject: KuduRDD.collect fails because of NoSerializableException
......................................................................

KuduRDD.collect fails because of NoSerializableException

RowResult and some classes are not Serializable so KuduRDD#collect,
take and similar operations fail, throwing NoSerializableException.

To fix this issue, I made RowResult, Schema, ColumnSchema
and Slice serializable.

In addition, I've removed toString method from KuduRow because
Row which KuduRow implements already defines toString and we can
get appearance like as follows.

```
scala> kuduContext.collect

...

res1: Array[org.apache.spark.sql.Row] = Array([2,John], [1,Bob])
```

Change-Id: If0463424481a33c66fd7464345c305062420cfe9
---
M java/kudu-client/src/main/java/org/apache/kudu/ColumnSchema.java
M java/kudu-client/src/main/java/org/apache/kudu/Schema.java
M java/kudu-client/src/main/java/org/apache/kudu/client/RowResult.java
M java/kudu-client/src/main/java/org/apache/kudu/util/Slice.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/KuduContextTest.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/TestContext.scala
7 files changed, 44 insertions(+), 11 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/96/5496/1
-- 
To view, visit http://gerrit.cloudera.org:8080/5496
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: If0463424481a33c66fd7464345c305062420cfe9
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Kousuke Saruta <sa...@oss.nttdata.co.jp>

[kudu-CR] KuduRDD.collect fails because of NoSerializableException

Posted by "Kousuke Saruta (Code Review)" <ge...@cloudera.org>.
Kousuke Saruta has posted comments on this change.

Change subject: KuduRDD.collect fails because of NoSerializableException
......................................................................


Patch Set 4:

Hi Dan,

Thank you for the review! I have a question and a comment.

 > I'm not keen on adding io.Serializable to the java client classes
 > due to compatibility concerns with the Java Serializable API. 

Can we have any compatibility issue? Do you have any examples?

 > looked into this issue, and it seems our biggest spark users are
 > using Kryo serialization instead of Java serialization, since it
 > provides much better performance (and it should be compatible with
 > KuduRDD).  Is it an option to use Kryo?  Using it should be as
 > simple as setting the "spark.serializer" option on the SparkConf:
 > 
 > new SparkConf().set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

As you mentioned, Kryo is more efficient than Java serializer but unfortunately,
we can't serialize/deserialize those classes by Kryo. 
When we try to serialize those classes by Kryo, we will get exception like as follows.

```
16/12/15 15:40:58 ERROR TaskResultGetter: Exception while getting task result
com.esotericsoftware.kryo.KryoException: java.lang.UnsupportedOperationException
Serialization trace:
columnsByIndex (org.apache.kudu.Schema)
schema (org.apache.kudu.client.RowResult)
rowResult (org.apache.kudu.spark.kudu.KuduRow)
        at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144)
        at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
        at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
        at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
        at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
        at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
        at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
        at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
        at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
        at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:396)
        at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:307)
        at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
        at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:327)
        at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:88)
        at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:72)
        at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63)
        at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
        at org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:62)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.UnsupportedOperationException
        at org.apache.kudu.client.shaded.com.google.common.collect.ImmutableCollection.add(ImmutableCollection.java:96)
        at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:134)
        at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:40)
        at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
        at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
        ... 21 more
```

The reason why we get the exception above is that Kryo can't serialize/deserialize guava's ImmutableList which is the type of columnsByIndex in Schema.

-- 
To view, visit http://gerrit.cloudera.org:8080/5496
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If0463424481a33c66fd7464345c305062420cfe9
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Kousuke Saruta <sa...@oss.nttdata.co.jp>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kousuke Saruta <sa...@oss.nttdata.co.jp>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: No

[kudu-CR] KuduRDD.collect fails because of NoSerializableException

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Dan Burkert has posted comments on this change.

Change subject: KuduRDD.collect fails because of NoSerializableException
......................................................................


Patch Set 4:

> Can we have any compatibility issue? Do you have any examples?

Yes, by implementing io.Serializable a class representation is constrained to the fields in the original format.  See Effective Java item 74 for more reasons why the Serializable API should be avoided.

>  Kryo can't serialize/deserialize guava's ImmutableList

Kryo support for guava collections is provided by https://github.com/magro/kryo-serializers, if you add that jar to the classpath I believe that error should go away.

    <dependency>
        <groupId>de.javakaffee</groupId>
        <artifactId>kryo-serializers</artifactId>
        <version>0.41</version>
    </dependency>

-- 
To view, visit http://gerrit.cloudera.org:8080/5496
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If0463424481a33c66fd7464345c305062420cfe9
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Kousuke Saruta <sa...@oss.nttdata.co.jp>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kousuke Saruta <sa...@oss.nttdata.co.jp>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: No

[kudu-CR] KuduRDD.collect fails because of NoSerializableException

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Dan Burkert has posted comments on this change.

Change subject: KuduRDD.collect fails because of NoSerializableException
......................................................................


Patch Set 4:

I've filed an issue and added some notes in https://issues.apache.org/jira/browse/KUDU-1824.  Now I'm going to try and figure out why the DataFrame API is not affected, and evaluate Kousuke's suggestion.  Will update the JIRA with findings.

-- 
To view, visit http://gerrit.cloudera.org:8080/5496
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If0463424481a33c66fd7464345c305062420cfe9
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Kousuke Saruta <sa...@oss.nttdata.co.jp>
Gerrit-Reviewer: Chris George <ch...@rms.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kousuke Saruta <sa...@oss.nttdata.co.jp>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KuduRDD.collect fails because of NoSerializableException

Posted by "Kousuke Saruta (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5496

to look at the new patch set (#2).

Change subject: KuduRDD.collect fails because of NoSerializableException
......................................................................

KuduRDD.collect fails because of NoSerializableException

RowResult and some classes are not Serializable so KuduRDD#collect,
take and similar operations fail, throwing NoSerializableException.

To fix this issue, I made RowResult, Schema, ColumnSchema
and Slice serializable.

In addition, I've removed toString method from KuduRow because
Row which KuduRow implements already defines toString and we can
get appearance like as follows.

```
scala> kuduContext.collect

...

res1: Array[org.apache.spark.sql.Row] = Array([2,John], [1,Bob])
```

Change-Id: If0463424481a33c66fd7464345c305062420cfe9
---
M java/kudu-client/src/main/java/org/apache/kudu/ColumnSchema.java
M java/kudu-client/src/main/java/org/apache/kudu/Schema.java
M java/kudu-client/src/main/java/org/apache/kudu/client/RowResult.java
M java/kudu-client/src/main/java/org/apache/kudu/util/Slice.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/KuduContextTest.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/TestContext.scala
7 files changed, 45 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/96/5496/2
-- 
To view, visit http://gerrit.cloudera.org:8080/5496
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If0463424481a33c66fd7464345c305062420cfe9
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Kousuke Saruta <sa...@oss.nttdata.co.jp>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins

[kudu-CR] KuduRDD.collect fails because of NoSerializableException

Posted by "Kousuke Saruta (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5496

to look at the new patch set (#3).

Change subject: KuduRDD.collect fails because of NoSerializableException
......................................................................

KuduRDD.collect fails because of NoSerializableException

RowResult and some classes are not Serializable so KuduRDD#collect,
take and similar operations fail, throwing NoSerializableException.

To fix this issue, I made RowResult, Schema, ColumnSchema
and Slice serializable.

In addition, I've removed toString method from KuduRow because
Row which KuduRow implements already defines toString and we can
get appearance like as follows.

```
scala> kuduContext.collect

...

res1: Array[org.apache.spark.sql.Row] = Array([2,John], [1,Bob])
```

Change-Id: If0463424481a33c66fd7464345c305062420cfe9
---
M java/kudu-client/src/main/java/org/apache/kudu/ColumnSchema.java
M java/kudu-client/src/main/java/org/apache/kudu/Schema.java
M java/kudu-client/src/main/java/org/apache/kudu/client/RowResult.java
M java/kudu-client/src/main/java/org/apache/kudu/util/Slice.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/KuduContextTest.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/TestContext.scala
7 files changed, 44 insertions(+), 11 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/96/5496/3
-- 
To view, visit http://gerrit.cloudera.org:8080/5496
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If0463424481a33c66fd7464345c305062420cfe9
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Kousuke Saruta <sa...@oss.nttdata.co.jp>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins

[kudu-CR] KuduRDD.collect fails because of NoSerializableException

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Dan Burkert has posted comments on this change.

Change subject: KuduRDD.collect fails because of NoSerializableException
......................................................................


Patch Set 4: Code-Review-1

Hi Kousuke,

I'm not keen on adding io.Serializable to the java client classes due to compatibility concerns with the Java Serializable API.  I looked into this issue, and it seems our biggest spark users are using Kryo serialization instead of Java serialization, since it provides much better performance (and it should be compatible with KuduRDD).  Is it an option to use Kryo?  Using it should be as simple as setting the "spark.serializer" option on the SparkConf:

    new SparkConf().set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

-- 
To view, visit http://gerrit.cloudera.org:8080/5496
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If0463424481a33c66fd7464345c305062420cfe9
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Kousuke Saruta <sa...@oss.nttdata.co.jp>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: No

[kudu-CR] KuduRDD.collect fails because of NoSerializableException

Posted by "Kousuke Saruta (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5496

to look at the new patch set (#4).

Change subject: KuduRDD.collect fails because of NoSerializableException
......................................................................

KuduRDD.collect fails because of NoSerializableException

RowResult and some classes are not Serializable so KuduRDD#collect,
take and similar operations fail, throwing NoSerializableException.

To fix this issue, I made RowResult, Schema, ColumnSchema
and Slice serializable.

In addition, I've removed toString method from KuduRow because
Row which KuduRow implements already defines toString and we can
get appearance like as follows.

```
scala> kuduContext.collect

...

res1: Array[org.apache.spark.sql.Row] = Array([2,John], [1,Bob])
```

Change-Id: If0463424481a33c66fd7464345c305062420cfe9
---
M java/kudu-client/src/main/java/org/apache/kudu/ColumnSchema.java
M java/kudu-client/src/main/java/org/apache/kudu/Schema.java
M java/kudu-client/src/main/java/org/apache/kudu/client/RowResult.java
M java/kudu-client/src/main/java/org/apache/kudu/util/Slice.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/KuduContextTest.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/TestContext.scala
8 files changed, 51 insertions(+), 15 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/96/5496/4
-- 
To view, visit http://gerrit.cloudera.org:8080/5496
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If0463424481a33c66fd7464345c305062420cfe9
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Kousuke Saruta <sa...@oss.nttdata.co.jp>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins

[kudu-CR] KuduRDD.collect fails because of NoSerializableException

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Dan Burkert has posted comments on this change.

Change subject: KuduRDD.collect fails because of NoSerializableException
......................................................................


Patch Set 4:

I've put up a new gerrit based on Kousuke's suggestion: https://gerrit.cloudera.org/#/c/5636/

-- 
To view, visit http://gerrit.cloudera.org:8080/5496
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If0463424481a33c66fd7464345c305062420cfe9
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Kousuke Saruta <sa...@oss.nttdata.co.jp>
Gerrit-Reviewer: Chris George <ch...@rms.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kousuke Saruta <sa...@oss.nttdata.co.jp>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KuduRDD.collect fails because of NoSerializableException

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Dan Burkert has posted comments on this change.

Change subject: KuduRDD.collect fails because of NoSerializableException
......................................................................


Patch Set 4:

I'm still keen to merge https://gerrit.cloudera.org/#/c/5636/ instead of this one.  Will try and get hta one pushed through for 1.3.

-- 
To view, visit http://gerrit.cloudera.org:8080/5496
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If0463424481a33c66fd7464345c305062420cfe9
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Kousuke Saruta <sa...@oss.nttdata.co.jp>
Gerrit-Reviewer: Chris George <ch...@rms.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kousuke Saruta <sa...@oss.nttdata.co.jp>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KuduRDD.collect fails because of NoSerializableException

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Dan Burkert has posted comments on this change.

Change subject: KuduRDD.collect fails because of NoSerializableException
......................................................................


Patch Set 4:

https://issues.apache.org/jira/browse/KUDU-1824

-- 
To view, visit http://gerrit.cloudera.org:8080/5496
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If0463424481a33c66fd7464345c305062420cfe9
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Kousuke Saruta <sa...@oss.nttdata.co.jp>
Gerrit-Reviewer: Chris George <ch...@rms.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kousuke Saruta <sa...@oss.nttdata.co.jp>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KuduRDD.collect fails because of NoSerializableException

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KuduRDD.collect fails because of NoSerializableException
......................................................................


Patch Set 4:

Chris and Dan -- now that the holidays are behind us, any further thoughts on this discussion?

-- 
To view, visit http://gerrit.cloudera.org:8080/5496
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If0463424481a33c66fd7464345c305062420cfe9
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Kousuke Saruta <sa...@oss.nttdata.co.jp>
Gerrit-Reviewer: Chris George <ch...@rms.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kousuke Saruta <sa...@oss.nttdata.co.jp>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KuduRDD.collect fails because of NoSerializableException

Posted by "Chris George (Code Review)" <ge...@cloudera.org>.
Chris George has posted comments on this change.

Change subject: KuduRDD.collect fails because of NoSerializableException
......................................................................


Patch Set 4: Code-Review+1

overall looks good... I don't think this is an issue if someone is using kyro serialization... but in looking at cassandra connector they do something similar

-- 
To view, visit http://gerrit.cloudera.org:8080/5496
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If0463424481a33c66fd7464345c305062420cfe9
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Kousuke Saruta <sa...@oss.nttdata.co.jp>
Gerrit-Reviewer: Chris George <ch...@rms.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kousuke Saruta <sa...@oss.nttdata.co.jp>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KuduRDD.collect fails because of NoSerializableException

Posted by "Kousuke Saruta (Code Review)" <ge...@cloudera.org>.
Kousuke Saruta has posted comments on this change.

Change subject: KuduRDD.collect fails because of NoSerializableException
......................................................................


Patch Set 4:

> > Can we have any compatibility issue? Do you have any examples?
 > 
 > Yes, by implementing io.Serializable a class representation is
 > constrained to the fields in the original format.  See Effective
 > Java item 74 for more reasons why the Serializable API should be
 > avoided.
 > 
 > >  Kryo can't serialize/deserialize guava's ImmutableList
 > 
 > Kryo support for guava collections is provided by https://github.com/magro/kryo-serializers,
 > if you add that jar to the classpath I believe that error should go
 > away.
 > 
 > <dependency>
 > <groupId>de.javakaffee</groupId>
 > <artifactId>kryo-serializers</artifactId>
 > <version>0.41</version>
 > </dependency>

Thanks for the reply.

 > Yes, by implementing io.Serializable a class representation is
 > constrained to the fields in the original format.  See Effective
 > Java item 74 for more reasons why the Serializable API should be
 > avoided.

Yeah, I know the generalization but what I want to know is whether the constraint really affects Kudu with Spark.

 > Kryo support for guava collections is provided by https://github.com/magro/kryo-serializers,
 > if you add that jar to the classpath I believe that error should go
 > away.
 > 
 > <dependency>
 > <groupId>de.javakaffee</groupId>
 > <artifactId>kryo-serializers</artifactId>
 > <version>0.41</version>
 > </dependency>

Thanks for the information. Actually I know the library but I think it's not good solution because of following two reasons.

First, to use the library, we need to register the serializer for ImmutableList like "ImmutableListSerializer.register(kryo)"
and this method require a Kryo instance but Spark doesn't expose the instance to users. So we can't leverage the library.
Second, why users need to know concrete class representation (it's columnIndexByName in this case ). I don't think it's good
design to enforce users to know the concrete representation.


I have another option which doesn't need to make the classes Serializable but need to modify KuduRDD and KuduRow a little like as follows.

```
/**
  * A Spark SQL [[Row]] iterator which wraps a [[KuduScanner]].
  * @param scanner the wrapped scanner
  */
private[spark] class RowResultIteratorScala(private val scanner: KuduScanner) extends Iterator[Row] {

  <skip>

  override def next(): Row = {
    val rowResult = currentIterator.next()
    val length = rowResult.getColumnProjection.getColumnCount
    val array = new Array[Any](length)
    var i = 0

    while (i < length) {
      array(i) = 
        if (rowResult.isNull(i)) null
        else rowResult.getColumnType(i) match {
          case Type.BOOL => rowResult.getBoolean(i)
          case Type.INT8 => rowResult.getByte(i)
          case Type.INT16 => rowResult.getShort(i)
          case Type.INT32 => rowResult.getInt(i)
          case Type.INT64 => rowResult.getLong(i)
          case Type.UNIXTIME_MICROS => KuduRelation.microsToTimestamp(rowResult.getLong(i))
          case Type.FLOAT => rowResult.getFloat(i)
          case Type.DOUBLE => rowResult.getDouble(i)
          case Type.STRING => rowResult.getString(i)
          case Type.BINARY => rowResult.getBinaryCopy(i)
        }
        i += 1
    }
    new KuduRow(array)
  }
}

/**
  * A Spark SQL [[Row]] which wraps a Kudu [[RowResult]].
  * @param values the row result represented as array.
  */
private[spark] class KuduRow(private val values: Array[Any]) extends Row {
  override def length: Int = values.length

  override def get(i: Int): Any = values(i)

  override def copy(): Row = Row.fromSeq(Range(0, length).map(get))
}
```

How about this option?

-- 
To view, visit http://gerrit.cloudera.org:8080/5496
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If0463424481a33c66fd7464345c305062420cfe9
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Kousuke Saruta <sa...@oss.nttdata.co.jp>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kousuke Saruta <sa...@oss.nttdata.co.jp>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: No