You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by alberskib <al...@gmail.com> on 2015/10/09 18:05:52 UTC

Issue with the class generated from avro schema

Hi all, 

I have piece of code written in spark that loads data from HDFS into java
classes generated from avro idl. On RDD created in that way I am executing
simple operation which results depends on fact whether I cache RDD before it
or not i.e if I run code below

val loadedData = loadFromHDFS[Data](path,...)
println(loadedData.map(x => x.getUserId + x.getDate).distinct().count()) //
200000
program will print 200000, on the other hand executing next code

val loadedData = loadFromHDFS[Data](path,...).cache()
println(loadedData.map(x => x.getUserId + x.getDate).distinct().count()) //
1
result in 1 printed to stdout.

When I inspect values of the fields after reading cached data it seems

I am pretty sure that root cause of described problem is issue with
serialization of classes generated from avro idl, but I do not know how to
resolve it. I tried to use Kryo, registering generated class (Data),
registering different serializers from chill_avro for given class
(SpecificRecordSerializer, SpecificRecordBinarySerializer, etc), but none of
those ideas helps me.

I post exactly the same question on stackoverflow but I did not receive any
repsponse.  link
<http://stackoverflow.com/questions/33027851/spark-issue-with-the-class-generated-from-avro-schema>  

What is more I created minimal working example, thanks to which it will be
easy to reproduce problem.
link <https://github.com/alberskib/spark-avro-serialization-issue>  

How I can solve this problem?


Thanks,
Bartek



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Issue-with-the-class-generated-from-avro-schema-tp24997.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Issue with the class generated from avro schema

Posted by Igor Berman <ig...@gmail.com>.

I think there is deepCopy method of generated avro classes.

On 9 October 2015 at 23:32, Bartłomiej Alberski <al...@gmail.com> wrote:

> I knew that one possible solution will be to map loaded object into
> another class just after reading from HDFS.
> I was looking for solution enabling reuse of avro generated classes.
> It could be useful in situation when your record have more 22 records,
> because you do not need to write boilerplate code for mapping from and to
> the class,  i.e loading class as instance of class generated from avro,
> updating some fields, removing duplicates, and saving those results with
> exactly the same schema.
>
> Thank you for the answer, at least I know that there is no way to make it
> works.
>
>
> 2015-10-09 20:19 GMT+02:00 Igor Berman <ig...@gmail.com>:
>
>> u should create copy of your avro data before working with it, i.e. just
>> after loadFromHDFS map it into new instance that is deap copy of the object
>> it's connected to the way spark/avro reader reads avro files(it reuses
>> some buffer or something)
>>
>> On 9 October 2015 at 19:05, alberskib <al...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I have piece of code written in spark that loads data from HDFS into java
>>> classes generated from avro idl. On RDD created in that way I am
>>> executing
>>> simple operation which results depends on fact whether I cache RDD
>>> before it
>>> or not i.e if I run code below
>>>
>>> val loadedData = loadFromHDFS[Data](path,...)
>>> println(loadedData.map(x => x.getUserId + x.getDate).distinct().count())
>>> //
>>> 200000
>>> program will print 200000, on the other hand executing next code
>>>
>>> val loadedData = loadFromHDFS[Data](path,...).cache()
>>> println(loadedData.map(x => x.getUserId + x.getDate).distinct().count())
>>> //
>>> 1
>>> result in 1 printed to stdout.
>>>
>>> When I inspect values of the fields after reading cached data it seems
>>>
>>> I am pretty sure that root cause of described problem is issue with
>>> serialization of classes generated from avro idl, but I do not know how
>>> to
>>> resolve it. I tried to use Kryo, registering generated class (Data),
>>> registering different serializers from chill_avro for given class
>>> (SpecificRecordSerializer, SpecificRecordBinarySerializer, etc), but
>>> none of
>>> those ideas helps me.
>>>
>>> I post exactly the same question on stackoverflow but I did not receive
>>> any
>>> repsponse.  link
>>> <
>>> http://stackoverflow.com/questions/33027851/spark-issue-with-the-class-generated-from-avro-schema
>>> >
>>>
>>> What is more I created minimal working example, thanks to which it will
>>> be
>>> easy to reproduce problem.
>>> link <https://github.com/alberskib/spark-avro-serialization-issue>
>>>
>>> How I can solve this problem?
>>>
>>>
>>> Thanks,
>>> Bartek
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Issue-with-the-class-generated-from-avro-schema-tp24997.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>

Re: Issue with the class generated from avro schema

Posted by Bartłomiej Alberski <al...@gmail.com>.

I knew that one possible solution will be to map loaded object into another
class just after reading from HDFS.
I was looking for solution enabling reuse of avro generated classes.
It could be useful in situation when your record have more 22 records,
because you do not need to write boilerplate code for mapping from and to
the class,  i.e loading class as instance of class generated from avro,
updating some fields, removing duplicates, and saving those results with
exactly the same schema.

Thank you for the answer, at least I know that there is no way to make it
works.


2015-10-09 20:19 GMT+02:00 Igor Berman <ig...@gmail.com>:

> u should create copy of your avro data before working with it, i.e. just
> after loadFromHDFS map it into new instance that is deap copy of the object
> it's connected to the way spark/avro reader reads avro files(it reuses
> some buffer or something)
>
> On 9 October 2015 at 19:05, alberskib <al...@gmail.com> wrote:
>
>> Hi all,
>>
>> I have piece of code written in spark that loads data from HDFS into java
>> classes generated from avro idl. On RDD created in that way I am executing
>> simple operation which results depends on fact whether I cache RDD before
>> it
>> or not i.e if I run code below
>>
>> val loadedData = loadFromHDFS[Data](path,...)
>> println(loadedData.map(x => x.getUserId + x.getDate).distinct().count())
>> //
>> 200000
>> program will print 200000, on the other hand executing next code
>>
>> val loadedData = loadFromHDFS[Data](path,...).cache()
>> println(loadedData.map(x => x.getUserId + x.getDate).distinct().count())
>> //
>> 1
>> result in 1 printed to stdout.
>>
>> When I inspect values of the fields after reading cached data it seems
>>
>> I am pretty sure that root cause of described problem is issue with
>> serialization of classes generated from avro idl, but I do not know how to
>> resolve it. I tried to use Kryo, registering generated class (Data),
>> registering different serializers from chill_avro for given class
>> (SpecificRecordSerializer, SpecificRecordBinarySerializer, etc), but none
>> of
>> those ideas helps me.
>>
>> I post exactly the same question on stackoverflow but I did not receive
>> any
>> repsponse.  link
>> <
>> http://stackoverflow.com/questions/33027851/spark-issue-with-the-class-generated-from-avro-schema
>> >
>>
>> What is more I created minimal working example, thanks to which it will be
>> easy to reproduce problem.
>> link <https://github.com/alberskib/spark-avro-serialization-issue>
>>
>> How I can solve this problem?
>>
>>
>> Thanks,
>> Bartek
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Issue-with-the-class-generated-from-avro-schema-tp24997.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: Issue with the class generated from avro schema

Posted by Igor Berman <ig...@gmail.com>.

u should create copy of your avro data before working with it, i.e. just
after loadFromHDFS map it into new instance that is deap copy of the object
it's connected to the way spark/avro reader reads avro files(it reuses some
buffer or something)

On 9 October 2015 at 19:05, alberskib <al...@gmail.com> wrote:

> Hi all,
>
> I have piece of code written in spark that loads data from HDFS into java
> classes generated from avro idl. On RDD created in that way I am executing
> simple operation which results depends on fact whether I cache RDD before
> it
> or not i.e if I run code below
>
> val loadedData = loadFromHDFS[Data](path,...)
> println(loadedData.map(x => x.getUserId + x.getDate).distinct().count()) //
> 200000
> program will print 200000, on the other hand executing next code
>
> val loadedData = loadFromHDFS[Data](path,...).cache()
> println(loadedData.map(x => x.getUserId + x.getDate).distinct().count()) //
> 1
> result in 1 printed to stdout.
>
> When I inspect values of the fields after reading cached data it seems
>
> I am pretty sure that root cause of described problem is issue with
> serialization of classes generated from avro idl, but I do not know how to
> resolve it. I tried to use Kryo, registering generated class (Data),
> registering different serializers from chill_avro for given class
> (SpecificRecordSerializer, SpecificRecordBinarySerializer, etc), but none
> of
> those ideas helps me.
>
> I post exactly the same question on stackoverflow but I did not receive any
> repsponse.  link
> <
> http://stackoverflow.com/questions/33027851/spark-issue-with-the-class-generated-from-avro-schema
> >
>
> What is more I created minimal working example, thanks to which it will be
> easy to reproduce problem.
> link <https://github.com/alberskib/spark-avro-serialization-issue>
>
> How I can solve this problem?
>
>
> Thanks,
> Bartek
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Issue-with-the-class-generated-from-avro-schema-tp24997.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>