You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gora.apache.org by Renato Marroquín Mogrovejo <re...@gmail.com> on 2013/02/06 19:31:53 UTC

Gora-174 in Gora-Cassandra

Hi all,

This is a really long overdue email. Finally I got the time to get
around to this while I am on holidays (:

I've made some changes to the Gora-Cassandra to support AvroUnion data
types even though Cassandra doesn't rely on Avro for serializing data.
 So what it has been done is a workaround to save specialized data
types e.g. UNIONS. I faced the same problems and doubts that Alfonso
described, and Alfonso, your post was very illustrative mate ;)

I will just explain the general approach so the changes can be
understood and the changes themselves can be found inside the code, or
reply to this email to talk about it.

** For storing Union data **
We are creating a new column only on at the moment in which we are
flushing the data into the data store. This generated column will
store the index of the schema used within the Union data type.

** For retrieving Union data **
Retrieving the data directly from Cassandra, Gora can make it by
itself. The problem here was to determine which serializer to use
while getting this data back. So the first thing to do is to get the
value stored within the generated column, and use that value to select
the appropriate serializer. After that is just using what Gora has in
it.

** For generating classes **
I am not particularly happy with the changes I've made here. I changed
GoraCompiler directly to create the extra field to store the selected
schema of the Union data type. I tried to only add a new field to the
schema before compiling and then let the compiler work but I kept on
getting a lock exception from Avro which didn't let me get through
this change as I wanted. If anybody could help me out on how to do it,
then  give me a shout! :)

I didn't know where to upload this patch or to Gora-174 because it
addresses an issues caused by it, or to create a new issue to handle
the Avro Union per data store.
Thanks for reading until the end!


Renato M.

Re: Gora-174 in Gora-Cassandra

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.
Hi Alfonso,

My replies are inline.

2013/2/6 Alfonso Nishikawa <al...@gmail.com>:
> Hi Renato,
>
> I saw in the code that Cassandra has its own serializers. Can you give us a
> small summary about how does it works and what affects before your
> modifications? This will help understanding your approaches.

This means that the way data is serialized inside Cassandra is by
using specific serializers for each data type supported by it. So it
doesn't rely on avro for deciding on how to store it. Serializers
convert objects to/from ByteBuffer. Clients such as Hector and
Astyanax provide serializers for most basic types.

> Does Cassandra have some penalties for the new column?

The penalties are storing an extra column in spite of the user has not
specified it, and then trying to find out how this data was originally
stored. I think this touches another point we should put on Gora's
roadmap and that is benchmarking Gora's usage against native access.
This would help us getting a deeper knowledge on how Gora behaves,
help us find performance issues, and of course, get some good
publicity for the project ;)

> In HBase that approach is not necessary since the union-index gets serialized (by Avro)
> and stored before the proper data (I know you know that :) just
> remembering).

This I did not know my friend, could you please point me to some
documentation or wherever I can read a little about this? I am not
very literate in HBase matters ): The things I read about HBase + Avro
were similar to [1, 2, 3] which uses an extra layer to store avro data
inside HBase.

> About generating classes, there's no need to modify the compiler (check if
> you really need to modify it). Taking into account that an union can't have
> 2 same types (avro specs):
> - When you are writing, you can implement the approach of avro show in
> GenericData#resolveUnion():333 [0] (avro 1.3.3) called from [1], where
> iterates on union types until matches the type of the data being written.

Please excuse me if I am not following you on this one. The thing is
that Cassandra does not natively store/read Avro data, it relies on
Avro specification to successfully work with the StateManager and get
a hold of schema. So Gora-Cassandra uses data type to find out the
proper serializer to write and read data in/out from Cassandra. Am we
talking about the same thing? or am I talking something completely
different? ):
The problem is not really writing stuff because as I explained,
Gora-Cassandra relies on data type to select the proper serializer,
but when reading it gets a ByteBuffer and doesn't know which
serializer to use to resolve the data to its original state.

> - When reading, you know the index. The aproach of Avro is in [2].

Avro approach is very attractive though. I wish I had more time to
rewrite some of the Cassandra serializing mechanism to make it more
similar to Avro (which I find less obscure)

> I suggest not modifying (if possible) because for HBase it gets a
> duplicated state, where one will be ignored and becomes noise in the
> structures.

I totally agree with this one. We shouldn't complicate things for
other data stores. I mean if Gora-Cassandra needs it, then only
Gora-Cassandra should use it. We should start thinking on a better
integration of different compilers for each data store because each
data store might have their own needs.


> My oppinion, of course :)
> Thanks for all!!

Thank you for taking the time!


Renato M.


[1] https://issues.apache.org/jira/browse/HBASE-6553
[2] https://github.com/spullara/havrobase
[3] http://blog.cloudera.com/blog/2011/07/avro-data-interop/

> Best regards,
>
> Alfonso Nishikawa
>
> [0] -
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/avro/1.3.3/org/apache/avro/generic/GenericData.java?av=f#333
> [1] - GenericDatumWriter#write():59 -
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/avro/1.3.3/org/apache/avro/generic/GenericDatumWriter.java?av=f#59
> [2] - GenericDatumReader#read():84 -
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/avro/1.3.3/org/apache/avro/generic/GenericDatumReader.java?av=f#77
>
>
> 2013/2/6 Renato Marroquín Mogrovejo <re...@gmail.com>
>
>> Hi all,
>>
>> This is a really long overdue email. Finally I got the time to get
>> around to this while I am on holidays (:
>>
>> I've made some changes to the Gora-Cassandra to support AvroUnion data
>> types even though Cassandra doesn't rely on Avro for serializing data.
>>  So what it has been done is a workaround to save specialized data
>> types e.g. UNIONS. I faced the same problems and doubts that Alfonso
>> described, and Alfonso, your post was very illustrative mate ;)
>>
>> I will just explain the general approach so the changes can be
>> understood and the changes themselves can be found inside the code, or
>> reply to this email to talk about it.
>>
>> ** For storing Union data **
>> We are creating a new column only on at the moment in which we are
>> flushing the data into the data store. This generated column will
>> store the index of the schema used within the Union data type.
>>
>> ** For retrieving Union data **
>> Retrieving the data directly from Cassandra, Gora can make it by
>> itself. The problem here was to determine which serializer to use
>> while getting this data back. So the first thing to do is to get the
>> value stored within the generated column, and use that value to select
>> the appropriate serializer. After that is just using what Gora has in
>> it.
>>
>> ** For generating classes **
>> I am not particularly happy with the changes I've made here. I changed
>> GoraCompiler directly to create the extra field to store the selected
>> schema of the Union data type. I tried to only add a new field to the
>> schema before compiling and then let the compiler work but I kept on
>> getting a lock exception from Avro which didn't let me get through
>> this change as I wanted. If anybody could help me out on how to do it,
>> then  give me a shout! :)
>>
>> I didn't know where to upload this patch or to Gora-174 because it
>> addresses an issues caused by it, or to create a new issue to handle
>> the Avro Union per data store.
>> Thanks for reading until the end!
>>
>>
>> Renato M.
>>
>
>
>
> --
> "Drinking bloody marys all night will make you feel like a corpse in the
> morning."

Re: Gora-174 in Gora-Cassandra

Posted by Alfonso Nishikawa <al...@gmail.com>.
Forgot about your last question.

I suggest to create a sub-task. Can you create one? If not, I will create
it for you (Menu "More Actions > Create sub-task").

Best regards,

Alfonso Nishikawa

2013/2/6 Alfonso Nishikawa <al...@gmail.com>

> Hi Renato,
>
> I saw in the code that Cassandra has its own serializers. Can you give us
> a small summary about how does it works and what affects before your
> modifications? This will help understanding your aproaches.
>
> Does Cassandra have some penalties for the new column? In HBase that
> approach is not necessary since the union-index gets serialized (by Avro)
> and stored before the proper data (I know you know that :) just
> remembering).
>
> About generating classes, there's no need to modify the compiler (check if
> you really need to modify it). Taking into account that an union can't have
> 2 same types (avro specs):
> - When you are writing, you can implement the approach of avro show in
> GenericData#resolveUnion():333 [0] (avro 1.3.3) called from [1], where
> iterates on union types until matches the type of the data being written.
> - When reading, you know the index. The aproach of Avro is in [2].
>
> I suggest not modifying (if possible) because for HBase it gets a
> duplicated state, where one will be ignored and becomes noise in the
> structures.
> My oppinion, of course :)
>
> Thanks for all!!
>
> Best regards,
>
> Alfonso Nishikawa
>
> [0] -
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/avro/1.3.3/org/apache/avro/generic/GenericData.java?av=f#333
> [1] - GenericDatumWriter#write():59 -
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/avro/1.3.3/org/apache/avro/generic/GenericDatumWriter.java?av=f#59
> [2] - GenericDatumReader#read():84 -
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/avro/1.3.3/org/apache/avro/generic/GenericDatumReader.java?av=f#77
>
>
>
> 2013/2/6 Renato Marroquín Mogrovejo <re...@gmail.com>
>
>> Hi all,
>>
>> This is a really long overdue email. Finally I got the time to get
>> around to this while I am on holidays (:
>>
>> I've made some changes to the Gora-Cassandra to support AvroUnion data
>> types even though Cassandra doesn't rely on Avro for serializing data.
>>  So what it has been done is a workaround to save specialized data
>> types e.g. UNIONS. I faced the same problems and doubts that Alfonso
>> described, and Alfonso, your post was very illustrative mate ;)
>>
>> I will just explain the general approach so the changes can be
>> understood and the changes themselves can be found inside the code, or
>> reply to this email to talk about it.
>>
>> ** For storing Union data **
>> We are creating a new column only on at the moment in which we are
>> flushing the data into the data store. This generated column will
>> store the index of the schema used within the Union data type.
>>
>> ** For retrieving Union data **
>> Retrieving the data directly from Cassandra, Gora can make it by
>> itself. The problem here was to determine which serializer to use
>> while getting this data back. So the first thing to do is to get the
>> value stored within the generated column, and use that value to select
>> the appropriate serializer. After that is just using what Gora has in
>> it.
>>
>> ** For generating classes **
>> I am not particularly happy with the changes I've made here. I changed
>> GoraCompiler directly to create the extra field to store the selected
>> schema of the Union data type. I tried to only add a new field to the
>> schema before compiling and then let the compiler work but I kept on
>> getting a lock exception from Avro which didn't let me get through
>> this change as I wanted. If anybody could help me out on how to do it,
>> then  give me a shout! :)
>>
>> I didn't know where to upload this patch or to Gora-174 because it
>> addresses an issues caused by it, or to create a new issue to handle
>> the Avro Union per data store.
>> Thanks for reading until the end!
>>
>>
>> Renato M.
>>
>
>
>
> --
> "Drinking bloody marys all night will make you feel like a corpse in the
> morning."
>



-- 
"Drinking bloody marys all night will make you feel like a corpse in the
morning."

Re: Gora-174 in Gora-Cassandra

Posted by Alfonso Nishikawa <al...@gmail.com>.
Hi Renato,

I saw in the code that Cassandra has its own serializers. Can you give us a
small summary about how does it works and what affects before your
modifications? This will help understanding your aproaches.

Does Cassandra have some penalties for the new column? In HBase that
approach is not necessary since the union-index gets serialized (by Avro)
and stored before the proper data (I know you know that :) just
remembering).

About generating classes, there's no need to modify the compiler (check if
you really need to modify it). Taking into account that an union can't have
2 same types (avro specs):
- When you are writing, you can implement the approach of avro show in
GenericData#resolveUnion():333 [0] (avro 1.3.3) called from [1], where
iterates on union types until matches the type of the data being written.
- When reading, you know the index. The aproach of Avro is in [2].

I suggest not modifying (if possible) because for HBase it gets a
duplicated state, where one will be ignored and becomes noise in the
structures.
My oppinion, of course :)

Thanks for all!!

Best regards,

Alfonso Nishikawa

[0] -
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/avro/1.3.3/org/apache/avro/generic/GenericData.java?av=f#333
[1] - GenericDatumWriter#write():59 -
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/avro/1.3.3/org/apache/avro/generic/GenericDatumWriter.java?av=f#59
[2] - GenericDatumReader#read():84 -
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/avro/1.3.3/org/apache/avro/generic/GenericDatumReader.java?av=f#77


2013/2/6 Renato Marroquín Mogrovejo <re...@gmail.com>

> Hi all,
>
> This is a really long overdue email. Finally I got the time to get
> around to this while I am on holidays (:
>
> I've made some changes to the Gora-Cassandra to support AvroUnion data
> types even though Cassandra doesn't rely on Avro for serializing data.
>  So what it has been done is a workaround to save specialized data
> types e.g. UNIONS. I faced the same problems and doubts that Alfonso
> described, and Alfonso, your post was very illustrative mate ;)
>
> I will just explain the general approach so the changes can be
> understood and the changes themselves can be found inside the code, or
> reply to this email to talk about it.
>
> ** For storing Union data **
> We are creating a new column only on at the moment in which we are
> flushing the data into the data store. This generated column will
> store the index of the schema used within the Union data type.
>
> ** For retrieving Union data **
> Retrieving the data directly from Cassandra, Gora can make it by
> itself. The problem here was to determine which serializer to use
> while getting this data back. So the first thing to do is to get the
> value stored within the generated column, and use that value to select
> the appropriate serializer. After that is just using what Gora has in
> it.
>
> ** For generating classes **
> I am not particularly happy with the changes I've made here. I changed
> GoraCompiler directly to create the extra field to store the selected
> schema of the Union data type. I tried to only add a new field to the
> schema before compiling and then let the compiler work but I kept on
> getting a lock exception from Avro which didn't let me get through
> this change as I wanted. If anybody could help me out on how to do it,
> then  give me a shout! :)
>
> I didn't know where to upload this patch or to Gora-174 because it
> addresses an issues caused by it, or to create a new issue to handle
> the Avro Union per data store.
> Thanks for reading until the end!
>
>
> Renato M.
>



-- 
"Drinking bloody marys all night will make you feel like a corpse in the
morning."