You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@gora.apache.org by Ed Kohlwey <ek...@gmail.com> on 2012/08/24 14:23:34 UTC

Compilers and data stores

So I just reviewed the Dynamo compiler, and I have a few questions,
followed by a few thoughts.

Questions:

   1. Are annotations the only way to implement the desired features?
   2. What if other data stores have other annotations? Will we create more
   compilers for them?
   3. Renato had mentioned that Gora supports "data services" now
   (presumably in addition to databases). I'm not sure I understand this
   distinction. I have heard Dynamo is a managed database that implements a
   model similar to Cassandra. Can you elaborate on this statement?

Thoughts:

   1. I'm concerned that there is currently some marginal reliance on
   accessing code that is generated by compilers and cannot be declared in a
   supertype. The exact instance of this that I'm aware of is accessing the
   static field _SCHEMA on Avro types generated by the 1.3 compiler via
   reflection. The current preference in the Avro community is to use the name
   SCHEMA$ instead. Issues like this cannot be caught by static compilation
   checks and are real no-no's in my opinion, unless the structure of the API
   is well-documented and enforced by regression tests. If there is a
   proliferation of compilers this problem could become more severe.
   2. Making objects inherit from SpecificRecord (an Avro class) makes them
   convenient to use in RPC's or map/reduce. I think this is one of the most
   attractive features of Gora.
   3. The current mechanism used to track the dirty state of gora-compiled
   objects must be improved 1.7 since the Avro 1.7 API is structured in a way
   that makes the current methodology almost impossible if you engage in any
   degree of code reuse. I believe the following requirements are necessary
   for an improved dirty state tracking system:
   1. The system must be able to represent the original state of the object
      as it was deserialized from the store prior to mutation. The
motivation for
      this is to be able to create the most generalized mapping
support possible.
      Some of this is currently done via the stateful map, but I believe the
      implementation could be improved and generalized. There are lots
of mapping
      schemes that are not currently possible because there is not enough
      information stored in objects to allow erasure of key/values
afterwards. A
      few examples:
         1. Objects of arbitrary structure could be stored with each field
         (including those of child objects) represented as a single
record in HBase,
         Accumulo, or Cassandra.
         2. Child objects could be stored in column families with their
         fields in column qualifiers, reserving one column family for
the fields of
         the parent object. Without storing the state of objects, this
could result
         in values getting "lost" in the database if a union type is used, for
         instance.
         3. Maps of maps
         2. The system should be implemented entirely in the over-the-wire
      protocol that is used to transmit objects
      3. The system will not be represented in the serialized
      representation that the "primary" data store uses since its
representation
      is authoritative.
      4. The improved system should have one representation and access
      pattern in the API (currently both a state tracker object and the
      persistent object itself describe the mutation state).
   4. I'd eventually like to see Avro/Gora objects used as both DTO's and
   DAO's using an Avro javascript implementation (there are two that I am
   aware of). Continued reliance on Avro for serialization on the wire
   supports this.

Re: Compilers and data stores

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi,

On Sat, Aug 25, 2012 at 8:04 AM, Renato Marroquín Mogrovejo
<re...@gmail.com> wrote:
> Hi Ed,

>>
>>    1. Are annotations the only way to implement the desired features?
>
> No, they are not the only way to implement the desired features.
> Amazon DynamoDB provides the possibility of writing items through a
> map of attribute names and an Amazon data type called 'AttributeValue'
> [1] which contains the actual values to be stored.

Thanks for clarification on this Renato, this one passed my by unawares.

Lewis

[1] http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/LowLevelJavaItemCRUD.html

Re: Compilers and data stores

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

Hi Ed,

Thanks for taking the time to look into this. So my answers are inline.

2012/8/24 Ed Kohlwey <ek...@gmail.com>:
> So I just reviewed the Dynamo compiler, and I have a few questions,
> followed by a few thoughts.
>
> Questions:
>
>    1. Are annotations the only way to implement the desired features?

No, they are not the only way to implement the desired features.
Amazon DynamoDB provides the possibility of writing items through a
map of attribute names and an Amazon data type called 'AttributeValue'
[1] which contains the actual values to be stored. As Lewis said we
decided on using DynamoDBMapper class because of the time frame and
because it seemed as a reasonable option at the time. I started
looking into this mapping class because Gora generates classes based
on Avro schemas, and as we don't have Avro schemas for web services we
decided to create DynamoDB annotated classes to persist them. This
helped us on making the code much less convoluted.

>    2. What if other data stores have other annotations? Will we create more
>    compilers for them?

Well ... yeah. The idea would be to refactor Gora's main compiler to
make it more intelligent so it could decide on what the classes are
being compiled into. For example, Google App Engine uses JPA and POJOs
to persist data so an alternative would be to compile the xml mapping
file into fully annotated classes to persist them. While thinking on
your email Ed, I looked for avro rpc libraries and I found this [2]
maybe you are more familiar with this. Do you think that we could use
that to make all of our data stores avro based?

>    3. Renato had mentioned that Gora supports "data services" now
>    (presumably in addition to databases). I'm not sure I understand this
>    distinction. I have heard Dynamo is a managed database that implements a
>    model similar to Cassandra. Can you elaborate on this statement?

Lewis'  answer on this doesn't need me saying no more.

> Thoughts:
>
>    1. I'm concerned that there is currently some marginal reliance on
>    accessing code that is generated by compilers and cannot be declared in a
>    supertype. The exact instance of this that I'm aware of is accessing the
>    static field _SCHEMA on Avro types generated by the 1.3 compiler via
>    reflection. The current preference in the Avro community is to use the name
>    SCHEMA$ instead. Issues like this cannot be caught by static compilation
>    checks and are real no-no's in my opinion, unless the structure of the API
>    is well-documented and enforced by regression tests. If there is a
>    proliferation of compilers this problem could become more severe.

All avro based data stores should share the same compiler, so I
totally agree with you on improving the structure of the API by making
a better documentation and by enforcing regression tests. So I guess
we would have to change the way in which all the data stores manage
their schema. We would have to make this as well for the web based
data stores, so Gora's API remains the same across the data stores.

>    2. Making objects inherit from SpecificRecord (an Avro class) makes them
>    convenient to use in RPC's or map/reduce. I think this is one of the most
>    attractive features of Gora.

True that, but how could we use Avro to write directly to web service
backed database?

>    3. The current mechanism used to track the dirty state of gora-compiled
>    objects must be improved 1.7 since the Avro 1.7 API is structured in a way
>    that makes the current methodology almost impossible if you engage in any
>    degree of code reuse. I believe the following requirements are necessary
>    for an improved dirty state tracking system:
>    1. The system must be able to represent the original state of the object
>       as it was deserialized from the store prior to mutation. The
> motivation for
>       this is to be able to create the most generalized mapping
> support possible.
>       Some of this is currently done via the stateful map, but I believe the
>       implementation could be improved and generalized. There are lots
> of mapping
>       schemes that are not currently possible because there is not enough
>       information stored in objects to allow erasure of key/values
> afterwards. A
>       few examples:
>          1. Objects of arbitrary structure could be stored with each field
>          (including those of child objects) represented as a single
> record in HBase,
>          Accumulo, or Cassandra.
>          2. Child objects could be stored in column families with their
>          fields in column qualifiers, reserving one column family for
> the fields of
>          the parent object. Without storing the state of objects, this
> could result
>          in values getting "lost" in the database if a union type is used, for
>          instance.
>          3. Maps of maps
>          2. The system should be implemented entirely in the over-the-wire
>       protocol that is used to transmit objects

We haven't modelled this functionality on the dynamoDB store because
DynamoDB is managed by a third party.
Just another question here Ed, what do you mean by over-the-wire
protocol? RPC, thrift, etc?

>       3. The system will not be represented in the serialized
>       representation that the "primary" data store uses since its
> representation
>       is authoritative.

Do you mean that we should abstract the serialized representation? but
how would we do for services in which we don't use disk based
serialization?

>       4. The improved system should have one representation and access
>       pattern in the API (currently both a state tracker object and the
>       persistent object itself describe the mutation state).
>    4. I'd eventually like to see Avro/Gora objects used as both DTO's and
>    DAO's using an Avro javascript implementation (there are two that I am
>    aware of). Continued reliance on Avro for serialization on the wire
>    supports this.

Thanks Ed for discussing this. We really need to decide on how Gora's
API should work for both types of data stores.


Renato M.

[1] http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/LowLevelJavaItemCRUD.html
[2] https://github.com/phunt/avro-rpc-quickstart

Re: Compilers and data stores

Posted by Lewis John Mcgibbney <le...@gmail.com>.

In addition to this I reminded myself of one of the many pro's Gora
has going for it.

"Gora uses Avro for bean definition, not byte code enhancement or annotations."

Mmmmm... as I've mentioned, the restrictions imposed upon users via
the DynamoDB compiler are far from the desired aim and even further
from the defined project scope therefore we are fully engaged in
improving this where we can.

Lewis

On Fri, Aug 24, 2012 at 11:57 PM, Lewis John Mcgibbney
<le...@gmail.com> wrote:
> Hi Ed,
>
> I've provided (where I can) answers for your questions to get us
> started here. Please see inline.
>
> On Fri, Aug 24, 2012 at 1:23 PM, Ed Kohlwey <ek...@gmail.com> wrote:
>
>> Questions:
>>
>>    1. Are annotations the only way to implement the desired features?
>
> Yes AFAIK. If I'm preempting your thoughts here, I agree that this is
> really restrictive (with respect to obtaining a more generalised
> design approach for a Gora compiler) but annotations are essential for
> for DynamoDBMapper [0] to achieve the correct mappings. As we were
> operating under the restrictions of the GSoC timeline, the decision
> was made to implement a DynamoDB specific compiler at this stage with
> the view of developing a more widely applicable implementation as the
> webservices API matured.
>
>>    2. What if other data stores have other annotations? Will we create more
>>    compilers for them?
>
> This relates to my answer above... ideally the utopian vision would be
> to use implementations which do not restrict us to annotations for
> mappings but just now (and as no other options seemed immediately
> available) we have the one compiler for Avro based implementations,
> one for Amazon DynamoDB (with the aim of having one to cover all web
> services back stores if possible) but maybe individual compilers for
> other web services stores... this is far from ideal.
>
>>    3. Renato had mentioned that Gora supports "data services" now
>>    (presumably in addition to databases). I'm not sure I understand this
>>    distinction. I have heard Dynamo is a managed database that implements a
>>    model similar to Cassandra. Can you elaborate on this statement?
>
> OK so this relates to the proposed addition of web services to the
> gora-core API (namely classes such as QueryWSBase[1],
> WSDataStoreBase[2] including specific packages for implementing web
> backed datastores of varying natures e.g. file-backed and
> webservice-backed) which do not rely upon Avro for serializations
> between the data store. For clarification on this point Gora's
> Persistent class extends from Avro's SpecificRecord which is certainly
> not appropriate for DynamoDB as Amazon uses web service requests for
> serializing to and from the data store and retrieving records.
> Additionally Avro based stores remain implementations of
> PersistentBase (still extending SpecificRecord) whilst the web
> services backed stores now implement PersistentWSBase.
>
> The reports which can be found here [4] contain commentary on all of
> this as and when we became aware of it during the design and
> progression of the project.
>
> I've intentionally not included replies to your thoughts section as I
> think it's best for me to leave this to simmer for a bit... also
> because I'll most certainly need to read it another once or twice for
> it to lodge properly ;)
>
> Hopefully Renato can chime in here with his thoughts, if there is
> anything I've failed to include or have stated incorrectly.
>
> Thanks
> Lewis
>
> [0] http://docs.amazonwebservices.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/dynamodb/datamodeling/DynamoDBMapper.html
> [1] http://svn.apache.org/repos/asf/gora/branches/goraamazon/gora-core/src/main/java/org/apache/gora/query/ws/impl/QueryWSBase.java
> [2] http://svn.apache.org/repos/asf/gora/branches/goraamazon/gora-core/src/main/java/org/apache/gora/store/ws/impl/WSDataStoreBase.java
> [3] http://svn.apache.org/repos/asf/gora/branches/goraamazon/gora-core/src/main/java/org/apache/gora/persistency/ws/impl/PersistentWSBase.java
> [4] http://svn.apache.org/repos/asf/gora/branches/goraamazon/gora-dynamodb/reporting/



-- 
Lewis

Re: Compilers and data stores

Posted by Ed Kohlwey <ek...@gmail.com>.

I agree. It's unfortunate that Amazon's API forces you to tightly couple
your DAOs to their data store. Given that fact the decision makes sense
though.

One eventual solution could be using something like AspectJ to enhance the
Persistent implementations with the necessary annotations.

I think that having multiple compilers is a workable if not ideal option,
but if that's the case I think there should be some planning around what
the official spec is and we should focus efforts on creating a test suite
that verifies the compliance of data stores and objects against the written
spec.
On Aug 24, 2012 6:58 PM, "Lewis John Mcgibbney" <le...@gmail.com>
wrote:

> Hi Ed,
>
> I've provided (where I can) answers for your questions to get us
> started here. Please see inline.
>
> On Fri, Aug 24, 2012 at 1:23 PM, Ed Kohlwey <ek...@gmail.com> wrote:
>
> > Questions:
> >
> >    1. Are annotations the only way to implement the desired features?
>
> Yes AFAIK. If I'm preempting your thoughts here, I agree that this is
> really restrictive (with respect to obtaining a more generalised
> design approach for a Gora compiler) but annotations are essential for
> for DynamoDBMapper [0] to achieve the correct mappings. As we were
> operating under the restrictions of the GSoC timeline, the decision
> was made to implement a DynamoDB specific compiler at this stage with
> the view of developing a more widely applicable implementation as the
> webservices API matured.
>
> >    2. What if other data stores have other annotations? Will we create
> more
> >    compilers for them?
>
> This relates to my answer above... ideally the utopian vision would be
> to use implementations which do not restrict us to annotations for
> mappings but just now (and as no other options seemed immediately
> available) we have the one compiler for Avro based implementations,
> one for Amazon DynamoDB (with the aim of having one to cover all web
> services back stores if possible) but maybe individual compilers for
> other web services stores... this is far from ideal.
>
> >    3. Renato had mentioned that Gora supports "data services" now
> >    (presumably in addition to databases). I'm not sure I understand this
> >    distinction. I have heard Dynamo is a managed database that
> implements a
> >    model similar to Cassandra. Can you elaborate on this statement?
>
> OK so this relates to the proposed addition of web services to the
> gora-core API (namely classes such as QueryWSBase[1],
> WSDataStoreBase[2] including specific packages for implementing web
> backed datastores of varying natures e.g. file-backed and
> webservice-backed) which do not rely upon Avro for serializations
> between the data store. For clarification on this point Gora's
> Persistent class extends from Avro's SpecificRecord which is certainly
> not appropriate for DynamoDB as Amazon uses web service requests for
> serializing to and from the data store and retrieving records.
> Additionally Avro based stores remain implementations of
> PersistentBase (still extending SpecificRecord) whilst the web
> services backed stores now implement PersistentWSBase.
>
> The reports which can be found here [4] contain commentary on all of
> this as and when we became aware of it during the design and
> progression of the project.
>
> I've intentionally not included replies to your thoughts section as I
> think it's best for me to leave this to simmer for a bit... also
> because I'll most certainly need to read it another once or twice for
> it to lodge properly ;)
>
> Hopefully Renato can chime in here with his thoughts, if there is
> anything I've failed to include or have stated incorrectly.
>
> Thanks
> Lewis
>
> [0]
> http://docs.amazonwebservices.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/dynamodb/datamodeling/DynamoDBMapper.html
> [1]
> http://svn.apache.org/repos/asf/gora/branches/goraamazon/gora-core/src/main/java/org/apache/gora/query/ws/impl/QueryWSBase.java
> [2]
> http://svn.apache.org/repos/asf/gora/branches/goraamazon/gora-core/src/main/java/org/apache/gora/store/ws/impl/WSDataStoreBase.java
> [3]
> http://svn.apache.org/repos/asf/gora/branches/goraamazon/gora-core/src/main/java/org/apache/gora/persistency/ws/impl/PersistentWSBase.java
> [4]
> http://svn.apache.org/repos/asf/gora/branches/goraamazon/gora-dynamodb/reporting/
>

Re: Compilers and data stores

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Ed,

I've provided (where I can) answers for your questions to get us
started here. Please see inline.

On Fri, Aug 24, 2012 at 1:23 PM, Ed Kohlwey <ek...@gmail.com> wrote:

> Questions:
>
>    1. Are annotations the only way to implement the desired features?

Yes AFAIK. If I'm preempting your thoughts here, I agree that this is
really restrictive (with respect to obtaining a more generalised
design approach for a Gora compiler) but annotations are essential for
for DynamoDBMapper [0] to achieve the correct mappings. As we were
operating under the restrictions of the GSoC timeline, the decision
was made to implement a DynamoDB specific compiler at this stage with
the view of developing a more widely applicable implementation as the
webservices API matured.

>    2. What if other data stores have other annotations? Will we create more
>    compilers for them?

This relates to my answer above... ideally the utopian vision would be
to use implementations which do not restrict us to annotations for
mappings but just now (and as no other options seemed immediately
available) we have the one compiler for Avro based implementations,
one for Amazon DynamoDB (with the aim of having one to cover all web
services back stores if possible) but maybe individual compilers for
other web services stores... this is far from ideal.

>    3. Renato had mentioned that Gora supports "data services" now
>    (presumably in addition to databases). I'm not sure I understand this
>    distinction. I have heard Dynamo is a managed database that implements a
>    model similar to Cassandra. Can you elaborate on this statement?

OK so this relates to the proposed addition of web services to the
gora-core API (namely classes such as QueryWSBase[1],
WSDataStoreBase[2] including specific packages for implementing web
backed datastores of varying natures e.g. file-backed and
webservice-backed) which do not rely upon Avro for serializations
between the data store. For clarification on this point Gora's
Persistent class extends from Avro's SpecificRecord which is certainly
not appropriate for DynamoDB as Amazon uses web service requests for
serializing to and from the data store and retrieving records.
Additionally Avro based stores remain implementations of
PersistentBase (still extending SpecificRecord) whilst the web
services backed stores now implement PersistentWSBase.

The reports which can be found here [4] contain commentary on all of
this as and when we became aware of it during the design and
progression of the project.

I've intentionally not included replies to your thoughts section as I
think it's best for me to leave this to simmer for a bit... also
because I'll most certainly need to read it another once or twice for
it to lodge properly ;)

Hopefully Renato can chime in here with his thoughts, if there is
anything I've failed to include or have stated incorrectly.

Thanks
Lewis

[0] http://docs.amazonwebservices.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/dynamodb/datamodeling/DynamoDBMapper.html
[1] http://svn.apache.org/repos/asf/gora/branches/goraamazon/gora-core/src/main/java/org/apache/gora/query/ws/impl/QueryWSBase.java
[2] http://svn.apache.org/repos/asf/gora/branches/goraamazon/gora-core/src/main/java/org/apache/gora/store/ws/impl/WSDataStoreBase.java
[3] http://svn.apache.org/repos/asf/gora/branches/goraamazon/gora-core/src/main/java/org/apache/gora/persistency/ws/impl/PersistentWSBase.java
[4] http://svn.apache.org/repos/asf/gora/branches/goraamazon/gora-dynamodb/reporting/