You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gora.apache.org by Keith Turner <ke...@deenlo.com> on 2011/12/01 22:07:45 UTC

accumulo backend for gora

I am have been writing an Accumulo [1]  backend for gora.  I am pretty
far along, but not finished.  When I am finished, I plan to post a
patch on a jira ticket.  If anyone would like to review it let me
know.

I have a question about storing arrays.  I am wondering what the
expected behavior is given the following?


 {
  "type": "record",
  "name": "Foo",
  "namespace": "test",
  "fields" : [
    {"name": "data","type": "array", "items": "string"}
  ]
}


Foo foo1 = new test.Foo();
foo1.addToData("d1");
foo1.addToData("d2");
foo1.addToData("d3");
datastore.put(42l, foo1);

datastore.flush();

Foo foo2 = new test.Foo();
foo2.addToData("d4");
foo2.addToData("d5");
datastore.put(42l, foo2);

datastore.flush();

Foo foo3 = datastore.get(42l);
System.out.println(foo3);  //what would you expect this to print for
the data array?  d4,d5?


[1]: http://incubator.apache.org/accumulo

Re: accumulo backend for gora

Posted by Keith Turner <ke...@deenlo.com>.
On Thu, Dec 1, 2011 at 5:16 PM, Enis Söztutar <en...@gmail.com> wrote:
> Wow, this is great news. If you upload the patch, I am sure there will be
> interest for review and we can add it to the code base.
>
> Coming to the array storage, one of the strengths of Gora is that it
> delegates the mapping to the data store, since every one has it's own data
> model. In HBas, and I believe in Accumulo as well, you can store arrays at
> least in three ways
>  (1) serialize the array and store it in one cell
>  - Adding deleting items will read and reserialize the whole array. This
> is perfect for small, mostly read only arrays.
>  (2) serialize each item in one cell sharing the same column family and
> having consecutive column numbers. Like family:0 -> arr[0],
> family:1->arr[1], ...
>  (3) serialize each item in columns sharing the same column family, but
> with empty calls. Like family:arr[0] -> 'dummy', family:arr[1], ... .
>  - The array elements will be stored in sorted order.
>
> So, the question is what to choose? It turns out that depending on how you
> want to access data and the characteristics of the data (like read-only,
> size, etc), you should be able to choose either of them for your fields.
> And depending on how you do the data layout in your storage, the semantics
> and/or the performance for the use case you mentioned can change. In HBase,
> we have only option (2), but ideally Gora-hbase and gora-accumulo should be
> able to work with all 3. And if you think about the deleting item from
> array semantics, it gets a little bit more involved. For example in
> gora-hbase, your use case will probably print d4,d5,d3 (since d1 and d2
> will be overriden, but d3 won't be deleted). However, I think the correct
> semantics should be only to print d4 and d5. However, if you go with (3), I
> think the correct semantics is to print d1,d2,d3,d4,d5.

Looking at the current HBase implementation, I thought it might yield
d4,d5,d3.  But I was not sure. I think with option 2 you could also
store a length or end or array marker, then just d4 and d5 would be
returned.  I was thinking of doing this for the Accumulo datastore,
but then its behavior would differ from the HBase store.  So what
should the behavior be?  Should different Gora stores have the same
behavior even if they have different implementations?  Seems like this
would be good for the gora user, makes it easier to switch between
implementations.  The behavior could be specified in the interfaces
and enforced via test.  Seems like there are already some test that
check for some behaviors across implementations.

>
> So, as I said, the "correct" semantics depends on the data model, and gora
> should be flexible enough so that we can utilize different models suitable
> for the job.
>
> Thanks,
> Enis
>
> On Thu, Dec 1, 2011 at 1:07 PM, Keith Turner <ke...@deenlo.com> wrote:
>
>> I am have been writing an Accumulo [1]  backend for gora.  I am pretty
>> far along, but not finished.  When I am finished, I plan to post a
>> patch on a jira ticket.  If anyone would like to review it let me
>> know.
>>
>> I have a question about storing arrays.  I am wondering what the
>> expected behavior is given the following?
>>
>>
>>  {
>>  "type": "record",
>>  "name": "Foo",
>>  "namespace": "test",
>>  "fields" : [
>>    {"name": "data","type": "array", "items": "string"}
>>  ]
>> }
>>
>>
>> Foo foo1 = new test.Foo();
>> foo1.addToData("d1");
>> foo1.addToData("d2");
>> foo1.addToData("d3");
>> datastore.put(42l, foo1);
>>
>> datastore.flush();
>>
>> Foo foo2 = new test.Foo();
>> foo2.addToData("d4");
>> foo2.addToData("d5");
>> datastore.put(42l, foo2);
>>
>> datastore.flush();
>>
>> Foo foo3 = datastore.get(42l);
>> System.out.println(foo3);  //what would you expect this to print for
>> the data array?  d4,d5?
>>
>>
>> [1]: http://incubator.apache.org/accumulo
>>

Re: accumulo backend for gora

Posted by Enis Söztutar <en...@gmail.com>.
Wow, this is great news. If you upload the patch, I am sure there will be
interest for review and we can add it to the code base.

Coming to the array storage, one of the strengths of Gora is that it
delegates the mapping to the data store, since every one has it's own data
model. In HBas, and I believe in Accumulo as well, you can store arrays at
least in three ways
 (1) serialize the array and store it in one cell
  - Adding deleting items will read and reserialize the whole array. This
is perfect for small, mostly read only arrays.
 (2) serialize each item in one cell sharing the same column family and
having consecutive column numbers. Like family:0 -> arr[0],
family:1->arr[1], ...
 (3) serialize each item in columns sharing the same column family, but
with empty calls. Like family:arr[0] -> 'dummy', family:arr[1], ... .
 - The array elements will be stored in sorted order.

So, the question is what to choose? It turns out that depending on how you
want to access data and the characteristics of the data (like read-only,
size, etc), you should be able to choose either of them for your fields.
And depending on how you do the data layout in your storage, the semantics
and/or the performance for the use case you mentioned can change. In HBase,
we have only option (2), but ideally Gora-hbase and gora-accumulo should be
able to work with all 3. And if you think about the deleting item from
array semantics, it gets a little bit more involved. For example in
gora-hbase, your use case will probably print d4,d5,d3 (since d1 and d2
will be overriden, but d3 won't be deleted). However, I think the correct
semantics should be only to print d4 and d5. However, if you go with (3), I
think the correct semantics is to print d1,d2,d3,d4,d5.

So, as I said, the "correct" semantics depends on the data model, and gora
should be flexible enough so that we can utilize different models suitable
for the job.

Thanks,
Enis

On Thu, Dec 1, 2011 at 1:07 PM, Keith Turner <ke...@deenlo.com> wrote:

> I am have been writing an Accumulo [1]  backend for gora.  I am pretty
> far along, but not finished.  When I am finished, I plan to post a
> patch on a jira ticket.  If anyone would like to review it let me
> know.
>
> I have a question about storing arrays.  I am wondering what the
> expected behavior is given the following?
>
>
>  {
>  "type": "record",
>  "name": "Foo",
>  "namespace": "test",
>  "fields" : [
>    {"name": "data","type": "array", "items": "string"}
>  ]
> }
>
>
> Foo foo1 = new test.Foo();
> foo1.addToData("d1");
> foo1.addToData("d2");
> foo1.addToData("d3");
> datastore.put(42l, foo1);
>
> datastore.flush();
>
> Foo foo2 = new test.Foo();
> foo2.addToData("d4");
> foo2.addToData("d5");
> datastore.put(42l, foo2);
>
> datastore.flush();
>
> Foo foo3 = datastore.get(42l);
> System.out.println(foo3);  //what would you expect this to print for
> the data array?  d4,d5?
>
>
> [1]: http://incubator.apache.org/accumulo
>