You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by N Kapshoo <nk...@gmail.com> on 2010/06/21 22:33:45 UTC

composite value vs composite qualifier

Is there any querying value in separating out values tied to each
other vs. keeping them in a serialized object? I am guessing the
second option would be much faster considering it is one composite
value on the disk, but I would like to know if there are any specific
advantages to doing things the other way. Thanks.
The values themselves are very small, basic information in String.

Eg:

DocInfo: <docId><type> = value1
DocInfo: <docId><priority> = value2
DocInfo: <docId><etcetc> = value3


Vs

DocInfo: docId = value (JSON(type, priority, etcetc))

Thank you.

Re: composite value vs composite qualifier

Posted by Andrey Stepachev <oc...@gmail.com>.
2010/6/22 Jonathan Gray <jg...@facebook.com>

> One minor correction to Andrey's thoughts...
>
> All updates to a given row are atomic.  Two operations from two different
> clients against the same row will always be done serially (updates to
> multiple columns will not be interleaved, one request will go and then the
> other will go... there is row-level locking).  If you are doing a
> read/modify/write operation, then this is different and to get atomicity
> there you would need to use something like checkAndPut.
>

Sorry for not clean description about what I mean. I mean, that in case of
serialisation you can't update different column simulateously from different
client (because you should serialize object as a whole). With different
columns - you can.

RE: composite value vs composite qualifier

Posted by Jonathan Gray <jg...@facebook.com>.
One minor correction to Andrey's thoughts...

All updates to a given row are atomic.  Two operations from two different clients against the same row will always be done serially (updates to multiple columns will not be interleaved, one request will go and then the other will go... there is row-level locking).  If you are doing a read/modify/write operation, then this is different and to get atomicity there you would need to use something like checkAndPut.

JG

> > Is there any querying value in separating out values tied to each
> > other vs. keeping them in a serialized object? I am guessing the
> > second option would be much faster considering it is one composite
> > value on the disk, but I would like to know if there are any specific
> > advantages to doing things the other way. Thanks.
> > The values themselves are very small, basic information in String.
> >
> > Eg:
> >
> > DocInfo: <docId><type> = value1
> > DocInfo: <docId><priority> = value2
> > DocInfo: <docId><etcetc> = value3
> >
> >
> > Vs
> >
> > DocInfo: docId = value (JSON(type, priority, etcetc))
> >
> > Thank you.
> >
> 
> This is mostly depends on usage pattern.
> 
> 1. each value in storage have full key key/family/qualifier/timestamp,
> so
> keyvalue size increasing
> (but this negative effect can be negated by using compression). So
> serialisation form will be smaller, take less disk io, and can be
> faster.
> 
> 2. second option gives you atomic updates (i.e all data comes as one
> "piece") and with first option you
> can have concurrent updates of the fields (and of course individual
> history,
> in opposite to serialized object, which will have history for a whole
> object)
> 
> 3. in serialised form you cant use server side filters (out of the box,
> you
> should patch hbase to support custom filters, which will deserialise
> object
> or use jsonpath on it's serialised form), but with first option - you
> can.

RE: composite value vs composite qualifier

Posted by Jonathan Gray <jg...@facebook.com>.
Not sure there is a right/wrong way.  You should probably just do what you're most comfortable with / what makes the most sense to you.

> -----Original Message-----
> From: N Kapshoo [mailto:nkapshoo@gmail.com]
> Sent: Monday, June 21, 2010 3:23 PM
> To: user@hbase.apache.org
> Subject: Re: composite value vs composite qualifier
> 
> Does it still make sense to follow the previous id generation we
> talked about? (for performance reasons instead of storing an entire
> string?)
> 
> <docId><byte1> = value1
> <docId><byte2> = value2
> 
> instead of
> <docId><author> = value1
> <docId><status> = value2
> etc?
> 
> 
> On Mon, Jun 21, 2010 at 5:19 PM, N Kapshoo <nk...@gmail.com> wrote:
> > Aha. That makes sense (both atomic writes and Filters).
> >
> > I am definitely only looking to filter within a given user, so looks
> > like what you describe below might work for me.
> >
> > Thanks so much for all your help, Jonathan. You have saved me (at
> > least) 2 weeks of tinkering and poking around!
> >
> > On Mon, Jun 21, 2010 at 5:10 PM, Jonathan Gray <jg...@facebook.com>
> wrote:
> >> It would be inefficient to run that query against this schema, if
> you're talking about finding all documents with a given author across
> all users.  In that case you'd want to use an additional table that had
> row keys as authors.
> >>
> >> If you want to search for documents with a specific author within a
> given users documents (single row) then you could use filters, and as
> Andrey said, it would be simpler if it was broken up into individual
> qualifiers but could also be done with a custom filter to read the
> serialized value.
> >>
> >> To answer your question, you'd want a QualifierFilter that matched
> against qualifiers of the form <anylong><author> and then a ValueFilter
> which matched the value against the specific author you're looking for.
> >>
> >> JG
> >>
> >>> -----Original Message-----
> >>> From: N Kapshoo [mailto:nkapshoo@gmail.com]
> >>> Sent: Monday, June 21, 2010 2:59 PM
> >>> To: user@hbase.apache.org
> >>> Subject: Re: composite value vs composite qualifier
> >>>
> >>> I am not sure how to use filters in my case since I do not know the
> >>> column name.
> >>> Eg:
> >>> DocInfo: 123213+author = "abc"
> >>>
> >>> 123213 is the docId. If I want to look for authors named 'abc' in
> all
> >>> docs, how would I go about specifying a filter?
> >>>
> >>> Thanks.
> >>>
> >>> On Mon, Jun 21, 2010 at 4:20 PM, Andrey Stepachev
> <oc...@gmail.com>
> >>> wrote:
> >>> > 2010/6/22 N Kapshoo <nk...@gmail.com>
> >>> >
> >>> >> Is there any querying value in separating out values tied to
> each
> >>> >> other vs. keeping them in a serialized object? I am guessing the
> >>> >> second option would be much faster considering it is one
> composite
> >>> >> value on the disk, but I would like to know if there are any
> >>> specific
> >>> >> advantages to doing things the other way. Thanks.
> >>> >> The values themselves are very small, basic information in
> String.
> >>> >>
> >>> >> Eg:
> >>> >>
> >>> >> DocInfo: <docId><type> = value1
> >>> >> DocInfo: <docId><priority> = value2
> >>> >> DocInfo: <docId><etcetc> = value3
> >>> >>
> >>> >>
> >>> >> Vs
> >>> >>
> >>> >> DocInfo: docId = value (JSON(type, priority, etcetc))
> >>> >>
> >>> >> Thank you.
> >>> >>
> >>> >
> >>> > This is mostly depends on usage pattern.
> >>> >
> >>> > 1. each value in storage have full key
> >>> key/family/qualifier/timestamp, so
> >>> > keyvalue size increasing
> >>> > (but this negative effect can be negated by using compression).
> So
> >>> > serialisation form will be smaller, take less disk io, and can be
> >>> faster.
> >>> >
> >>> > 2. second option gives you atomic updates (i.e all data comes as
> one
> >>> > "piece") and with first option you
> >>> > can have concurrent updates of the fields (and of course
> individual
> >>> history,
> >>> > in opposite to serialized object, which will have history for a
> whole
> >>> > object)
> >>> >
> >>> > 3. in serialised form you cant use server side filters (out of
> the
> >>> box, you
> >>> > should patch hbase to support custom filters, which will
> deserialise
> >>> object
> >>> > or use jsonpath on it's serialised form), but with first option -
> you
> >>> can.
> >>> >
> >>
> >

Re: composite value vs composite qualifier

Posted by N Kapshoo <nk...@gmail.com>.
Does it still make sense to follow the previous id generation we
talked about? (for performance reasons instead of storing an entire
string?)

<docId><byte1> = value1
<docId><byte2> = value2

instead of
<docId><author> = value1
<docId><status> = value2
etc?


On Mon, Jun 21, 2010 at 5:19 PM, N Kapshoo <nk...@gmail.com> wrote:
> Aha. That makes sense (both atomic writes and Filters).
>
> I am definitely only looking to filter within a given user, so looks
> like what you describe below might work for me.
>
> Thanks so much for all your help, Jonathan. You have saved me (at
> least) 2 weeks of tinkering and poking around!
>
> On Mon, Jun 21, 2010 at 5:10 PM, Jonathan Gray <jg...@facebook.com> wrote:
>> It would be inefficient to run that query against this schema, if you're talking about finding all documents with a given author across all users.  In that case you'd want to use an additional table that had row keys as authors.
>>
>> If you want to search for documents with a specific author within a given users documents (single row) then you could use filters, and as Andrey said, it would be simpler if it was broken up into individual qualifiers but could also be done with a custom filter to read the serialized value.
>>
>> To answer your question, you'd want a QualifierFilter that matched against qualifiers of the form <anylong><author> and then a ValueFilter which matched the value against the specific author you're looking for.
>>
>> JG
>>
>>> -----Original Message-----
>>> From: N Kapshoo [mailto:nkapshoo@gmail.com]
>>> Sent: Monday, June 21, 2010 2:59 PM
>>> To: user@hbase.apache.org
>>> Subject: Re: composite value vs composite qualifier
>>>
>>> I am not sure how to use filters in my case since I do not know the
>>> column name.
>>> Eg:
>>> DocInfo: 123213+author = "abc"
>>>
>>> 123213 is the docId. If I want to look for authors named 'abc' in all
>>> docs, how would I go about specifying a filter?
>>>
>>> Thanks.
>>>
>>> On Mon, Jun 21, 2010 at 4:20 PM, Andrey Stepachev <oc...@gmail.com>
>>> wrote:
>>> > 2010/6/22 N Kapshoo <nk...@gmail.com>
>>> >
>>> >> Is there any querying value in separating out values tied to each
>>> >> other vs. keeping them in a serialized object? I am guessing the
>>> >> second option would be much faster considering it is one composite
>>> >> value on the disk, but I would like to know if there are any
>>> specific
>>> >> advantages to doing things the other way. Thanks.
>>> >> The values themselves are very small, basic information in String.
>>> >>
>>> >> Eg:
>>> >>
>>> >> DocInfo: <docId><type> = value1
>>> >> DocInfo: <docId><priority> = value2
>>> >> DocInfo: <docId><etcetc> = value3
>>> >>
>>> >>
>>> >> Vs
>>> >>
>>> >> DocInfo: docId = value (JSON(type, priority, etcetc))
>>> >>
>>> >> Thank you.
>>> >>
>>> >
>>> > This is mostly depends on usage pattern.
>>> >
>>> > 1. each value in storage have full key
>>> key/family/qualifier/timestamp, so
>>> > keyvalue size increasing
>>> > (but this negative effect can be negated by using compression). So
>>> > serialisation form will be smaller, take less disk io, and can be
>>> faster.
>>> >
>>> > 2. second option gives you atomic updates (i.e all data comes as one
>>> > "piece") and with first option you
>>> > can have concurrent updates of the fields (and of course individual
>>> history,
>>> > in opposite to serialized object, which will have history for a whole
>>> > object)
>>> >
>>> > 3. in serialised form you cant use server side filters (out of the
>>> box, you
>>> > should patch hbase to support custom filters, which will deserialise
>>> object
>>> > or use jsonpath on it's serialised form), but with first option - you
>>> can.
>>> >
>>
>

Re: composite value vs composite qualifier

Posted by N Kapshoo <nk...@gmail.com>.
Aha. That makes sense (both atomic writes and Filters).

I am definitely only looking to filter within a given user, so looks
like what you describe below might work for me.

Thanks so much for all your help, Jonathan. You have saved me (at
least) 2 weeks of tinkering and poking around!

On Mon, Jun 21, 2010 at 5:10 PM, Jonathan Gray <jg...@facebook.com> wrote:
> It would be inefficient to run that query against this schema, if you're talking about finding all documents with a given author across all users.  In that case you'd want to use an additional table that had row keys as authors.
>
> If you want to search for documents with a specific author within a given users documents (single row) then you could use filters, and as Andrey said, it would be simpler if it was broken up into individual qualifiers but could also be done with a custom filter to read the serialized value.
>
> To answer your question, you'd want a QualifierFilter that matched against qualifiers of the form <anylong><author> and then a ValueFilter which matched the value against the specific author you're looking for.
>
> JG
>
>> -----Original Message-----
>> From: N Kapshoo [mailto:nkapshoo@gmail.com]
>> Sent: Monday, June 21, 2010 2:59 PM
>> To: user@hbase.apache.org
>> Subject: Re: composite value vs composite qualifier
>>
>> I am not sure how to use filters in my case since I do not know the
>> column name.
>> Eg:
>> DocInfo: 123213+author = "abc"
>>
>> 123213 is the docId. If I want to look for authors named 'abc' in all
>> docs, how would I go about specifying a filter?
>>
>> Thanks.
>>
>> On Mon, Jun 21, 2010 at 4:20 PM, Andrey Stepachev <oc...@gmail.com>
>> wrote:
>> > 2010/6/22 N Kapshoo <nk...@gmail.com>
>> >
>> >> Is there any querying value in separating out values tied to each
>> >> other vs. keeping them in a serialized object? I am guessing the
>> >> second option would be much faster considering it is one composite
>> >> value on the disk, but I would like to know if there are any
>> specific
>> >> advantages to doing things the other way. Thanks.
>> >> The values themselves are very small, basic information in String.
>> >>
>> >> Eg:
>> >>
>> >> DocInfo: <docId><type> = value1
>> >> DocInfo: <docId><priority> = value2
>> >> DocInfo: <docId><etcetc> = value3
>> >>
>> >>
>> >> Vs
>> >>
>> >> DocInfo: docId = value (JSON(type, priority, etcetc))
>> >>
>> >> Thank you.
>> >>
>> >
>> > This is mostly depends on usage pattern.
>> >
>> > 1. each value in storage have full key
>> key/family/qualifier/timestamp, so
>> > keyvalue size increasing
>> > (but this negative effect can be negated by using compression). So
>> > serialisation form will be smaller, take less disk io, and can be
>> faster.
>> >
>> > 2. second option gives you atomic updates (i.e all data comes as one
>> > "piece") and with first option you
>> > can have concurrent updates of the fields (and of course individual
>> history,
>> > in opposite to serialized object, which will have history for a whole
>> > object)
>> >
>> > 3. in serialised form you cant use server side filters (out of the
>> box, you
>> > should patch hbase to support custom filters, which will deserialise
>> object
>> > or use jsonpath on it's serialised form), but with first option - you
>> can.
>> >
>

RE: composite value vs composite qualifier

Posted by Jonathan Gray <jg...@facebook.com>.
It would be inefficient to run that query against this schema, if you're talking about finding all documents with a given author across all users.  In that case you'd want to use an additional table that had row keys as authors.

If you want to search for documents with a specific author within a given users documents (single row) then you could use filters, and as Andrey said, it would be simpler if it was broken up into individual qualifiers but could also be done with a custom filter to read the serialized value.

To answer your question, you'd want a QualifierFilter that matched against qualifiers of the form <anylong><author> and then a ValueFilter which matched the value against the specific author you're looking for.

JG

> -----Original Message-----
> From: N Kapshoo [mailto:nkapshoo@gmail.com]
> Sent: Monday, June 21, 2010 2:59 PM
> To: user@hbase.apache.org
> Subject: Re: composite value vs composite qualifier
> 
> I am not sure how to use filters in my case since I do not know the
> column name.
> Eg:
> DocInfo: 123213+author = "abc"
> 
> 123213 is the docId. If I want to look for authors named 'abc' in all
> docs, how would I go about specifying a filter?
> 
> Thanks.
> 
> On Mon, Jun 21, 2010 at 4:20 PM, Andrey Stepachev <oc...@gmail.com>
> wrote:
> > 2010/6/22 N Kapshoo <nk...@gmail.com>
> >
> >> Is there any querying value in separating out values tied to each
> >> other vs. keeping them in a serialized object? I am guessing the
> >> second option would be much faster considering it is one composite
> >> value on the disk, but I would like to know if there are any
> specific
> >> advantages to doing things the other way. Thanks.
> >> The values themselves are very small, basic information in String.
> >>
> >> Eg:
> >>
> >> DocInfo: <docId><type> = value1
> >> DocInfo: <docId><priority> = value2
> >> DocInfo: <docId><etcetc> = value3
> >>
> >>
> >> Vs
> >>
> >> DocInfo: docId = value (JSON(type, priority, etcetc))
> >>
> >> Thank you.
> >>
> >
> > This is mostly depends on usage pattern.
> >
> > 1. each value in storage have full key
> key/family/qualifier/timestamp, so
> > keyvalue size increasing
> > (but this negative effect can be negated by using compression). So
> > serialisation form will be smaller, take less disk io, and can be
> faster.
> >
> > 2. second option gives you atomic updates (i.e all data comes as one
> > "piece") and with first option you
> > can have concurrent updates of the fields (and of course individual
> history,
> > in opposite to serialized object, which will have history for a whole
> > object)
> >
> > 3. in serialised form you cant use server side filters (out of the
> box, you
> > should patch hbase to support custom filters, which will deserialise
> object
> > or use jsonpath on it's serialised form), but with first option - you
> can.
> >

Re: composite value vs composite qualifier

Posted by N Kapshoo <nk...@gmail.com>.
I am not sure how to use filters in my case since I do not know the
column name.
Eg:
DocInfo: 123213+author = "abc"

123213 is the docId. If I want to look for authors named 'abc' in all
docs, how would I go about specifying a filter?

Thanks.

On Mon, Jun 21, 2010 at 4:20 PM, Andrey Stepachev <oc...@gmail.com> wrote:
> 2010/6/22 N Kapshoo <nk...@gmail.com>
>
>> Is there any querying value in separating out values tied to each
>> other vs. keeping them in a serialized object? I am guessing the
>> second option would be much faster considering it is one composite
>> value on the disk, but I would like to know if there are any specific
>> advantages to doing things the other way. Thanks.
>> The values themselves are very small, basic information in String.
>>
>> Eg:
>>
>> DocInfo: <docId><type> = value1
>> DocInfo: <docId><priority> = value2
>> DocInfo: <docId><etcetc> = value3
>>
>>
>> Vs
>>
>> DocInfo: docId = value (JSON(type, priority, etcetc))
>>
>> Thank you.
>>
>
> This is mostly depends on usage pattern.
>
> 1. each value in storage have full key key/family/qualifier/timestamp, so
> keyvalue size increasing
> (but this negative effect can be negated by using compression). So
> serialisation form will be smaller, take less disk io, and can be faster.
>
> 2. second option gives you atomic updates (i.e all data comes as one
> "piece") and with first option you
> can have concurrent updates of the fields (and of course individual history,
> in opposite to serialized object, which will have history for a whole
> object)
>
> 3. in serialised form you cant use server side filters (out of the box, you
> should patch hbase to support custom filters, which will deserialise object
> or use jsonpath on it's serialised form), but with first option - you can.
>

Re: composite value vs composite qualifier

Posted by Andrey Stepachev <oc...@gmail.com>.
2010/6/22 N Kapshoo <nk...@gmail.com>

> Is there any querying value in separating out values tied to each
> other vs. keeping them in a serialized object? I am guessing the
> second option would be much faster considering it is one composite
> value on the disk, but I would like to know if there are any specific
> advantages to doing things the other way. Thanks.
> The values themselves are very small, basic information in String.
>
> Eg:
>
> DocInfo: <docId><type> = value1
> DocInfo: <docId><priority> = value2
> DocInfo: <docId><etcetc> = value3
>
>
> Vs
>
> DocInfo: docId = value (JSON(type, priority, etcetc))
>
> Thank you.
>

This is mostly depends on usage pattern.

1. each value in storage have full key key/family/qualifier/timestamp, so
keyvalue size increasing
(but this negative effect can be negated by using compression). So
serialisation form will be smaller, take less disk io, and can be faster.

2. second option gives you atomic updates (i.e all data comes as one
"piece") and with first option you
can have concurrent updates of the fields (and of course individual history,
in opposite to serialized object, which will have history for a whole
object)

3. in serialised form you cant use server side filters (out of the box, you
should patch hbase to support custom filters, which will deserialise object
or use jsonpath on it's serialised form), but with first option - you can.