You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Jack Levin <ma...@gmail.com> on 2011/01/08 05:01:14 UTC

question about merge-join (or AND operator betwween colums)

Hello all, I have a scanner question, we have this table:

hbase(main):002:0> scan 'mattest'
ROW                                          COLUMN+CELL
 1                                           column=generic:,
timestamp=1294454057618, value=1
 1                                           column=photo:,
timestamp=1294453830339, value=1
 1                                           column=type:,
timestamp=1294453812716, value=photo
 1                                           column=type:photo,
timestamp=1294453884174, value=photo
 2                                           column=generic:,
timestamp=1294454061156, value=1
 2                                           column=type:,
timestamp=1294453851757, value=video
 2                                           column=type:video,
timestamp=1294453877719, value=video
 2                                           column=video:,
timestamp=1294453842722, value=1

We need to run this query:

hbase(main):004:0> scan 'mattest', {COLUMNS => ['generic', 'photo']}
ROW                                          COLUMN+CELL
 1                                           column=generic:,
timestamp=1294454057618, value=1
 1                                           column=photo:,
timestamp=1294453830339, value=1
 2                                           column=generic:,
timestamp=1294454061156, value=1

Note that  ['generic', 'photo'], utilizes 'OR' operator, and not
'AND'.   Is it possible to create a scanner that will not AND and not
OR?, in which case something like this:

scan 'mattest', {COLUMNS => ['generic' AND 'photo']}
ROW                                          COLUMN+CELL
 1                                           column=generic:,
timestamp=1294454057618, value=1
 1                                           column=photo:,
timestamp=1294453830339, value=1

Thanks in advance.

-Jack

Re: question about merge-join (or AND operator betwween colums)

Posted by Andrey Stepachev <oc...@gmail.com>.

2011/1/9 Jack Levin <ma...@gmail.com>

> Future wise we plan to have millions of rows, probably across multiple
> regions, even if IO is not a problem, doing millions of filter operations
> does not make much sense.
>

It depends on selectivity of your photo column. If it is rare case (1% of
rows has fotos), it is more wise to scan only photo family and then get
another families. If selectivity is high, you will have small amount of
mismatches.

But I agree, that hbase doesn't have feature like "first check this family,
and if it has
value, proceed others", and in some case it can be very usefull (for inplace
indexing).


>
> -Jack
>
> On Sat, Jan 8, 2011 at 2:54 PM, Andrey Stepachev <oc...@gmail.com> wrote:
>
> > Ok. Understand.
> >
> > But do you check is it really an issue? I think that it is only 1 IO
> here,
> > (especially
> > if compression used)? You have big rows?
> >
> >
> >
> > 2011/1/9 Jack Levin <ma...@gmail.com>
> >
> > > Sorting is not the issue, the location of data can be in the beginning,
> > > middle or end, or any combination of thereof.  I only given the worst
> > case
> > > scenario example, I understand that filtering will produce results we
> > want
> > > but at cost of examining every row and offloading AND/join logic to the
> > > application.
> > >
> > > -Jack
> > >
> > > On Sat, Jan 8, 2011 at 1:59 PM, Andrey Stepachev <oc...@gmail.com>
> > wrote:
> > >
> > > > More details on binary sorting you can read
> > > >
> > > >
> > >
> >
> http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/
> > > >
> > > > 2011/1/8 Jack Levin <ma...@gmail.com>
> > > >
> > > > > Basic problem described:
> > > > >
> > > > > user uploads 1 image and creates some text -10 days ago, then
> creates
> > > > 1000
> > > > > text messages on between 9 days ago and today:
> > > > >
> > > > >
> > > > > row key          | fm:type --> value
> > > > >
> > > > >
> > > > > 00days:uid     | type:text --> text_id
> > > > >
> > > > > .
> > > > >
> > > > > .
> > > > >
> > > > > 09days:uid | type:text --> text_id
> > > > >
> > > > >
> > > > > 10days:uid     | type:photo --> URL
> > > > >
> > > > >          | type:text --> text_id
> > > > >
> > > > >
> > > > > Skip all the way to 10days:uid row, without reading 00days:id -
> > 09:uid
> > > > > rows.
> > > > >  Ideally we do not want to read all 1000 entries that have _only_
> > text.
> > > >  We
> > > > > want to get to last entry in the most efficient way possible.
> > > > >
> > > > >
> > > > > -Jack
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Sat, Jan 8, 2011 at 11:43 AM, Stack <st...@duboce.net> wrote:
> > > > > > Strike that.  This is a Scan, so can't do blooms + filter.
>  Sorry.
> > > > > > Sounds like a coprocessor then.  You'd have your query 'lean' on
> > the
> > > > > > column that you know has the lesser items and then per item,
> you'd
> > do
> > > > > > a get inside the coprocessor against the column of many entries.
> >  The
> > > > > > get would go via blooms.
> > > > > >
> > > > > > St.Ack
> > > > > >
> > > > > >
> > > > > > On Sat, Jan 8, 2011 at 11:39 AM, Stack <st...@duboce.net> wrote:
> > > > > >> On Sat, Jan 8, 2011 at 11:35 AM, Jack Levin <ma...@gmail.com>
> > > > wrote:
> > > > > >>> Yes, we thought about using filters, the issue is, if one
> family
> > > > > >>> column has 1ml values, and second family column has 10 values
> at
> > > the
> > > > > >>> bottom, we would end up scanning and filtering 99990 records
> and
> > > > > >>> throwing them away, which seems inefficient.
> > > > > >>
> > > > > >> Blooms+filters?
> > > > > >> St.Ack
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: question about merge-join (or AND operator betwween colums)

Posted by Jack Levin <ma...@gmail.com>.

Suppose we used different families, how would it help ? 

-Jack


On Jan 8, 2011, at 6:47 PM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Jack,
> 
> Why not put photos and texts in separate column families?
> 
> -Todd
> 
> On Sat, Jan 8, 2011 at 2:57 PM, Jack Levin <ma...@gmail.com> wrote:
> 
>> Future wise we plan to have millions of rows, probably across multiple
>> regions, even if IO is not a problem, doing millions of filter operations
>> does not make much sense.
>> 
>> -Jack
>> 
>> On Sat, Jan 8, 2011 at 2:54 PM, Andrey Stepachev <oc...@gmail.com> wrote:
>> 
>>> Ok. Understand.
>>> 
>>> But do you check is it really an issue? I think that it is only 1 IO
>> here,
>>> (especially
>>> if compression used)? You have big rows?
>>> 
>>> 
>>> 
>>> 2011/1/9 Jack Levin <ma...@gmail.com>
>>> 
>>>> Sorting is not the issue, the location of data can be in the beginning,
>>>> middle or end, or any combination of thereof.  I only given the worst
>>> case
>>>> scenario example, I understand that filtering will produce results we
>>> want
>>>> but at cost of examining every row and offloading AND/join logic to the
>>>> application.
>>>> 
>>>> -Jack
>>>> 
>>>> On Sat, Jan 8, 2011 at 1:59 PM, Andrey Stepachev <oc...@gmail.com>
>>> wrote:
>>>> 
>>>>> More details on binary sorting you can read
>>>>> 
>>>>> 
>>>> 
>>> 
>> http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/
>>>>> 
>>>>> 2011/1/8 Jack Levin <ma...@gmail.com>
>>>>> 
>>>>>> Basic problem described:
>>>>>> 
>>>>>> user uploads 1 image and creates some text -10 days ago, then
>> creates
>>>>> 1000
>>>>>> text messages on between 9 days ago and today:
>>>>>> 
>>>>>> 
>>>>>> row key          | fm:type --> value
>>>>>> 
>>>>>> 
>>>>>> 00days:uid     | type:text --> text_id
>>>>>> 
>>>>>> .
>>>>>> 
>>>>>> .
>>>>>> 
>>>>>> 09days:uid | type:text --> text_id
>>>>>> 
>>>>>> 
>>>>>> 10days:uid     | type:photo --> URL
>>>>>> 
>>>>>>         | type:text --> text_id
>>>>>> 
>>>>>> 
>>>>>> Skip all the way to 10days:uid row, without reading 00days:id -
>>> 09:uid
>>>>>> rows.
>>>>>> Ideally we do not want to read all 1000 entries that have _only_
>>> text.
>>>>> We
>>>>>> want to get to last entry in the most efficient way possible.
>>>>>> 
>>>>>> 
>>>>>> -Jack
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Sat, Jan 8, 2011 at 11:43 AM, Stack <st...@duboce.net> wrote:
>>>>>>> Strike that.  This is a Scan, so can't do blooms + filter.
>> Sorry.
>>>>>>> Sounds like a coprocessor then.  You'd have your query 'lean' on
>>> the
>>>>>>> column that you know has the lesser items and then per item,
>> you'd
>>> do
>>>>>>> a get inside the coprocessor against the column of many entries.
>>> The
>>>>>>> get would go via blooms.
>>>>>>> 
>>>>>>> St.Ack
>>>>>>> 
>>>>>>> 
>>>>>>> On Sat, Jan 8, 2011 at 11:39 AM, Stack <st...@duboce.net> wrote:
>>>>>>>> On Sat, Jan 8, 2011 at 11:35 AM, Jack Levin <ma...@gmail.com>
>>>>> wrote:
>>>>>>>>> Yes, we thought about using filters, the issue is, if one
>> family
>>>>>>>>> column has 1ml values, and second family column has 10 values
>> at
>>>> the
>>>>>>>>> bottom, we would end up scanning and filtering 99990 records
>> and
>>>>>>>>> throwing them away, which seems inefficient.
>>>>>>>> 
>>>>>>>> Blooms+filters?
>>>>>>>> St.Ack
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 
> 
> 
> -- 
> Todd Lipcon
> Software Engineer, Cloudera

Re: question about merge-join (or AND operator betwween colums)

Posted by Todd Lipcon <to...@cloudera.com>.

Hi Jack,

Why not put photos and texts in separate column families?

-Todd

On Sat, Jan 8, 2011 at 2:57 PM, Jack Levin <ma...@gmail.com> wrote:

> Future wise we plan to have millions of rows, probably across multiple
> regions, even if IO is not a problem, doing millions of filter operations
> does not make much sense.
>
> -Jack
>
> On Sat, Jan 8, 2011 at 2:54 PM, Andrey Stepachev <oc...@gmail.com> wrote:
>
> > Ok. Understand.
> >
> > But do you check is it really an issue? I think that it is only 1 IO
> here,
> > (especially
> > if compression used)? You have big rows?
> >
> >
> >
> > 2011/1/9 Jack Levin <ma...@gmail.com>
> >
> > > Sorting is not the issue, the location of data can be in the beginning,
> > > middle or end, or any combination of thereof.  I only given the worst
> > case
> > > scenario example, I understand that filtering will produce results we
> > want
> > > but at cost of examining every row and offloading AND/join logic to the
> > > application.
> > >
> > > -Jack
> > >
> > > On Sat, Jan 8, 2011 at 1:59 PM, Andrey Stepachev <oc...@gmail.com>
> > wrote:
> > >
> > > > More details on binary sorting you can read
> > > >
> > > >
> > >
> >
> http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/
> > > >
> > > > 2011/1/8 Jack Levin <ma...@gmail.com>
> > > >
> > > > > Basic problem described:
> > > > >
> > > > > user uploads 1 image and creates some text -10 days ago, then
> creates
> > > > 1000
> > > > > text messages on between 9 days ago and today:
> > > > >
> > > > >
> > > > > row key          | fm:type --> value
> > > > >
> > > > >
> > > > > 00days:uid     | type:text --> text_id
> > > > >
> > > > > .
> > > > >
> > > > > .
> > > > >
> > > > > 09days:uid | type:text --> text_id
> > > > >
> > > > >
> > > > > 10days:uid     | type:photo --> URL
> > > > >
> > > > >          | type:text --> text_id
> > > > >
> > > > >
> > > > > Skip all the way to 10days:uid row, without reading 00days:id -
> > 09:uid
> > > > > rows.
> > > > >  Ideally we do not want to read all 1000 entries that have _only_
> > text.
> > > >  We
> > > > > want to get to last entry in the most efficient way possible.
> > > > >
> > > > >
> > > > > -Jack
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Sat, Jan 8, 2011 at 11:43 AM, Stack <st...@duboce.net> wrote:
> > > > > > Strike that.  This is a Scan, so can't do blooms + filter.
>  Sorry.
> > > > > > Sounds like a coprocessor then.  You'd have your query 'lean' on
> > the
> > > > > > column that you know has the lesser items and then per item,
> you'd
> > do
> > > > > > a get inside the coprocessor against the column of many entries.
> >  The
> > > > > > get would go via blooms.
> > > > > >
> > > > > > St.Ack
> > > > > >
> > > > > >
> > > > > > On Sat, Jan 8, 2011 at 11:39 AM, Stack <st...@duboce.net> wrote:
> > > > > >> On Sat, Jan 8, 2011 at 11:35 AM, Jack Levin <ma...@gmail.com>
> > > > wrote:
> > > > > >>> Yes, we thought about using filters, the issue is, if one
> family
> > > > > >>> column has 1ml values, and second family column has 10 values
> at
> > > the
> > > > > >>> bottom, we would end up scanning and filtering 99990 records
> and
> > > > > >>> throwing them away, which seems inefficient.
> > > > > >>
> > > > > >> Blooms+filters?
> > > > > >> St.Ack
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: question about merge-join (or AND operator betwween colums)

Posted by Jack Levin <ma...@gmail.com>.

Future wise we plan to have millions of rows, probably across multiple
regions, even if IO is not a problem, doing millions of filter operations
does not make much sense.

-Jack

On Sat, Jan 8, 2011 at 2:54 PM, Andrey Stepachev <oc...@gmail.com> wrote:

> Ok. Understand.
>
> But do you check is it really an issue? I think that it is only 1 IO here,
> (especially
> if compression used)? You have big rows?
>
>
>
> 2011/1/9 Jack Levin <ma...@gmail.com>
>
> > Sorting is not the issue, the location of data can be in the beginning,
> > middle or end, or any combination of thereof.  I only given the worst
> case
> > scenario example, I understand that filtering will produce results we
> want
> > but at cost of examining every row and offloading AND/join logic to the
> > application.
> >
> > -Jack
> >
> > On Sat, Jan 8, 2011 at 1:59 PM, Andrey Stepachev <oc...@gmail.com>
> wrote:
> >
> > > More details on binary sorting you can read
> > >
> > >
> >
> http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/
> > >
> > > 2011/1/8 Jack Levin <ma...@gmail.com>
> > >
> > > > Basic problem described:
> > > >
> > > > user uploads 1 image and creates some text -10 days ago, then creates
> > > 1000
> > > > text messages on between 9 days ago and today:
> > > >
> > > >
> > > > row key          | fm:type --> value
> > > >
> > > >
> > > > 00days:uid     | type:text --> text_id
> > > >
> > > > .
> > > >
> > > > .
> > > >
> > > > 09days:uid | type:text --> text_id
> > > >
> > > >
> > > > 10days:uid     | type:photo --> URL
> > > >
> > > >          | type:text --> text_id
> > > >
> > > >
> > > > Skip all the way to 10days:uid row, without reading 00days:id -
> 09:uid
> > > > rows.
> > > >  Ideally we do not want to read all 1000 entries that have _only_
> text.
> > >  We
> > > > want to get to last entry in the most efficient way possible.
> > > >
> > > >
> > > > -Jack
> > > >
> > > >
> > > >
> > > >
> > > > On Sat, Jan 8, 2011 at 11:43 AM, Stack <st...@duboce.net> wrote:
> > > > > Strike that.  This is a Scan, so can't do blooms + filter.  Sorry.
> > > > > Sounds like a coprocessor then.  You'd have your query 'lean' on
> the
> > > > > column that you know has the lesser items and then per item, you'd
> do
> > > > > a get inside the coprocessor against the column of many entries.
>  The
> > > > > get would go via blooms.
> > > > >
> > > > > St.Ack
> > > > >
> > > > >
> > > > > On Sat, Jan 8, 2011 at 11:39 AM, Stack <st...@duboce.net> wrote:
> > > > >> On Sat, Jan 8, 2011 at 11:35 AM, Jack Levin <ma...@gmail.com>
> > > wrote:
> > > > >>> Yes, we thought about using filters, the issue is, if one family
> > > > >>> column has 1ml values, and second family column has 10 values at
> > the
> > > > >>> bottom, we would end up scanning and filtering 99990 records and
> > > > >>> throwing them away, which seems inefficient.
> > > > >>
> > > > >> Blooms+filters?
> > > > >> St.Ack
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: question about merge-join (or AND operator betwween colums)

Posted by Andrey Stepachev <oc...@gmail.com>.

Ok. Understand.

But do you check is it really an issue? I think that it is only 1 IO here,
(especially
if compression used)? You have big rows?



2011/1/9 Jack Levin <ma...@gmail.com>

> Sorting is not the issue, the location of data can be in the beginning,
> middle or end, or any combination of thereof.  I only given the worst case
> scenario example, I understand that filtering will produce results we want
> but at cost of examining every row and offloading AND/join logic to the
> application.
>
> -Jack
>
> On Sat, Jan 8, 2011 at 1:59 PM, Andrey Stepachev <oc...@gmail.com> wrote:
>
> > More details on binary sorting you can read
> >
> >
> http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/
> >
> > 2011/1/8 Jack Levin <ma...@gmail.com>
> >
> > > Basic problem described:
> > >
> > > user uploads 1 image and creates some text -10 days ago, then creates
> > 1000
> > > text messages on between 9 days ago and today:
> > >
> > >
> > > row key          | fm:type --> value
> > >
> > >
> > > 00days:uid     | type:text --> text_id
> > >
> > > .
> > >
> > > .
> > >
> > > 09days:uid | type:text --> text_id
> > >
> > >
> > > 10days:uid     | type:photo --> URL
> > >
> > >          | type:text --> text_id
> > >
> > >
> > > Skip all the way to 10days:uid row, without reading 00days:id - 09:uid
> > > rows.
> > >  Ideally we do not want to read all 1000 entries that have _only_ text.
> >  We
> > > want to get to last entry in the most efficient way possible.
> > >
> > >
> > > -Jack
> > >
> > >
> > >
> > >
> > > On Sat, Jan 8, 2011 at 11:43 AM, Stack <st...@duboce.net> wrote:
> > > > Strike that.  This is a Scan, so can't do blooms + filter.  Sorry.
> > > > Sounds like a coprocessor then.  You'd have your query 'lean' on the
> > > > column that you know has the lesser items and then per item, you'd do
> > > > a get inside the coprocessor against the column of many entries.  The
> > > > get would go via blooms.
> > > >
> > > > St.Ack
> > > >
> > > >
> > > > On Sat, Jan 8, 2011 at 11:39 AM, Stack <st...@duboce.net> wrote:
> > > >> On Sat, Jan 8, 2011 at 11:35 AM, Jack Levin <ma...@gmail.com>
> > wrote:
> > > >>> Yes, we thought about using filters, the issue is, if one family
> > > >>> column has 1ml values, and second family column has 10 values at
> the
> > > >>> bottom, we would end up scanning and filtering 99990 records and
> > > >>> throwing them away, which seems inefficient.
> > > >>
> > > >> Blooms+filters?
> > > >> St.Ack
> > > >>
> > > >
> > >
> >
>

Re: question about merge-join (or AND operator betwween colums)

Posted by Jack Levin <ma...@gmail.com>.

Sorting is not the issue, the location of data can be in the beginning,
middle or end, or any combination of thereof.  I only given the worst case
scenario example, I understand that filtering will produce results we want
but at cost of examining every row and offloading AND/join logic to the
application.

-Jack

On Sat, Jan 8, 2011 at 1:59 PM, Andrey Stepachev <oc...@gmail.com> wrote:

> More details on binary sorting you can read
>
> http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/
>
> 2011/1/8 Jack Levin <ma...@gmail.com>
>
> > Basic problem described:
> >
> > user uploads 1 image and creates some text -10 days ago, then creates
> 1000
> > text messages on between 9 days ago and today:
> >
> >
> > row key          | fm:type --> value
> >
> >
> > 00days:uid     | type:text --> text_id
> >
> > .
> >
> > .
> >
> > 09days:uid | type:text --> text_id
> >
> >
> > 10days:uid     | type:photo --> URL
> >
> >          | type:text --> text_id
> >
> >
> > Skip all the way to 10days:uid row, without reading 00days:id - 09:uid
> > rows.
> >  Ideally we do not want to read all 1000 entries that have _only_ text.
>  We
> > want to get to last entry in the most efficient way possible.
> >
> >
> > -Jack
> >
> >
> >
> >
> > On Sat, Jan 8, 2011 at 11:43 AM, Stack <st...@duboce.net> wrote:
> > > Strike that.  This is a Scan, so can't do blooms + filter.  Sorry.
> > > Sounds like a coprocessor then.  You'd have your query 'lean' on the
> > > column that you know has the lesser items and then per item, you'd do
> > > a get inside the coprocessor against the column of many entries.  The
> > > get would go via blooms.
> > >
> > > St.Ack
> > >
> > >
> > > On Sat, Jan 8, 2011 at 11:39 AM, Stack <st...@duboce.net> wrote:
> > >> On Sat, Jan 8, 2011 at 11:35 AM, Jack Levin <ma...@gmail.com>
> wrote:
> > >>> Yes, we thought about using filters, the issue is, if one family
> > >>> column has 1ml values, and second family column has 10 values at the
> > >>> bottom, we would end up scanning and filtering 99990 records and
> > >>> throwing them away, which seems inefficient.
> > >>
> > >> Blooms+filters?
> > >> St.Ack
> > >>
> > >
> >
>

Re: question about merge-join (or AND operator betwween colums)

Posted by Andrey Stepachev <oc...@gmail.com>.

More details on binary sorting you can read
http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/

2011/1/8 Jack Levin <ma...@gmail.com>

> Basic problem described:
>
> user uploads 1 image and creates some text -10 days ago, then creates 1000
> text messages on between 9 days ago and today:
>
>
> row key          | fm:type --> value
>
>
> 00days:uid     | type:text --> text_id
>
> .
>
> .
>
> 09days:uid | type:text --> text_id
>
>
> 10days:uid     | type:photo --> URL
>
>          | type:text --> text_id
>
>
> Skip all the way to 10days:uid row, without reading 00days:id - 09:uid
> rows.
>  Ideally we do not want to read all 1000 entries that have _only_ text.  We
> want to get to last entry in the most efficient way possible.
>
>
> -Jack
>
>
>
>
> On Sat, Jan 8, 2011 at 11:43 AM, Stack <st...@duboce.net> wrote:
> > Strike that.  This is a Scan, so can't do blooms + filter.  Sorry.
> > Sounds like a coprocessor then.  You'd have your query 'lean' on the
> > column that you know has the lesser items and then per item, you'd do
> > a get inside the coprocessor against the column of many entries.  The
> > get would go via blooms.
> >
> > St.Ack
> >
> >
> > On Sat, Jan 8, 2011 at 11:39 AM, Stack <st...@duboce.net> wrote:
> >> On Sat, Jan 8, 2011 at 11:35 AM, Jack Levin <ma...@gmail.com> wrote:
> >>> Yes, we thought about using filters, the issue is, if one family
> >>> column has 1ml values, and second family column has 10 values at the
> >>> bottom, we would end up scanning and filtering 99990 records and
> >>> throwing them away, which seems inefficient.
> >>
> >> Blooms+filters?
> >> St.Ack
> >>
> >
>

Re: question about merge-join (or AND operator betwween colums)

Posted by Andrey Stepachev <oc...@gmail.com>.

Hm. But what the problem to have Long.MAX - dayNum instead of dayNum?
In this case you get all data sorted in reverse order and you give last
entries
first in scan results?

2011/1/8 Jack Levin <ma...@gmail.com>

> Basic problem described:
>
> user uploads 1 image and creates some text -10 days ago, then creates 1000
> text messages on between 9 days ago and today:
>
>
> row key          | fm:type --> value
>
>
> 00days:uid     | type:text --> text_id
>
> .
>
> .
>
> 09days:uid | type:text --> text_id
>
>
> 10days:uid     | type:photo --> URL
>
>          | type:text --> text_id
>
>
> Skip all the way to 10days:uid row, without reading 00days:id - 09:uid
> rows.
>  Ideally we do not want to read all 1000 entries that have _only_ text.  We
> want to get to last entry in the most efficient way possible.
>
>
> -Jack
>
>
>
>
> On Sat, Jan 8, 2011 at 11:43 AM, Stack <st...@duboce.net> wrote:
> > Strike that.  This is a Scan, so can't do blooms + filter.  Sorry.
> > Sounds like a coprocessor then.  You'd have your query 'lean' on the
> > column that you know has the lesser items and then per item, you'd do
> > a get inside the coprocessor against the column of many entries.  The
> > get would go via blooms.
> >
> > St.Ack
> >
> >
> > On Sat, Jan 8, 2011 at 11:39 AM, Stack <st...@duboce.net> wrote:
> >> On Sat, Jan 8, 2011 at 11:35 AM, Jack Levin <ma...@gmail.com> wrote:
> >>> Yes, we thought about using filters, the issue is, if one family
> >>> column has 1ml values, and second family column has 10 values at the
> >>> bottom, we would end up scanning and filtering 99990 records and
> >>> throwing them away, which seems inefficient.
> >>
> >> Blooms+filters?
> >> St.Ack
> >>
> >
>

Re: question about merge-join (or AND operator betwween colums)

Posted by Jack Levin <ma...@gmail.com>.

Basic problem described:

user uploads 1 image and creates some text -10 days ago, then creates 1000
text messages on between 9 days ago and today:


row key          | fm:type --> value


00days:uid     | type:text --> text_id

.

.

09days:uid | type:text --> text_id


10days:uid     | type:photo --> URL

          | type:text --> text_id


Skip all the way to 10days:uid row, without reading 00days:id - 09:uid rows.
 Ideally we do not want to read all 1000 entries that have _only_ text.  We
want to get to last entry in the most efficient way possible.


-Jack




On Sat, Jan 8, 2011 at 11:43 AM, Stack <st...@duboce.net> wrote:
> Strike that.  This is a Scan, so can't do blooms + filter.  Sorry.
> Sounds like a coprocessor then.  You'd have your query 'lean' on the
> column that you know has the lesser items and then per item, you'd do
> a get inside the coprocessor against the column of many entries.  The
> get would go via blooms.
>
> St.Ack
>
>
> On Sat, Jan 8, 2011 at 11:39 AM, Stack <st...@duboce.net> wrote:
>> On Sat, Jan 8, 2011 at 11:35 AM, Jack Levin <ma...@gmail.com> wrote:
>>> Yes, we thought about using filters, the issue is, if one family
>>> column has 1ml values, and second family column has 10 values at the
>>> bottom, we would end up scanning and filtering 99990 records and
>>> throwing them away, which seems inefficient.
>>
>> Blooms+filters?
>> St.Ack
>>
>

Re: question about merge-join (or AND operator betwween colums)

Posted by Stack <st...@duboce.net>.

Strike that.  This is a Scan, so can't do blooms + filter.  Sorry.
Sounds like a coprocessor then.  You'd have your query 'lean' on the
column that you know has the lesser items and then per item, you'd do
a get inside the coprocessor against the column of many entries.  The
get would go via blooms.

St.Ack

On Sat, Jan 8, 2011 at 11:39 AM, Stack <st...@duboce.net> wrote:
> On Sat, Jan 8, 2011 at 11:35 AM, Jack Levin <ma...@gmail.com> wrote:
>> Yes, we thought about using filters, the issue is, if one family
>> column has 1ml values, and second family column has 10 values at the
>> bottom, we would end up scanning and filtering 99990 records and
>> throwing them away, which seems inefficient.
>
> Blooms+filters?
> St.Ack
>

Re: question about merge-join (or AND operator betwween colums)

Posted by Stack <st...@duboce.net>.

On Sat, Jan 8, 2011 at 11:35 AM, Jack Levin <ma...@gmail.com> wrote:
> Yes, we thought about using filters, the issue is, if one family
> column has 1ml values, and second family column has 10 values at the
> bottom, we would end up scanning and filtering 99990 records and
> throwing them away, which seems inefficient.

Blooms+filters?
St.Ack

Re: question about merge-join (or AND operator betwween colums)

Posted by Jack Levin <ma...@gmail.com>.

Yes, we thought about using filters, the issue is, if one family
column has 1ml values, and second family column has 10 values at the
bottom, we would end up scanning and filtering 99990 records and
throwing them away, which seems inefficient.  The only solution is to
break the tables apart, and do a psuedo JOIN by some row key with the
application itself. There is no contrib package that allows merged
index of multiple families or columns , is there?

-Jack

On Sat, Jan 8, 2011 at 11:30 AM, Andrey Stepachev <oc...@gmail.com> wrote:
> I don't think that it is possible on scanner level with bloomfilters
> (families are in separate files, so
> they scanned independently).
> But you can use filters, to filter out unneeded data.
>
> 2011/1/8 Jack Levin <ma...@gmail.com>
>
>> Hello all, I have a scanner question, we have this table:
>>
>> hbase(main):002:0> scan 'mattest'
>> ROW                                          COLUMN+CELL
>>  1                                           column=generic:,
>> timestamp=1294454057618, value=1
>>  1                                           column=photo:,
>> timestamp=1294453830339, value=1
>>  1                                           column=type:,
>> timestamp=1294453812716, value=photo
>>  1                                           column=type:photo,
>> timestamp=1294453884174, value=photo
>>  2                                           column=generic:,
>> timestamp=1294454061156, value=1
>>  2                                           column=type:,
>> timestamp=1294453851757, value=video
>>  2                                           column=type:video,
>> timestamp=1294453877719, value=video
>>  2                                           column=video:,
>> timestamp=1294453842722, value=1
>>
>> We need to run this query:
>>
>> hbase(main):004:0> scan 'mattest', {COLUMNS => ['generic', 'photo']}
>> ROW                                          COLUMN+CELL
>>  1                                           column=generic:,
>> timestamp=1294454057618, value=1
>>  1                                           column=photo:,
>> timestamp=1294453830339, value=1
>>  2                                           column=generic:,
>> timestamp=1294454061156, value=1
>>
>> Note that  ['generic', 'photo'], utilizes 'OR' operator, and not
>> 'AND'.   Is it possible to create a scanner that will not AND and not
>> OR?, in which case something like this:
>>
>> scan 'mattest', {COLUMNS => ['generic' AND 'photo']}
>> ROW                                          COLUMN+CELL
>>  1                                           column=generic:,
>> timestamp=1294454057618, value=1
>>  1                                           column=photo:,
>> timestamp=1294453830339, value=1
>>
>> Thanks in advance.
>>
>> -Jack
>>
>

Re: question about merge-join (or AND operator betwween colums)

Posted by Andrey Stepachev <oc...@gmail.com>.

I don't think that it is possible on scanner level with bloomfilters
(families are in separate files, so
they scanned independently).
But you can use filters, to filter out unneeded data.

2011/1/8 Jack Levin <ma...@gmail.com>

> Hello all, I have a scanner question, we have this table:
>
> hbase(main):002:0> scan 'mattest'
> ROW                                          COLUMN+CELL
>  1                                           column=generic:,
> timestamp=1294454057618, value=1
>  1                                           column=photo:,
> timestamp=1294453830339, value=1
>  1                                           column=type:,
> timestamp=1294453812716, value=photo
>  1                                           column=type:photo,
> timestamp=1294453884174, value=photo
>  2                                           column=generic:,
> timestamp=1294454061156, value=1
>  2                                           column=type:,
> timestamp=1294453851757, value=video
>  2                                           column=type:video,
> timestamp=1294453877719, value=video
>  2                                           column=video:,
> timestamp=1294453842722, value=1
>
> We need to run this query:
>
> hbase(main):004:0> scan 'mattest', {COLUMNS => ['generic', 'photo']}
> ROW                                          COLUMN+CELL
>  1                                           column=generic:,
> timestamp=1294454057618, value=1
>  1                                           column=photo:,
> timestamp=1294453830339, value=1
>  2                                           column=generic:,
> timestamp=1294454061156, value=1
>
> Note that  ['generic', 'photo'], utilizes 'OR' operator, and not
> 'AND'.   Is it possible to create a scanner that will not AND and not
> OR?, in which case something like this:
>
> scan 'mattest', {COLUMNS => ['generic' AND 'photo']}
> ROW                                          COLUMN+CELL
>  1                                           column=generic:,
> timestamp=1294454057618, value=1
>  1                                           column=photo:,
> timestamp=1294453830339, value=1
>
> Thanks in advance.
>
> -Jack
>

Re: question about merge-join (or AND operator betwween colums)

Posted by Stack <st...@duboce.net>.

Sounds like you need to write a little filter Jack, one that filters
all that does not have values from all query columns.  Maybe you can
manhandle SkipFilter into doing the job?
http://hbase.apache.org/docs/r0.89.20100924/apidocs/org/apache/hadoop/hbase/filter/SkipFilter.html

St.Ack

On Sat, Jan 8, 2011 at 10:30 AM, Jack Levin <ma...@gmail.com> wrote:
> Sorry, my mistake, right now its only OR, and we really need AND.
> I would think that with bloomfilters this could be a sweet feature to
> produce if its not there.
>
>
> -Jack
>
> On Fri, Jan 7, 2011 at 10:50 PM, Phil Whelan <ph...@gmail.com> wrote:
>> Hi Jack,
>>
>> I'm just trying follow the logic and I'm a bit confused.
>>
>>> Note that  ['generic', 'photo'], utilizes 'OR' operator, and not
>>> 'AND'.   Is it possible to create a scanner that will not AND and not
>>> OR?, in which case something like this:
>>
>> Am I right in thinking you meant "AND and not OR" instead of "not AND
>> and not OR"?
>>
>> Thanks,
>> Phil
>>
>> On Fri, Jan 7, 2011 at 8:01 PM, Jack Levin <ma...@gmail.com> wrote:
>>> Hello all, I have a scanner question, we have this table:
>>>
>>> hbase(main):002:0> scan 'mattest'
>>> ROW                                          COLUMN+CELL
>>>  1                                           column=generic:,
>>> timestamp=1294454057618, value=1
>>>  1                                           column=photo:,
>>> timestamp=1294453830339, value=1
>>>  1                                           column=type:,
>>> timestamp=1294453812716, value=photo
>>>  1                                           column=type:photo,
>>> timestamp=1294453884174, value=photo
>>>  2                                           column=generic:,
>>> timestamp=1294454061156, value=1
>>>  2                                           column=type:,
>>> timestamp=1294453851757, value=video
>>>  2                                           column=type:video,
>>> timestamp=1294453877719, value=video
>>>  2                                           column=video:,
>>> timestamp=1294453842722, value=1
>>>
>>> We need to run this query:
>>>
>>> hbase(main):004:0> scan 'mattest', {COLUMNS => ['generic', 'photo']}
>>> ROW                                          COLUMN+CELL
>>>  1                                           column=generic:,
>>> timestamp=1294454057618, value=1
>>>  1                                           column=photo:,
>>> timestamp=1294453830339, value=1
>>>  2                                           column=generic:,
>>> timestamp=1294454061156, value=1
>>>
>>> Note that  ['generic', 'photo'], utilizes 'OR' operator, and not
>>> 'AND'.   Is it possible to create a scanner that will not AND and not
>>> OR?, in which case something like this:
>>>
>>> scan 'mattest', {COLUMNS => ['generic' AND 'photo']}
>>> ROW                                          COLUMN+CELL
>>>  1                                           column=generic:,
>>> timestamp=1294454057618, value=1
>>>  1                                           column=photo:,
>>> timestamp=1294453830339, value=1
>>>
>>> Thanks in advance.
>>>
>>> -Jack
>>>
>>
>

Re: question about merge-join (or AND operator betwween colums)

Posted by Jack Levin <ma...@gmail.com>.

Sorry, my mistake, right now its only OR, and we really need AND.
I would think that with bloomfilters this could be a sweet feature to
produce if its not there.


-Jack

On Fri, Jan 7, 2011 at 10:50 PM, Phil Whelan <ph...@gmail.com> wrote:
> Hi Jack,
>
> I'm just trying follow the logic and I'm a bit confused.
>
>> Note that  ['generic', 'photo'], utilizes 'OR' operator, and not
>> 'AND'.   Is it possible to create a scanner that will not AND and not
>> OR?, in which case something like this:
>
> Am I right in thinking you meant "AND and not OR" instead of "not AND
> and not OR"?
>
> Thanks,
> Phil
>
> On Fri, Jan 7, 2011 at 8:01 PM, Jack Levin <ma...@gmail.com> wrote:
>> Hello all, I have a scanner question, we have this table:
>>
>> hbase(main):002:0> scan 'mattest'
>> ROW                                          COLUMN+CELL
>>  1                                           column=generic:,
>> timestamp=1294454057618, value=1
>>  1                                           column=photo:,
>> timestamp=1294453830339, value=1
>>  1                                           column=type:,
>> timestamp=1294453812716, value=photo
>>  1                                           column=type:photo,
>> timestamp=1294453884174, value=photo
>>  2                                           column=generic:,
>> timestamp=1294454061156, value=1
>>  2                                           column=type:,
>> timestamp=1294453851757, value=video
>>  2                                           column=type:video,
>> timestamp=1294453877719, value=video
>>  2                                           column=video:,
>> timestamp=1294453842722, value=1
>>
>> We need to run this query:
>>
>> hbase(main):004:0> scan 'mattest', {COLUMNS => ['generic', 'photo']}
>> ROW                                          COLUMN+CELL
>>  1                                           column=generic:,
>> timestamp=1294454057618, value=1
>>  1                                           column=photo:,
>> timestamp=1294453830339, value=1
>>  2                                           column=generic:,
>> timestamp=1294454061156, value=1
>>
>> Note that  ['generic', 'photo'], utilizes 'OR' operator, and not
>> 'AND'.   Is it possible to create a scanner that will not AND and not
>> OR?, in which case something like this:
>>
>> scan 'mattest', {COLUMNS => ['generic' AND 'photo']}
>> ROW                                          COLUMN+CELL
>>  1                                           column=generic:,
>> timestamp=1294454057618, value=1
>>  1                                           column=photo:,
>> timestamp=1294453830339, value=1
>>
>> Thanks in advance.
>>
>> -Jack
>>
>

Re: question about merge-join (or AND operator betwween colums)

Posted by Phil Whelan <ph...@gmail.com>.

Hi Jack,

I'm just trying follow the logic and I'm a bit confused.

> Note that  ['generic', 'photo'], utilizes 'OR' operator, and not
> 'AND'.   Is it possible to create a scanner that will not AND and not
> OR?, in which case something like this:

Am I right in thinking you meant "AND and not OR" instead of "not AND
and not OR"?

Thanks,
Phil

On Fri, Jan 7, 2011 at 8:01 PM, Jack Levin <ma...@gmail.com> wrote:
> Hello all, I have a scanner question, we have this table:
>
> hbase(main):002:0> scan 'mattest'
> ROW                                          COLUMN+CELL
>  1                                           column=generic:,
> timestamp=1294454057618, value=1
>  1                                           column=photo:,
> timestamp=1294453830339, value=1
>  1                                           column=type:,
> timestamp=1294453812716, value=photo
>  1                                           column=type:photo,
> timestamp=1294453884174, value=photo
>  2                                           column=generic:,
> timestamp=1294454061156, value=1
>  2                                           column=type:,
> timestamp=1294453851757, value=video
>  2                                           column=type:video,
> timestamp=1294453877719, value=video
>  2                                           column=video:,
> timestamp=1294453842722, value=1
>
> We need to run this query:
>
> hbase(main):004:0> scan 'mattest', {COLUMNS => ['generic', 'photo']}
> ROW                                          COLUMN+CELL
>  1                                           column=generic:,
> timestamp=1294454057618, value=1
>  1                                           column=photo:,
> timestamp=1294453830339, value=1
>  2                                           column=generic:,
> timestamp=1294454061156, value=1
>
> Note that  ['generic', 'photo'], utilizes 'OR' operator, and not
> 'AND'.   Is it possible to create a scanner that will not AND and not
> OR?, in which case something like this:
>
> scan 'mattest', {COLUMNS => ['generic' AND 'photo']}
> ROW                                          COLUMN+CELL
>  1                                           column=generic:,
> timestamp=1294454057618, value=1
>  1                                           column=photo:,
> timestamp=1294453830339, value=1
>
> Thanks in advance.
>
> -Jack
>