You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Stefan Comanita <co...@yahoo.com> on 2011/05/10 14:56:39 UTC

HBase filtered scan problem

Hi all, 

I want to do a scan on a number of rows, each row having multiple columns, and I want to filter out some of this columns based on their values per example, if I have the following rows:

plainRow:col:value1                     column=T:19, timestamp=19, value=                                                                                   
plainRow:col:value1                     column=T:2, timestamp=2, value=U                                                                                    
plainRow:col:value1                     column=T:3, timestamp=3, value=U                                                                                    
plainRow:col:value1                     column=T:4, timestamp=4, value=

and

secondRow:col:value1                     column=T:1, timestamp=1, value=                                                                                   
secondRow:col:value1                     column=T:2, timestamp=2, value=                                                                                    
secondRow:col:value1                     column=T:3, timestamp=3, value=U                                                                                   
secondRow:col:value1                     column=T:4, timestamp=4, value=


and I want to select all the rows but just with the columns that don't have the value "U", something like:

plainRow:col:value1                     column=T:19, timestamp=19, value=                                                                                   
plainRow:col:value1                     column=T:4, timestamp=4, value=
secondRow:col:value1                     column=T:1, timestamp=1, value=                                                                                   
secondRow:col:value1                     column=T:2, timestamp=2, value=                                                                                    secondRow:col:value1                     column=T:4, timestamp=4, value=

and to achieve this, i try the following:

Scan scan = new Scan();
        
scan.setStartRow(stringToBytes(rowIdentifier));
scan.setStopRow(stringToBytes(rowIdentifier + Constants.MAX_CHAR));
scan.addFamily(Constants.TERM_VECT_COLUMN_FAMILY);

if(includeFilter) {
    Filter filter = new ValueFilter(CompareOp.EQUAL, 
            new BinaryComparator(stringToBytes("U")));        
    scan.setFilter(filter);
}

and if i execute this scan I get the rows with the columns having the value "U", which is correct, but when i set CompareOp.NOT_EQUAL and i expect to get the other columns it doesnt work the way i want, it give me back all the rows, including the one which have the value "U", the same happens when i use: 
Filter filter = new ValueFilter(CompareOp.EQUAL, new BinaryComparator(stringToBytes(""))); 

I mention that the columns have the values "U" and "" (empty string), and that i also saw the same behaivior with the RegexComparator and SubstringComparator.

Any idea would be very much appreciated, sorry for the long mail, thank you.

Stefan Comanita

Re: HBase filtered scan problem

Posted by Iulia Zidaru <iu...@1and1.ro>.

  Thank you very much St. Ack.
It sounds like we have to create other filer.
Iulia

On 05/12/2011 08:07 PM, Stack wrote:
> On Thu, May 12, 2011 at 6:42 AM, Iulia Zidaru<iu...@1and1.ro>  wrote:
>>   Hi,
>>
>> Thank you for your answer St. Ack.
>> Yes, both coordinates are the same. It is impossible for the filter to
>> decide that a value is old. I still don't understand why the HBase server
>> has both values or how long does it keep both.
> Well its hard to 'overwrite' if one value is in the memstore and the
> other is out on the filesystem.
>
> It'll do the clean up on major compaction.
>
> The fiilter should be able to pick up ordering hints from its context;
> its just not doing it.
>
>> The same thing happens if
>> puts have different timestamps.
>>
> With the filter you mean?  I'd think the filter should distingush these.
> St.Ack

Re: HBase filtered scan problem

Posted by Stack <st...@duboce.net>.

On Thu, May 12, 2011 at 6:42 AM, Iulia Zidaru <iu...@1and1.ro> wrote:
>  Hi,
>
> Thank you for your answer St. Ack.
> Yes, both coordinates are the same. It is impossible for the filter to
> decide that a value is old. I still don't understand why the HBase server
> has both values or how long does it keep both.

Well its hard to 'overwrite' if one value is in the memstore and the
other is out on the filesystem.

It'll do the clean up on major compaction.

The fiilter should be able to pick up ordering hints from its context;
its just not doing it.

> The same thing happens if
> puts have different timestamps.
>

With the filter you mean?  I'd think the filter should distingush these.
St.Ack

Re: HBase filtered scan problem

Posted by Iulia Zidaru <iu...@1and1.ro>.

  Hi,

Thank you for your answer St. Ack.
Yes, both coordinates are the same. It is impossible for the filter to 
decide that a value is old. I still don't understand why the HBase 
server has both values or how long does it keep both. The same thing 
happens if puts have different timestamps.

Regards,
Iulia

On 05/11/2011 08:05 PM, Stack wrote:
> On Wed, May 11, 2011 at 2:05 AM, Iulia Zidaru<iu...@1and1.ro>  wrote:
>>   Hi,
>> I'll try to rephrase the problem...
>> We have a table where we add an empty value.(The same thing happen also if
>> we have a value).
>> Afterward we put a value inside.(Same put, just other value). When scanning
>> for empty values (first values inserted), the result is wrong because the
>> filter gets called for both values (the empty which maches and the not empty
>> which doesn't match). The table has only one version. It looks like the heap
>> object in StoreScanner has both objects. Do you have any idea if this is a
>> normal behavior and if we can avoid this somehow?
>>
> Both entries exist in the hbase server, yes.
>
> The coordinates for both are the same?  If exactly the same
> row/cf/qualifier/timestamp then its going to be hard to distingush
> between the two entries.  The filter is probably not smart enough to
> take insertion order into account.
>
> St.Ack

Re: HBase filtered scan problem

Posted by Stack <st...@duboce.net>.

On Wed, May 11, 2011 at 2:05 AM, Iulia Zidaru <iu...@1and1.ro> wrote:
>  Hi,
> I'll try to rephrase the problem...
> We have a table where we add an empty value.(The same thing happen also if
> we have a value).
> Afterward we put a value inside.(Same put, just other value). When scanning
> for empty values (first values inserted), the result is wrong because the
> filter gets called for both values (the empty which maches and the not empty
> which doesn't match). The table has only one version. It looks like the heap
> object in StoreScanner has both objects. Do you have any idea if this is a
> normal behavior and if we can avoid this somehow?
>

Both entries exist in the hbase server, yes.

The coordinates for both are the same?  If exactly the same
row/cf/qualifier/timestamp then its going to be hard to distingush
between the two entries.  The filter is probably not smart enough to
take insertion order into account.

St.Ack

Re: HBase filtered scan problem

Posted by Iulia Zidaru <iu...@1and1.ro>.

  Hi,
I'll try to rephrase the problem...
We have a table where we add an empty value.(The same thing happen also 
if we have a value).
Afterward we put a value inside.(Same put, just other value). When 
scanning for empty values (first values inserted), the result is wrong 
because the filter gets called for both values (the empty which maches 
and the not empty which doesn't match). The table has only one version. 
It looks like the heap object in StoreScanner has both objects. Do you 
have any idea if this is a normal behavior and if we can avoid this somehow?

Thank you,
Iulia

On 05/10/2011 03:56 PM, Stefan Comanita wrote:
> Hi all,
>
> I want to do a scan on a number of rows, each row having multiple columns, and I want to filter out some of this columns based on their values per example, if I have the following rows:
>
> plainRow:col:value1                     column=T:19, timestamp=19, value=                                                                                   
> plainRow:col:value1                     column=T:2, timestamp=2, value=U                                                                                    
> plainRow:col:value1                     column=T:3, timestamp=3, value=U                                                                                    
> plainRow:col:value1                     column=T:4, timestamp=4, value=
>
> and
>
> secondRow:col:value1                     column=T:1, timestamp=1, value=                                                                                   
> secondRow:col:value1                     column=T:2, timestamp=2, value=                                                                                    
> secondRow:col:value1                     column=T:3, timestamp=3, value=U                                                                                   
> secondRow:col:value1                     column=T:4, timestamp=4, value=
>
>
> and I want to select all the rows but just with the columns that don't have the value "U", something like:
>
> plainRow:col:value1                     column=T:19, timestamp=19, value=                                                                                   
> plainRow:col:value1                     column=T:4, timestamp=4, value=
> secondRow:col:value1                     column=T:1, timestamp=1, value=                                                                                   
> secondRow:col:value1                     column=T:2, timestamp=2, value=                                                                                    secondRow:col:value1                     column=T:4, timestamp=4, value=
>
> and to achieve this, i try the following:
>
> Scan scan = new Scan();
>          
> scan.setStartRow(stringToBytes(rowIdentifier));
> scan.setStopRow(stringToBytes(rowIdentifier + Constants.MAX_CHAR));
> scan.addFamily(Constants.TERM_VECT_COLUMN_FAMILY);
>
> if(includeFilter) {
>      Filter filter = new ValueFilter(CompareOp.EQUAL,
>              new BinaryComparator(stringToBytes("U")));        
>      scan.setFilter(filter);
> }
>
> and if i execute this scan I get the rows with the columns having the value "U", which is correct, but when i set CompareOp.NOT_EQUAL and i expect to get the other columns it doesnt work the way i want, it give me back all the rows, including the one which have the value "U", the same happens when i use:
> Filter filter = new ValueFilter(CompareOp.EQUAL, new BinaryComparator(stringToBytes("")));
>
> I mention that the columns have the values "U" and "" (empty string), and that i also saw the same behaivior with the RegexComparator and SubstringComparator.
>
> Any idea would be very much appreciated, sorry for the long mail, thank you.
>
> Stefan Comanita


-- 
Iulia Zidaru
Java Developer

1&1 Internet AG - Bucharest/Romania - Web Components Romania
18 Mircea Eliade St
Sect 1, Bucharest
RO Bucharest, 012015
iulia.zidaru@1and1.ro
0040 31 223 9153