You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by lars hofhansl <la...@apache.org> on 2013/12/07 02:31:26 UTC

Re: HBase returns old values even with max versions = 1

+ dev list

Specifically:

Currently the workflow in ScanQueryMatcher is something like this:

1. <versions> = min(<CF versions>, <scan version>)
2. filter by timerange
3. filter out columns (i.e. columns not specified in the scan)
4. apply customer filters
5. filter by <versions>

Every KV is passed through this filtering process.

What we should do is this:

1. filter by <CF versions>
2. filter by timerange
3. filter out columns (i.e. columns not specified in the scan)
4. apply customer filters
5. filter by <scan versions>

The trick will be doing that efficiently.

-- Lars



________________________________
 From: lars hofhansl <la...@apache.org>
To: "user@hbase.apache.org" <us...@hbase.apache.org> 
Sent: Friday, December 6, 2013 5:10 PM
Subject: Re: HBase returns old values even with max versions = 1
 

The old versions can still be around until a flush and/or compaction.

During a user-level scan, HBase first filters by timerange and then counts the versions.
I agree, this is counter intuitive in this case. In other cases people want to first limit by timerange, and then get x numbers of versions back.
We might need to start to distinguish between the number of version configured for the column family and the number of versions configured for the scan.

Mind filing a jira? Can discuss solutions there.

Thanks.

-- Lars



________________________________

From: Niels Basjes <Ni...@basjes.nl>
To: user <us...@hbase.apache.org> 
Sent: Friday, December 6, 2013 8:05 AM
Subject: HBase returns old values even with max versions = 1


Hi,

I have the desire to find the columns that have not been updated for more
than a specific time period.

So I want to do a scan against the columns with a timerange.
The normal behavior of HBase is that you then get the latest value in that
time range (which is not what I want).

As far as I understand the way HBase should work is that if you set the
maximum number of versions for the values in a column family to '1' it
should retain only the last value that was put into the cell.

What I found is different.

If I do the following commands into the hbase shell

    create 't1', {NAME => 'c1', VERSIONS => 1}
    put 't1', 'r1', 'c1', 'One', 1000
    put 't1', 'r1', 'c1', 'Two', 2000
    put 't1', 'r1', 'c1', 'Three', 3000
    get 't1', 'r1'
    get 't1', 'r1' , {TIMERANGE => [0,1500]}

the result is this:

    get 't1', 'r1'
    COLUMN                     CELL
     c1:                       timestamp=3000, value=Three
    1 row(s) in 0.0780 seconds

    get 't1', 'r1' , {TIMERANGE => [0,1500]}
    COLUMN                     CELL
     c1:                       timestamp=1000, value=One
    1 row(s) in 0.1390 seconds

Why does the second query return a value even though I've set the max
versions to only 1?
I expect that it only 'knows' about the latest value ('Three') and thus
should return an empty result in the above example.
What is the correct way to obtain what I'm looking for?

My current workaround is that I simply retrieve the latest value for all my
columns and filter them in my application code.

The HBase version I currently have installed here is HBase 0.94.6-cdh4.4.0

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: HBase returns old values even with max versions = 1

Posted by lars hofhansl <la...@apache.org>.
With the restrictions and optimizations that HBase does, this change in behavior is simply not possible.
In your usecase, can't you just retrieve the latest state of a row and then check the timestamps of the KeyValues?

-- Lars



________________________________
 From: Niels Basjes <Ni...@basjes.nl>
To: dev@hbase.apache.org; larsh@apache.org 
Sent: Sunday, December 22, 2013 12:31 PM
Subject: Re: HBase returns old values even with max versions = 1
 

I saw that the issue was closed with a "Won't fix".
Now I would like to know what the optimal solution is for the functional
effect I'm looking for?
"Get the latest value for only those columns that have not been modified
for more than X time"

Is it writing a custom filter or is there a better way?

Niels

On Dec 8, 2013 10:01 AM, "Niels Basjes" <Ni...@basjes.nl> wrote:

> Thanks for clarifying this,
> I know now why my code didn't work as expected.
>
> For now I think that creating a simple custom Filter for my situation is
> the most efficient workaround.
>
> Niels Basjes
>
>
> On Sat, Dec 7, 2013 at 3:26 AM, lars hofhansl <la...@apache.org> wrote:
>
>> Filed https://issues.apache.org/jira/browse/HBASE-10102
>>
>>
>>
>> ________________________________
>>  From: lars hofhansl <la...@apache.org>
>> To: "user@hbase.apache.org" <us...@hbase.apache.org>; hbase-dev <
>> dev@hbase.apache.org>
>> Sent: Friday, December 6, 2013 5:31 PM
>> Subject: Re: HBase returns old values even with max versions = 1
>>
>>
>> + dev list
>>
>> Specifically:
>>
>> Currently the workflow in ScanQueryMatcher is something like this:
>>
>> 1. <versions> = min(<CF versions>, <scan version>)
>> 2. filter by timerange
>> 3. filter out columns (i.e. columns not specified in the scan)
>> 4. apply customer filters
>> 5. filter by <versions>
>>
>> Every KV is passed through this filtering process.
>>
>> What we should do is this:
>>
>> 1. filter by <CF versions>
>> 2. filter by timerange
>> 3. filter out columns (i.e. columns not specified in the scan)
>> 4. apply customer filters
>> 5. filter by <scan versions>
>>
>> The trick will be doing that efficiently.
>>
>> -- Lars
>>
>>
>>
>> ________________________________
>>
>> From: lars hofhansl <la...@apache.org>
>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>> Sent: Friday, December 6, 2013 5:10 PM
>> Subject: Re: HBase returns old values even with max versions = 1
>>
>>
>> The old versions can still be around until a flush and/or compaction.
>>
>> During a user-level scan, HBase first filters by timerange and then
>> counts the versions.
>> I agree, this is counter intuitive in this case. In other cases people
>> want to first limit by timerange, and then get x numbers of versions back.
>> We might need to start to distinguish between the number of version
>> configured for the column family and the number of versions configured for
>> the scan.
>>
>> Mind filing a jira? Can discuss solutions there.
>>
>> Thanks.
>>
>> -- Lars
>>
>>
>>
>> ________________________________
>>
>> From: Niels Basjes <Ni...@basjes.nl>
>> To: user <us...@hbase.apache.org>
>> Sent: Friday, December 6, 2013 8:05 AM
>> Subject: HBase returns old values even with max versions = 1
>>
>>
>> Hi,
>>
>> I have the desire to find the columns that have not been updated for more
>> than a specific time period.
>>
>> So I want to do a scan against the columns with a timerange.
>> The normal behavior of HBase is that you then get the latest value in that
>> time range (which is not what I want).
>>
>> As far as I understand the way HBase should work is that if you set the
>> maximum number of versions for the values in a column family to '1' it
>> should retain only the last value that was put into the cell.
>>
>> What I found is different.
>>
>> If I do the following commands into the hbase shell
>>
>>     create 't1', {NAME => 'c1', VERSIONS => 1}
>>     put 't1', 'r1', 'c1', 'One', 1000
>>     put 't1', 'r1', 'c1', 'Two', 2000
>>     put 't1', 'r1', 'c1', 'Three', 3000
>>     get 't1', 'r1'
>>     get 't1', 'r1' , {TIMERANGE => [0,1500]}
>>
>> the result is this:
>>
>>     get 't1', 'r1'
>>     COLUMN                     CELL
>>      c1:                       timestamp=3000, value=Three
>>     1 row(s) in 0.0780 seconds
>>
>>     get 't1', 'r1' , {TIMERANGE => [0,1500]}
>>     COLUMN                     CELL
>>      c1:                       timestamp=1000, value=One
>>     1 row(s) in 0.1390 seconds
>>
>> Why does the second query return a value even though I've set the max
>> versions to only 1?
>> I expect that it only 'knows' about the latest value ('Three') and thus
>> should return an empty result in the above example.
>> What is the correct way to obtain what I'm looking for?
>>
>> My current workaround is that I simply retrieve the latest value for all
>> my
>> columns and filter them in my application code.
>>
>> The HBase version I currently have installed here is HBase 0.94.6-cdh4.4.0
>>
>> --
>> Best regards / Met vriendelijke groeten,
>>
>> Niels Basjes
>>
>
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes
>

Re: HBase returns old values even with max versions = 1

Posted by Niels Basjes <Ni...@basjes.nl>.
I saw that the issue was closed with a "Won't fix".
Now I would like to know what the optimal solution is for the functional
effect I'm looking for?
"Get the latest value for only those columns that have not been modified
for more than X time"

Is it writing a custom filter or is there a better way?

Niels
On Dec 8, 2013 10:01 AM, "Niels Basjes" <Ni...@basjes.nl> wrote:

> Thanks for clarifying this,
> I know now why my code didn't work as expected.
>
> For now I think that creating a simple custom Filter for my situation is
> the most efficient workaround.
>
> Niels Basjes
>
>
> On Sat, Dec 7, 2013 at 3:26 AM, lars hofhansl <la...@apache.org> wrote:
>
>> Filed https://issues.apache.org/jira/browse/HBASE-10102
>>
>>
>>
>> ________________________________
>>  From: lars hofhansl <la...@apache.org>
>> To: "user@hbase.apache.org" <us...@hbase.apache.org>; hbase-dev <
>> dev@hbase.apache.org>
>> Sent: Friday, December 6, 2013 5:31 PM
>> Subject: Re: HBase returns old values even with max versions = 1
>>
>>
>> + dev list
>>
>> Specifically:
>>
>> Currently the workflow in ScanQueryMatcher is something like this:
>>
>> 1. <versions> = min(<CF versions>, <scan version>)
>> 2. filter by timerange
>> 3. filter out columns (i.e. columns not specified in the scan)
>> 4. apply customer filters
>> 5. filter by <versions>
>>
>> Every KV is passed through this filtering process.
>>
>> What we should do is this:
>>
>> 1. filter by <CF versions>
>> 2. filter by timerange
>> 3. filter out columns (i.e. columns not specified in the scan)
>> 4. apply customer filters
>> 5. filter by <scan versions>
>>
>> The trick will be doing that efficiently.
>>
>> -- Lars
>>
>>
>>
>> ________________________________
>>
>> From: lars hofhansl <la...@apache.org>
>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>> Sent: Friday, December 6, 2013 5:10 PM
>> Subject: Re: HBase returns old values even with max versions = 1
>>
>>
>> The old versions can still be around until a flush and/or compaction.
>>
>> During a user-level scan, HBase first filters by timerange and then
>> counts the versions.
>> I agree, this is counter intuitive in this case. In other cases people
>> want to first limit by timerange, and then get x numbers of versions back.
>> We might need to start to distinguish between the number of version
>> configured for the column family and the number of versions configured for
>> the scan.
>>
>> Mind filing a jira? Can discuss solutions there.
>>
>> Thanks.
>>
>> -- Lars
>>
>>
>>
>> ________________________________
>>
>> From: Niels Basjes <Ni...@basjes.nl>
>> To: user <us...@hbase.apache.org>
>> Sent: Friday, December 6, 2013 8:05 AM
>> Subject: HBase returns old values even with max versions = 1
>>
>>
>> Hi,
>>
>> I have the desire to find the columns that have not been updated for more
>> than a specific time period.
>>
>> So I want to do a scan against the columns with a timerange.
>> The normal behavior of HBase is that you then get the latest value in that
>> time range (which is not what I want).
>>
>> As far as I understand the way HBase should work is that if you set the
>> maximum number of versions for the values in a column family to '1' it
>> should retain only the last value that was put into the cell.
>>
>> What I found is different.
>>
>> If I do the following commands into the hbase shell
>>
>>     create 't1', {NAME => 'c1', VERSIONS => 1}
>>     put 't1', 'r1', 'c1', 'One', 1000
>>     put 't1', 'r1', 'c1', 'Two', 2000
>>     put 't1', 'r1', 'c1', 'Three', 3000
>>     get 't1', 'r1'
>>     get 't1', 'r1' , {TIMERANGE => [0,1500]}
>>
>> the result is this:
>>
>>     get 't1', 'r1'
>>     COLUMN                     CELL
>>      c1:                       timestamp=3000, value=Three
>>     1 row(s) in 0.0780 seconds
>>
>>     get 't1', 'r1' , {TIMERANGE => [0,1500]}
>>     COLUMN                     CELL
>>      c1:                       timestamp=1000, value=One
>>     1 row(s) in 0.1390 seconds
>>
>> Why does the second query return a value even though I've set the max
>> versions to only 1?
>> I expect that it only 'knows' about the latest value ('Three') and thus
>> should return an empty result in the above example.
>> What is the correct way to obtain what I'm looking for?
>>
>> My current workaround is that I simply retrieve the latest value for all
>> my
>> columns and filter them in my application code.
>>
>> The HBase version I currently have installed here is HBase 0.94.6-cdh4.4.0
>>
>> --
>> Best regards / Met vriendelijke groeten,
>>
>> Niels Basjes
>>
>
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes
>

Re: HBase returns old values even with max versions = 1

Posted by Niels Basjes <Ni...@basjes.nl>.
Thanks for clarifying this,
I know now why my code didn't work as expected.

For now I think that creating a simple custom Filter for my situation is
the most efficient workaround.

Niels Basjes


On Sat, Dec 7, 2013 at 3:26 AM, lars hofhansl <la...@apache.org> wrote:

> Filed https://issues.apache.org/jira/browse/HBASE-10102
>
>
>
> ________________________________
>  From: lars hofhansl <la...@apache.org>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>; hbase-dev <
> dev@hbase.apache.org>
> Sent: Friday, December 6, 2013 5:31 PM
> Subject: Re: HBase returns old values even with max versions = 1
>
>
> + dev list
>
> Specifically:
>
> Currently the workflow in ScanQueryMatcher is something like this:
>
> 1. <versions> = min(<CF versions>, <scan version>)
> 2. filter by timerange
> 3. filter out columns (i.e. columns not specified in the scan)
> 4. apply customer filters
> 5. filter by <versions>
>
> Every KV is passed through this filtering process.
>
> What we should do is this:
>
> 1. filter by <CF versions>
> 2. filter by timerange
> 3. filter out columns (i.e. columns not specified in the scan)
> 4. apply customer filters
> 5. filter by <scan versions>
>
> The trick will be doing that efficiently.
>
> -- Lars
>
>
>
> ________________________________
>
> From: lars hofhansl <la...@apache.org>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Sent: Friday, December 6, 2013 5:10 PM
> Subject: Re: HBase returns old values even with max versions = 1
>
>
> The old versions can still be around until a flush and/or compaction.
>
> During a user-level scan, HBase first filters by timerange and then counts
> the versions.
> I agree, this is counter intuitive in this case. In other cases people
> want to first limit by timerange, and then get x numbers of versions back.
> We might need to start to distinguish between the number of version
> configured for the column family and the number of versions configured for
> the scan.
>
> Mind filing a jira? Can discuss solutions there.
>
> Thanks.
>
> -- Lars
>
>
>
> ________________________________
>
> From: Niels Basjes <Ni...@basjes.nl>
> To: user <us...@hbase.apache.org>
> Sent: Friday, December 6, 2013 8:05 AM
> Subject: HBase returns old values even with max versions = 1
>
>
> Hi,
>
> I have the desire to find the columns that have not been updated for more
> than a specific time period.
>
> So I want to do a scan against the columns with a timerange.
> The normal behavior of HBase is that you then get the latest value in that
> time range (which is not what I want).
>
> As far as I understand the way HBase should work is that if you set the
> maximum number of versions for the values in a column family to '1' it
> should retain only the last value that was put into the cell.
>
> What I found is different.
>
> If I do the following commands into the hbase shell
>
>     create 't1', {NAME => 'c1', VERSIONS => 1}
>     put 't1', 'r1', 'c1', 'One', 1000
>     put 't1', 'r1', 'c1', 'Two', 2000
>     put 't1', 'r1', 'c1', 'Three', 3000
>     get 't1', 'r1'
>     get 't1', 'r1' , {TIMERANGE => [0,1500]}
>
> the result is this:
>
>     get 't1', 'r1'
>     COLUMN                     CELL
>      c1:                       timestamp=3000, value=Three
>     1 row(s) in 0.0780 seconds
>
>     get 't1', 'r1' , {TIMERANGE => [0,1500]}
>     COLUMN                     CELL
>      c1:                       timestamp=1000, value=One
>     1 row(s) in 0.1390 seconds
>
> Why does the second query return a value even though I've set the max
> versions to only 1?
> I expect that it only 'knows' about the latest value ('Three') and thus
> should return an empty result in the above example.
> What is the correct way to obtain what I'm looking for?
>
> My current workaround is that I simply retrieve the latest value for all my
> columns and filter them in my application code.
>
> The HBase version I currently have installed here is HBase 0.94.6-cdh4.4.0
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes
>



-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: HBase returns old values even with max versions = 1

Posted by Niels Basjes <Ni...@basjes.nl>.
Thanks for clarifying this,
I know now why my code didn't work as expected.

For now I think that creating a simple custom Filter for my situation is
the most efficient workaround.

Niels Basjes


On Sat, Dec 7, 2013 at 3:26 AM, lars hofhansl <la...@apache.org> wrote:

> Filed https://issues.apache.org/jira/browse/HBASE-10102
>
>
>
> ________________________________
>  From: lars hofhansl <la...@apache.org>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>; hbase-dev <
> dev@hbase.apache.org>
> Sent: Friday, December 6, 2013 5:31 PM
> Subject: Re: HBase returns old values even with max versions = 1
>
>
> + dev list
>
> Specifically:
>
> Currently the workflow in ScanQueryMatcher is something like this:
>
> 1. <versions> = min(<CF versions>, <scan version>)
> 2. filter by timerange
> 3. filter out columns (i.e. columns not specified in the scan)
> 4. apply customer filters
> 5. filter by <versions>
>
> Every KV is passed through this filtering process.
>
> What we should do is this:
>
> 1. filter by <CF versions>
> 2. filter by timerange
> 3. filter out columns (i.e. columns not specified in the scan)
> 4. apply customer filters
> 5. filter by <scan versions>
>
> The trick will be doing that efficiently.
>
> -- Lars
>
>
>
> ________________________________
>
> From: lars hofhansl <la...@apache.org>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Sent: Friday, December 6, 2013 5:10 PM
> Subject: Re: HBase returns old values even with max versions = 1
>
>
> The old versions can still be around until a flush and/or compaction.
>
> During a user-level scan, HBase first filters by timerange and then counts
> the versions.
> I agree, this is counter intuitive in this case. In other cases people
> want to first limit by timerange, and then get x numbers of versions back.
> We might need to start to distinguish between the number of version
> configured for the column family and the number of versions configured for
> the scan.
>
> Mind filing a jira? Can discuss solutions there.
>
> Thanks.
>
> -- Lars
>
>
>
> ________________________________
>
> From: Niels Basjes <Ni...@basjes.nl>
> To: user <us...@hbase.apache.org>
> Sent: Friday, December 6, 2013 8:05 AM
> Subject: HBase returns old values even with max versions = 1
>
>
> Hi,
>
> I have the desire to find the columns that have not been updated for more
> than a specific time period.
>
> So I want to do a scan against the columns with a timerange.
> The normal behavior of HBase is that you then get the latest value in that
> time range (which is not what I want).
>
> As far as I understand the way HBase should work is that if you set the
> maximum number of versions for the values in a column family to '1' it
> should retain only the last value that was put into the cell.
>
> What I found is different.
>
> If I do the following commands into the hbase shell
>
>     create 't1', {NAME => 'c1', VERSIONS => 1}
>     put 't1', 'r1', 'c1', 'One', 1000
>     put 't1', 'r1', 'c1', 'Two', 2000
>     put 't1', 'r1', 'c1', 'Three', 3000
>     get 't1', 'r1'
>     get 't1', 'r1' , {TIMERANGE => [0,1500]}
>
> the result is this:
>
>     get 't1', 'r1'
>     COLUMN                     CELL
>      c1:                       timestamp=3000, value=Three
>     1 row(s) in 0.0780 seconds
>
>     get 't1', 'r1' , {TIMERANGE => [0,1500]}
>     COLUMN                     CELL
>      c1:                       timestamp=1000, value=One
>     1 row(s) in 0.1390 seconds
>
> Why does the second query return a value even though I've set the max
> versions to only 1?
> I expect that it only 'knows' about the latest value ('Three') and thus
> should return an empty result in the above example.
> What is the correct way to obtain what I'm looking for?
>
> My current workaround is that I simply retrieve the latest value for all my
> columns and filter them in my application code.
>
> The HBase version I currently have installed here is HBase 0.94.6-cdh4.4.0
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes
>



-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: HBase returns old values even with max versions = 1

Posted by lars hofhansl <la...@apache.org>.
Filed https://issues.apache.org/jira/browse/HBASE-10102



________________________________
 From: lars hofhansl <la...@apache.org>
To: "user@hbase.apache.org" <us...@hbase.apache.org>; hbase-dev <de...@hbase.apache.org> 
Sent: Friday, December 6, 2013 5:31 PM
Subject: Re: HBase returns old values even with max versions = 1
 

+ dev list

Specifically:

Currently the workflow in ScanQueryMatcher is something like this:

1. <versions> = min(<CF versions>, <scan version>)
2. filter by timerange
3. filter out columns (i.e. columns not specified in the scan)
4. apply customer filters
5. filter by <versions>

Every KV is passed through this filtering process.

What we should do is this:

1. filter by <CF versions>
2. filter by timerange
3. filter out columns (i.e. columns not specified in the scan)
4. apply customer filters
5. filter by <scan versions>

The trick will be doing that efficiently.

-- Lars



________________________________

From: lars hofhansl <la...@apache.org>
To: "user@hbase.apache.org" <us...@hbase.apache.org> 
Sent: Friday, December 6, 2013 5:10 PM
Subject: Re: HBase returns old values even with max versions = 1


The old versions can still be around until a flush and/or compaction.

During a user-level scan, HBase first filters by timerange and then counts the versions.
I agree, this is counter intuitive in this case. In other cases people want to first limit by timerange, and then get x numbers of versions back.
We might need to start to distinguish between the number of version configured for the column family and the number of versions configured for the scan.

Mind filing a jira? Can discuss solutions there.

Thanks.

-- Lars



________________________________

From: Niels Basjes <Ni...@basjes.nl>
To: user <us...@hbase.apache.org> 
Sent: Friday, December 6, 2013 8:05 AM
Subject: HBase returns old values even with max versions = 1


Hi,

I have the desire to find the columns that have not been updated for more
than a specific time period.

So I want to do a scan against the columns with a timerange.
The normal behavior of HBase is that you then get the latest value in that
time range (which is not what I want).

As far as I understand the way HBase should work is that if you set the
maximum number of versions for the values in a column family to '1' it
should retain only the last value that was put into the cell.

What I found is different.

If I do the following commands into the hbase shell

    create 't1', {NAME => 'c1', VERSIONS => 1}
    put 't1', 'r1', 'c1', 'One', 1000
    put 't1', 'r1', 'c1', 'Two', 2000
    put 't1', 'r1', 'c1', 'Three', 3000
    get 't1', 'r1'
    get 't1', 'r1' , {TIMERANGE => [0,1500]}

the result is this:

    get 't1', 'r1'
    COLUMN                     CELL
     c1:                       timestamp=3000, value=Three
    1 row(s) in 0.0780 seconds

    get 't1', 'r1' , {TIMERANGE => [0,1500]}
    COLUMN                     CELL
     c1:                       timestamp=1000, value=One
    1 row(s) in 0.1390 seconds

Why does the second query return a value even though I've set the max
versions to only 1?
I expect that it only 'knows' about the latest value ('Three') and thus
should return an empty result in the above example.
What is the correct way to obtain what I'm looking for?

My current workaround is that I simply retrieve the latest value for all my
columns and filter them in my application code.

The HBase version I currently have installed here is HBase 0.94.6-cdh4.4.0

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: HBase returns old values even with max versions = 1

Posted by lars hofhansl <la...@apache.org>.
Filed https://issues.apache.org/jira/browse/HBASE-10102



________________________________
 From: lars hofhansl <la...@apache.org>
To: "user@hbase.apache.org" <us...@hbase.apache.org>; hbase-dev <de...@hbase.apache.org> 
Sent: Friday, December 6, 2013 5:31 PM
Subject: Re: HBase returns old values even with max versions = 1
 

+ dev list

Specifically:

Currently the workflow in ScanQueryMatcher is something like this:

1. <versions> = min(<CF versions>, <scan version>)
2. filter by timerange
3. filter out columns (i.e. columns not specified in the scan)
4. apply customer filters
5. filter by <versions>

Every KV is passed through this filtering process.

What we should do is this:

1. filter by <CF versions>
2. filter by timerange
3. filter out columns (i.e. columns not specified in the scan)
4. apply customer filters
5. filter by <scan versions>

The trick will be doing that efficiently.

-- Lars



________________________________

From: lars hofhansl <la...@apache.org>
To: "user@hbase.apache.org" <us...@hbase.apache.org> 
Sent: Friday, December 6, 2013 5:10 PM
Subject: Re: HBase returns old values even with max versions = 1


The old versions can still be around until a flush and/or compaction.

During a user-level scan, HBase first filters by timerange and then counts the versions.
I agree, this is counter intuitive in this case. In other cases people want to first limit by timerange, and then get x numbers of versions back.
We might need to start to distinguish between the number of version configured for the column family and the number of versions configured for the scan.

Mind filing a jira? Can discuss solutions there.

Thanks.

-- Lars



________________________________

From: Niels Basjes <Ni...@basjes.nl>
To: user <us...@hbase.apache.org> 
Sent: Friday, December 6, 2013 8:05 AM
Subject: HBase returns old values even with max versions = 1


Hi,

I have the desire to find the columns that have not been updated for more
than a specific time period.

So I want to do a scan against the columns with a timerange.
The normal behavior of HBase is that you then get the latest value in that
time range (which is not what I want).

As far as I understand the way HBase should work is that if you set the
maximum number of versions for the values in a column family to '1' it
should retain only the last value that was put into the cell.

What I found is different.

If I do the following commands into the hbase shell

    create 't1', {NAME => 'c1', VERSIONS => 1}
    put 't1', 'r1', 'c1', 'One', 1000
    put 't1', 'r1', 'c1', 'Two', 2000
    put 't1', 'r1', 'c1', 'Three', 3000
    get 't1', 'r1'
    get 't1', 'r1' , {TIMERANGE => [0,1500]}

the result is this:

    get 't1', 'r1'
    COLUMN                     CELL
     c1:                       timestamp=3000, value=Three
    1 row(s) in 0.0780 seconds

    get 't1', 'r1' , {TIMERANGE => [0,1500]}
    COLUMN                     CELL
     c1:                       timestamp=1000, value=One
    1 row(s) in 0.1390 seconds

Why does the second query return a value even though I've set the max
versions to only 1?
I expect that it only 'knows' about the latest value ('Three') and thus
should return an empty result in the above example.
What is the correct way to obtain what I'm looking for?

My current workaround is that I simply retrieve the latest value for all my
columns and filter them in my application code.

The HBase version I currently have installed here is HBase 0.94.6-cdh4.4.0

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes