You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by "Jonathan Gray (JIRA)" <ji...@apache.org> on 2009/06/05 20:09:07 UTC

[jira] Created: (HBASE-1485) Wrong or indeterminate behavior when there are duplicate versions of a column

Wrong or indeterminate behavior when there are duplicate versions of a column
-----------------------------------------------------------------------------

                 Key: HBASE-1485
                 URL: https://issues.apache.org/jira/browse/HBASE-1485
             Project: Hadoop HBase
          Issue Type: Bug
          Components: regionserver
    Affects Versions: 0.20.0
            Reporter: Jonathan Gray
             Fix For: 0.20.1


As of now, both gets and scanners will end up returning all duplicate versions of a column.  The ordering of them is indeterminate.

We need to decide what the desired/expected behavior should be and make it happen.

Note:  It's nearly impossible for this to work with Gets as they are now implemented in 1304 so this is really a Scanner issue.  To implement this correctly with Gets, we would have to undo basically all the optimizations that Gets do and making them far slower than a Scanner.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (HBASE-1485) Wrong or indeterminate behavior when there are duplicate versions of a column

Posted by Ryan Rawson <ry...@gmail.com>.

The sequenceid in the file tells you the newest (largest=newest). If the
heap used that we might be sitting pretty.

We want to avoid using ts for filename I think, not sure what assumptions
might break.

On Jul 9, 2009 11:42 AM, "Jonathan Gray (JIRA)" <ji...@apache.org> wrote:

[
https://issues.apache.org/jira/browse/HBASE-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12729379#action_12729379]

Jonathan Gray commented on HBASE-1485:
--------------------------------------

I've had at least three people with a use case for this.

Might create a couple sub-tasks here so we can at least head in the right
direction.

First, we need to make scanners ignore duplicate versions of the same
column. The trickiest part is, how do we determine which to keep? We want
to always come from the latest storefile, but I believe their IDs are still
random and not timestamps? We might need to make that change to fix this.
Would also then require a modification to the KVHeap to take this into
account, all other things considered equal.

Once we have scanners working, that will mean the proper thing is enforced
on major (and if we want, minor) compactions.

Gets will only work once we re-implement Gets as an optimized scan (taking
advantage of bloom filters, mostly).

I remember why I punted this to 0.20.1, the tricky part at the beginning is
pretty tough and touches a good bit of core read-path code.

Revisiting now, we'll see. Anyone else interested in this / want to work on
it?

> Wrong or indeterminate behavior when there are duplicate versions of a
column > -----------------...

[jira] Commented: (HBASE-1485) Wrong or indeterminate behavior when there are duplicate versions of a column

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12729526#action_12729526 ] 

stack commented on HBASE-1485:
------------------------------

Yeah, as ryan suggested, we should exploit sequenceid -- maybe name file for sequenceid?

> Wrong or indeterminate behavior when there are duplicate versions of a column
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-1485
>                 URL: https://issues.apache.org/jira/browse/HBASE-1485
>             Project: Hadoop HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.20.0
>            Reporter: Jonathan Gray
>             Fix For: 0.20.1
>
>
> As of now, both gets and scanners will end up returning all duplicate versions of a column.  The ordering of them is indeterminate.
> We need to decide what the desired/expected behavior should be and make it happen.
> Note:  It's nearly impossible for this to work with Gets as they are now implemented in 1304 so this is really a Scanner issue.  To implement this correctly with Gets, we would have to undo basically all the optimizations that Gets do and making them far slower than a Scanner.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1485) Wrong or indeterminate behavior when there are duplicate versions of a column

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12729379#action_12729379 ] 

Jonathan Gray commented on HBASE-1485:
--------------------------------------

I've had at least three people with a use case for this.

Might create a couple sub-tasks here so we can at least head in the right direction.

First, we need to make scanners ignore duplicate versions of the same column.  The trickiest part is, how do we determine which to keep?  We want to always come from the latest storefile, but I believe their IDs are still random and not timestamps?  We might need to make that change to fix this.  Would also then require a modification to the KVHeap to take this into account, all other things considered equal.

Once we have scanners working, that will mean the proper thing is enforced on major (and if we want, minor) compactions.

Gets will only work once we re-implement Gets as an optimized scan (taking advantage of bloom filters, mostly).


I remember why I punted this to 0.20.1, the tricky part at the beginning is pretty tough and touches a good bit of core read-path code.

Revisiting now, we'll see.  Anyone else interested in this / want to work on it?

> Wrong or indeterminate behavior when there are duplicate versions of a column
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-1485
>                 URL: https://issues.apache.org/jira/browse/HBASE-1485
>             Project: Hadoop HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.20.0
>            Reporter: Jonathan Gray
>             Fix For: 0.20.1
>
>
> As of now, both gets and scanners will end up returning all duplicate versions of a column.  The ordering of them is indeterminate.
> We need to decide what the desired/expected behavior should be and make it happen.
> Note:  It's nearly impossible for this to work with Gets as they are now implemented in 1304 so this is really a Scanner issue.  To implement this correctly with Gets, we would have to undo basically all the optimizations that Gets do and making them far slower than a Scanner.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1485) Wrong or indeterminate behavior when there are duplicate versions of a column

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716713#action_12716713 ] 

stack commented on HBASE-1485:
------------------------------

Just to say Gets and Scanners should work the same, whatever we decide.

> Wrong or indeterminate behavior when there are duplicate versions of a column
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-1485
>                 URL: https://issues.apache.org/jira/browse/HBASE-1485
>             Project: Hadoop HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.20.0
>            Reporter: Jonathan Gray
>             Fix For: 0.20.1
>
>
> As of now, both gets and scanners will end up returning all duplicate versions of a column.  The ordering of them is indeterminate.
> We need to decide what the desired/expected behavior should be and make it happen.
> Note:  It's nearly impossible for this to work with Gets as they are now implemented in 1304 so this is really a Scanner issue.  To implement this correctly with Gets, we would have to undo basically all the optimizations that Gets do and making them far slower than a Scanner.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1485) Wrong or indeterminate behavior when there are duplicate versions of a column

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Gray updated HBASE-1485:
---------------------------------

    Fix Version/s:     (was: 0.20.1)
                   0.21.0

Bumped to 0.21

> Wrong or indeterminate behavior when there are duplicate versions of a column
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-1485
>                 URL: https://issues.apache.org/jira/browse/HBASE-1485
>             Project: Hadoop HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.20.0
>            Reporter: Jonathan Gray
>             Fix For: 0.21.0
>
>
> As of now, both gets and scanners will end up returning all duplicate versions of a column.  The ordering of them is indeterminate.
> We need to decide what the desired/expected behavior should be and make it happen.
> Note:  It's nearly impossible for this to work with Gets as they are now implemented in 1304 so this is really a Scanner issue.  To implement this correctly with Gets, we would have to undo basically all the optimizations that Gets do and making them far slower than a Scanner.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1485) Wrong or indeterminate behavior when there are duplicate versions of a column

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805304#action_12805304 ] 

stack commented on HBASE-1485:
------------------------------

>From the list:

{code}
On Tue, Jan 26, 2010 at 9:36 AM, Rod Cope <rod.cope at openlogic dot com> wrote:
> Hi,
>
> I¹m seeing behavior on 0.20.2 and 0.20.3 that doesn¹t seem quite right and
> would like to know if this is by design, a bug, or something I¹m doing
> wrong.
>
> Background:
>
> When I do a put that includes a timestamp like this (conceptually  I know
> this is not the actual API), it works just fine.
>  put ³table², ³family², ³column², ³bbb², 12345
>
> Then, if I do another put in the same client code using the same timestamp
> like this...
>  put ³table², ³family², ³column², ³aaa², 12345
>
> ...and I create a scanner, grab a Result, and iterate over all values using
> list(), I get this...
>  ³table², ³family², ³column², ³aaa², 12345
>
> So far, so good.  Now, if I truncate the table from the shell and run a new
> program that does a flush() on the table between the two put¹s, but does it
> in the same client program back-to-back, I also get the same results from
> list().
>
> -----
>
> Problem:
>
> Here¹s where the trouble starts.  I truncate the table and run a new program
> that puts ³bbb², flushes the table, and quits.  Here¹s what I get from
> list():
>  ³table², ³family², ³column², ³bbb², 12345
>
> Then I run another program that puts ³aaa², flushes, and quits.  Here¹s what
> I get from list():
>  ³table², ³family², ³column², ³aaa², 12345
>  ³table², ³family², ³column², ³bbb², 12345
>
> And if I then run a third program that puts ³ccc², flushes, and quits, I get
> this from list():
>  ³table², ³family², ³column², ³ccc², 12345
>  ³table², ³family², ³column², ³bbb², 12345
>  ³table², ³family², ³column², ³aaa², 12345
>
> I¹m getting three different values for identical
> table/family/qualifier/timestamp tuples.  Does this seem right?  There also
> doesn¹t seem to be a defined sort order, probably because the timestamps are
> identical.
>
> Also, if instead of using list(), I use getMap(), then I always only get a
> single result.  The single result is always the last item in the lists above
> (i.e., ³bbb² then ³bbb² then ³aaa²).  I get identical results from using
> getNoVersionMap().
>
> I suspect that this same behavior could occur when HBase decides to flush on
> its own, but I could be wrong.  As you can imagine, this can cause problems
> because clients can¹t know from the results of calling list() which value is
> ³right² or ³newest².  They also can¹t rely on getMap() or getNoVersionMap()
> because the single result that gets returned is not necessarily ³right² or
> ³newest².
>
> I¹ve reproduced everything above in a stand-alone installation and also with
> a 7 regionserver cluster with the final 0.20.3.  I started down this
> debugging path originally because I ran into this problem on the 7
> regionserver cluster with one table of 100+ regions.  I was flushing
> programmatically at the end of some large imports because I'm doing
> setWriteToWAL(false) for load performance.
>
> Am I doing something wrong?  Did I miss an HBase assumption about flushing
> and/or identical timestamps?
>
> Any help would be much appreciated.
{code}

> Wrong or indeterminate behavior when there are duplicate versions of a column
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-1485
>                 URL: https://issues.apache.org/jira/browse/HBASE-1485
>             Project: Hadoop HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.20.0
>            Reporter: Jonathan Gray
>             Fix For: 0.21.0
>
>
> As of now, both gets and scanners will end up returning all duplicate versions of a column.  The ordering of them is indeterminate.
> We need to decide what the desired/expected behavior should be and make it happen.
> Note:  It's nearly impossible for this to work with Gets as they are now implemented in 1304 so this is really a Scanner issue.  To implement this correctly with Gets, we would have to undo basically all the optimizations that Gets do and making them far slower than a Scanner.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.