You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@accumulo.apache.org by "Christopher Tubbs (JIRA)" <ji...@apache.org> on 2012/09/25 12:56:07 UTC

[jira] [Created] (ACCUMULO-775) Optimize iterator seek() method when seeking forward

Christopher Tubbs created ACCUMULO-775:
------------------------------------------

             Summary: Optimize iterator seek() method when seeking forward
                 Key: ACCUMULO-775
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-775
             Project: Accumulo
          Issue Type: Improvement
          Components: tserver
            Reporter: Christopher Tubbs
            Assignee: Keith Turner
             Fix For: 1.5.0


At present, seeking is a very expensive operation. Yet, it is a very common case, especially when writing filtering/consuming/skipping iterators to seek to the next possible match (perhaps in the next row, when matching a column family with a regular expression), rather than continuing to iterate. A common solution is to continue to scan for some threshold (~10-20 entries), hoping to just "run into" the next possible match, rather than waste resources seeking directly to it.

This pattern can be rolled in to the lower level iterator, so that iterators on top don't have to do this. They can seek, and the underlying source iterator can simply consume the next X entries when it makes sense, rather than waste resources seeking.

I could be wrong (please comment and correct me below if I am), but I imagine that the places where this would make the most sense is if the data currently being sought (seek'd) is in the current compressed block from the underlying file, especially if it is forward, relative to the current pointer. A better seek method should be able to tell where one currently is, and whether the requested data is within reach without doing all the expensive operations to re-seek to the same compressed block that is already loaded, reload it, decompress it, and scan to the requested starting point.

Having such an optimization would eliminate the need for users to try to calibrate their own such scan vs. seek optimization based on guessing whether their data is in the current block or another one, while still getting that same performance benefit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ACCUMULO-775) Optimize iterator seek() method when seeking forward

Posted by "Christopher Tubbs (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ACCUMULO-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13466383#comment-13466383 ] 

Christopher Tubbs commented on ACCUMULO-775:
--------------------------------------------

David, I think we have to first understand why it is faster sometimes to scan instead of seek (even the fast seek that Adam refers to). I'm not actually sure why it should happen, except perhaps it's just more expensive to make the decision than it is for users to simply know their data well enough to guess at a reasonable number of entries over which to scan before seeking.

If somebody can simulate this, or demonstrate where the slow-down is, I'd love to know, but considering that we already have the optimization Adam mentioned, I'm thinking this is probably a low priority thing to investigate.
                
> Optimize iterator seek() method when seeking forward
> ----------------------------------------------------
>
>                 Key: ACCUMULO-775
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-775
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: tserver
>            Reporter: Christopher Tubbs
>            Assignee: Keith Turner
>            Priority: Minor
>              Labels: iterator, scan, seek
>             Fix For: 1.5.0
>
>
> At present, seeking is a very expensive operation. Yet, it is a very common case, especially when writing filtering/consuming/skipping iterators to seek to the next possible match (perhaps in the next row, when matching a column family with a regular expression), rather than continuing to iterate. A common solution to the problem of whether to scan or seek is to continue to scan for some threshold (~10-20 entries), hoping to just "run into" the next possible match, rather than waste resources seeking directly to it.
> This pattern can be rolled in to the lower level iterator, so that iterators on top don't have to do this. They can seek, and the underlying source iterator can simply consume the next X entries when it makes sense, rather than waste resources seeking.
> I could be wrong (please comment and correct me below if I am), but I imagine that the places where this would make the most sense is if the data currently being sought (seek'd) is in the current compressed block from the underlying file, especially if it is forward, relative to the current pointer. A better seek method should be able to tell where one currently is, and whether the requested data is within reach without doing all the expensive operations to re-seek to the same compressed block that is already loaded, reload it, decompress it, and scan to the requested starting point.
> Having such an optimization would eliminate the need for users to try to calibrate their own such scan vs. seek optimization based on guessing whether their data is in the current block or another one, while still getting that same performance benefit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ACCUMULO-775) Optimize iterator seek() method when seeking forward

Posted by "Adam Fuchs (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ACCUMULO-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13462690#comment-13462690 ] 

Adam Fuchs commented on ACCUMULO-775:
-------------------------------------

RFile already implements this optimization. Take a look at the RFile._seek(Range) method, roughly lines 658-692 of RFile.java.
                
> Optimize iterator seek() method when seeking forward
> ----------------------------------------------------
>
>                 Key: ACCUMULO-775
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-775
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: tserver
>            Reporter: Christopher Tubbs
>            Assignee: Keith Turner
>              Labels: iterator, scan, seek
>             Fix For: 1.5.0
>
>
> At present, seeking is a very expensive operation. Yet, it is a very common case, especially when writing filtering/consuming/skipping iterators to seek to the next possible match (perhaps in the next row, when matching a column family with a regular expression), rather than continuing to iterate. A common solution to the problem of whether to scan or seek is to continue to scan for some threshold (~10-20 entries), hoping to just "run into" the next possible match, rather than waste resources seeking directly to it.
> This pattern can be rolled in to the lower level iterator, so that iterators on top don't have to do this. They can seek, and the underlying source iterator can simply consume the next X entries when it makes sense, rather than waste resources seeking.
> I could be wrong (please comment and correct me below if I am), but I imagine that the places where this would make the most sense is if the data currently being sought (seek'd) is in the current compressed block from the underlying file, especially if it is forward, relative to the current pointer. A better seek method should be able to tell where one currently is, and whether the requested data is within reach without doing all the expensive operations to re-seek to the same compressed block that is already loaded, reload it, decompress it, and scan to the requested starting point.
> Having such an optimization would eliminate the need for users to try to calibrate their own such scan vs. seek optimization based on guessing whether their data is in the current block or another one, while still getting that same performance benefit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (ACCUMULO-775) Optimize iterator seek() method when seeking forward

Posted by "Christopher Tubbs (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/ACCUMULO-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christopher Tubbs updated ACCUMULO-775:
---------------------------------------

    Description: 
At present, seeking is a very expensive operation. Yet, it is a very common case, especially when writing filtering/consuming/skipping iterators to seek to the next possible match (perhaps in the next row, when matching a column family with a regular expression), rather than continuing to iterate. A common solution to the problem of whether to scan or seek is to continue to scan for some threshold (~10-20 entries), hoping to just "run into" the next possible match, rather than waste resources seeking directly to it.

This pattern can be rolled in to the lower level iterator, so that iterators on top don't have to do this. They can seek, and the underlying source iterator can simply consume the next X entries when it makes sense, rather than waste resources seeking.

I could be wrong (please comment and correct me below if I am), but I imagine that the places where this would make the most sense is if the data currently being sought (seek'd) is in the current compressed block from the underlying file, especially if it is forward, relative to the current pointer. A better seek method should be able to tell where one currently is, and whether the requested data is within reach without doing all the expensive operations to re-seek to the same compressed block that is already loaded, reload it, decompress it, and scan to the requested starting point.

Having such an optimization would eliminate the need for users to try to calibrate their own such scan vs. seek optimization based on guessing whether their data is in the current block or another one, while still getting that same performance benefit.

  was:
At present, seeking is a very expensive operation. Yet, it is a very common case, especially when writing filtering/consuming/skipping iterators to seek to the next possible match (perhaps in the next row, when matching a column family with a regular expression), rather than continuing to iterate. A common solution is to continue to scan for some threshold (~10-20 entries), hoping to just "run into" the next possible match, rather than waste resources seeking directly to it.

This pattern can be rolled in to the lower level iterator, so that iterators on top don't have to do this. They can seek, and the underlying source iterator can simply consume the next X entries when it makes sense, rather than waste resources seeking.

I could be wrong (please comment and correct me below if I am), but I imagine that the places where this would make the most sense is if the data currently being sought (seek'd) is in the current compressed block from the underlying file, especially if it is forward, relative to the current pointer. A better seek method should be able to tell where one currently is, and whether the requested data is within reach without doing all the expensive operations to re-seek to the same compressed block that is already loaded, reload it, decompress it, and scan to the requested starting point.

Having such an optimization would eliminate the need for users to try to calibrate their own such scan vs. seek optimization based on guessing whether their data is in the current block or another one, while still getting that same performance benefit.

    
> Optimize iterator seek() method when seeking forward
> ----------------------------------------------------
>
>                 Key: ACCUMULO-775
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-775
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: tserver
>            Reporter: Christopher Tubbs
>            Assignee: Keith Turner
>              Labels: iterator, scan, seek
>             Fix For: 1.5.0
>
>
> At present, seeking is a very expensive operation. Yet, it is a very common case, especially when writing filtering/consuming/skipping iterators to seek to the next possible match (perhaps in the next row, when matching a column family with a regular expression), rather than continuing to iterate. A common solution to the problem of whether to scan or seek is to continue to scan for some threshold (~10-20 entries), hoping to just "run into" the next possible match, rather than waste resources seeking directly to it.
> This pattern can be rolled in to the lower level iterator, so that iterators on top don't have to do this. They can seek, and the underlying source iterator can simply consume the next X entries when it makes sense, rather than waste resources seeking.
> I could be wrong (please comment and correct me below if I am), but I imagine that the places where this would make the most sense is if the data currently being sought (seek'd) is in the current compressed block from the underlying file, especially if it is forward, relative to the current pointer. A better seek method should be able to tell where one currently is, and whether the requested data is within reach without doing all the expensive operations to re-seek to the same compressed block that is already loaded, reload it, decompress it, and scan to the requested starting point.
> Having such an optimization would eliminate the need for users to try to calibrate their own such scan vs. seek optimization based on guessing whether their data is in the current block or another one, while still getting that same performance benefit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (ACCUMULO-775) Optimize iterator seek() method when seeking forward

Posted by "Christopher Tubbs (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/ACCUMULO-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christopher Tubbs updated ACCUMULO-775:
---------------------------------------

    Priority: Minor  (was: Major)
    
> Optimize iterator seek() method when seeking forward
> ----------------------------------------------------
>
>                 Key: ACCUMULO-775
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-775
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: tserver
>            Reporter: Christopher Tubbs
>            Assignee: Keith Turner
>            Priority: Minor
>              Labels: iterator, scan, seek
>             Fix For: 1.5.0
>
>
> At present, seeking is a very expensive operation. Yet, it is a very common case, especially when writing filtering/consuming/skipping iterators to seek to the next possible match (perhaps in the next row, when matching a column family with a regular expression), rather than continuing to iterate. A common solution to the problem of whether to scan or seek is to continue to scan for some threshold (~10-20 entries), hoping to just "run into" the next possible match, rather than waste resources seeking directly to it.
> This pattern can be rolled in to the lower level iterator, so that iterators on top don't have to do this. They can seek, and the underlying source iterator can simply consume the next X entries when it makes sense, rather than waste resources seeking.
> I could be wrong (please comment and correct me below if I am), but I imagine that the places where this would make the most sense is if the data currently being sought (seek'd) is in the current compressed block from the underlying file, especially if it is forward, relative to the current pointer. A better seek method should be able to tell where one currently is, and whether the requested data is within reach without doing all the expensive operations to re-seek to the same compressed block that is already loaded, reload it, decompress it, and scan to the requested starting point.
> Having such an optimization would eliminate the need for users to try to calibrate their own such scan vs. seek optimization based on guessing whether their data is in the current block or another one, while still getting that same performance benefit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ACCUMULO-775) Optimize iterator seek() method when seeking forward

Posted by "David Medinets (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ACCUMULO-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13465843#comment-13465843 ] 

David Medinets commented on ACCUMULO-775:
-----------------------------------------

Is the scanning something that can be simulated? I'm thinking that a simulation might show the new technique has a prediced x% speedup.
                
> Optimize iterator seek() method when seeking forward
> ----------------------------------------------------
>
>                 Key: ACCUMULO-775
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-775
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: tserver
>            Reporter: Christopher Tubbs
>            Assignee: Keith Turner
>              Labels: iterator, scan, seek
>             Fix For: 1.5.0
>
>
> At present, seeking is a very expensive operation. Yet, it is a very common case, especially when writing filtering/consuming/skipping iterators to seek to the next possible match (perhaps in the next row, when matching a column family with a regular expression), rather than continuing to iterate. A common solution to the problem of whether to scan or seek is to continue to scan for some threshold (~10-20 entries), hoping to just "run into" the next possible match, rather than waste resources seeking directly to it.
> This pattern can be rolled in to the lower level iterator, so that iterators on top don't have to do this. They can seek, and the underlying source iterator can simply consume the next X entries when it makes sense, rather than waste resources seeking.
> I could be wrong (please comment and correct me below if I am), but I imagine that the places where this would make the most sense is if the data currently being sought (seek'd) is in the current compressed block from the underlying file, especially if it is forward, relative to the current pointer. A better seek method should be able to tell where one currently is, and whether the requested data is within reach without doing all the expensive operations to re-seek to the same compressed block that is already loaded, reload it, decompress it, and scan to the requested starting point.
> Having such an optimization would eliminate the need for users to try to calibrate their own such scan vs. seek optimization based on guessing whether their data is in the current block or another one, while still getting that same performance benefit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ACCUMULO-775) Optimize iterator seek() method when seeking forward

Posted by "Christopher Tubbs (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ACCUMULO-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13465604#comment-13465604 ] 

Christopher Tubbs commented on ACCUMULO-775:
--------------------------------------------

That's great, but I still see performance gains with scanning over X entries before deciding to seek, so it seems there is some room for further optimization, which we may be able to predict at a lower layer more reliably than the person writing (or the person using, in the case of a configurable iterator) the iterator that applies this scan-before-seek pattern.
                
> Optimize iterator seek() method when seeking forward
> ----------------------------------------------------
>
>                 Key: ACCUMULO-775
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-775
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: tserver
>            Reporter: Christopher Tubbs
>            Assignee: Keith Turner
>              Labels: iterator, scan, seek
>             Fix For: 1.5.0
>
>
> At present, seeking is a very expensive operation. Yet, it is a very common case, especially when writing filtering/consuming/skipping iterators to seek to the next possible match (perhaps in the next row, when matching a column family with a regular expression), rather than continuing to iterate. A common solution to the problem of whether to scan or seek is to continue to scan for some threshold (~10-20 entries), hoping to just "run into" the next possible match, rather than waste resources seeking directly to it.
> This pattern can be rolled in to the lower level iterator, so that iterators on top don't have to do this. They can seek, and the underlying source iterator can simply consume the next X entries when it makes sense, rather than waste resources seeking.
> I could be wrong (please comment and correct me below if I am), but I imagine that the places where this would make the most sense is if the data currently being sought (seek'd) is in the current compressed block from the underlying file, especially if it is forward, relative to the current pointer. A better seek method should be able to tell where one currently is, and whether the requested data is within reach without doing all the expensive operations to re-seek to the same compressed block that is already loaded, reload it, decompress it, and scan to the requested starting point.
> Having such an optimization would eliminate the need for users to try to calibrate their own such scan vs. seek optimization based on guessing whether their data is in the current block or another one, while still getting that same performance benefit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira