You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by "John Vines (Created) (JIRA)" <ji...@apache.org> on 2012/04/05 19:44:23 UTC

[jira] [Created] (ACCUMULO-516) Column family search with sparse files is painfully long

Column family search with sparse files is painfully long
--------------------------------------------------------

                 Key: ACCUMULO-516
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-516
             Project: Accumulo
          Issue Type: Bug
          Components: tserver
    Affects Versions: 1.3.5, 1.4.0
            Reporter: John Vines
            Assignee: Keith Turner
            Priority: Critical
             Fix For: 1.4.1


Background: a tablet with 3 files, coming in at ~500MB, 200MB, and ~20MB. One of the files (I believe smallest) did not have the column of interest at all. Running a query filtering on a column family/qualifier pair. I can scan the entirety of the table in ~30 minutes. I aborted a scan for just that column after 2 hours.

Cause: Keith and I investigated, major compacting the tablet brought a column scan down to under 7 minutes. Dumping the largest file and grepping for the column of interest resulted in a large dead spot for that column which took minutes to grep over. After looking it over, the problem is how we do column family filtering. We handle colf filtering below the multi-iterator, which handles the merge read between multiple files. We do it at this level because we keep column info in the RFile metadata for quick filtering of entire files. The problem here is one of the files has that column, but does not have any relevant data in a large period. So every time we seek, which is for each batch of the query, we go down to the multi-iterator and seek for the first hit of the column(s) of interest. This means we are constantly spending minutes grabbing a key of interest to us which is substantially far down in the stack, such that we won't merge read it for many, MANY batches.

Proposed Solution: Split the column family filter into two seperate pieces. Keep the RFile optimized portion, as it can only occur at this level. But move the actual column family filter for files with that column above the MultiIterator. This will prevent this constant repetition of a large, painful seek.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ACCUMULO-516) Column family search with sparse files is painfully long

Posted by "Keith Turner (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ACCUMULO-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13250721#comment-13250721 ] 

Keith Turner commented on ACCUMULO-516:
---------------------------------------

One possible work around is to manually place the column family filtering iterator higher up in the stack.   This would avoid the filtering at the lower level that is causing the problem.  However, this solution will not work well on a table that has locality groups configured because the iterator will drop the info needed by rfile to make smart locality group decisions.

One issue with this work around is that this iterator does not have an init method, so you would need to extend it and add an init method.
                
> Column family search with sparse files is painfully long
> --------------------------------------------------------
>
>                 Key: ACCUMULO-516
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-516
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.3.5, 1.4.0
>            Reporter: John Vines
>            Assignee: Keith Turner
>            Priority: Critical
>             Fix For: 1.4.1
>
>
> Background: a tablet with 3 files, coming in at ~500MB, 200MB, and ~20MB. One of the files (I believe smallest) did not have the column of interest at all. Running a query filtering on a column family/qualifier pair. I can scan the entirety of the table in ~30 minutes. I aborted a scan for just that column after 2 hours.
> Cause: Keith and I investigated, major compacting the tablet brought a column scan down to under 7 minutes. Dumping the largest file and grepping for the column of interest resulted in a large dead spot for that column which took minutes to grep over. After looking it over, the problem is how we do column family filtering. We handle colf filtering below the multi-iterator, which handles the merge read between multiple files. We do it at this level because we keep column info in the RFile metadata for quick filtering of entire files. The problem here is one of the files has that column, but does not have any relevant data in a large period. So every time we seek, which is for each batch of the query, we go down to the multi-iterator and seek for the first hit of the column(s) of interest. This means we are constantly spending minutes grabbing a key of interest to us which is substantially far down in the stack, such that we won't merge read it for many, MANY batches.
> Proposed Solution: Split the column family filter into two seperate pieces. Keep the RFile optimized portion, as it can only occur at this level. But move the actual column family filter for files with that column above the MultiIterator. This will prevent this constant repetition of a large, painful seek.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (ACCUMULO-516) Column family search with sparse files is painfully long

Posted by "Keith Turner (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ACCUMULO-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Keith Turner updated ACCUMULO-516:
----------------------------------

    Fix Version/s: 1.5.0
    
> Column family search with sparse files is painfully long
> --------------------------------------------------------
>
>                 Key: ACCUMULO-516
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-516
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.4.0, 1.3.5
>            Reporter: John Vines
>            Assignee: Keith Turner
>            Priority: Critical
>             Fix For: 1.5.0, 1.4.1
>
>
> Background: a tablet with 3 files, coming in at ~500MB, 200MB, and ~20MB. One of the files (I believe smallest) did not have the column of interest at all. Running a query filtering on a column family/qualifier pair. I can scan the entirety of the table in ~30 minutes. I aborted a scan for just that column after 2 hours.
> Cause: Keith and I investigated, major compacting the tablet brought a column scan down to under 7 minutes. Dumping the largest file and grepping for the column of interest resulted in a large dead spot for that column which took minutes to grep over. After looking it over, the problem is how we do column family filtering. We handle colf filtering below the multi-iterator, which handles the merge read between multiple files. We do it at this level because we keep column info in the RFile metadata for quick filtering of entire files. The problem here is one of the files has that column, but does not have any relevant data in a large period. So every time we seek, which is for each batch of the query, we go down to the multi-iterator and seek for the first hit of the column(s) of interest. This means we are constantly spending minutes grabbing a key of interest to us which is substantially far down in the stack, such that we won't merge read it for many, MANY batches.
> Proposed Solution: Split the column family filter into two seperate pieces. Keep the RFile optimized portion, as it can only occur at this level. But move the actual column family filter for files with that column above the MultiIterator. This will prevent this constant repetition of a large, painful seek.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (ACCUMULO-516) Column family search with sparse files is painfully long

Posted by "Keith Turner (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ACCUMULO-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Keith Turner resolved ACCUMULO-516.
-----------------------------------

    Resolution: Fixed
    
> Column family search with sparse files is painfully long
> --------------------------------------------------------
>
>                 Key: ACCUMULO-516
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-516
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.4.0, 1.3.5
>            Reporter: John Vines
>            Assignee: Keith Turner
>            Priority: Critical
>             Fix For: 1.5.0, 1.4.1
>
>
> Background: a tablet with 3 files, coming in at ~500MB, 200MB, and ~20MB. One of the files (I believe smallest) did not have the column of interest at all. Running a query filtering on a column family/qualifier pair. I can scan the entirety of the table in ~30 minutes. I aborted a scan for just that column after 2 hours.
> Cause: Keith and I investigated, major compacting the tablet brought a column scan down to under 7 minutes. Dumping the largest file and grepping for the column of interest resulted in a large dead spot for that column which took minutes to grep over. After looking it over, the problem is how we do column family filtering. We handle colf filtering below the multi-iterator, which handles the merge read between multiple files. We do it at this level because we keep column info in the RFile metadata for quick filtering of entire files. The problem here is one of the files has that column, but does not have any relevant data in a large period. So every time we seek, which is for each batch of the query, we go down to the multi-iterator and seek for the first hit of the column(s) of interest. This means we are constantly spending minutes grabbing a key of interest to us which is substantially far down in the stack, such that we won't merge read it for many, MANY batches.
> Proposed Solution: Split the column family filter into two seperate pieces. Keep the RFile optimized portion, as it can only occur at this level. But move the actual column family filter for files with that column above the MultiIterator. This will prevent this constant repetition of a large, painful seek.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira